# Exploratory: extract data from a webpage

This notebook uses a Python environment to learn how to extract a sample of data from a webpage.  

## Setup packages

In [2]:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
from dask import diagnostics


## Identify website with data 

In [10]:
# Make a request to the website
r = requests.get("https://www.capp.ca/resources/glossary/")

r.content

# Use the 'html.parser' to parse the page
soup = BeautifulSoup(r.content, 'html.parser')

# Use CSS selector to find the specific path
content = soup.select_one('body > section > div:nth-of-type(3) > dl')

# Print the text
if content:
    print(content.text)
else:
    print("No content found at the specified path.")




Abandonment 

The process of changing a once active well (one that will no longer produce oil or natural gas), to a state where it can be left indefinitely. All equipment that was used to produce oil and gas is removed and work is completed on the well to ensure that it will not cause harm to any environmental or human surroundings. 

Accelerated capital cost allowance (CCA) 

CCA is essentially a yearly deduction allowed by Revenue Canada (CRA) to expense a portion of an asset. For example if you a purchase a computer you are not allowed to expense it all in one year since the computer will likely last at least 3 years or longer. A useful life is assigned to an asset and an annual rate at which they should be expensed is applied. An accelerated CCA allows the company to shorten the number of years of an assets useful life, enabling them to claim the capital cost in a shorter period of time. 

Active Well 

A well that is currently producing oil or natural gas. 

API Gravity 

The Am

## Arrange data in data frame

In [11]:
# Put the content into a two column dictionary
glossary = {}
for dt, dd in zip(content.find_all('dt'), content.find_all('dd')):
    glossary[dt.text] = dd.text

# Add a source column that provides the source of the definition for each item in the dictionary
glossary_with_source = {}
for term, definition in glossary.items():
    glossary_with_source[term] = {'definition': definition, 'source': 'https://www.capp.ca/resources/glossary/'}



## Save the formatted data 

In [12]:
# Write the dictionary to a JSON file
json_string = json.dumps(glossary, indent=4)
with open('glossary.json', 'w') as file:
    file.write(json_string)

# Write the dictionary to an excel file
df = pd.DataFrame(list(glossary.items()), columns=['Term', 'Definition'])
df.to_excel('glossary.xlsx', index=False)

# Write the dictionary to a csv file
df.to_csv('glossary.csv', index=False)

# Write the dictionary to a html file
df.to_html('glossary.html', index=False)

# Write the dictionary with source to a JSON file
json_string_with_source = json.dumps(glossary_with_source, indent=4)
with open('glossary_with_source.json', 'w') as file:
    file.write(json_string_with_source)

# Write the dictionary with source to an excel file
df_with_source = pd.DataFrame(list(glossary_with_source.items()), columns=['Term', 'Definition with Source'])
df_with_source.to_excel('glossary_with_source.xlsx', index=False)

# Write the dictionary with source to a csv file
df_with_source.to_csv('glossary_with_source.csv', index=False)

# Write the dictionary with source to a html file
df_with_source.to_html('glossary_with_source.html', index=False)

## Convert data into list of dictionaries

In [14]:
import json

# Load the data from the JSON file
data = glossary_with_source

# Print the keys in the data
print(f"Keys in data: {list(data.keys())}")

# Print the first few values in the data
for key in list(data.keys())[:5]:
    print(f"Value for key {key}: {data[key]}")






Keys in data: ['\nAbandonment ', '\nAccelerated capital cost allowance (CCA) ', '\nActive Well ', '\nAPI Gravity ', '\nBarrel ', '\nBattery ', '\nBenchmarking Measures ', '\nBenzene ', '\nBitumen ', '\nCarbon Capture and Storage (CCS) ', '\nCarbon Leakage ', '\nCentrifugal Pump ', '\nCoalbed Methane (CBM) ', '\nCondensate ', '\nConventional Crude Oil ', '\nC-ring Tanks ', '\nCriteria Air Contaminants (CAC) ', '\nCumulative Effects ', '\nCumulative Production ', '\nCyclic Steam Stimulation (CSS) ', '\nDeclining balance ', '\nDensity ', '\nDevelopment Well ', '\nDilbit ', '\nDiluent ', '\nDirectional Well ', '\nDiscovery Well ', '\nDownstream Sector ', '\nEcosystem ', '\nEnhanced Oil Recovery (EOR) ', '\nEstablished Reserves ', '\nExploratory Well ', '\nExtraction ', '\nFeedstock ', '\nField ', '\nFlaring/Venting ', '\nFlow Line ', '\nFlowback ', '\nFracking - see Hydraulic fracturing ', '\nFugitive Emissions ', '\nGlycol Dehydrator ', '\nGreenhouse Gas Intensity (GHG Intensity) ', '\nGr

## Convert json data into a list of dictionaries

In [16]:
# Convert the data into a list of dictionaries with 'term', 'definition', and, 'source' as keys
output = []
for key, item in data.items():
    if isinstance(item, dict):
        row = {
            'term': key,
            'definition': item.get('definition', None),
            'source': item.get('source', None)
        }
        output.append(row)
    else:
        print(f"Unexpected item in data: {item}")



## Output glossary in pretty table

In [18]:
from prettytable import PrettyTable
table = PrettyTable()
table.field_names = ["Term", "Definition", "Source"]
for item in output:
    table.add_row([item['term'], item['definition'], item['source']])
print(table)

# Write the list of dictionaries to a JSON file (pretty printed)
json_string = json.dumps(output, indent=4)
with open('glossary_with_source_reshape.json', 'w') as file:
    file.write(json_string)

# Write the list of dictionaries to an excel file
import pandas as pd
df = pd.DataFrame(output)
df.to_excel('glossary_with_source_reshape.xlsx', index=False)


+----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+
|                     Term                     |                                                                                                                                                                                                                                                                                      Definition      