# Web Data Set Scraping in Python - Greenhouse Emissions Data

## Introduction

This project will focus on web scraping data from a webpage (Wikipedia) for use in an exploratory analysis between greenhouse emissions and GDP per capita by country. 

For this web scraping portion, we will utilize BeautifulSoup and Pandas dataframes to create, clean, and manipulate a data set from the internet. The original URL for the Wikipedia article can be found here:
https://en.wikipedia.org/wiki/List_of_countries_by_greenhouse_gas_emissions

### Import proper libraries

In [1]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd

### Request and receive page information

In [2]:
# Send request to Wikipedia article
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_greenhouse_gas_emissions'
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    html_content = response.content
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")


In [3]:
# Get Soup
soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table')

In [10]:
# Initialize datasets and parse rows to append data into list
data = []
arrows = []

for row in table.find_all('tr'):
    cells = row.find_all(['td', 'th'])
    row_data = [cell.text.strip() for cell in cells]
    data.append(row_data)
    
    #creating data list to store unique arrow notation from original chart, essential to not lose data on trends
    img_tag = row.find('span', {'title': 'Positive decrease'}).find('img', 
                                                                    alt='Positive decrease'
                                                                   ) if row.find('span', 
                                                                                 {'title': 'Positive decrease'}
                                                                                )else None
    arrow = '▲' if img_tag and 'Decrease_Positive.svg' in img_tag['src'] else '▼' if img_tag and 'Increase_Positive.svg' in img_tag['src'] else ''
    arrows.append(arrow)

# Create a Pandas dataframe with new headers and data lists above
headers = ['Country/territory','GHG emissions 1970','GHG emissions 1990','GHG emissions 2005',
           'GHG emissions 2017','GHG emissions 2022','GHG per capita 2022', '% of world emissions 2022',
           '% Change from 1990']    
df = pd.DataFrame(data, columns=headers)

#reformat the percentages and format to numeric float values
df['% Change from 1990'] = pd.to_numeric(df['% Change from 1990'].str.replace(',', '').str.replace('%', '').replace('', '0'), errors='coerce')
df['% of world emissions 2022'] = pd.to_numeric(df['% of world emissions 2022'].str.replace('%', '').replace('', '0'), errors='coerce')



# Add temporary 'Color Arrow' column to the dataframe
df['Color Arrow'] = arrows

# Handle formatting of the '% Change from 1990' column, assign negative values for decrease in emissions
df.loc[df['Color Arrow'] == '▲', '% Change from 1990'] *= -1
df = df.drop('Color Arrow',axis=1)



### Clean and modify the dataframe 

In [5]:
# Original chart has cumulative categories in the first 5 rows, trim for later analysis
new_set = df.iloc[6:]

# View the first 20 rows of data to validate results with original webpage
new_set.iloc[:20]

Unnamed: 0,Country/territory,GHG emissions 1970,GHG emissions 1990,GHG emissions 2005,GHG emissions 2017,GHG emissions 2022,GHG per capita 2022,% of world emissions 2022,% Change from 1990
6,Aruba,45.2,214.4,462.6,467.2,496.7,4.64,0.001,231.7
7,Afghanistan,17336.2,13775.6,18191.3,31773.0,29117.9,0.73,0.054,211.4
8,Angola,20138.4,34957.0,73533.9,81888.0,66480.1,1.9,0.124,190.2
9,Anguilla,4.3,8.8,18.0,34.5,28.1,1.87,0.0,317.6
10,Albania,8261.3,11568.1,8070.1,9281.3,7983.4,2.71,0.015,-69.0
11,Netherlands Antilles,15078.6,2855.9,6146.2,4171.5,2154.5,13.06,0.004,-75.4
12,United Arab Emirates,29374.9,84786.9,169629.1,270987.8,295110.3,29.33,0.549,348.1
13,Argentina,223210.9,262670.2,344827.1,379169.3,382992.5,8.27,0.712,145.8
14,Armenia,15350.1,24373.3,7429.7,8624.0,9377.0,3.19,0.017,-38.5
15,Antigua and Barbuda,170.5,246.6,265.2,335.7,359.8,3.36,0.001,145.9


In [6]:
# Change data types from string to float
format_headers = headers[1:-2]
new_set[format_headers] = new_set[format_headers].replace({',': ''}, regex=True).astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_set[format_headers] = new_set[format_headers].replace({',': ''}, regex=True).astype(float)


In [7]:
# Rename dataframe and use describe to inspect baseline trends and statistics
country_data = new_set
country_data.describe()

Unnamed: 0,GHG emissions 1970,GHG emissions 1990,GHG emissions 2005,GHG emissions 2017,GHG emissions 2022,GHG per capita 2022,% of world emissions 2022,% Change from 1990
count,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0
mean,115148.1,156786.4,198499.3,239694.7,252929.7,6.975144,0.470236,204.021635
std,467674.1,579411.7,803933.5,1087207.0,1212423.0,8.709528,2.254147,472.802651
min,3.8,7.0,11.4,18.6,20.3,0.59,0.0,-98.7
25%,1886.25,2448.9,3641.3,4535.6,5175.45,2.2225,0.00975,101.075
50%,13111.4,21402.95,24365.6,32996.5,35578.5,4.505,0.066,187.3
75%,57437.85,82327.73,95295.88,101641.9,114216.8,8.3525,0.21225,264.675
max,5750030.0,6163742.0,8431922.0,13710100.0,15684630.0,67.38,29.161,6336.7


### Export the data frame to csv to be used in SQL data exploration

In [8]:
# Export to csv without index 
country_data.to_csv('GHG_Emission_Data.csv', index=False)

## Conclusion and Recap

In this notebook, I successfully utilized BeautifulSoup to perform web scraping from the original wikipedia article. This new dataframe successfully captured all correct values from the original chart, even including percentage changes such as in the last column. 

This data can be used for visualizations and further analysis in Python, or exported to be used with SQL or Microsoft Excel. By web scraping our information, a new insightful dataset was easily formatted and made accessible for future analysis, without having to rely on existing databases. 

Please visit my corresponding SQL exploration and Power BI dashboard in my project portfolio, or at my personal portfolio website: www.johnsieve.com