# CS 3654 Team Project

### Team Info:  
Project Title:  Correlations on Climate Change  
Team name:  Greenhouse Guys  
Team member names and PIDs: Atharva Haldankar (ahaldankar), Fairuz Ahmed (ahfairuz), Andrew Ahn (aandrew17), Jonathan Jwa (jonathanyjwa23), Justin Perez (justinmp)

### Project Introduction:

**Initial Description:** We plan to analyze climate data based on country to understand which countries are responsible for the majority of greenhouse gas emissions, what the characteristics of those countries are, and what negative effects greenhouse emissions have on people and the environment.

**Potential research questions:**  
    1. Which countries produce the most greenhouse gases? Which countries produce the least?  
    2. Is there a correlation between GDP and greenhouse gas emissions?  
    3. Does a country's use of renewable energy decrease their emissions?  
    4. Does a country's population or land area have anything to do with greenhouse emissions?  
    5. What forms of government do the countries that produce the most greenhouse gases have?  
    6. Do greenhouse emissions come primarily from urban or rural settings?  
    7. Which countries are affected most by greenhouse emissions?  
    8. Do emissions impact human life expectancy?  
    
**Potential source data:**
1. https://www.kaggle.com/datasets/sudalairajkumar/undata-country-profiles
2. https://worldpopulationreview.com/country-rankings/greenhouse-gas-emissions-by-country
3. https://www.kaggle.com/saurabhshahane/green-house-gas-historical-emission-data  
4. https://www.kaggle.com/brendan45774/countries-life-expectancy

**Question: Does a Country's Population or Land Area have anything to do with greenhouse emissions? (Atharva)**

Does population or land area affect the volume of greenhouse emissions? By determining a correlation between these variables, we can better determine which countries are major contributors of greenhouse emissions. For example, if population and greenhouse emissions are strongly correlated together, then we can focus on countries with large populations, since those nations will have the greatest influence over the volume of emissions. Furthermore, we'll get a better sense geographically for which countries are major contributors of emissions. 

In order to answer this question, data from https://www.kaggle.com/datasets/sudalairajkumar/undata-country-profiles will be utilized. This data contains general information about each of the countries as well as social, economic, and environmental indicators. The dataset was extracted from information published by the United Nations, so it is a good authoritative source. 

Before analyzing the data, it will be helpful to define what units population, land area, and greenhouse emissions are measured in. Population will be measured in thousands of people, land area will be measured in square kilometers, and greenhouse emissions will be quantified in million tons / tons per capita. 

//Maybe add more detail later


In [1]:
import pandas
import numpy

In [2]:
# Read the original data into a pandas dataframe. 
dirty = pandas.read_csv("country_profile_variables.csv")

In [3]:
dirty.head()

Unnamed: 0,country,Region,Surface area (km2),Population in thousands (2017),"Population density (per km2, 2017)","Sex ratio (m per 100 f, 2017)",GDP: Gross domestic product (million current US$),"GDP growth rate (annual %, const. 2005 prices)",GDP per capita (current US$),Economy: Agriculture (% of GVA),...,Mobile-cellular subscriptions (per 100 inhabitants).1,Individuals using the Internet (per 100 inhabitants),Threatened species (number),Forested area (% of land area),CO2 emission estimates (million tons/tons per capita),"Energy production, primary (Petajoules)",Energy supply per capita (Gigajoules),"Pop. using improved drinking water (urban/rural, %)","Pop. using improved sanitation facilities (urban/rural, %)",Net Official Development Assist. received (% of GNI)
0,Afghanistan,SouthernAsia,652864,35530,54.4,106.3,20270,-2.4,623.2,23.3,...,8.3,42,2.1,9.8/0.3,63,5,78.2/47.0,45.1/27.0,21.43,-99
1,Albania,SouthernEurope,28748,2930,106.9,101.9,11541,2.6,3984.2,22.4,...,63.3,130,28.2,5.7/2.0,84,36,94.9/95.2,95.5/90.2,2.96,-99
2,Algeria,NorthernAfrica,2381741,41318,17.3,102.0,164779,3.8,4154.1,12.2,...,38.2,135,0.8,145.4/3.7,5900,55,84.3/81.8,89.8/82.2,0.05,-99
3,American Samoa,Polynesia,199,56,278.2,103.6,-99,-99.0,-99.0,-99.0,...,-99.0,92,87.9,-99,-99,-99,100.0/100.0,62.5/62.5,-99.0,-99
4,Andorra,SouthernEurope,468,77,163.8,102.3,2812,0.8,39896.4,0.5,...,96.9,13,34.0,0.5/6.4,1,119,100.0/100.0,100.0/100.0,-99.0,-99


In [4]:
dirty.dtypes

country                                                        object
Region                                                         object
Surface area (km2)                                             object
Population in thousands (2017)                                  int64
Population density (per km2, 2017)                            float64
Sex ratio (m per 100 f, 2017)                                 float64
GDP: Gross domestic product (million current US$)               int64
GDP growth rate (annual %, const. 2005 prices)                 object
GDP per capita (current US$)                                  float64
Economy: Agriculture (% of GVA)                                object
Economy: Industry (% of GVA)                                  float64
Economy: Services and other activity (% of GVA)               float64
Employment: Agriculture (% of employed)                        object
Employment: Industry (% of employed)                           object
Employment: Services

In [5]:
# Make a copy of the original dataframe and process data for analysis
clean = dirty.copy()

In [21]:
# Take out the ~ symbol
# Note: For computation purposes, we will treat countries that have a really small land area (~0) 
# as having no land area, even though this is clearly not the case. 
clean['Surface area (km2)'] = dirty['Surface area (km2)'].map(lambda val: int(val.replace('~', '')))

In [30]:
# TODO: Maybe interpolate or replace the country surface area with data pulled from other sources
clean[clean['Surface area (km2)'] < 0]

Unnamed: 0,country,Region,Surface area (km2),Population in thousands (2017),"Population density (per km2, 2017)","Sex ratio (m per 100 f, 2017)",GDP: Gross domestic product (million current US$),"GDP growth rate (annual %, const. 2005 prices)",GDP per capita (current US$),Economy: Agriculture (% of GVA),...,Mobile-cellular subscriptions (per 100 inhabitants).1,Individuals using the Internet (per 100 inhabitants),Threatened species (number),Forested area (% of land area),CO2 emission estimates (million tons/tons per capita),"Energy production, primary (Petajoules)",Energy supply per capita (Gigajoules),"Pop. using improved drinking water (urban/rural, %)","Pop. using improved sanitation facilities (urban/rural, %)",Net Official Development Assist. received (% of GNI)
25,"Bonaire, Sint Eustatius and Saba",Caribbean,-99,25,77.4,-99.0,-99,-99.0,-99.0,-99.0,...,-99,56,-99.0,0.3/13.3,0,208,-99,-99,-99.0,-99
130,Mayotte,EasternAfrica,-99,253,674.8,96.7,-99,-99.0,-99.0,-99.0,...,...,88,16.1,-99,0,19,-99,-99,-99.0,-99
193,Sudan,NorthernAfrica,-99,40533,23.0,99.9,79546,4.9,1977.0,32.4,...,26.6,133,-99.0,15.4/0.4,682,16,-99,-99,1.09,-99


In [28]:
# Remove countries that meet this criteria. 
clean[clean['CO2 emission estimates (million tons/tons per capita)'] < 0]

Unnamed: 0,country,Region,Surface area (km2),Population in thousands (2017),"Population density (per km2, 2017)","Sex ratio (m per 100 f, 2017)",GDP: Gross domestic product (million current US$),"GDP growth rate (annual %, const. 2005 prices)",GDP per capita (current US$),Economy: Agriculture (% of GVA),...,Mobile-cellular subscriptions (per 100 inhabitants).1,Individuals using the Internet (per 100 inhabitants),Threatened species (number),Forested area (% of land area),CO2 emission estimates (million tons/tons per capita),"Energy production, primary (Petajoules)",Energy supply per capita (Gigajoules),"Pop. using improved drinking water (urban/rural, %)","Pop. using improved sanitation facilities (urban/rural, %)",Net Official Development Assist. received (% of GNI)
3,American Samoa,Polynesia,199,56,278.2,103.6,-99,-99.0,-99.0,-99.0,...,-99.0,92,87.9,-99,-99,-99,100.0/100.0,62.5/62.5,-99.0,-99
7,Antigua and Barbuda,Caribbean,442,102,231.8,92.3,1356,4.1,14764.5,1.9,...,65.2,55,22.3,0.5/5.8,-99,84,97.9/97.9,91.4/91.4,0.12,-99
38,Cayman Islands,Caribbean,264,62,256.5,100.4,3726,0.7,62132.0,0.3,...,77.0,74,52.9,0.5/9.2,-99,130,97.4/...,95.6/...,-99.0,-99
41,Channel Islands,NorthernEurope,180,165,870.1,98.4,-99,-99.0,-99.0,-99.0,...,-99.0,-99,4.2,-99,-99,-99,-99,-99,-99.0,-99
43,"China, Hong Kong SAR",EasternAsia,1106,7365,7014.2,85.1,309236,2.4,42431.0,0.1,...,84.9,64,-99.0,46.2/6.4,-99,83,-99,-99,-99.0,-99
49,Cook Islands,Polynesia,236,17,72.4,97.4,294,5.5,14118.7,8.1,...,-99.0,75,62.9,0.1/3.4,-99,48,99.9/99.9,97.6/97.6,-99.0,-99
80,Gibraltar,SouthernEurope,6,35,3457.1,101.8,-99,-99.0,-99.0,-99.0,...,65.0,31,0.0,0.5/16.5,-99,259,-99,-99,-99.0,-99
85,Guam,Micronesia,549,164,304.1,102.6,-99,-99.0,-99.0,-99.0,...,73.1,99,46.3,-99,-99,-99,99.5/99.5,89.8/89.8,-99.0,-99
91,Holy See,SouthernEurope,0,1,1800.0,219.2,-99,-99.0,-99.0,-99.0,...,-99.0,1,-99.0,-99,-99,-99,-99,-99,-99.0,-99
133,Monaco,WesternEurope,2,39,25969.8,94.7,6258,5.4,165870.6,-99.0,...,93.4,21,-99.0,-99,-99,-99,100.0/...,100.0/...,-99.0,-99


In [32]:
# We will replace values that have either a negative Surface area (km2) or negative CO2 emission estimates with NaN
# The UN likely didn't have accurate data on those countries
nonNegSA = clean[clean['Surface area (km2)'] >= 0]
filtClean = nonNegSA[nonNegSA['CO2 emission estimates (million tons/tons per capita)'] >= 0]

In [35]:
# Sanity check: We would expect that 20 rows are filtered based on the emission estimates column
# and 3 rows are filtered out based on surface area. That gives 229 rows total - 23 row = 206 rows. 
filtClean.shape

(206, 50)

In [37]:
# Now that the data is thoroughly cleaned, we can begin analysis