# CS 3654 Team Project

### Team Info:  
Project Title:  Correlations on Climate Change  
Team name:  Greenhouse Guys  
Team member names and PIDs: Atharva Haldankar (ahaldankar), Fairuz Ahmed (ahfairuz), Andrew Ahn (aandrew17), Jonathan Jwa (jonathanyjwa23), Justin Perez (justinmp)

### Project Introduction:

**Initial Description:** We plan to analyze climate data based on country to understand which countries are responsible for the majority of greenhouse gas emissions, what the characteristics of those countries are, and what negative effects greenhouse emissions have on people and the environment.

**Potential research questions:**  
    1. Which countries produce the most greenhouse gases? Which countries produce the least?  
    2. Is there a correlation between GDP and greenhouse gas emissions?  
    3. Does a country's use of renewable energy decrease their emissions?  
    4. Does a country's population or land area have anything to do with greenhouse emissions?  
    5. What forms of government do the countries that produce the most greenhouse gases have?  
    6. Do greenhouse emissions come primarily from urban or rural settings?  
    7. Which countries are affected most by greenhouse emissions?  
    8. Do emissions impact human life expectancy?  
    
**Potential source data:**
1. https://www.kaggle.com/datasets/sudalairajkumar/undata-country-profiles
2. https://worldpopulationreview.com/country-rankings/greenhouse-gas-emissions-by-country
3. https://www.kaggle.com/saurabhshahane/green-house-gas-historical-emission-data  
4. https://www.kaggle.com/brendan45774/countries-life-expectancy
5. https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who

### Individual Contributions: 
Atharva:  
- Completed QACs for questions 4 and 5 in potential research questions section (Population/Land Area vs. Greenhouse Emissions and Government Forms vs. CO2 Emissions).  
- Edited final report and added more information to Q and C sections of other team member's reports.  

Justin  
- QAC on GDP vs. Greenhouse Emissions  

Andrew  
- QAC on Life Expectancy vs. Greenhouse Emissions

Fairuz
- QAC on Renewable Energy and CO2 emissions

Jonathan
- QAC on Countries that Produce the Most and Least CO2 Emissions

### Procedural Notes
- When analyzing population vs. CO2 emissions, we tried fitting a logarithmic model to the data, due to how the data was structured. However, since this model had a lower R-value than the linear model, we decided not to include those results. 
- For population vs. CO2 emissions, we also tried fitting a polynomial model. However, despite specifying the model as a polynomial of degree 2, we still got a linear fit, since the model computed an x^2 coefficient of 0. 
- For our initial questions, we used the CO2 emission estimates column from the country profiles kaggle dataset (link 1). However, since emissions data from the world population review site (link 2) is slightly more up to date, we used this source for some of the later questions. 

**DimRed and Analysis:** 

In [None]:
import pandas
import numpy
import sklearn.metrics.pairwise
import sklearn.manifold
import sklearn.cluster
import matplotlib
import matplotlib.pyplot as plt
# pandas.options.mode.chained_assignment = None  # default='warn'

In [None]:
dirtyCountries = pandas.read_csv("country_profile_variables.csv")
dirtyCountries

cont_preJoin = dirtyCountries.copy()

# Relevant Columns: 
# Country, Surface Area, Population (from other data set), 
# GDP, GDP growth rate, Economy: Agriculture, 
# Threatened Species, CO2 emission estimates (other dataset), Energy production (Petajoules)
cont_preJoin = cont_preJoin[['country', 'Surface area (km2)', 'GDP: Gross domestic product (million current US$)', 
                               'GDP growth rate (annual %, const. 2005 prices)', 'Economy: Agriculture (% of GVA)', 
                             'Economy: Industry (% of GVA)', 'Economy: Services and other activity (% of GVA)',
                               'Threatened species (number)', 'Energy production, primary (Petajoules)',
                                'Population in thousands (2017)', 'Population density (per km2, 2017)']]
cont_preJoin.shape
cont_preJoin.dtypes
# cont_preJoin.head()

In [None]:
# Cleaning: Make sure columns have appropriate data types. 
cont_preJoin['Surface area (km2)'] = dirtyCountries['Surface area (km2)'].map(lambda val: int(val.replace('~', '')))
cont_preJoin['GDP growth rate (annual %, const. 2005 prices)'] = dirtyCountries['GDP growth rate (annual %, const. 2005 prices)'].map(lambda val : float(val.replace('~', '')))
cont_preJoin['Economy: Agriculture (% of GVA)'] = dirtyCountries['Economy: Agriculture (% of GVA)'].map(lambda val : float(val.replace('~', '')))
cont_preJoin['Threatened species (number)'] = dirtyCountries['Threatened species (number)'].map(lambda val : float(val.replace('~', '')))

In [None]:
cont_preJoin.shape
cont_preJoin.head(5)
# cont_preJoin[:-10]
cont_preJoin.dtypes

In [None]:
# Now, make sure values are in proper range
filteredOutRows = cont_preJoin.loc[cont_preJoin.eq(-99).any(1), :]
len(filteredOutRows)

# Map all -99s to NAs. 
cont_preJoin = cont_preJoin.replace(-99, numpy.NaN)
cont_preJoin = cont_preJoin.dropna()
cont_preJoin.shape
cont_preJoin.head()

In [None]:
# Now, process the other dataset. 
dirtyEmissions = pandas.read_csv("CO2Emissions.csv")

# Create a copy of this dataset to preserve the original. Only keep the relevant columns. 
emissions_preJoin = dirtyEmissions.copy()
emissions_preJoin = emissions_preJoin[['Country Name', '2017']]

# Overview information about the dataset. 
emissions_preJoin.shape
emissions_preJoin.dtypes
emissions_preJoin.head()

In [None]:
# Clean emissions_preJoin by dropping NaN values. Let's rename the '2017' column to 'Emissions 2017'
emissions_preJoin = emissions_preJoin.dropna()
emissions_preJoin.rename(columns={'2017':'Emissions 2017 (Metric Tons Per Capita)'}, inplace=True)
# emissions_preJoin = emissions_preJoin.reset_index(drop=True)
emissions_preJoin.head()
# emissions_preJoin[emissions_preJoin['Country Name'] == 'Russia']

In [None]:
# Check for any missing or invalid values in the emissions dataset
len(emissions_preJoin[emissions_preJoin['Emissions 2017 (Metric Tons Per Capita)'] < 0])

In [None]:
# Before we join: Let's make sure that the major emitters of CO2 emissions are all represented. 
# Create a dictionary which maps country names in cont_preJoin to the corresponding names in emissions_preJoin
countryMappings = {
    'United States of America': 'United States',
    # 'Russian Federation' : 'Russia',
    'Republic of Korea' : 'Korea, Rep.',
    'Viet Nam' : 'Vietnam',
    'Czechia' : 'Czech Republic'
}

In [None]:
# It looks like we are all good with cleaning. Now, let's join cont_preJoin with emissions_preJoin. 
# Do an inner join so we don't get any missing or NaN values. 
cont_preJoin.country = cont_preJoin.country.map(lambda c : countryMappings[c] if c in countryMappings.keys() else c)
cont_preJoin.country = cont_preJoin.country.map(lambda c : c[:(c.find("(") - 1)] if c.find("(") != -1 else c)
# cont_preJoin.country = cont_preJoin.country.map(lambda c : c.upper())
countryStats = pandas.merge(cont_preJoin, emissions_preJoin, how='inner', left_on='country', right_on='Country Name')
countryStats.shape
# emissions_preJoin[emissions_preJoin['Country Name'] == 'Russian Federation']
# cont_preJoin.head()
# cont_preJoin.shape

In [None]:
# We got the majority of countries from both datasets. Here's a sample of the joined data. 
countryStats.shape
countryStats.head()

In [None]:
# Now, we are ready to analyze the data. 
# Let's begin by normalizing the data, so that the columns are weighted equally, 
countryStats.head()
preNorm = countryStats.drop(['country', 'Country Name'], axis=1)
preNorm.head()
norm = (preNorm-preNorm.mean())/(preNorm.std())
norm.head()

In [None]:
# Let's visualize the data using a parallel coordinates plot. 
normWithCountries = norm.join(countryStats.country)
# Rename the columns for readability
normWithCountries.columns = ['Surface area', 'GDP', 'GDP growth rate', 'Economy: Agriculture', 
                                'Economy: Industry', 'Economy: Services and other', 
                                'Threatened species', 'Energy production', 'Population', 
                                'Population density', 'CO2 estimates', 'country']
parallelPlot = pandas.plotting.parallel_coordinates(normWithCountries, class_column='country', colormap='rainbow_r')
parallelPlot.figure.set_size_inches(30,30, forward=True)

In [None]:
norm.index = normWithCountries.country
norm.head()

In [None]:
# Now let's perform dimension reduction. 
# First, compute the distance matrix. 
distHD = sklearn.metrics.pairwise.euclidean_distances(norm)
distHD = pandas.DataFrame(distHD, columns=norm.index, index=norm.index)
distHD

In [None]:
# Compute the distance matrix for the weighted high-dimensional data using L1 distance function.
#  Input HD data should already be weighted.
def distance_matrix_HD(dataHDw):  # dataHDw (pandas or numpy) -> distance matrix (numpy)
    dist_matrix = sklearn.metrics.pairwise.euclidean_distances(dataHDw)
    #m = pd.DataFrame(m, columns=dataHD.index, index=dataHD.index)  # keep as np array for performance
    return dist_matrix

# Compute the distance matrix for 2D projected data using L2 distance function.
def distance_matrix_2D(data2D):  # data2d (pandas or numpy) -> distance matrix (numpy)
    dist_matrix = sklearn.metrics.pairwise.euclidean_distances(data2D) 
    #m = pd.DataFrame(m, columns=data2D.index, index=data2D.index) # keep as np array for performance
    return dist_matrix

#def dist(x,y):
#    return np.linalg.norm(x-y, ord=2)


In [None]:
# Calculate the MDS stress metric between HD and 2D distances.  Uses numpy for efficiency.
def stress(distHD, dist2D):  #  distHD, dist2D (numpy) -> stress (float)
    #s = np.sqrt((distHD-dist2D).pow(2).sum().sum() / distHD.pow(2).sum().sum())  # pandas
    #s = np.sqrt(((distHD-dist2D)**2).sum() / (distHD**2).sum())   # numpy
    s = ((distHD-dist2D)**2).sum() / (distHD**2).sum()   # numpy, eliminate sqrt for efficiency
    return s

def compute_mds(dataHDw):  # dataHDw -> data2D (pandas)
    # distHD = distance_matrix_HD(dataHDw)
    distHD = sklearn.metrics.pairwise.euclidean_distances(norm)
    # Adjust these parameters for performance/accuracy tradeoff
    mds = sklearn.manifold.MDS(n_components=2, dissimilarity='precomputed', n_init=10, max_iter=1000)
    # Reduction algorithm happens here:  data2D is nx2 matrix
    data2D = mds.fit_transform(distHD)
    
    # Rotate the resulting 2D projection to make it more consistent across multiple runs.
    # Set the 1st PC to the y axis, plot looks better to spread data vertically with horizontal text labels
    # pca = sklearn.decomposition.PCA(n_components=2)
    # data2D = pca.fit_transform(data2D)
    # data2D = pd.DataFrame(data2D, columns=['y','x'], index=dataHDw.index)
    
    # data2D.stress_value = stress(distHD, distance_matrix_2D(data2D))
    return data2D

def dimension_reduction(dataHD, wts): # dataHD, wts -> data2D (pandas)
    # Normalize the weights to sum to 1
    wts = wts/wts.sum()
    
    # Apply weights to the HD data 
    dataHDw = dataHD * wts
    
    # DR algorithm
    data2D = compute_mds(dataHDw)

    # Compute row relevances as:  data dot weights
    # High relevance means large values in upweighted dimensions
    # data2D['relevance'] = dataHDw.sum(axis=1)
    return data2D

In [None]:
# Now, use the MDS algorithm to reduce the data down to 2 dimensions. 

weights = pandas.Series([1, 1, 1, 1, 1, 1, 1, 1])
data2D = dimension_reduction(norm, weights)

# mds = sklearn.manifold.MDS(n_components=2, dissimilarity='precomputed') # TODO: Change parameters if necessary. 
# data2D = mds.fit_transform(distHD)
data2D = pandas.DataFrame(data2D, columns=['x', 'y'], index=norm.index)
data2D

In [None]:
## Plot the 2D data
data2D_v = data2D.join(norm['Emissions 2017 (Metric Tons Per Capita)'])
data2D_v.head()
ax = data2D_v.plot.scatter('x', 'y', c='Emissions 2017 (Metric Tons Per Capita)', 
                           s=40, colormap=plt.cm.rainbow, figsize=(15,15), sharex=False)
# ax.axis('scaled')
for i,r in data2D.iterrows():
   ax.text(r.x, r.y, i[0:3])

In [None]:
# Let's Cluster the data to see how we can group countries together. 
# First, let's find the optimal number of clusters. 

kVals = []
twcv = []
for k in range(1, len(norm)):
    centroids = norm.iloc[0:k]
    km = sklearn.cluster.KMeans(n_clusters=k, init=centroids, n_init=1, max_iter=10)
    km.fit(norm)
    kVals.append(k)
    twcv.append(km.inertia_)
d = {'K': kVals, 'Inertia': twcv}
Answer2 = pandas.DataFrame(data=d)
plt.figure(figsize=(8, 8))
plt.plot(Answer2.K, Answer2.Inertia, marker='o')
Answer2[Answer2.K == 7]

In [None]:
# It looks like there's a steep drop in within-cluster variance just around K = 7. 
# Choosing more clusters will make it more difficult to meaningfully group countries, so 
# let's choose 7 clusters for now. 
km = sklearn.cluster.KMeans(n_clusters=7)
labels = km.fit_predict(norm)
# labels
labels = pandas.DataFrame(labels, columns=['Cluster'], index=norm.index)
labels

# labels.sort_values('Cluster')

In [None]:
# Now, let's plot the results of the clustering. 
data2DClustered = data2D.join(labels.Cluster)
data2DClustered
ax = data2DClustered.plot.scatter('x', 'y', c='Cluster', colormap=plt.cm.viridis, figsize=(10, 10), sharex=False)
ax.axis('scaled')
# for i,r in data2D.iterrows():
#     ax.text(r.x, r.y, i)

## Question: Does a Country's Population or Land Area have anything to do with greenhouse emissions? (Atharva)

Does population or land area affect the volume of greenhouse emissions? By determining a correlation between these variables, we can better determine which countries are major contributors of greenhouse emissions. For example, if population and greenhouse emissions are strongly correlated together, then we can focus on countries with large populations, since those nations will have the greatest influence over the global volume of emissions. Furthermore, we'll get a better sense geographically for which countries are major contributors of emissions.  

Hypothesis 1: We should expect countries with larger populations to emit more CO2 into the atmosphere. This will most likely be the case, since a larger population typically consumes more energy than a smaller population. Many countries meet their energy needs by burning coal or fossil fuels, and these sources of energy release CO2 into the atmosphere. 
  
  
Hypothesis 2: Countries with larger land areas will, on average, emit more CO2 into the atmosphere than smaller countries. Many of the major exporters of the world are nations which have a large surface area, and countries which have more economic activity would most likely release greater amounts of CO2 than countries with less active economies.

In order to answer this question, data from https://www.kaggle.com/datasets/sudalairajkumar/undata-country-profiles will be utilized. This data contains general information about each of the countries as well as social, economic, and environmental indicators. The dataset was extracted from information published by the United Nations, so it is a good authoritative source. 

Before analyzing the data, it will be helpful to define what units population, land area, and greenhouse emissions are measured in. Population will be measured in thousands of people, land area will be measured in square kilometers, and greenhouse emissions will be quantified in million tons / tons per capita.  

Throughout this report, the only major bias present in our work is that the authors of this report believe that human activity has altered Earth's climate system and that an excess amount of greenhouse gases in the atmosphere can have a negative effect on the environment. 

## Analysis: 
First we import some libraries that we will need. Pandas is a general purpose data analysis library and numpy is useful for certain mathematical operations, like matrix multiplications. The sklearn.linear_model module will allow for a linear regression line to be fitted to the given data. 

In [None]:
import pandas
import numpy
from sklearn.linear_model import LinearRegression

Let's use the countryStats dataset from the clustering analysis above. It contains all the relevant columns and is already cleaned, so we can begin right away with visualization and analysis.  

Below are 2-D scatterplots which show the relationships between Surface Area vs. CO2 emissions and Population (thousands) vs. CO2 emissions

In [None]:
# Let's add a column for raw emissions. 
countryStats['Raw Emissions (Million Metric Tons)'] = countryStats['Emissions 2017 (Metric Tons Per Capita)'] * countryStats['Population in thousands (2017)'] / 1000
countryStats.plot.scatter(x='Surface area (km2)', y='Raw Emissions (Million Metric Tons)', figsize=(10,5))
countryStats.plot.scatter(x='Population in thousands (2017)', y='Raw Emissions (Million Metric Tons)', color='green', figsize=(10,5))

Now, let's compute the Pearson Correlation Coefficients for Surface Area vs. Raw CO2 emissions and for Population (thousands) vs. Raw CO2 emissions.

In [None]:
countryStats['Surface area (km2)'].corr(countryStats['Raw Emissions (Million Metric Tons)'])

In [None]:
countryStats['Population in thousands (2017)'].corr(countryStats['Raw Emissions (Million Metric Tons)'])

We get a fairly strong correlation coefficient for Population vs. CO2 Emissions. The correlation coefficient for Surface area vs. CO2 is slightly weaker, probably due to outliers. Let's see what happens if we remove these outliers. 

In [None]:
countryStatsNoOutliers = countryStats[countryStats['Raw Emissions (Million Metric Tons)'] < 4000]
countryStatsNoOutliers['Surface area (km2)'].corr(countryStatsNoOutliers['Raw Emissions (Million Metric Tons)'])

Interestingly, we get about the same correlation coefficient. It looks like the outliers didn't really affect the Pearson coefficient.  
Let's create a 3-D visualization of the data with Surface area and Population on the x and y axes and CO2 emissions on the z axis. 

In [None]:
# Graphing libraries
# %matplotlib notebook
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
sc = ax.scatter(countryStats['Surface area (km2)'], countryStats['Population in thousands (2017)'], countryStats['Raw Emissions (Million Metric Tons)'], 
                s=30)
xl = ax.set_xlabel('Surface area (km2)')
yl = ax.set_ylabel('Population in thousands')
zl = ax.set_zlabel('Raw Emissions (Million Metric Tons)')

Now, what happens if we do a multiple linear regression analysis using both surface area and population as independent variables?

In [None]:
X = countryStats[['Surface area (km2)','Population in thousands (2017)']]
lmMult = LinearRegression().fit(X, countryStats['Raw Emissions (Million Metric Tons)'])
lmMult.coef_,lmMult.intercept_
# multiRegR = (lmMult.score(X, countryStats['Raw Emissions (Million Metric Tons)']))**0.5
# multiRegR

Our multiple linear regression model gives an R-value of ~0.85, which is better than both of the individual R-values.  
Let's overlay the predicted data from the multiple linear regression model with the actual data. This will help us see sources of error between the predictions and actual data. 

In [None]:
# Overlay the predicted CO2 levels on the plot with actual Surface Area vs. CO2 data. 
# Create a predictedCO2 column
# y = m1*x1 + m2*x2 + b
predictedCO2 = lmMult.coef_[0]*countryStats['Surface area (km2)'] + lmMult.coef_[1]*countryStats['Population in thousands (2017)'] + lmMult.intercept_
countryStatsPred = countryStats.assign(Predicted_CO2 = predictedCO2)
countryStatsPred.head()

sc2 = ax.scatter(countryStatsPred['Surface area (km2)'], countryStatsPred['Population in thousands (2017)'], 
           countryStatsPred['Predicted_CO2'], c='r', marker='x', s=30)
fig

We see the overall trend, but there's a really large cluster of points in the bottom right corner. Let's zoom in on that area of the plot. 

In [None]:
ax.set_xlim([0, 0.25e+07])
ax.set_zlim([0, 1000])
fig

This gives us a better sense of the predictions in relation to the actual data points. It looks like, for the most part, the predictions are fairly close to the actual values. This would explain the high R-value. 

Read in the original country profile data

In [None]:
# Read the original data into a pandas dataframe. 
dirty = pandas.read_csv("country_profile_variables.csv")

Here's a glance at the data:   

In [None]:
dirty.head()

These are the data types. Notice how the Surface area (km2) column has an object data type. We should probably clean this column so the data is in a more useful form. 

In [None]:
dirty.dtypes

Before doing any further analysis, we retain a copy of the original data. This way, we can track any modifications we choose to make. 

In [None]:
# Make a copy of the original dataframe and process data for analysis
clean = dirty.copy()

Clean the data by taking out the ~ symbol  
Note: For computation purposes, we will treat countries that have a really small land area (~0) as having no land area, even though this is clearly not the case. 

In [None]:
clean['Surface area (km2)'] = dirty['Surface area (km2)'].map(lambda val: int(val.replace('~', '')))

Some countries have a land area < 0 or emission estimates that are < 0. 

In [None]:
# Remove countries that meet this criteria. 
clean[clean['Surface area (km2)'] < 0]
clean[clean['CO2 emission estimates (million tons/tons per capita)'] < 0]

We will filter out rows that have a negative Surface area (km2) or negative CO2 emission estimates.  
The UN likely didn't have accurate data on those countries

In [None]:
nonNegSA = clean[clean['Surface area (km2)'] >= 0]
filtClean = nonNegSA[nonNegSA['CO2 emission estimates (million tons/tons per capita)'] >= 0]

Sanity check: We would expect that 20 rows are filtered out based on the emission estimates column and 3 rows are filtered out based on surface area. That gives 229 rows - 23 rows = 206 rows. 

In [None]:
filtClean.shape

In [None]:
filtClean.head()

Now that the data is thoroughly cleaned, we can begin visualization and analysis.  
We first create scatterplots for both Surface Area vs. CO2 emissions and Population (thousands) vs. CO2 emissions

In [None]:
filtClean.plot.scatter(x='Surface area (km2)', y='CO2 emission estimates (million tons/tons per capita)', figsize=(10,5))
filtClean.plot.scatter(x='Population in thousands (2017)', y='CO2 emission estimates (million tons/tons per capita)', color='green', figsize=(10,5))

Then, we compute the Pearson correlation coefficients for Surface Area vs. CO2 emissions and for Population (thousands) vs. CO2.  

In [None]:
filtClean['Surface area (km2)'].corr(filtClean['CO2 emission estimates (million tons/tons per capita)'])

In [None]:
filtClean['Population in thousands (2017)'].corr(filtClean['CO2 emission estimates (million tons/tons per capita)'])

These R values are both near or in the 0.70-0.80 range, so they indicate a relatively good linear fit.  
Let's go ahead and create a linear regression model for both pairs of x,y data. 

In [None]:
# Fit the Surface Area vs. CO2 emissions data to a linear regression model. 
lmSA = LinearRegression().fit(filtClean[['Surface area (km2)']], filtClean[['CO2 emission estimates (million tons/tons per capita)']])
lmSA.coef_, lmSA.intercept_

In [None]:
# Fit the Population (thousands) vs. CO2 emissions data to a linear regression model. 
lmPop = LinearRegression().fit(filtClean[['Population in thousands (2017)']], filtClean[['CO2 emission estimates (million tons/tons per capita)']])
lmPop.coef_, lmPop.intercept_

Create a new data table which has a column for predicted CO2 levels with the SA vs. CO2 data

In [None]:
# Used assign() to create a new DataFrame with the Predicted_CO2 column because of the 
# SettingWithCopy warning. 

# Create a predictedCO2 column for Surface Area
predictedCO2 = filtClean['Surface area (km2)']*lmSA.coef_[0] + lmSA.intercept_
filtCleanSA = filtClean.assign(Predicted_CO2 = predictedCO2)
filtCleanSA.head()

Do the same, except for the Pop. (thousands) vs. CO2 data

In [None]:
# Create a predictedCO2 column for Population
predictedCO2 = filtClean['Population in thousands (2017)']*lmPop.coef_[0] + lmPop.intercept_
filtCleanPop = filtClean.assign(Predicted_CO2 = predictedCO2)
filtCleanPop.head()

Visualize the results.  
Overlay the actual data with the predicted data for both x,y pairs. 

In [None]:
# Overlay the predicted CO2 levels on the plot with actual Surface Area vs. CO2 data. 
axSA = filtCleanSA.plot.scatter(x='Surface area (km2)', y='CO2 emission estimates (million tons/tons per capita)', figsize=(10,8))
filtCleanSA.plot.scatter(x='Surface area (km2)', y='Predicted_CO2', ax=axSA, color='red')
axSA = axSA.set_ylabel('CO2 emissions (million tons/tons per capita)')

In [None]:
# Visualization Population (thousands) vs. CO2 data
axPop = filtCleanPop.plot.scatter(x='Population in thousands (2017)', y='CO2 emission estimates (million tons/tons per capita)', color='green', figsize=(10,8))
filtCleanPop.plot.scatter(x='Population in thousands (2017)', y='Predicted_CO2', ax=axPop, color='red')
axPop = axPop.set_ylabel('CO2 emissions (million tons/tons per capita)')

Make sure the regression models are consistent with what we would expect. 

In [None]:
# Checkpoint: Verify the predicted columns are correct. 
filtCleanSA['Surface area (km2)'].corr(filtCleanSA['Predicted_CO2']), filtCleanPop['Population in thousands (2017)'].corr(filtCleanPop['Predicted_CO2'])

## Conclusion:

From the above analysis, it appears that land area and population do have some relation to CO2 emissions. In both cases, there is a positive correlation with a small slope. As surface area increases by 1 km^2, the model predicts an increase of 4.33*10^-3 millions of tons of CO2 / tons per capita. Similarly, as population increases by a thousand people, the model indicates that there should be about an increase of 0.052 millions of tons of CO2 / tons per capita. According to the linear regression model, countries which have a larger surface area on average produce more CO2 emissions. Likewise, countries which have a greater population on average produce more CO2 than countries with smaller populations. These results support both of the hypotheses stated above. 

However, the models created in the analysis section are only approximations and leave out important information. From the graph of Surface Area (km2) vs. CO2 emissions, we can see a few countries which emit a substantially larger volume of CO2 than other nations. These countries also lie above the regression line for both plots, which means they emit more CO2 than the models predict. 

One additional question that can be explored is whether population density is a better predictor of CO2 emissions than total population. A large population density value could be associated with urban areas and cities, and may therefore have a strong correlation with CO2 emissions. Another potential question to consider is whether CO2 emissions can be better predicted using both surface area and population as independent variables. This would require a multiple linear regression analysis. Finally, while the CO2 emissions from this dataset are normalized by population (e.g. divided by tons per capita), it may be worth considering what effect population and surface area have on raw CO2 emissions, for example in units of cubic meters. 

## Question: Is there a correlation between GDP and greenhouse gas emissions? (Justin)

Does GDP affect the volume of greenhouse emissions? By determining a correlation between these variables, we can better determine how a country's economy affects its CO2 emissions. We would expect countries with larger economies, and therefore larger GDPs, to produce more CO2 emissions, due to more energy demands and pollution from factories. 

In order to answer this question, data from https://www.kaggle.com/datasets/sudalairajkumar/undata-country-profiles will be utilized. This data contains general information about each of the countries as well as social, economic, and environmental indicators. The dataset was extracted from information published by the United Nations, so it is a good authoritative source. 

Before analyzing the data, it will be helpful to define what units GDP and greenhouse emissions are measured in. GDP will be measured in GDP per capita, in USD. CO2 emissions will be quantified in million tons / tons per capita. GDP per capita will be used instead of raw GDP, since CO2 emissions are already measured per person, so both variables will be scaled in the same way. 

## Analysis: 
We'll use numpy, pandas, and matplotlib to analyze the data. We'll also use the sklearn.linear_model module to fit a linear regression model if there is a correlation.

Here, we'll compare GDP per capita to CO2 emissions per capita. This is because both variables are measured per person, and thus are scaled the same relative to each country. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy

We'll start by reading in our data.

In [None]:
dirty = pd.read_csv("country_profile_variables.csv")

Preview the data to see what the data looks like and the data types.

In [None]:
dirty.head()

Lets extract the GDP and CO2 emissions data from the original dataset into one clean dataframe.

In [None]:
clean = pd.DataFrame()

# copy 'GDP per capita (current US$)' and ''CO2 emission estimates (million tons/tons per capita)'
clean['Country'] = dirty['country']
clean['Region'] = dirty['Region']
clean['CO2 emission estimates (million tons/tons per capita)'] = dirty['CO2 emission estimates (million tons/tons per capita)']
clean['GDP per capita (current US$)'] = dirty['GDP per capita (current US$)']
clean.head()

There seems to be a lot of missing data in the both columns, here labeled with -99. We'll remove all rows with missing data.

In [None]:
# remove rows with values less than 0
clean = clean[clean['CO2 emission estimates (million tons/tons per capita)'] >= 0]
clean = clean[clean['GDP per capita (current US$)'] >= 0]
clean.head()

We removed some rows. Lets count the number of missing rows.

In [None]:
dirty.shape[0] - clean.shape[0]

Lets plot the data.

In [None]:
# plot CO2 emission estimates vs. GDP per capita
clean.plot.scatter(x='CO2 emission estimates (million tons/tons per capita)', y='GDP per capita (current US$)', figsize=(10,5))

This plot is not too helpful, there seems to be a lot of crowded points close to zero. Lets zoom in closer to zero to get a better understanding of the plot.

In [None]:
# plot, limit x to 15000
clean.plot.scatter(x='CO2 emission estimates (million tons/tons per capita)', y='GDP per capita (current US$)', figsize=(10,5), xlim=(-500,15000))

To the eye, there does not seem to be a correlation. Lets compute Pearson's coeffecient.

In [None]:
# compute r
clean['CO2 emission estimates (million tons/tons per capita)'].corr(clean['GDP per capita (current US$)'])

## Conclusion:

There is no correlation between GDP and CO2 emissions. This could be because GDP is not a good predictor of CO2 emissions, or because the data is not representative of the real world.

If our result is true, it could mean that CO2 emissions do not have any relationship with GDP. This could mean that countries with lots of exports do not have high CO2 emissions, and that they might rely on other countries for energy and other needs which produce CO2.

This hypothesis could be true due to the amount of outliers we see on our plot. Most countries have low CO2 emissions with the exception of a few countries with very high CO2 emissions.

## Question: What forms of government do the countries that produce the most greenhouse gases have? (Atharva)
What forms of government do nations which are major emitters of greenhouse gases have? By answering this question, we may be able to gain insight into whether certain forms of government are more effective than others in terms of reducing emissions. 

Data for this question will be taken taken from https://cddrl.fsi.stanford.edu/research/autocracies_of_the_world_dataset and https://worldpopulationreview.com/country-rankings/greenhouse-gas-emissions-by-country. 

The Stanford Center on Democracy, Development, and the Rule of Law is responsible for producing the first dataset. This dataset includes information on countries and their forms of government through the years 1950-2012. While this data might seem to be somewhat outdated, only data from 2012 will be analyzed. Furthermore, most governments throughout the world have remained stable for at least the last 10 years, so we expect the data to be accurate. In this dataset, government types are grouped into 5 categories: Democracy, Military, Monarchy, Multiparty, and Single Party. 

The second link lists out countries and their CO2 emissions in millions of tons. The world population review site was responsible for collecting this data, and we expect it to be both accurate and reliable. The data is also current, since it was taken in 2022. 

## Analysis: 
First, let's import both datasets and get a sense of what the data looks like. 

In [None]:
dirtyGovt = pd.read_excel("countries_by_govt.xls")
dirtyEmissions = pd.read_csv("emissions_Mt_country_2022.csv")

In [None]:
dirtyGovt.head()

In [None]:
dirtyEmissions.head()

Here we display the shape of the data as well as the data types present. 

In [None]:
dirtyGovt.shape
dirtyGovt.dtypes

In [None]:
dirtyEmissions.shape
dirtyEmissions.dtypes

First, let's make a copy of the DataFrames. This will allow us to refer back to the original data if necessary. 

In [None]:
# Make a copy of the original dataframe and process data for analysis
newEmissions = dirtyEmissions.copy()
newGovt = dirtyGovt.copy()

Let's filter out all rows in newGovt which have years other than 2012. 

In [None]:
# Filter by year (only 2012)
newGovt2012 = newGovt[newGovt.year == 2012]
newGovt2012

Are there any missing or negative values for the columns we're interested in? 

In [None]:
len(newEmissions[newEmissions.totalCO2emission < 0])
newGovt2012.country.isnull().values.any()
newGovt2012.regime_nr.isnull().values.any()

It looks like we are ok to proceed with visualization and analysis. The data types seem to be consistent, and there are no missing or outlier values in the columns we care about.  
First, let's join the relevant columns from both of these tables together. 

In [None]:
# First make sure the primary key columns are consistent for both tables. 
countryUpper = [c.upper() for c in newGovt2012.country]
len(countryUpper)
govt2012 = newGovt2012.assign(country_upper = countryUpper)

In [None]:
govt2012.head(10)
govt2012.shape

In [None]:
# Now, join by country name. 
joinedData = pandas.merge(govt2012, newEmissions, how='inner', left_on='country_upper', right_on='country')
joinedData.shape

An inner join was used in order to ensure that no values will be NaN in the newly formed table. Now that we've joined, we can filter out the columns which aren't relevant. 

In [None]:
joinedData = joinedData[['country_x', 'regime_nr', 'totalCO2emission']]
joinedData.shape
joinedData.head()

It looks like we haven't lost any rows, so all the data from the joined table for each of these columns should be present.  
Let's now do a groupby on government type. 

In [None]:
g = joinedData.groupby('regime_nr')
g.size()

The next step is to aggregate data for the groupby object we have generated. Let's first try taking the mean of the CO2 emissions for each of these columns. 

In [None]:
g.totalCO2emission.mean()

It looks like countries which have a single party government have a large mean of CO2 emissions. Perhaps this is because there are outlier countries, like China. Let's take the median, which is less prone to outliers, and see if our results differ. 

In [None]:
g.totalCO2emission.median()

Median results give more insight into what may be going on. Democracies, Monarchies, and Single Party states appear to emit more CO2 emissions than nations which have Military governments or Multiparty systems. However, based on the differences between the mean and median, Single Party states and Democracies contain a few outlier countries which emit a lot more CO2 than the rest.  
  
  Let's now visualize the results. 

In [None]:
joinedData.sort_values('totalCO2emission').plot.bar('country_x', 'totalCO2emission', figsize=(25,5))

Due to the large number of countries, let's focus on the countries which are major emitters. 

In [None]:
majorEmitters = joinedData[joinedData.totalCO2emission > 20000]
majorEmitters.shape

In [None]:
axP3 = majorEmitters.sort_values('totalCO2emission', ascending=False).plot('country_x', 'totalCO2emission', 
                                                                    kind='bar', figsize=(20,5))
axP3 = axP3.set(xlabel='Country', ylabel='CO2 emissions (millions of tons)')


As predicted, a few countries that emit a lot of CO2, like China and the United States, impact the mean for Single Party systems and Democracies significantly. 

## Conclusion: 


Of the five types of governmental systems in the Stanford dataset, it appears that Democracies, Single Party states, and Monarchies emit more CO2 into the atmosphere than Military states and Multiparty systems. However, as seen from the bar graph, a few countries emit substantially more CO2 than the vast majority of other nations. These include China, The United States, and India. In fact, China emits about 2x as much CO2 as the US, which in turn emits more than 2x as much CO2 as India. Countries like China and the United States are responsible for drastically increasing the mean number of emissions of their respective government type. 

One limitation of this analysis is that a specific government type may be significantly impacted by one or two countries. For example, if China was removed from this analysis, then the emissions of Single Party states would be substantially lower. Therefore, the aggregation step is outlier-prone, especially when aggregating by mean. 

A logical next step from this analysis would be to analyze which of the nations that are major emitters of CO2 are world powers. Moreover, it may be more valuable to group countries by other criteria, like whether a given nation is developed, developing, or underdeveloped. Another approach, for which we have already conducted some analysis, would be to classify countries by their economic status.

## Question: Do emissions impact human life expectancy? (Andrew)

Do greenhouse emissions impact human life expectancy? By answering this question, we will better understand how harmful greenhouse gases are towards people. We hypothesize that as CO2 emissions increase, human life expectancy rates will decrease. 

We can answer this question by gathering data on life expectancy and greenhouse gas emissions of various countries, and finding a Pearson correlation coefficient between the two data sets. Two good sources of data to help answer this question are: https://worldpopulationreview.com/country-rankings/greenhouse-gas-emissions-by-country and https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who.

The first link contains data about total CO2 emissions in each country, as well as population size. For this analysis, we will be looking at total CO2 emissions. The second link contains information about life expectancy and other factors that can affect life expectancy. For this analysis, we will be looking at the life expectancy for each country. Since our life expectancy data spans many years, we will aggregate by mean life expectancy per country and use it in correlation with the CO2 emissions data.

To analyze this data, we will create a scatter plot and find the Pearson correlation coefficient to evaluate the two datasets. We chose this method to analyze this question because we are trying to find how one factor affects another. In this case, how greenhouse gas emissions affect human life expectancy.

## Analysis:

First, let's import both datasets and get a sense of what the data looks like.

In [None]:
dirtyLife = pd.read_csv('life_expectancy_data.csv')

In [None]:
dirtyLife.head()

In [None]:
dirtyEmissions.head()

Let's make a copy of the original data frames so we do not mess with the original data for future reference if needed.

In [None]:
newLife = dirtyLife.copy()
newEmiss = dirtyEmissions.copy()

In [None]:
newLife.dtypes
newEmiss.dtypes

We must make the keys for joining the two datasets consistent, so we will make the country names be in uppercase letters.

In [None]:
# First make sure the primary key columns are consistent for both tables. 
countryUpper = [c.upper() for c in newLife.Country]
len(countryUpper)
newLife['Country'] = countryUpper
newLife.columns = newLife.columns.str.replace(' ', '_')
newLife

In [None]:
newLife.shape, newEmiss.shape

Next, we will aggregate the mean life expectancy for each country.

In [None]:
g = newLife.groupby('Country')
g.Life_expectancy_.mean()
gLife = g.aggregate({'Life_expectancy_':numpy.mean})
gLife = gLife.reset_index(level=0)
gLife

In [None]:
gLife.shape, newEmiss.shape

Let's join the two data sets so we can find a correlation.

In [None]:
j = pandas.merge(gLife, newEmiss, how='inner', left_on='Country', right_on='country')
clean = j.copy()

Let's sort the data by total CO2 emissions so we can get a sense of any trends in the data. We will also drop any rows where the life expectancy is NaN.

In [None]:
cleanSort = clean.sort_values('totalCO2emission', ascending=False).dropna()
cleanSort

Let's plot the data and find the Pearson correlation coefficient.

In [None]:
axLE = cleanSort.plot.scatter(x='totalCO2emission', y='Life_expectancy_', figsize=(15, 7))
axLE = axLE.set(xlabel='Total CO2 emissions (millions of tons)', ylabel='Life Expectancy (years)')

In [None]:
# Pearson Correlation
cleanSort.totalCO2emission.corr(cleanSort.Life_expectancy_)

## Conclusion:

There is no correlation between life expectancy and CO2 emissions. This could be because life expectancy is not a good predictor of CO2 emissions, or because the data is not representative of the real world.

If our result is true, it could mean that CO2 emissions do not have any relationship with life expectancy. One reason that there could be no correlation is that more developed countries could have higher CO2 emissions due to industrialization. These more developed countries may have better healthcare and living conditions, causing the adverse effects of more emissions to be nullified. In future studies, the correlation between developed countries and CO2 emissions should be researched further. A country's classification as developing or developed could be used to predict CO2 emissions. 

This claim could be true due to the outliers we see on our plot. There are many countries with low CO2 emissions, but the outliers with high CO2 emissions show a life expectancy that is average or slightly above average.

## Question: Does a country's use of renewable energy decrease their emissions? (Fairuz)
Does a country's use of renewable energy decrease their emissions? By evaluating this question, we can reach a conclusion about the extent to which renewable energy reduces CO2 emisions and thus benefits the environment. By computing the Pearson correlation coefficient for these two variables, we will gain insight into the effectiveness of renewable energy sources. 

Hypothesis: A general assumption can be made that increasing the use of renewable energy will decrease the use of non-renewable energy sources such as fossil-fuels, which will lead to a decrease in CO2 emisions.


To answer this inquiry, the following datasets will be used:

Renewable energy consumption (% of total final energy consumption): https://data.worldbank.org/indicator/EG.FEC.RNEW.ZS
* This data was taken from the World Bank and therefore it is assumed to be accurate data.
* The data shows the percentage of a country's total energy consumption that originated from renewable energy sources
* The data spans from 1990 to 2018

CO2 emissions (metric tons per capita): https://data.worldbank.org/indicator/EN.ATM.CO2E.PC
* This data was taken from the World Bank and therefore it is assumed to be accurate data.
* The data shows the CO2 emissions of a country in metric tons per capita
* The data spans from 1960 to 2018

## Analysis
First, several libraries need to be imported for analyzing and evaluating the data. Furthermore, the data will need to be read. A simple display of the data will help see how the tables are organized.

Note: Some reformatting of the csv files was required for the pandas library to read in the data. 

In [None]:
import pandas
import numpy
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

renewableEnergy = pandas.read_csv("RenewableEnergyConsumption.csv")
CO2Emissions = pandas.read_csv("CO2Emissions.csv")

renewableEnergy.head()

In [None]:
CO2Emissions.head()

Clean the data by removing unneeded columns. Make sure to retain copies of the original data in case reference to the original data is needed at any point. Furthermore, to make computations and visualizations less complex, I've decided to remove any countries with missing data. This will ensure we can see data that can be correlated and we will still have a large pool of data to compute with.

In [None]:
cleanEnergy = renewableEnergy.copy()

cleanEnergy.drop(['Country Code', 'Indicator Name', 'Indicator Code'], inplace=True, axis=1)
cleanEnergy = cleanEnergy.dropna()
cleanEnergy = cleanEnergy.reset_index(drop=True)

cleanEnergy

Do the same as above, but here remove the columns from 1960 to 1989 as well, since in this case that data is irrelevant.

In [None]:
cleanCO2 = CO2Emissions.copy()

cleanCO2.drop(cleanCO2.iloc[:, 1:34], inplace=True, axis=1)
cleanCO2 = cleanCO2.dropna()
cleanCO2 = cleanCO2.reset_index(drop=True)

cleanCO2

Now we need to merge the datasets to easily view the data as a whole. But in both datasets, there are identical columns, so horizontally merging may make some of the column names ambiguous. One thing that can be done is to rename the column names by adding suffixes and then merging.

In [None]:
mergedData = cleanEnergy.merge(cleanCO2, on='Country Name', suffixes=('_energy', '_CO2'))
mergedData

### Note
To see an interactive graphic visualization of this data, follow the links in the Question section above and you will be able to see the world data as well as the data for each country.

The next part of this analysis will deal with only the latest year: 2018. We will find the correlation between emissions and renewable energy usage for the most current data, instead of relying on potentially outdated data.

In [None]:
data2018 = pandas.DataFrame().assign(CN=mergedData['Country Name'], EC=mergedData['2018_energy'], 
                                     C2 = mergedData['2018_CO2'])

data2018 = data2018.rename(columns={'CN': 'Country Name', 'EC': 'Renewable Energy Consumption Percentage', 'C2' : 
                         'CO2 Emissions (Metric Tons Per Capita)'})

data2018

Now we can visualize the data for both energy consumption and CO2 emissions in the year 2018.

In [None]:
data2018.plot.scatter(x='Renewable Energy Consumption Percentage', y='CO2 Emissions (Metric Tons Per Capita)', figsize=(10,5), xlim=(0 ,100))

We can see a general trend in the fact that countries with lower renewable energy consumption percentages had higher CO2 emissions per capita. However, this trend seems to become less pronounced at renewable energy consumption percentages above 40%.

Finally, lets look at the correlation value between the 2 variables for the year 2018.

In [None]:
data2018['Renewable Energy Consumption Percentage'].corr(data2018['CO2 Emissions (Metric Tons Per Capita)'])

## Conclusion

As per the analysis above, we can see that there is some truth to the hypothesis made in the question section. The hypothesis stated that as renewable energy consumption decreased, CO2 emissions would increase. From the analysis we can see that the data has a correlation coefficent of about -0.5. While this doesn't strongly support the hypothesis, it does indicate that both of these variables are somewhat related in that lower renewable energy use does increase CO2 emissions. 

Furthermore, visualizing the data for 2018 helps to understand the trend in the data a little bit better. From the scatterplot above, one can see that the countries with the highest CO2 emissions per capita had the lowest renewable energy consumption. However, as renewable energy consumption increased, the trend discontinued at around 40% of renewable energy consumption. This may indicate that there is a threshold at which the renewable energy consumption gives diminishing returns in terms of reducing CO2 emissions.

Limitiations and errors may have occured with this data analysis, as there were several factors that could lead to different results. First and foremost, if raw CO2 emissions in metric tons were used instead of CO2 emissions per capita, we could have seen a different trend in the data. This could be a potential update to this analysis. A limitation to this analysis was the fact that only one year was used to evaluate, but there was a choice of 28 years. Although it might complicate and elongate the evaluation, one could review the trends for each of the 28 years and come up with an average trend.


## Question: Which Countries produce the most greenhouse gases? Which countries produce the least? (Jonathan)

Which coutries produce the most and least greenhouse gas emissions? By determining which countries have been producing the most and least greenhouse gases, we can better understand what greenhouse gas emissions have to do with human activity. By finding the countries that produce the most and least greenhouse gases, we can also generate new questions about the characteristics of those countries which may cause them to produce a greater amount of emissions. 

In order to answer this question, data from https://www.kaggle.com/saurabhshahane/green-house-gas-historical-emission-data will be utilized. This data contains greenhouse gas emissions data for 194 countries from 1990-2018. The dataset was extracted from the World Resources Institute.

The unit for this dataset is MtCO2e, which is Metric tons of carbon dioxide equivalent.

## Analysis:
We will use pandas, numpy, and matplotlib to sort and analyze the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

We will start by pulling our data from the csv file we have acquired from the website.

In [None]:
dirty = pd.read_csv("ghg-emissions.csv")

Preview of the data.

In [None]:
dirty.head()

In [None]:
dirty

As we can see, there are couple rows that we do not need for our analysis. Let's clean up the dataset.

In [None]:
clean = pd.DataFrame()
clean = dirty
clean = clean[clean["unit"] == "MtCO2e"]
clean

Let's find the total MtCO2e values for each country through the time period 1990-2018.

In [None]:
clean["total"] = clean.sum(axis=1)

Here, I have added a new column named "total" that represent the total green house gas emission from 1990 to 2018.

In [None]:
clean

Here, I have created a final data frame that takes in the clean data from before and sorts it by the "Total Emission from 1990 to 2018" value.

In [None]:
final = pd.DataFrame()
final["Country"] = clean["Country/Region"]
final["unit"] = clean["unit"]
final["Total Emission from 1990 to 2018"] = clean["total"]
final = final.sort_values(by = "Total Emission from 1990 to 2018")
final

We can see there are some data points that are not helpful for us. We will remove these countries. 

In [None]:
final = final[final["Total Emission from 1990 to 2018"] >= 0]
final

This is the number of missing rows, as we eliminated some for comprehensible data.

In [None]:
dirty.shape[0] - final.shape[0]

These two countries are the countries with the least and most amount of greenhouse gas emissions from 1990 to 2018.

In [None]:
least = final.iloc[0]
most = final.iloc[-1]

In [None]:
least

In [None]:
most

## Conclusion:
We found that the country that produces the most greenhouse gas emissions is China, while the country that produces the least is Niue. This makes sense, as Niue is a small, isolated island located in the South Pacific Ocean, while China is located in the middle of Asia and is one of the biggest, if not the biggest manufacturing country in the world.  

Other major contributors of greenhouse gases include The United States, India, Russia, and Brazil. Besides Niue, countries which have the smallest carbon footprint include Tuvalu, The Cook Islands, Kiribati, and Nauru. In general, it appears that industrial powerhouses are responsible for the majority of emissions. On the other hand, island nations which are isolated from the rest of the world and which may have relatively self-contained economies have the least CO2 emissions. 

We can use this information to determine if greenhouse gas emissions have any correlations with trade. Countries like China, The United States, and India all have large economies with a lot of exports. On the other hand, smaller island nations, like Niue and Tuvalu, probably have a relatively small export market. Therefore, it is quite possible that trade has a correlation with CO2 emissions. A logical next step would be to analyze the effects of both domestic and international trade on CO2 emissions. 

One limitation of this analysis is that emission data from 1990-2018 was used. Therefore, the data may not be quite as up to date as current data, which may affect our results slightly. However, we still anticipate that our data is mostly accurate, and it certainly helps us determine what characteristics of a country may be associated with higher volumes of CO2 emissions. 