# CO2 Emissions Analysis

In this project, I try to analyze the CO2 and Greenhouse Gases Emissions data provided by *Our World in Data*. The descriptions for the columns used in this analysis are as follows:

* **country**              : Geographic location.	
* **year**                 : Year of observation.	
* **iso_code**             : ISO 3166-1 alpha-3, three-letter country codes.
* **co2**                  : Annual production-based emissions of carbon dioxide (CO2), measured in million tonnes.
* **coal_co2**             : Annual production-based emissions of carbon dioxide (CO2) from coal, measured in million tonnes.
* **gas_co2**              : Annual production-based emissions of carbon dioxide (CO2) from gas, measured in million tonnes.
* **oil_co2**              : Annual production-based emissions of carbon dioxide (CO2) from oil, measured in million tonnes.
* **share_global_co2**     : Annual production-based emissions of carbon dioxide (CO2), measured as a percentage of global       production-based emissions of CO2 in the same year.
* **share_global_cumulative_co2**: Cumulative production-based emissions of carbon dioxide (CO2) since the first year of data availability, measured as a percentage of global cumulative production-based emissions of CO2 since the first year of data availability.


The full list of columns and their descriptions can be found [here](https://github.com/owid/co2-data/blob/master/owid-co2-codebook.csv).

## Imports

In [26]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly
import cufflinks as cf
import plotly.graph_objects as go
cf.go_offline()
%matplotlib inline

## Getting the Data

In [2]:
df = pd.read_csv('owid-co2-data.csv')

In [3]:
df.head()

Unnamed: 0,country,year,iso_code,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,co2_growth_prct,...,share_global_cumulative_oil_co2,share_global_cumulative_other_co2,share_global_flaring_co2,share_global_gas_co2,share_global_oil_co2,share_global_other_co2,total_ghg,total_ghg_excluding_lucf,trade_co2,trade_co2_share
0,Afghanistan,1949,AFG,7624058.0,,,,0.015,,,...,,,,,,,,,,
1,Afghanistan,1950,AFG,7752117.0,9421400000.0,,,0.084,0.07,475.0,...,0.0,,,,0.0,,,,,
2,Afghanistan,1951,AFG,7840151.0,9692280000.0,,,0.092,0.007,8.7,...,0.0,,,,0.0,,,,,
3,Afghanistan,1952,AFG,7935996.0,10017330000.0,,,0.092,0.0,0.0,...,0.0,,,,0.0,,,,,
4,Afghanistan,1953,AFG,8039684.0,10630520000.0,,,0.106,0.015,16.0,...,0.0,,,,0.0,,,,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26008 entries, 0 to 26007
Data columns (total 60 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   country                              26008 non-null  object 
 1   year                                 26008 non-null  int64  
 2   iso_code                             21913 non-null  object 
 3   population                           23878 non-null  float64
 4   gdp                                  13479 non-null  float64
 5   cement_co2                           12668 non-null  float64
 6   cement_co2_per_capita                12638 non-null  float64
 7   co2                                  24670 non-null  float64
 8   co2_growth_abs                       24294 non-null  float64
 9   co2_growth_prct                      25696 non-null  float64
 10  co2_per_capita                       24032 non-null  float64
 11  co2_per_gdp                 

In [5]:
df.describe()

Unnamed: 0,year,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,co2_growth_prct,co2_per_capita,co2_per_gdp,...,share_global_cumulative_oil_co2,share_global_cumulative_other_co2,share_global_flaring_co2,share_global_gas_co2,share_global_oil_co2,share_global_other_co2,total_ghg,total_ghg_excluding_lucf,trade_co2,trade_co2_share
count,26008.0,23878.0,13479.0,12668.0,12638.0,24670.0,24294.0,25696.0,24032.0,15851.0,...,21100.0,2208.0,4641.0,9245.0,21100.0,2208.0,6149.0,6149.0,4096.0,4096.0
mean,1952.169525,93362740.0,288744500000.0,15.853638,0.111586,326.658348,6.383185,20.760047,4.115845,0.428189,...,3.684855,17.432921,6.870909,6.664776,3.669233,18.240177,771.485169,748.578503,-8.124469,21.584172
std,54.562304,406602100.0,2184803000000.0,84.179826,0.147532,1677.027132,61.526038,692.088419,14.700555,0.480043,...,13.60132,30.325683,15.449948,20.434607,13.393958,31.649508,3553.425128,3488.004831,262.090604,45.459095
min,1750.0,1490.0,55432000.0,0.0,0.0,0.0,-1895.244,-99.64,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-186.55,0.01,-2233.0,-96.76
25%,1923.0,1401949.0,9759778000.0,0.134,0.019,0.557,-0.007,-0.3325,0.253,0.145,...,0.01,0.23,0.11,0.03,0.01,0.34,8.44,7.03,-1.66375,-2.8575
50%,1966.0,5288000.0,30419140000.0,0.603,0.068,5.333,0.066,3.37,1.226,0.285,...,0.08,1.105,0.79,0.22,0.09,1.55,38.05,30.6,1.8745,10.79
75%,1994.0,22344030.0,127386600000.0,3.255,0.155,48.15325,1.242,10.38,4.61225,0.544,...,0.63,18.7625,5.78,1.5,0.65,19.3625,151.15,131.32,9.7005,35.0175
max,2020.0,7794789000.0,113630200000000.0,1626.371,2.738,36702.504,1736.258,102318.508,748.639,7.718,...,100.0,100.0,100.0,100.0,100.0,100.0,49758.23,48116.559,2047.575,366.15


## Exploratory Data Analysis

In [6]:
df.isnull().sum()

country                                    0
year                                       0
iso_code                                4095
population                              2130
gdp                                    12529
cement_co2                             13340
cement_co2_per_capita                  13370
co2                                     1338
co2_growth_abs                          1714
co2_growth_prct                          312
co2_per_capita                          1976
co2_per_gdp                            10157
co2_per_unit_energy                    16050
coal_co2                                8099
coal_co2_per_capita                     8472
consumption_co2                        21912
consumption_co2_per_capita             21912
consumption_co2_per_gdp                22131
cumulative_cement_co2                  13340
cumulative_co2                          1338
cumulative_coal_co2                     8099
cumulative_flaring_co2                 21367
cumulative

There seems to be a lot of missing values. We can filter out the first quartile for 'year' since the data starts from 1750 and most likely wouldn't contain a lot of information. For the purpose of this analysis, we will restrict the analysis to the time period 1920 to 2020. 

In [7]:
df[df['year']>1920].isnull().sum()

country                                    0
year                                       0
iso_code                                2232
population                              1245
gdp                                     7930
cement_co2                              7243
cement_co2_per_capita                   7273
co2                                      700
co2_growth_abs                           886
co2_growth_prct                          220
co2_per_capita                          1010
co2_per_gdp                             6508
co2_per_unit_energy                     9737
coal_co2                                7378
coal_co2_per_capita                     7423
consumption_co2                        15599
consumption_co2_per_capita             15599
consumption_co2_per_gdp                15818
cumulative_cement_co2                   7243
cumulative_co2                           700
cumulative_coal_co2                     7378
cumulative_flaring_co2                 15054
cumulative

In [8]:
df = df[df['year']>1920]

In [9]:
df.shape

(19695, 60)

We can drop the columns with more than 80% missing values

In [10]:
thresh = len(df) * 0.2
df.dropna(axis=1,thresh=thresh,inplace=True)

In [11]:
df.shape

(19695, 54)

In [12]:
df.columns

Index(['country', 'year', 'iso_code', 'population', 'gdp', 'cement_co2',
       'cement_co2_per_capita', 'co2', 'co2_growth_abs', 'co2_growth_prct',
       'co2_per_capita', 'co2_per_gdp', 'co2_per_unit_energy', 'coal_co2',
       'coal_co2_per_capita', 'consumption_co2', 'consumption_co2_per_capita',
       'cumulative_cement_co2', 'cumulative_co2', 'cumulative_coal_co2',
       'cumulative_flaring_co2', 'cumulative_gas_co2', 'cumulative_oil_co2',
       'energy_per_capita', 'energy_per_gdp', 'flaring_co2',
       'flaring_co2_per_capita', 'gas_co2', 'gas_co2_per_capita',
       'ghg_excluding_lucf_per_capita', 'ghg_per_capita', 'methane',
       'methane_per_capita', 'nitrous_oxide', 'nitrous_oxide_per_capita',
       'oil_co2', 'oil_co2_per_capita', 'primary_energy_consumption',
       'share_global_cement_co2', 'share_global_co2', 'share_global_coal_co2',
       'share_global_cumulative_cement_co2', 'share_global_cumulative_co2',
       'share_global_cumulative_coal_co2',
       's

### Global CO2 Emissions Trend

In [59]:
worlddf = df[df['country']=='World']
fig = go.Figure()

fig.add_traces(go.Scatter(x=worlddf['year'], y=worlddf['co2'], mode='lines', showlegend=False))

fig.update_layout(title='Global CO2 Emissions')
fig.update_xaxes(title='Year')
fig.update_yaxes(title='CO2 Emissions (in million tonnes)')

Plotting markers on the troughs

In [60]:
newdf = pd.DataFrame(worlddf[['year','co2']])
def get_markers(co2,year):
    if year in [1992,2009,2020]:
        return co2
    else:
        return np.NaN
newdf['co2'] = newdf.apply(lambda x: get_markers(x.co2,x.year), axis=1)

In [61]:
fig = go.Figure()

fig.add_traces(go.Scatter(x=worlddf['year'], y=worlddf['co2'], mode='lines', showlegend=False))

fig.add_traces(go.Scatter(x=newdf['year'], y=newdf['co2'], mode='markers', marker=dict(color='red',symbol='square'), showlegend=False))

fig.update_layout(title='Global CO2 Emissions')
fig.update_xaxes(title='Year')
fig.update_yaxes(title='CO2 Emissions (in million tonnes)')

**Interestingly, the low points coincide with economic crises and major events - COVID 19 pandemic in 2020, global financial crisis and oil crisis in 2008, Soviet Union collapse, Gulf war and Indian economic crisis in 1991** 

In [62]:
df[df['country']=='World'].iplot(x='year',y=['co2','gas_co2','oil_co2','coal_co2'],title='Global CO2 Emissions due to Gas, Oil and Coal',
                                 xTitle='Year',yTitle='CO2 Emissions (in million tonnes)')

### Analyzing top contributors to global emissions

In [15]:
df[df['year']==2020][['country','co2']].sort_values(by='co2',ascending=False).head(10)

Unnamed: 0,country,co2
25747,World,34807.258
1250,Asia,20317.059
24871,Upper-middle-income countries,15895.922
10662,High-income countries,11828.815
4730,China,10667.888
1441,Asia (excl. China & India),7207.378
14146,Lower-middle-income countries,5825.887
17018,North America,5775.159
7279,Europe,4946.035
24680,United States,4712.771


The dataset contains the data for categories like continents, income categories etc. These records will not have iso codes as they are not countries. We create a list of entities that are not countries so that we can exclude them if needed.

In [16]:
entitylist = df[df['iso_code'].isnull()==True]['country'].unique().tolist()

In [17]:
entitylist

['Africa',
 'Asia',
 'Asia (excl. China & India)',
 'Europe',
 'Europe (excl. EU-27)',
 'Europe (excl. EU-28)',
 'European Union (27)',
 'European Union (28)',
 'French Equatorial Africa',
 'French West Africa',
 'High-income countries',
 'International transport',
 'Kosovo',
 'Kuwaiti Oil Fires',
 'Leeward Islands',
 'Low-income countries',
 'Lower-middle-income countries',
 'North America',
 'North America (excl. USA)',
 'Oceania',
 'Panama Canal Zone',
 'Ryukyu Islands',
 'South America',
 'St. Kitts-Nevis-Anguilla',
 'Upper-middle-income countries',
 'World']

We identify the top 10 countries that contributed to CO2 emissions as of 2020 and plot their trend over the last 100 years.

In [18]:
top10co2 = df[(~df['country'].isin(entitylist)) & (df['year']==2020)][['country','co2']].sort_values(by='co2',ascending=False).head(10)
top10 = top10co2['country'].to_list()

In [19]:
plotly.express.line(data_frame=df[df['country'].isin(top10)], x='year', y='co2', line_group='country',
                    color='country', title='Annual CO2 Emissions by Country', category_orders={'country':top10},
                    labels={'year':'Year','co2':'Annual CO2 Emissions (in million tonnes)','country':'Country'})

**Russis's sudden decline in emissions since 1991 may be attributed to the collapse of the Soviet Union.**

**The increase of CO2 emissions in China through out the 2000's may be attributed to its fast economic growth during the time and focus on manufacturing**

### Global Share of the top 10 countries 

In [20]:
pie_df = df[(df['country'].isin(top10)) & (df['year']==2020)]

Filtering out the columns that do not contain global share details (except 'country') in the new dataframe

In [21]:
filter_col = [col for col in df.columns if (col.startswith('share')!=True)]
filter_col.remove('country')
pie_df.drop(filter_col,axis=1,inplace=True)

Adding a row 'Rest of the World' to represent the rest of the countries

In [22]:
newdf = pd.DataFrame(columns=pie_df.columns)
newdf['country']=['Rest of the World']
for x in pie_df.columns:
    if x == 'country':
        continue 
    newdf[x] = [100 - pie_df[x].sum()]
pie_df = pd.concat([pie_df,newdf])
top10.append('Rest of the World')

The dataframe to plot the pie chart

In [23]:
pie_df.head(11)

Unnamed: 0,country,share_global_cement_co2,share_global_co2,share_global_coal_co2,share_global_cumulative_cement_co2,share_global_cumulative_co2,share_global_cumulative_coal_co2,share_global_cumulative_flaring_co2,share_global_cumulative_gas_co2,share_global_cumulative_oil_co2,share_global_flaring_co2,share_global_gas_co2,share_global_oil_co2
4730,China,52.77,30.65,53.1,34.3,13.89,22.48,,2.49,5.52,,8.18,14.56
9597,Germany,0.82,1.85,1.42,2.76,5.46,8.17,0.52,2.86,3.31,0.46,2.31,2.26
11255,India,7.56,7.02,11.36,5.73,3.21,4.47,0.93,0.96,2.39,0.39,1.72,5.44
11387,Indonesia,2.08,1.69,2.15,1.56,0.85,0.47,2.77,0.99,1.19,2.55,1.06,1.5
11573,Iran,1.47,2.14,0.04,1.2,1.11,0.03,10.84,2.86,1.56,13.84,5.87,2.0
12314,Japan,1.56,2.96,2.88,4.59,3.87,3.01,0.09,2.79,5.53,0.09,2.93,3.41
19595,Russia,1.25,4.53,2.55,3.79,6.8,5.72,6.53,13.93,5.52,10.75,10.1,3.51
20268,Saudi Arabia,1.57,1.8,,1.23,0.94,,5.07,1.94,1.64,0.02,3.48,3.09
21589,South Korea,1.41,1.72,2.02,2.17,1.08,1.02,,0.8,1.21,,1.55,1.51
24680,United States,2.51,13.54,6.36,6.36,24.56,22.28,12.67,31.22,26.68,19.43,22.37,18.25


In [24]:
fig = plotly.subplots.make_subplots(rows=1,cols=2,
                                    subplot_titles=['Percentage of Global Emissions','Percentage of Cumulative Global Emisssions'],
                                    specs=[[{'type': 'pie'}, {'type': 'pie'}]])

fig.add_trace(go.Pie(values=pie_df['share_global_co2'], labels=pie_df['country'], 
                     domain=dict(x=[0,0.5]), name='% of Emissions'), row=1, col=1)

fig.add_trace(go.Pie(values=pie_df['share_global_cumulative_co2'], labels=pie_df['country'], 
                     domain=dict(x=[0.5,1.0]), name='% of Cumulative Emissions'), row=1, col=2)



**China and India's share of global emissions in 2020 is more than their share of cumulative emissions - China's being 16 points ahead. United States on the other hand had a lower share in 2020 compared to their share of cumulative emissions**