## Imports

In [None]:
# Import pandas, numpy, and matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# seaborn is a data visualization library built on matplotlib
import seaborn as sns 

# set the plotting style 
sns.set_style("whitegrid")

# Plot missing values
import missingno as msno

## Lab introduction

Use the greenhouse gas emission data set owid-co2-data.csv from Our World in Data to describe how the emission levels of the current top 10 CO2 emitters have changed over the last 50 years (1971 - 2020). 



## Import and set up the data set

##### $\rightarrow$ Use Pandas to load the file `owid-co2-data.csv` from https://github.com/owid/co2-data as a `DataFrame`. Name the `DataFrame` `df`.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv')

##### Solution

Consult the [codebook](https://github.com/owid/co2-data/blob/master/owid-co2-codebook.csv) to see the description of each column.



In [None]:
df_want = df[['country', 'year', 'population','co2']]

##### $\rightarrow$ Select the rows corresponding to individual countries 

In [None]:
df_want.loc[df_want['country'].isin(['Canada', 'United States'])]

Unnamed: 0,country,year,population,co2
7722,Canada,1785,,0.004
7723,Canada,1786,,0.004
7724,Canada,1787,,0.004
7725,Canada,1788,,0.004
7726,Canada,1789,,0.004
...,...,...,...,...
44218,United States,2017,329791232.0,5210.958
44219,United States,2018,332140032.0,5376.657
44220,United States,2019,334319680.0,5259.144
44221,United States,2020,335942016.0,4715.691


The `country` column of the data set contains some values that are groups of countries. We will remove these observations from the data set.

In [None]:
non_countries = ['Africa', 'Africa (GCP)', 'Asia', 'Asia (GCP)', 'Asia (excl. China and India)', 'Central America (GCP)',
                  'EU-27', 'Europe', 'Europe (excl. EU-27)', 'European Union (27) (GCP)', 'Europe (GCP)',
                  'Europe (excl. EU-28)', 'European Union (27)', 
                  'European Union (28)', 'French Equatorial Africa', 
                  'French Guiana', 'French Polynesia', 'French West Africa',
                  'High-income countries', 'International transport', 
                  'Low-income countries', 'Lower-middle-income countries', 'Mayotte', 'Middle East (GCP)',
                  'Non-OECD (GCP)',
                  'North America',  'North America (excl. USA)', 'North America (GCP)',
                  'Oceania (GCP)', 'OECD (GCP)', 
                  'Panama Canal Zone','South America', 'South America (GCP)', 'Upper-middle-income countries', 
                  'World']

Remove the rows corresponding to the non-countries.

In [None]:
df_want.loc[df_want['country'].isin(non_countries) == False]

Unnamed: 0,country,year,population,co2
0,Afghanistan,1850,3752993.0,
1,Afghanistan,1851,3769828.0,
2,Afghanistan,1852,3787706.0,
3,Afghanistan,1853,3806634.0,
4,Afghanistan,1854,3825655.0,
...,...,...,...,...
46518,Zimbabwe,2017,14751101.0,9.596
46519,Zimbabwe,2018,15052191.0,11.795
46520,Zimbabwe,2019,15354606.0,11.115
46521,Zimbabwe,2020,15669663.0,10.608


## Explore the data set

##### $\rightarrow$ Display the head of the data frame

##### Solution

In [None]:
df_want.head()

Unnamed: 0,country,year,population,co2
0,Afghanistan,1850,3752993.0,
1,Afghanistan,1851,3769828.0,
2,Afghanistan,1852,3787706.0,
3,Afghanistan,1853,3806634.0,
4,Afghanistan,1854,3825655.0,


##### $\rightarrow$ Use the `info` method further explore the data.
1.  Are there any columns where the data type is obviously incorrect? For example, is there a variable that should be a number, but is coded as a string?
2.  Do any of the columns have missing (null) values?

In [None]:
df_want.info

<bound method DataFrame.info of            country  year  population     co2
0      Afghanistan  1850   3752993.0     NaN
1      Afghanistan  1851   3769828.0     NaN
2      Afghanistan  1852   3787706.0     NaN
3      Afghanistan  1853   3806634.0     NaN
4      Afghanistan  1854   3825655.0     NaN
...            ...   ...         ...     ...
46518     Zimbabwe  2017  14751101.0   9.596
46519     Zimbabwe  2018  15052191.0  11.795
46520     Zimbabwe  2019  15354606.0  11.115
46521     Zimbabwe  2020  15669663.0  10.608
46522     Zimbabwe  2021  15993525.0  11.296

[46523 rows x 4 columns]>

##### Solution

In [None]:
df_new = df_want.dropna()

In [None]:
df_new.info

<bound method DataFrame.info of            country  year  population     co2
99     Afghanistan  1949   7624058.0   0.015
100    Afghanistan  1950   7480464.0   0.084
101    Afghanistan  1951   7571542.0   0.092
102    Afghanistan  1952   7667534.0   0.092
103    Afghanistan  1953   7764549.0   0.106
...            ...   ...         ...     ...
46518     Zimbabwe  2017  14751101.0   9.596
46519     Zimbabwe  2018  15052191.0  11.795
46520     Zimbabwe  2019  15354606.0  11.115
46521     Zimbabwe  2020  15669663.0  10.608
46522     Zimbabwe  2021  15993525.0  11.296

[25558 rows x 4 columns]>

##### $\rightarrow$ What years are present in the data set?

In [None]:
df_new

##### Solution

## Analysis of top emissions in 2020

##### $\rightarrow$ Find the top 10 emitters of total CO$_2$ in 2020.



##### Solution

##### $\rightarrow$ Make a histogram of total CO$_2$ emissions in 2020. Make the plot on a density scale.

##### Solution

##### $\rightarrow$ Make a boxplot of total CO$_2$ emissions in 2020. Add a strip plot on top of the boxplot.

##### Solution

##### $\rightarrow$ Are the CO$_2$ emissions of the top 10 emitters in 2020 outliers in the distribution?

##### Solution

## Emission trend over time

##### $\rightarrow$ Is the data set missing any CO$_2$ emission values for the top 10 emitters in 2020 over the years 1971 to 2020?

##### Solution

##### $\rightarrow$ Plot the time plot of the total CO$_2$ emissions from 1971 to 2020 for the top 10 emitters in 2020.

##### Solution

##### $\rightarrow$ Again, plot the time plot of the total CO$_2$ emissions from 1971 to 2020 for the top 10 emitters in 2020, but now also include a plot of the mean total CO$_2$ emissions over all countries on the same plot.

##### Solution

##### $\rightarrow$ Given the large difference between the smallest and largest values, it can help to plot the results on a log scale. Produce the plot of the top 10 emitters and the mean with CO$_2$ emissions on a log scale.

##### Solution

##### $\rightarrow$ Comment on the trend in CO$_2$ emissions from these countries over the last 50 years.

##### Solution