#### Data Processing with Python

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


###### IN CASE OF PROBLEMS IMPORTING PACKAGES


In [None]:
# SOLUTION A: select this cell and type Shift-Enter to execute the code below.

%conda install openpyxl pandas seaborn

# Now restart the kernel (Menu -> Kernel -> Restart Kernel)

In [None]:
# SOLUTION B: select this cell and type Shift-Enter to execute the code below.

%pip install openpyxl pandas seaborn

# Now restart the kernel (Menu -> Kernel -> Restart Kernel)

<hr>

Run the following cell to rebuild the three `DataFrames`:

In [None]:

# countries
countries = pd.read_excel("data_geographies_v1.xlsx", sheet_name = "list-of-countries-etc")

# co2
data = pd.read_csv("yearly_co2_emissions_1000_tonnes.csv")
co2 = data.melt(id_vars=['country'], var_name='year', value_name='kt')
co2.dropna(inplace=True)
co2["year"] = co2["year"].astype(int)

# stats97
data = pd.read_csv('stats_1997.csv', header=None)
df = data[0].str.split('-', expand=True)
df.columns = ['geo','statistic']
df['value'] = data[1]
stats97 = df.pivot(index='geo',columns='statistic',values='value')


# 5. Manipulating Data

In this notebook, we will look at some more of pandas's data-handling tools.

***
## 5.1 Joining tables

To compare emissions between countries in a fair way, it would make sense to convert them to a per-capita basis. 
Let's start with the figures for 1997 to see how this can be done.

First we will make a new dataframe containing only the 1997 emissions:

In [None]:
co2_1997 = co2.query('year==1997')
co2_1997

However, the population data is not yet in the co2 dataframe, so we will need to look it up from another dataframe by matching the country name. 

This type of **relational data**, where information must be collected from multiple tables, requires careful handling to make sure that rows in different tables are correctly associated with each other. The country name acts as a **key** to unlock the correct data from the associated table.

The relevant population data is in the stats97 table: 

In [None]:
stats97

However, this is indexed by the `geo` code, rather than the `country` name that we find in `co2_1997`. Fortunately, the `countries` table contains both:

In [None]:
countries.head()

Taking the `co2_1997` data, we apply a `join()` to relate its `country` variable to the `name` variable in `countries`.

To do this, we need to set these columns as the index in each table

In [None]:
a = co2_1997.set_index('country')
b = countries.set_index('name')
c = a.join(b)
c.head()


For every row in the table `a`, `join()` tries to match its index with a row index in `b`. 
The resulting table imports the additional columns from the `countries` tibble, so now we can associate each `geo` code with the correct CO2 emissions.

`join()` is just one of several pandas functions for working with relational data.


#### Exercise {-}

Use another `join()` to connect `c` to `stats97`.

*Hint*: you can move the current index column back into the body of the DataFrame using the method [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html?highlight=reset_index#pandas.DataFrame.reset_index)

#### Exercise {-}

Calculate the per-capita emissions for 1997 as a new column and plot these on a histogram.

#### Exercise {-}

The file `population_total.csv` contains (real or predicted) population data for each country for the years 1800-2100.

Write a workflow to construct a new DataFrame `co2_pp` containing the following columns:

* country
* year
* kt = total CO2 emissions (in kilotonnes)
* pop = total population
* t_pp = per-capita CO2 emissions (in tonnes)



***

## 5.2 Summarising data across groups

Cases often belong to distinct groups that we want to compare with each other in some way.


#### Exercise {-}

Using the output of the previous exercise and the `countries` dataframe, add a column for the `eight_regions` grouping.

## Box plots

Let's look at the data for 2014 only. Here's a more complex visualisation of the data:

In [None]:
d = co2_pp.query('year==2014')
g = sns.catplot(x="eight_regions",y="t_pp",data=d,kind="box")
g.set_axis_labels("", "CO2 emissions per capita / tonnes")
g.set_xticklabels(rotation=90)
plt.show()

## groupby()

Pandas allows you to define groups of rows and construct summary statistics for each group:

In [None]:
grouped = co2_pp.groupby("eight_regions")

In [None]:
grouped['kt'].sum()

##### *Exercise*

Plot the total global CO2 emissions for each year.


##### *Exercise*

Plot the yearly median per-capita CO2 emissions for the eight regions, from 1950 onwards.


***