#### Data Processing with Python

In [None]:
import pandas as pd
import numpy as np

<hr>
###### IN CASE OF PROBLEMS IMPORTING PACKAGES


In [None]:
# SOLUTION A: select this cell and type Shift-Enter to execute the code below.

%conda install numpy openpyxl pandas seaborn

# Now restart the kernel (Menu -> Kernel -> Restart Kernel)

In [None]:
# SOLUTION B: select this cell and type Shift-Enter to execute the code below.

%pip install numpy openpyxl pandas seaborn

# Now restart the kernel (Menu -> Kernel -> Restart Kernel)

<hr>

Run the following cell to rebuild the three `DataFrames` from the last notebook:

In [None]:

# countries
countries = pd.read_excel("data_geographies_v1.xlsx", sheet_name = "list-of-countries-etc")

# co2
data = pd.read_csv("yearly_co2_emissions_1000_tonnes.csv")
co2 = data.melt(id_vars=['country'], var_name='year', value_name='kt')
co2.dropna(inplace=True)
co2["year"] = co2["year"].astype(int)

# stats97
data = pd.read_csv('stats_1997.csv', header=None)
df = data[0].str.split('-', expand=True)
df.columns = ['geo','statistic']
df['value'] = data[1]
stats97 = df.pivot(index='geo',columns='statistic',values='value')


# 4. Visualising Data

Having loaded and tidied some data, a sensible next step is to visualise the distributions of variables to check for any issues.

[Matplotlib](https://matplotlib.org) is the base graphics library in python. Although it has many useful plotting functions, in this session we will focus on making plots using another package called [seaborn](https://seaborn.pydata.org). This is another set of functions, built on top of matplotlib, that support a consistent approach to data visualisation, and integrate well with the pandas data structures. 

We need to import both matplotlib and seaborn:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# some adjustments to the default image resolution
plt.rcParams['figure.dpi']= 100
plt.rcParams['figure.figsize'] = (4.5,2.5)
plt.rc("savefig", dpi=150)

Seaborn makes it much easier to create high-quality graphics in python for presentations and publications.

The following commands override the matplotlib defaults with the chosen seaborn style, so that any plots produced by matplotlib will have the seaborn styling:

In [None]:
sns.set()
sns.set_style("darkgrid")

## 4.1 Histogram

Let's start with a histogram for the GDP data from `stats97`.

The DataFrame actually has a set of built-in methods for plotting, which call the matplotlib functions. e.g.:

In [None]:
stats97.hist("gdp")
plt.show()

The seaborn version of a histogram is called [`displot`](https://seaborn.pydata.org/generated/seaborn.displot.html#seaborn.displot). 

It is a bit fancier than matplotlib's:

In [None]:
sns.displot(stats97['gdp'])
plt.show()

The data are highly positively skewed, so let's take a log of the data using the numpy function `log10`.

In [None]:
sns.displot(np.log10(stats97['gdp']))
plt.show()

Keyword arguments to the seaborn functions control the aesthetics:

In [None]:
sns.displot(np.log10(stats97['gdp']),
             bins=30, 
             color='red')
plt.show()

The seaborn function `displot` returns a [`FacetGrid`](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html#seaborn.FacetGrid) object, which we can use to set titles etc:

In [None]:
g = sns.displot(np.log10(stats97['gdp']),
             bins=30, 
             color='red')
g.set_titles('GDP in 1997')
g.set_xlabels('log10(GDP / USD_2010)')
plt.show()

##### *Exercise*
Make a similar histogram for 1997 population. Make it a different colour.

### Saving plots to file

We can save a plot to a file like so:

In [None]:
g.savefig('population.png')

The file type is determined by the file extension.

## Box plot

The same distribution can be summarised using the `boxplot()` function:


In [None]:
sns.boxplot(np.log10(stats97['gdp']))
plt.show()

## Violin plot

The `violinplot()` function gives a similar but more informative view:


In [None]:
sns.violinplot(np.log10(stats97['gdp']))
plt.show()

## Scatter plot

We can visualise covariation between variables using a scatter plot, for example GDP vs population. This uses `relplot()`:


In [None]:
sns.relplot(x='pop', y='gdp', data=np.log(stats97))
plt.show()

Using the `jointplot()` function, we can make use of a kernel density estimate (KDE) to summarise this joint distribution:

In [None]:
sns.jointplot(x='pop', y='gdp', data=np.log(stats97), kind='kde')
plt.show()

## Derived variables

It might be more useful to compare countries' GDP on a per-capita basis. We need to make a new variable to show per-capita GDP. 

To do this, we will add a new column to the `DataFrame`.

##### *Exercise*

1. Add GDP per person (**gdp_pp**) as a new column of `stats97`.

2. Visualise the distribution of **gdp_pp**.


## Line plot

##### *Exercise*

Starting with the `co2` dataframe, plot the annual emissions of a country of your choice.

*Hint*: use the seaborn `lineplot()` function.
