# ECB Data Academy - Evolve Programme

[Krisolis](http://www.theanalyticsstore.com)

## Advanced Data Manipulation With Pandas

This notebook explores more advanced data manipulation operations in pandas that allow us to create sophisticated datasets from multiple sources. 

Load some libraries

In [None]:
import pandas as pd
import numpy as np

### An Example Dataset

Load a dataset containing details about different world nations.

In [None]:
country_ind = pd.read_csv("..//Data//world_indicators_data_ext.csv", index_col = 0)
print(country_ind.shape)
display(country_ind)

### Joining DataFrames

#### Add Columns

The file *first_languages.csv*  contains a list of the first languages spoken in the countries in the list of indciators.

In [None]:
first_langs = pd.read_csv("..//Data//first_languages.csv")
print(first_langs.shape)
display(first_langs)

We can add these columns directly onto the data frame - **this assumes the data is in the same order**.

In [None]:
country_ind_2 = pd.concat([country_ind, first_langs], 
                        axis = 0)
display(country_ind_2)

#### Add Rows

Adding new rows to a DataFrame is easy using the **append** method. first load up some new data

In [None]:
extra_rows = pd.read_csv("..//Data//world_indicators_data_short.csv", index_col = 0)
display(extra_rows)

Append the rows from the new dataframe to the existing one - notice how it handles missing and extra columns.

In [None]:
pd.concat([country_ind, extra_rows], 
          axis = 0)

#### Merging Data

We can also *merge* together DataFrames in SQL style join operations using the **merge** method. First load some extra data - in this case the populations of countries.

In [None]:
populations = pd.read_csv("..//Data//populations.csv",
                         index_col=0)
display(populations)

Now pefrom a merge of country indciators and populations using country names as the common key. 

In [None]:
country_ind_with_pop = pd.merge(country_ind, 
                                populations, 
                                on="Country")
display(country_ind_with_pop)
print(country_ind.shape)
print(populations.shape)
print(country_ind_with_pop.shape)

The **how** parameter to the merge method determines the type of join that is performed - the options are 'left', 'right', 'outer', 'inner' (the default is 'inner')

In [None]:
country_ind_with_pop = pd.merge(country_ind, 
                                populations, 
                                on="Country", 
                                how = 'left')
display(country_ind_with_pop)
print(country_ind.shape)
print(populations.shape)
print(country_ind_with_pop.shape)

In [None]:
country_ind_with_pop = pd.merge(country_ind, 
                                populations, 
                                on="Country", 
                                how = 'right')
display(country_ind_with_pop)
print(country_ind.shape)
print(populations.shape)
print(country_ind_with_pop.shape)

In [None]:
country_ind_with_pop = pd.merge(country_ind, 
                                populations, 
                                on="Country", 
                                how = 'outer')
display(country_ind_with_pop)
print(country_ind.shape)
print(populations.shape)
print(country_ind_with_pop.shape)

#### Now you try ...

Load the dataset stored in the file **TriathloneData.csv**. The variables in this dataset are as follows:

* **Place:** The place in which the athlete finished the race (missing for non-finishers)
* **Number:** The athlete's race bib number
* **Wave:** The wave with which the athlete started (one of 1, 2, or 3)
* **Age_Cat:** The athlete's age category (one of 16-19, 20-29, 30-39, 40-49, or 50+)
* **Gender:** The gender that the athlete declared (one of 'M' or 'F')
* **TI_Number:** Some athlete's are members of the Traithlon Ireland association and if so declare their membership number	
* **Swim:** The time taken for the swimming leg of the event (in seconds)
* **T1:** The time taken for the first transition of the event (in seconds)
* **Cycle:** The time taken for the cycling leg of the event (in seconds)
* **T2:** The time taken for the swimming leg of the event (in seconds)
* **Run:** The time taken for the running leg of the event (in seconds)
* **Finish:** The time taken for the total event ( in seconds)
    
Also the dataset stored in the file **provinces.csv**. This dataset stores the Irish proviince to which each athlete belongs. The variables in this dataset are as follows:

* **Number:** The athlete's race bib number
* **Province:** The Irish province in which the athlete lives

Join the province data to the main race dataset using the bib number as a key.

Using the new joined dataseet calcualte the average finishing time fro each province. 

### Aggregating DataFrames

If there are a categoricial variables in a dataset we can use them to define groups. Once groups are defined it is possible to perform analysis based on these groups.

To define groups within a dataframe we use the **groupby** function, passing it the name of the column we would like to group by. Using the grouped data then we can then perform grouped analysis.

In [None]:
country_ind_grp = country_ind.groupby(['Region'])
country_ind_grp.mean()

Using groups we can also perform **data aggregation** jobs - rolling up muptiple rows of data into a single row that aggregates them. To do this we use the **agg** function in conection with grouped data. For example to create a dataset containing the mean life expectancy of each continent we could use:

In [None]:
country_ind_grp['Life Exp.'].agg([np.mean])

We can add multiple measurs to this aggregation - for example including max and min as well as mean:

In [None]:
country_ind_grp['Life Exp.'].agg([np.mean, np.min, np.max, len])

We can do this for multiple columns from the original dataset to be even more expressive.

In [None]:
country_ind_grp[['Life Exp.', 'Infant Mort.']].agg([np.mean, np.min, np.max])

Alternatively we can pass a dictionary showing different funcctions to use for each column.

In [None]:
country_ind_grp.agg({"Region": len, "Life Exp.":[np.mean], "Infant Mort.":[min], "School Years": [max]})

#### Now you try ...

Using the Triathlone dataset create an aggregated version of the data that stores the min, median, and max cycling, swimming and running times for each province. 

### Reshaping Data - Optional Extra

Different data analytics tools and techniques work with data in different formats, or shapes 

#### Transpose

Simplest reshaping we go can is a **transpose**

In [None]:
display(country_ind.head())

In [None]:
trans = country_ind.transpose()
display(trans.head())

#### Melt

We can convert from *short, fat* data to *long, skinny* data using a **melt** operation

In [None]:
short_fat_olympic_data = pd.read_csv("..//Data//Olympic_Medal_Count.csv")
short_fat_olympic_data = short_fat_olympic_data.loc[0:11]
short_fat_olympic_data

In [None]:
# Melt the data
long_skinny_olympic_data = \
pd.melt(short_fat_olympic_data, id_vars=["Year"], \
        var_name = "Medal", \
        value_name = "Count") 


display(long_skinny_olympic_data)

#### Cast

We can convert from *long, skinny* data to *short, fat* data using a **cast** (or pivot) operation

In [None]:
# Load some data
quarterly_liabilities_data="..//Data//quarterly_liabilities.csv"
long_skinny_liabilities = pd.read_csv(quarterly_liabilities_data)
long_skinny_liabilities

In [None]:
# Cast to short fat data
short_fat_liabilities  = \
long_skinny_liabilities.pivot(index = "Account", \
                              columns= "Quarter",  values="Liability")

display(short_fat_liabilities)

Counting occurences of different levels is a really common application of casting

In [None]:
# Load some data
form_submission_data="..//Data//form_submission.csv"
long_skinny_form_sub = pd.read_csv(form_submission_data)

# Cast to one row per subject
#one_row_per_cust_form_sub = long_skinny_form_sub.pivot(index = "Cust_ID", columns = "Type")

one_row_per_cust_form_sub = \
pd.pivot_table(long_skinny_form_sub, \
               index = "Cust_ID", \
               columns = "Type", aggfunc=len, fill_value = 0)
display(one_row_per_cust_form_sub)