# ECB Data Academy - Evolve Programme
[Krisolis](http://www.krisolis.ie)

## Analysing Data in Data Frames

In this notebook we explore ways in which data in pandas data frames can be analysed.

In [None]:
import pandas as pd
# this will stop output being in scientific notation so much
pd.set_option('display.float_format', lambda x: '%.5f' % x) 
import numpy as np

### Loading Data 

We will use the world indicators dataset used previously to demonstrate analysis techniques.

In [None]:
country_indicators = pd.read_csv("..//Data//world_indicators_data.csv")
print(country_indicators.shape)
display(country_indicators)

### Basic Summary Statistics

pandas offers a range of easy to use analysis operations for DataFrames. For example **mean**, **median**, **mode**, **var**, **std**, **quantile**, **min**, **max**, and **describe**. For example the basic statistics operators can be calcualted for a single column or a selection of columns.

In [None]:
country_indicators['Population'].mean()

In [None]:
country_indicators[['Population', 'Life Exp.']].mean()

In [None]:
country_indicators['Region'].mode()

We can also calcualte descriptive statistics for all columns (**although we need to be careful about mixed type data frames**). 

In [None]:
country_indicators.max()

In [None]:
country_indicators.mean()

The **descibe** function is a great way to generate a suiote of summary statistics for each column in a data frame. **Note** if DataFrame is mixed only numeric columns are included. 

In [None]:
country_indicators.describe()

We can specify the quantiles to include for numeric column sumamries.

In [None]:
country_indicators.describe([0.1, 0.5, 0.9])

Perform describe separately for numeric and categorical columns.

In [None]:
country_indicators.describe(include='number')

In [None]:
country_indicators.describe(include='object')

We can force inclusion of both types which leads to some missing values. 

In [None]:
country_indicators.describe(include='all')

#### Now you try ...

Load the dataset stored in the file **TriathloneData.csv**. The variables in this dataset are as follows:

* **Place:** The place in which the athlete finished the race (missing for non-finishers)
* **Number:** The athlete's race bib number
* **Wave:** The wave with which the athlete started (one of 1, 2, or 3)
* **Age_Cat:** The athlete's age category (one of 16-19, 20-29, 30-39, 40-49, or 50+)
* **Gender:** The gender that the athlete declared (one of 'M' or 'F')
* **TI_Number:** Some athlete's are members of the Traithlon Ireland association and if so declare their membership number	
* **Swim:** The time taken for the swimming leg of the event (in seconds)
* **T1:** The time taken for the first transition of the event (in seconds)
* **Cycle:** The time taken for the cycling leg of the event (in seconds)
* **T2:** The time taken for the swimming leg of the event (in seconds)
* **Run:** The time taken for the running leg of the event (in seconds)
* **Finish:** The time taken for the total event ( in seconds)

What is the average finish time?

What is minimum, mean and maximum cycling times?

On average which component takes the longest: swimming, cycling or running?

### Frequency Tables

Cross tabulations  are a great we to measure how often values for different columns occur togehter. 

In [None]:
pd.crosstab(country_indicators["Region"], 
            country_indicators["Landlocked"])

In [None]:
pd.crosstab(country_indicators["Region"], \
            country_indicators["First Language"])

We can normalise counts within a frequency table - notice different behaviour depeneing on different values of the normalise parameter.

In [None]:
pd.crosstab(country_indicators["Region"], 
            country_indicators["Landlocked"], normalize="all")

In [None]:
pd.crosstab(country_indicators["Region"], 
            country_indicators["Landlocked"], normalize="index")

In [None]:
pd.crosstab(country_indicators["Region"], 
            country_indicators["Landlocked"], normalize="columns")

Adding margins creates summary totals across rows and columns. 

In [None]:
pd.crosstab(country_indicators["Region"], \
            country_indicators["Landlocked"], margins = True)

#### Now you try ...

Using the triathlone data build a frequency table of the different age categories that took part in the race. Which age category had most people? 

Build a cross tabulation of wave numbers and age categories. Is there an even distrbution of ages across waves?

### Pivot Tables

A pivot table is a real data analysis workhorse allowing us see summary statistics for different values grouped by the values of one or more columns

In [None]:
pd.pivot_table(country_indicators, 
               values = "Life Exp.", 
               index = "Region")

We can partition by another variable.

In [None]:
pd.pivot_table(country_indicators, 
               values = "Life Exp.", 
               index = "Region", 
               columns = "Landlocked")

We can partition by even more values.

In [None]:
pd.pivot_table(country_indicators, 
               values = ["Population", "Life Exp."], 
               index = "Region", 
               columns = "Landlocked")

We can also add different or multiple aggregation functions.

In [None]:
pd.pivot_table(country_indicators, 
               values = "Population", 
               index = "Region", 
               columns = "Landlocked", 
               aggfunc=[np.mean, np.std, np.min, np.max])

#### Now you try ...

Using the triathlone data use a pivot table to caclualte the avearge finish time for the different age categories. 

Using a pivot table exmaine the maximum finish time broken down by age categories and gender. 