# ECB Data Academy - Evolve Programme
[Krisolis](http://www.krisolis.ie)

## Data Manipulation Operations in pandas

This notebook explores data manipulation operations in pandas that allow us to create sophisticated datasets from multiple sources. 

[Data Manipulation Operations]('Data_Ops_1.png')

Load some libraries

In [None]:
import pandas as pd
import numpy as np

### An Example Dataset

Load a dataset containing details about different world nations.

In [None]:
country_ind = pd.read_csv("..//Data//world_indicators_data_ext.csv", index_col = 0)
print(country_ind.shape)
display(country_ind)

### Filtering Columns

Accessing *columns* in a DataFrame is simply a matter of using the name of the column (similar to dictionary selection) to give a single column Series:

In [None]:
school = country_ind["School Years"]
display(school)

If we want to keep single columns as a DataFrame rather than a Series we provide the name in a single-item list.

In [None]:
school = country_ind[["School Years"]]
display(school)

We can easily select multiple columns by passing a list of column names:

In [None]:
school_details = country_ind[["Country",
                                     "School Years"]]
display(school_details)

We can also specify the columns we don't want to include

In [None]:
pos_details = country_ind[
        country_ind.columns.difference(["Mil. Spend",
                                               "CPI"])]
display(pos_details)

Columns in a DataFrame are easily removed using the **del** operator:

In [None]:
del country_ind["GDP"]
country_ind.head()

### Filtering Rows

We can access rows either using row labels or row indices using the **loc** or **iloc** methods which both return a series.

In [None]:
country_ind.loc['BR']

In [None]:
country_ind.iloc[7]

We can also easily slice by rows to get an extract from a DataFrame:

In [None]:
country_ind.iloc[4:9]

In [None]:
country_ind.iloc[:9]

In [None]:
country_ind.iloc[4:]

One very useful way to slice a DataFrame is using a condition. We can pass a list of Boolean values to a DataFrame indicating which rows should be retained (True) and which should be filtered (False). 

In [None]:
military_country_ind = country_ind.loc[
                        country_ind["Mil. Spend"] > 2]
display(military_country_ind)

In [None]:
country_ind.loc[country_ind["School Years"] < 10]

In [None]:
country_ind.loc[
    (country_ind["Mil. Spend"] > 1) 
            & (country_ind["School Years"] < 10)]

We can also delete rows using the **drop** function with their row labels.

In [None]:
country_ind = country_ind.drop(['AR', 'CN'])
display(country_ind)

### Filtering Columns & Rows

We can combine row selection and column selection using the **loc** method. We pass it the row slice first, followed by a list of column headings. For example:

In [None]:
reduced_data = country_ind.loc[
                (country_ind["School Years"] < 10), 
                ["Country", "Infant Mort."]]
display(reduced_data)

### Now you try ...

Load the dataset stored in the file **TriathloneData.csv**. The variables in this dataset are as follows:

* **Place:** The place in which the athlete finished the race (missing for non-finishers)
* **Number:** The athlete's race bib number
* **Wave:** The wave with which the athlete started (one of 1, 2, or 3)
* **Age_Cat:** The athlete's age category (one of 16-19, 20-29, 30-39, 40-49, or 50+)
* **Gender:** The gender that the athlete declared (one of 'M' or 'F')
* **TI_Number:** Some athlete's are members of the Traithlon Ireland association and if so declare their membership number	
* **Swim:** The time taken for the swimming leg of the event (in seconds)
* **T1:** The time taken for the first transition of the event (in seconds)
* **Cycle:** The time taken for the cycling leg of the event (in seconds)
* **T2:** The time taken for the swimming leg of the event (in seconds)
* **Run:** The time taken for the running leg of the event (in seconds)
* **Finish:** The time taken for the total event ( in seconds)
    
Create a subset of this data including just the **Number**, **Age_Cat**, and **Finish** columns. 

Create a subset of the triathlone data including just the **Number**, **Age_Cat**, and **Finish** columns and entries with finish times less than 5000 seconds. 

### Sampling From DataFrames

We can easily sample from a pandas data frame using the sample method}

In [None]:
country_ind.sample(n = 5)

Notice that the index has been lost. If we want to retain it we set ignore_index to False.

In [None]:
country_ind.sample(n = 5, 
                   ignore_index = False)

We can also perform sampling with replacement - note duplication of rows.

In [None]:
country_ind.sample(frac=0.5, 
                   replace = True)

Sample 100% of data without replacement to perform a shuffle

In [None]:
country_ind.sample(frac=1.0,
                  replace = False)

### Sorting DataFrames

The **sort_values** function sorts the values in a data frame based on a column values. Simlpest example is to sort by a single column.

In [None]:
country_ind.sort_values(by = "Life Exp.")

We can change sorting order using ascending.

In [None]:
country_ind.sort_values(by = "Life Exp.", 
                        ascending = False)

We can sort by multiple columns.

In [None]:
country_ind.sort_values(by = ["Region", 
                              "School Years"])

We can change the sort order of different columns independently.

In [None]:
country_ind.sort_values(by = ["Region", 
                              "School Years"], 
                      ascending = [True, 
                                   False])

### Deriving New Fields

We can easily add new columns to a DataFrame by simply referring to the new column name in an expression. For example:

In [None]:
country_ind["High Education"] = True
display(country_ind)

Most interestingly we can use other columns in the DataFrame to define the new value.

In [None]:
country_ind["High Education"] \
        = country_ind["School Years"] > 10
display(country_ind)

Or we can create the new field as the result of a calculation from existing fields.

In [None]:
country_ind["Mil School Ratio"] = \
            country_ind["Mil. Spend"]  /   \
            country_ind["School Years"]
display(country_ind)

### Recoding Fields

A common way to derive a new variable is to recode an existing variable. Perform a recoding of cateorical varaibles using the replace function.

In [None]:
old = [False, True]
new = ["Low", "High"]

country_ind["High Education"].replace(old, new, 
                                      inplace = True)
display(country_ind)

The cut function can be used to recode numeric values to categorical ones. It returns the bin to which each values in a series belongs based on a set of breakpoints.

In [None]:
break_points = [0, 60, 80, 120] 
life_exp_labels = ["low","medium","high"]

country_ind["Life Exp. Bin"] =  pd.cut(country_ind["Life Exp."],
                                       bins = break_points,
                                       labels = labels,
                                       include_lowest = True)
display(country_ind)

### Now you try ...
Using the Triathlone data derive a new field that stores the percentage of their total time (**Finish**) that athletes spend in transition (**T1** and **T2**).

Using the **cut** function create a new field that categorises racers based on their finish times as follows:
- "fast": finish times < 5000 seconds
- "average": finish times between 5000 and 6000 seconds
- "slow": finish times > 6000 seconds