<p><font size="6"><b> 02 - Pandas: Basic operations on Series and DataFrames</b></font></p>


> *DS Data manipulation, analysis and visualisation in Python*  
> *December, 2017*

> *© 2016, Joris Van den Bossche and Stijn Van Hoey  (<mailto:jorisvandenbossche@gmail.com>, <mailto:stijnvanhoey@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*

---

In [None]:
%matplotlib inline

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.


In [None]:
# redefining the example objects

population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 
                        'United Kingdom': 64.9, 'Netherlands': 16.9})

countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})

In [None]:
countries.head()

# The 'new' concepts

## Elementwise-operations 

Just like with numpy arrays, many operations are element-wise:

In [None]:
population / 100

In [None]:
countries['population'] / countries['area']

In [None]:
np.log(countries['population'])

which can be added as a new column, as follows:

In [None]:
countries["log_population"] = np.log(countries['population'])

In [None]:
countries.columns

In [None]:
countries['population'] > 40

<div class="alert alert-info">

<b>REMEMBER</b>:

 <ul>
  <li>When you have an operation which does NOT work element-wise or you have no idea how to do it directly in Pandas, use the **apply()** function</li>
  <li>A typical use case is with a custom written or a **lambda** function</li>
</ul>
</div>

In [None]:
countries["population"].apply(np.log) # but this works as well element-wise...

In [None]:
countries["capital"].apply(lambda x: len(x)) # in case you forgot the functionality: countries["capital"].str.len()

In [None]:
def population_annotater(population):
    """annotate as large or small"""
    if population > 50:
        return 'large'
    else:
        return 'small'

In [None]:
countries["population"].apply(population_annotater) # a custom user function

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>Calculate the population numbers relative to Belgium</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_02_basic_operations14.py

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>Calculate the population density for each country and add this as a new column to the dataframe.</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_02_basic_operations15.py

In [None]:
# %load _solutions/pandas_02_basic_operations16.py

<div class="alert alert-danger">

**WARNING**: **Alignment!** (unlike numpy)

 <ul>
  <li>Pay attention to **alignment**: operations between series will align on the index:  </li>
</ul> 

</div>

In [None]:
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]

In [None]:
s1

In [None]:
s2

In [None]:
s1 + s2

## Aggregations (reductions)

Pandas provides a large set of **summary** functions that operate on different kinds of pandas objects (DataFrames, Series, Index) and produce single value. When applied to a DataFrame, the result is returned as a pandas Series (one value for each column). 

The average population number:

In [None]:
population.mean()

The minimum area:

In [None]:
countries['area'].min()

For dataframes, often only the numeric columns are included in the result:

In [None]:
countries.median()

# Application on a real dataset

Reading in the titanic data set...

In [None]:
df = pd.read_csv("../data/titanic.csv")

Quick exploration first...

In [None]:
df.head()

In [None]:
len(df)

The available metadata of the titanic data set provides the following information:

VARIABLE   |  DESCRIPTION
------ | --------
survival       | Survival (0 = No; 1 = Yes)
pclass         | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name           | Name
sex            | Sex
age            | Age
sibsp          | Number of Siblings/Spouses Aboard
parch          | Number of Parents/Children Aboard
ticket         | Ticket Number
fare           | Passenger Fare
cabin          | Cabin
embarked       | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>What is the average age of the passengers?</li>
</ul>

</div>

In [None]:
# %load _solutions/pandas_02_basic_operations27.py

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>Plot the age distribution of the titanic passengers</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_02_basic_operations28.py

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>What is the survival rate? (the relative number of people that survived)</li>
</ul>

Note: the 'Survived' column indicates whether someone survived (1) or not (0).
</div>

In [None]:
# %load _solutions/pandas_02_basic_operations29.py

In [None]:
# %load _solutions/pandas_02_basic_operations30.py

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>What is the maximum Fare? And the median?</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_02_basic_operations31.py

In [None]:
# %load _solutions/pandas_02_basic_operations32.py

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>Calculate the 75th percentile (`quantile`) of the Fare price (Tip: look in the docstring how to specify the percentile)</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_02_basic_operations33.py

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>Calculate the normalized Fares (relative to its mean)</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_02_basic_operations34.py

# Acknowledgement


> This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).

---