<p><font size="6"><b> 02 - Pandas: Basic operations on Series and DataFrames</b></font></p>


> *Data wrangling in Python*  
> *November, 2020*
>
> *© 2020, Joris Van den Bossche and Stijn Van Hoey  (<mailto:jorisvandenbossche@gmail.com>, <mailto:stijnvanhoey@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*

---

In [None]:
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

In [None]:
# redefining the example DataFrame

countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})

In [None]:
countries.head()

# Elementwise-operations 

The typical arithmetic (+, -, \*, /) and comparison (==, >, <, ...) operations work *element-wise*.

With as scalar:

In [None]:
population = countries['population']
population

In [None]:
population * 1000

In [None]:
population > 50

With two Series objects:

In [None]:
countries['population'] / countries['area']

## Adding new columns

We can add a new column to a DataFrame with similar syntax as selecting a columns: create a new column by assigning the output to the DataFrame with a new column name in between the `[]`.

For example, to add the population density calculated above, we can do:

In [None]:
countries['population_density'] = countries['population'] / countries['area'] * 1e6

In [None]:
countries

# Aggregations (reductions)

Pandas provides a large set of **summary** functions that operate on different kinds of pandas objects (DataFrames, Series, Index) and produce single value. When applied to a DataFrame, the result is returned as a pandas Series (one value for each column). 

The average population number:

In [None]:
population.mean()

The minimum area:

In [None]:
countries['area'].min()

For dataframes, often only the numeric columns are included in the result:

In [None]:
countries.median()

# Application on a real dataset

Reading in the titanic data set...

In [None]:
df = pd.read_csv("data/titanic.csv")

Quick exploration first...

In [None]:
df.head()

In [None]:
len(df)

The available metadata of the titanic data set provides the following information:

VARIABLE   |  DESCRIPTION
------ | --------
Survived       | Survival (0 = No; 1 = Yes)
Pclass         | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name           | Name
Sex            | Sex
Age            | Age
SibSp          | Number of Siblings/Spouses Aboard
Parch          | Number of Parents/Children Aboard
Ticket         | Ticket Number
Fare           | Passenger Fare
Cabin          | Cabin
Embarked       | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>What is the average age of the passengers?</li>
</ul>

</div>

In [None]:
# %load _solutions/pandas_02_basic_operations1.py

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>Plot the age distribution of the titanic passengers</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_02_basic_operations2.py

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>What is the survival rate? (the relative number of people that survived)</li>
</ul>
<br>

Note: the 'Survived' column indicates whether someone survived (1) or not (0).
</div>

In [None]:
# %load _solutions/pandas_02_basic_operations3.py

In [None]:
# %load _solutions/pandas_02_basic_operations4.py

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>What is the maximum Fare? And the median?</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_02_basic_operations5.py

In [None]:
# %load _solutions/pandas_02_basic_operations6.py

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>Calculate the 75th percentile (`quantile`) of the Fare price (Tip: look in the docstring how to specify the percentile)</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_02_basic_operations7.py

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>Calculate the normalized Fares (normalized relative to its mean), and add this as a new column ('Fare_normalized') to the DataFrame.</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_02_basic_operations8.py

In [None]:
# %load _solutions/pandas_02_basic_operations9.py

<div class="alert alert-success">
    
**EXERCISE**:

* Calculate the log of the Fares. Tip: check the `np.log` function.

</div>

In [None]:
# %load _solutions/pandas_02_basic_operations10.py

# Numpy -  multidimensional data arrays

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

* a powerful N-dimensional array/vector/matrix object
* sophisticated (broadcasting) functions
* function implementation in C/Fortran assuring good performance if vectorized
* tools for integrating C/C++ and Fortran code
* useful linear algebra, Fourier transform, and random number capabilities

Also known as *array oriented computing*. The recommended convention to import numpy is:

In [None]:
import numpy as np

## Speed

Memory-efficient container that provides fast numerical operations:

In [None]:
L = range(1000)
%timeit [i**2 for i in L]

In [None]:
a = np.arange(1000)
%timeit a**2

## It's used by Pandas under the hood

The columns of a DataFrame are internally stored using numpy arrays. We can also retrieve this data as numpy arrays, for example using the `to_numpy()` method:

In [None]:
arr = countries["population"].to_numpy()
arr

What we said above about element-wise operations and reductions works the same for numpy arrays:

In [None]:
arr + 10

In [None]:
arr.mean()

Numpy contains more numerical functions than pandas, for example to calculate the log:

In [None]:
np.log(arr)

Those functions can *also* be applied on pandas objects:

In [None]:
np.log(countries["population"])

<div class="alert alert-info">

__NumPy__ provides

* multi-dimensional, homogeneously typed arrays  (single data type!)

<br>

__Pandas__ provides

* 2D, heterogeneous data structure (multiple data types!)
* labeled (named) row and column index

</div>

# Acknowledgement


> This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).

---