<p><font size="6"><b>01 - Pandas: Data Structures </b></font></p>


> *DS Data manipulation, analysis and visualisation in Python*  
> *December, 2017*

> *© 2016, Joris Van den Bossche and Stijn Van Hoey  (<mailto:jorisvandenbossche@gmail.com>, <mailto:stijnvanhoey@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*

---

In [None]:
import pandas as pd

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Introduction

Let's directly start with importing some data: the `titanic` dataset about the passengers of the Titanic and their survival:

In [None]:
df = pd.read_csv("../data/titanic.csv")

In [None]:
df.head()

Starting from reading such a tabular dataset, Pandas provides the functionalities to answer questions about this data in a few lines of code. Let's start with a few examples as illustration:

<div class="alert alert-warning">

 <ul>
  <li>What is the age distribution of the passengers?</li>
</ul> 

</div>

In [None]:
df['Age'].hist()

<div class="alert alert-warning">

 <ul>
  <li>How does the survival rate of the passengers differ between sexes?</li>
</ul> 

</div>

In [None]:
df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))

<div class="alert alert-warning">

 <ul>
  <li>Or how does the survival rate differ between the different classes of the Titanic?</li>
</ul> 

</div>

In [None]:
df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar')

<div class="alert alert-warning">

 <ul>
  <li>Are young people (e.g. < 25 years) likely to survive?</li>
</ul> 

</div>

In [None]:
df['Survived'].sum() / df['Survived'].count()

In [None]:
df25 = df[df['Age'] <= 25]
df25['Survived'].sum() / len(df25['Survived'])

All the needed functionality for the above examples will be explained throughout the course, but as a start: the data types to work with.

# Data structures

Pandas provides two fundamental data objects, for 1D (``Series``) and 2D data (``DataFrame``).

## Series

A Series is a basic holder for **one-dimensional labeled data**. It can be created much as a NumPy array is created:

In [None]:
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s

### Attributes of a Series: `index` and `values`

The series has a built-in concept of an **index**, which by default is the numbers *0* through *N - 1*

In [None]:
s.index

You can access the underlying numpy array representation with the `.values` attribute:

In [None]:
s.values

We can access series values via the index, just like for NumPy arrays:

In [None]:
s[0]

Unlike the NumPy array, though, this index can be something other than integers:

In [None]:
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2

In [None]:
s2['c']

### Pandas Series versus dictionaries

In this way, a ``Series`` object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value.

In fact, it's possible to construct a series directly from a Python dictionary:

In [None]:
pop_dict = {'Germany': 81.3, 
            'Belgium': 11.3, 
            'France': 64.3, 
            'United Kingdom': 64.9, 
            'Netherlands': 16.9}
population = pd.Series(pop_dict)
population

We can index the populations like a dict as expected:

In [None]:
population['France']

but with the power of numpy arrays:

In [None]:
population * 1000

## DataFrames: Multi-dimensional Data

A DataFrame is a **tablular data structure** (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.

<img src="../img/schema-dataframe.svg" width=50%>

One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.

Note that in the IPython notebook, the dataframe will display in a rich HTML view:

In [None]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries

### Attributes of the DataFrame

A DataFrame has besides a `index` attribute, also a `columns` attribute:

In [None]:
countries.index

In [None]:
countries.columns

To check the data types of the different columns:

In [None]:
countries.dtypes

An overview of that information can be given with the `info()` method:

In [None]:
countries.info()

Also a DataFrame has a `values` attribute, but attention: when you have heterogeneous data, all values will be upcasted:

In [None]:
countries.values

To access a Series representing a column in the data, use typical indexing syntax:

In [None]:
countries['area']

### Changing the DataFrame index

If we don't like what the index looks like, we can reset it and set one of our columns:

In [None]:
countries = countries.set_index('country')
countries

Reversing this operation, is `reset_index`:

In [None]:
countries.reset_index('country')

## Some useful methods on these data structures

Exploration of the Series and DataFrame is essential (check out what you're dealing with). 

In [None]:
countries.head() # Top rows

In [None]:
countries.tail() # Bottom rows

One useful method to use is the ``describe`` method, which computes summary statistics for each column:

In [None]:
countries.describe()

**Sort**ing your data **by** a specific column is another important first-check:

In [None]:
countries.sort_values(by='population')

<div class="alert alert-success">
    <b>EXERCISE</b>:

     <ul>
      <li>Check the help of the `sort_values` function and find out how to sort from the largest values to the lowest values</li>
    </ul>
</div>

In [None]:
# %load _solutions/pandas_01_data_structures32.py

The **`plot`** method can be used to quickly visualize the data in different ways:

In [None]:
countries.plot()

However, for this dataset, it does not say that much:

In [None]:
countries['population'].plot(kind='barh')

<div class="alert alert-success">
    <b>EXERCISE</b>:

     <ul>
      <li>You can play with the `kind` keyword of the `plot` function in the figure above: 'line', 'bar', 'hist', 'density', 'area', 'pie', 'scatter', 'hexbin'</li>
    </ul>
</div>

# Importing and exporting data

A wide range of input/output formats are natively supported by pandas:

* CSV, text
* SQL database
* Excel
* HDF5
* json
* html
* pickle
* ...

In [None]:
# pd.read_

In [None]:
# states.to_

---
# Acknowledgement


> This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).
