# Pandas Primer

This little primer on `pandas` is designed to get to grips with this very useful library and make the most of the bootcamp.

First you need to load the python libraries that you need. Libraries are like extensions to the base `python` that add functionality or help to make tasks more convenient to do. You will load some libraries that will boost your data handling capacity.

The main libraries include `numpy` and `pandas`, which are the most prominent libraries to work efficiently with data in python. Here you just use the `import` function to, you guessed it import the pandas library and make it accessible`as` `pd` in the following code to save some typing (4 characters to be precise...).

In [1]:
# libraries to work efficiently with data in python
import numpy as np
import pandas as pd

## What is `pandas`

Throughout this bootcamp you will be using `pandas` which is a python library that makes it muuuuuch more convenient to work with data than the base `python` methods. `pandas` is built on top of `numpy`, which is the library that brings efficient numerical operations to `python`. As the author of `pandas` Wes McKinney puts it: '*pandas provides high level data manipulation tools built on top of NumPy*'. `pandas` takes care of making it easy to work with tabular data in providing selections, merging, calculating statistics, filling in missing values and provides solutions to many other challenges that would be cumbersome to overcome with base `python`.

When you load data into python with `pandas` it is put into a special structure called a `DataFrame`. `DataFrame`s are what makes `pandas` so convenient to work with for data analysis. It is worth to take the time to understand what kind of an object a `DataFrame` is, or in other words what it is made of in order to get all the benefits it has on offer.

![Table Anatomy Class](./table_anatomy.png)

You are familiar with what a table is, it has column and rows and often these are annotated with column labels and row labels respectively. But how are table encoded with `pandas`. The raw data is stored in `numpy` **arrays** and this is where `pandas` can leverage all the numerical data processing.

To add more convenient selection on top of this array, it is encoded in a so-called `pd.Series`, which can be thought of as a table with a single column. Crucially a `pd.Series` can have a (column) label and row labels. Also a `pd.Series` will store data of a given type, i.e. numbers, words, times. Row labels are called indexes in `pandas` and they are very important for a lot of `pandas` and we will introduce some later in the module. The `pd.Series` comes with many of the convenience functions that are included in `numpy`, such as `.sum()`, `.max()` etc. However, is has some additional functionality that `numpy` is missing. For instance it is very easy to count the unique number of entires in a pd.Series by simply using the `.nunique()` method.

Finally a bunch of `pd.Series` in one table constitute the `DataFrame`, with column labels and an index. Crucially, `DataFrame`s can have different types of data in different columns, which is essential when representing tables. Within a `DataFrame`, the different columns of the table can easily be accessed via the *name* of the columns. Similarly, you can select individual rows via the indices.

In this primer you will go though a lot of the basic pandas functionality and try to understand how they are build so it will be easier to maniputate them later.

## Numpy arrays - `np.array`s

You will start with a simple `python` `list` of heigths of individuals. This is base `python` and has no bells and whistles.

In [2]:
heights_list = [180, 180, 210]  # in cm for instance
heights_list

[180, 180, 210]

Now if you want to find out the tallest person in this list, you would need to write a tedious for loop - it is not possible to just to `heights_list.max()` for instance.

In [3]:
max_height = 0
for height in heights_list:
    if height > max_height:
        max_height = height
print(max_height)

210


If you pack the heights_list into a `numpy` array, this becomes much easier. First you define the array.

In [4]:
heights_np_array = np.array([180, 180, 210])
heights_np_array

array([180, 180, 210])

You will notice that when you print the new `heights_np_array` it shows us that you are dealing with an `numpy` array. 

### Predefined `numpy` methods

On this array you can use the predifined method `.max()`.

In [5]:
heights_np_array.max()

210

And you see that with virually no code, you got to the same correct answer. But you can do better with `pd.Series`.

## Pandas series - `pd.Series`

In [6]:
heights_pd_series = pd.Series([180, 180, 210])
heights_pd_series

0    180
1    180
2    210
dtype: int64

You see that you have something that resembles a column with indices on the left. You can also see the `dtype` entry that tells you what type of data is contained in the `pd.Series`, in this case it is numeric (`int64` for an integer). 

### `pandas` indices

What is most important here are the indices. They are essential and will be generated for you so that you can start selecting rows (as you will see later). One could also put names as the indices.

In [7]:
heights_pd_series = pd.Series([180, 180, 210], index=["Peter", "Natalie", "Mark"])
heights_pd_series

Peter      180
Natalie    180
Mark       210
dtype: int64

Now you can see that instead of the autogenerated indices, you have people names. These indices can now be used to select rows (with names or numerical indices - note python has 0-based indexing, so the first row is accessed with 0). Selecting by names is genereally speaking a more robust way to do selections as you are less likely to get mixed up if the order of rows or columns changes without you realising.

In [8]:
heights_pd_series.ix[0]

180

In [9]:
heights_pd_series.ix["Peter"]

180

You can also select multiple rows in providing a list of values you want to be returned.

In [10]:
heights_pd_series.ix[["Peter", "Mark"]]

Peter    180
Mark     210
dtype: int64

Strictly speaking you do not need to specify the `.ix` in these cases as it is implicity that you want to collect rows, give that the `pd.Series` does not have columns. However later in the example of `DataFrame`s it will be required.

Just to have a look at how the `pd.Series` stores the list of heights you can use the `.values` attribute.

In [11]:
heights_pd_series.values

array([180, 180, 210])

As you can see, the data is indeed stored as a `numpy` array. You can also access the row labels with `.index`.

In [12]:
heights_pd_series.index

Index([u'Peter', u'Natalie', u'Mark'], dtype='object')

Furthermore you can assign a name to the index. This is achieved by simply assign a value to the `pd.DataFrame.index.name` attribute and it is done as shown below.

In [13]:
heights_pd_series.index.name = 'name'
heights_pd_series

name
Peter      180
Natalie    180
Mark       210
dtype: int64

You can now see that the index '*column*' has a name. This can useful when you convert between indexes and when you want to switch them back to normal rows, which you will see a bit later with `.reset_index()`.

### `numpy` methods work on series

Getting back to how `pd.Series` can be useful, any of the methods that work on `np.array`s will also work on the `pd.Series`. For instance the `.max()`...

In [14]:
heights_pd_series.max()

210

...and `.mean()`.

In [15]:
heights_pd_series.mean()

190.0

Crucially however, there are additional goodies that work with `pd.Series`. For instance you will make use of `.nunique()` quite a bit in this module to count the unique values in a `pd.Series`, which can be a very useful summary statistic.

In [16]:
heights_pd_series.nunique()

2

If you want to learn more about `pd.Series`, follow this [link](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) or type `pd.Series?`.

## Pandas dataframes - `pd.DataFrame`s

Now you have gone through all the parts that make up the `pd.DataFrame`s and it is time to put it all toghether.

There are different ways to build a `pd.DataFrame`; when loading a table into `python` with `pd.read_csv()`, you can build it from scratch or you can put together matching `pd.Series`. Loading the data into `python` is the most common way to generate the `pd.DataFrame`.

Let's start by building one from scratch.

In [17]:
people_pd_df = pd.DataFrame({'height': [180, 180, 210], 'age': [20, 32, 43]})

You can do the same but with specifying an index, just like you did with the `pd.Series`.

In [18]:
people_pd_df = pd.DataFrame(
    {
        'height': [180, 180, 210],
        'age': [20, 32, 43],
        'hobby': ['fencing', 'football','painting']
    },
    index=["Peter", "Natalie", "Mark"]
)
people_pd_df

Unnamed: 0,age,height,hobby
Peter,20,180,fencing
Natalie,32,180,football
Mark,43,210,painting


Let's also assign a name to the index to have a full elaborate data structure for demonstation of all its functionality.

In [19]:
people_pd_df.index.name = 'name'
people_pd_df

Unnamed: 0_level_0,age,height,hobby
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Peter,20,180,fencing
Natalie,32,180,football
Mark,43,210,painting


### Selections in `pd.DataFrames`

With such a `pd.DataFrame` it is easy to pick out indivdual rows with the `.ix` method where you specify name or number of the row that you want to retrieve.

In [20]:
peter_pd_series = people_pd_df.ix['Peter']
peter_pd_series

age            20
height        180
hobby     fencing
Name: Peter, dtype: object

This returns one row as a `pd.Series` where the indices are `age` and `height`. For here you can just repeat what you learned above and select a specific index with `.ix()` to, for instance, just pick Peters `age`.

In [21]:
peter_pd_series.ix['age']

20

And you get the single value corresponding to Peter's `age`.

However when working with data you are mostly first selecting columns. That is why using the square brackets (`[]`) for selection without any specific method will return a column.

In [22]:
people_pd_df['age']

name
Peter      20
Natalie    32
Mark       43
Name: age, dtype: int64

Similarly, you can then pick a specific row with `.ix` (but remember when selecting from a `pd.Series` the `.ix` is not required as it is implicit when working with a single 'column').

In [23]:
people_pd_df['age'].ix['Peter']

20

Finally you can select multiple columns and muliple indices in the context of `pd.DataFrames`.

In [24]:
people_pd_df[['height','hobby']].ix['Peter']

height        180
hobby     fencing
Name: Peter, dtype: object

Of course this goes for both the indices and the columns.

In [25]:
people_pd_df[['height','hobby']].ix[['Peter','Mark']]

Unnamed: 0_level_0,height,hobby
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Peter,180,fencing
Mark,210,painting


### Removing columns or rows

Sometimes you might not want to select columns, but just remove some. This can easily achieved with the `.drop()` method. You just need to make sure you specify the `axis` paramter - `1` for columns...

In [26]:
people_pd_df.drop(['height'], axis=1)

Unnamed: 0_level_0,age,hobby
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Peter,20,fencing
Natalie,32,football
Mark,43,painting


...and `0` for rows.

In [27]:
people_pd_df.drop(['Peter'], axis=0)

Unnamed: 0_level_0,age,height,hobby
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Natalie,32,180,football
Mark,43,210,painting


### Resetting index

As a final point, it is possible to convert the indexes into a column, and is especially convenient when the index has a name. This can for instance be useful after you want to revert a `.groupby()` call (which you will see later on in the bootcamp).

In [28]:
people_pd_df.reset_index()

Unnamed: 0,name,age,height,hobby
0,Peter,20,180,fencing
1,Natalie,32,180,football
2,Mark,43,210,painting


From this lession you should have learned the two main methods to select data from a `pd.DataFrame` object, which can be quite confusing to newcomers. The two commands on the bottom will yield the same result - with the former method first selecting the row and then the column, and the second method going the other way round.

In [29]:
people_pd_df.ix['Peter']['age']
people_pd_df['age']['Peter']

20

If you want to learn more about `pd.DataFrames`, follow this [link](http://pandas.pydata.org/pandas-docs/stable/api.html#dataframe) or type `pd.DataFrame?`. If you want to have a lot of help do `pd.DataFrame??`. There is much more functionality to be found there.