# Fundamentals of Data Analysis with Python 

## Day 3: Scientific Computing with NumPy, Pandas, Matplotlib, and Seaborn 

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

### Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr>

### Overview 

High-level overview coming soon... 

### Plan for the Day

1. [Introduction to NumPy](#numpy)
    * Arrays 
2. [Introduction to Pandas](#pandas)
    * Series 
    * Dataframes 
    * Groupby and descriptive statistics
3. [Simple Data Visualization](#viz)
    * A gentle introduction to Matplotlib 
    * Data Visualization with Seaborn 
4. [Best practices when working with Pandas Series and DataFrames](#pandasbp)
    * Understanding how data is stored in Pandas
    * Initialization
    * Transformation
5. [Open Work Time](#open)

<hr>

# Introduction to NumPy<a id='numpy'></a>

# Introduction to Pandas<a id='pandas'></a>

Most social scientists are used to working with data in tabular form, such as a `dataframe` with variables in the columns and observations in the rows. In Python, the `Pandas` package enables us to use `dataframes` to organize, manipulate, and analyze data. `Pandas` is an extremely popular package in the scientific computing community regardless of the discipline (physics, sociology, neuroscience, history) or industry (academia, government, industry). 

This part of the notebook covers some basic functionality of `Pandas dataframes`. Of course, it does not cover *everything* that is possible to do with `dataframes`. As with the previous content, the goal is to build a basic foundation that we can build on throughout the week. 

# Simple Data Visualization<a id='viz'></a>

# Best practices when working with Pandas Series and DataFrames <a id='pandasbp'></a>

From [pandas](https://pandas.pydata.org/):
>pandas is a **fast**, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. 

This is all true, with a pretty large caveat. Pandas is fast (and generally efficient), if you avoid some of the common pitfalls. Unfortunately, these traps are easy to fall for and many pandas users (even senior data scientists) don't know they might be slowing their code down 10-1000x. These people will often be hesitant to use pandas on large datasets and may dissuade others from using the library. 

However, by understanding a little about what is going on in the backend, we can avoid the worst of the problems and write relatively fast pandas code. 

[How is data stored in pandas?](#storage)   
[Efficient Transformation](#transform)   
[Efficient Initialization](#init)   

## How is data stored in Pandas? <a id='storage'></a>

<h3><font color='tomato'>### Series</font></h3>

### DataFrames
DataFrames are really just a collection of Series, with each column corresponding to its own Series. As we mentioned above, each item in a Series (or column) is stored right after the one before it. This means that the entire column is stored within a single range of memory.

However, the multiple Series (columns) that make a DataFrame can be stored anywhere in memory and are often not stored side-by-side. 

We can think of this like a grocery list for sandwiches. Lets imagine that each kind of sandwich we make is composed of 1 type of bread, 1 type of  meat and 1 type of vegetable. We could arrange our grocery list into a table like this: 

| sandwich_id | bread_type | meat_type  | vegetable_type |
|-------------|------------|------------|----------------|
| 0           | sourdough  | ham        | lettuce        |
| 1           | baguette   | turkey     | tomato         |
| 2           | rye        | roast beef | onion          |

We buy all of our bread products from a bakery, meat from a deli, and vegetables from a grocer. The result is that to get everything in a column, you can go to one location (e.g. bakery for bread_type). But to get everything from a row you will have to visit all three locations. 

This means that is it really fast to access an entire column, but really slow to access a row. Lets check it out!

>**Aside**   
>In the code below, the `%timeit` line is called a [magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html#). The `timeit` magic lets us time the execution of a Python statement. 

In [2]:
# Setup code -- ideally this is changed to a dataset we are using 
import numpy as np
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.sample(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
137,6.4,3.1,5.5,1.8,virginica
78,6.0,2.9,4.5,1.5,versicolor
94,5.6,2.7,4.2,1.3,versicolor
72,6.3,2.5,4.9,1.5,versicolor
29,4.7,3.2,1.6,0.2,setosa


In [3]:
print("Column\n------")
%timeit sl = iris['petal_length']

print("Row\n------")
%timeit example_1 = iris.iloc[12]

Column
------
2.38 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Row
------
139 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


This difference in speed is more than a 50x difference! 

## Transformations <a id='transform'></a>
One instance where the underlying storage structure and its consquence on speed is when applying transformations (calculations or other functions) to a DataFrame. 

A common case of this is when we want to add a new column to our DataFrame based on values in other columns. For example, we may want to:  
* Extract month from a data column
* Calculate area from width & length columns
* Predict whether a flight will be late by applying a deep learning model to the values of 5 other columns. 

There is a long list of transformations we might be interested in, many of which operate on a single row, independent of other rows. 

There are many ways to implement transformations in pandas, some of which take advantage of how DataFrames are stored and others that do not. Below, we are going to look at 6 methods for implementing transformations:
1. [For Loops](#for)
2. [`iterrows`](#iter)
3. [Apply Method](#apply)
4. [Zip & Iterate](#zip)
5. [Vectorized Functions](#vec) 
6. [NumPy Vectorized Functions](#np)

We will use a common example across transformation methods allowing us to compare the speed of each one. For each method we will create a new column called `petal_area` by multiplying `petal_length` by `petal_width`.

### 1 For Loops <a id='for'></a>
One possible method we can use to create this new column is to go row-by-row through the dataframe using a [for loop](01_introduction.ipynb#conditional).

For each row in the data frame we will calculate the area for that example by multiplying `petal_length` by `petal_width` and placing the result in a Series that would eventually be added as a column to the DataFrame.

Perhaps one of the most obvious ways to approach a transformation is to go row-by-row through the dataframe, doing the necessary transformations one at a time. A simple way to do this is using a for loop.

In [4]:
%%timeit
# Looping over the rows
area_column = []
for i in range(0, len(iris)):
    row = iris.loc[i]
    row_area = row['petal_length'] * row['petal_width']
    area_column.append(row_area)

30.4 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


While we haven't checked any other methods yet, I'll let you know that this is _really_ slow. If we think about how DataFrames are stored it becomes clear why this is so slow. 

Before I get into this, lets look at the second method. 



### 2  `iterrows()` <a id='iter'></a>
A second method we can use to add our new column is using the `iterrows()` method ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows)). This is a built-in method pandas has implemented to iterate over the rows in a  DataFrame. 

This method creates a [generator object](https://wiki.python.org/moin/Generators), a special Python object, which we can use a for loop to iterate over. 

In [5]:
%%timeit
area_column = []
for idx, row in iris.iterrows():
    row_area = row['petal_length'] * row['petal_width']
    area_column.append(row_area)

18.6 ms ± 339 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


While this was definitely faster than the basic [for loop](#for) approach, its still really slow. In fact, the underlying reason why these two approaches are so slow is the same.

Both approaches use a `for` loops to go row-by-row through the DataFrame. Gathering the data for that row as its needed. 

In our sandwich example, this is the equivalent of buying ingredients for sandwich 1, then buying ingredients for sandwich 2, etc. This results in visiting each shop (bakery, deli, grocer) once for every sandwich recipe!

The same thing is happening in pandas. To iterate over the rows using a for loop we retreive all values for row 1, then all values for row 2, etc. 

This is incredibly inefficient (imagine the funny looks you'd get on your 3rd visit to the bakery)! In fact, I would venture to say that **you should never use for loops when working with pandas DataFrames**. There might be cases when I'm wrong, but there is almost always a better approach than `for` loops. 

### 3 Apply Method <a id=apply></a>
A third approach we can use is the `apply()` method ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply)). This built-in pandas method applies a specific function across some axis (rows or columns). In our case, we want to apply a function along the column axis, applying the function to each row. 

To use apply, you have to define the function you want to apply. This function needs to take in a row, apply the function, and return some value. For our case, we'll define an `area()` function. 

In [6]:
%%timeit
def area(row):
    return row['petal_length'] * row['petal_width']

area_column = iris.apply(area, axis=1)

6.63 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Thankfully this is faster than the previous two approaches. But we are still in the realm of miliseconds. This method is still relatively slow because it continues to go row-by-row through the Dataframe. 

Since `apply()` is used for a specific purpose, pandas is able to make assumptions and include optmizations that the more general approaches don't have access to. For example, the `apply()` method implements 

Its faster because of internal optimizations pandas is able to do. For example, `apply()` checks to see if your function is compatible with its "fast" mode ([docs](https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/frame.py#L6737-L6928)). As well, it offloads some of the work to C (a low-level language known for speed), only performing the functions itself in Python. 

Typically, I almost always avoid using `apply()`. Although, it does make for readable code.

### 4 Zip & Iterate <a id=zip></a>
A fourth method we can use is to use the built-in `zip()` function available in Python ([docs](https://docs.python.org/3.3/library/functions.html#zip)). This function takes in a group of iterators (lists, dictionaries, tuples, etc) and creates a new iterator where the i-th element in the iterator will be a tuple containing the i-th elements from each of the original iterators. 

For example:   
```
>>> l1 = [1, 2, 3]
>>> l2 = ['a', 'b', 'c']
>>> z = zip(l1, l2)
>>> list(z)
    [(1, 'a'), (2, 'b'), (3, 'c')]
```

In general this is quite a useful function that can be used for lots of different purposes. In our case we will:
1. Determine which columns are needed for the transformation
2. Zip these columns together
3. Iterate over the zipped object to retrieve pairs one at a time, applying some function to the pairs and storing the result in a list which will later become our new column

In [7]:
%%timeit
area_column = []
for w, l in zip(iris['petal_length'], iris['petal_width']):
    area = w*l
    area_column.append(area)

52.1 µs ± 544 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


As you can see, this method offers a great improvement over our last method (~120x faster). The reason for this large improvement is this is the first method that avoids going row-by-row through the DataFrame. 

Instead of performing many (#rows x #columns) costly read operations, this method reads each column only once. The resulting data is stored temporarily in fast memory, where it can be accessed at little cost when it is needed for calculations. 

This is the method I typically use for calculations. While it offers a good balance of efficiency, readability, and flexibility. 

### 5 Use Vectorized Functions <a id=vec></a>
Depending on the transformation we are undertaking, we might be able to use a vectorized function. These functions operate on entire Series, rather than on individual values (aka vector functions). 

Vectorized functions are those which take in and operate on pandas Series. There are many built-in vectorized functions, such as `*` (shown below), `add()`, `between()`, and `shift`. You can also build your own vectorizef function as a combination of these built-in methods.  

In [8]:
%%timeit
# Straight up vector calculations
iris['petal_area'] = iris['petal_length'] * iris['petal_width']

354 µs ± 110 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


Similar to option 4, this method is significantly faster than the first three approaches. Once again, this is because we are avoiding accessing rows one-by-one. 

In addition, vectorized functions are able to further optimize by making use of pre-compiled code written in a lower-level (and faster) language like C. 

Honestly, I'm unsure why this method appears to be slower than the our 4th option, the zip method. I suspect this will not always be the case, especially when functions become more complex. 

### 6 Use NumPy Vectorized Functions<a id=np></a>
For an improvement over method 5, we take one extra step and convert our pandas Series into NumPy arrays and apply the same vectorized functions to obtain our transformation. 

In [9]:
%%timeit
areas = np.array(iris['petal_length']) * np.array(iris['petal_width'])

44.3 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


As with the last two options, this approach avoids costly row-by-row reads. Like with option 5, this method also uses pre-compiled code to achieve further optmization. 

In addition, by converting the pandas Series to NumPy arrays, this method removes the overhead incurred by Pandas additional functionality. 

### and Beyond <a id='beyond'></a>
For the cases when even these options aren't fast enough, you can implement more advanced techniques to enhance performance. The improvements these advanced techniques can offer differ based on the problem at hand. For example, some techniques use functions and methods that are optimized for boolean comparisons (e.g. great than) but offer little improvements when working with other functions like addition. 

Some other approaches to checkout include: 
* Using [NumExpr](https://pypi.org/project/numexpr/2.6.1/) for extra fast numerical expressions
* Rewriting functions in [Cython](https://cython.org/)
* Using [Numba](https://numba.pydata.org/) to convert Python code to fast machine code. 

## Takeaways

While this difference in speed is hard (if not impossible) to notice for small datasets, it can become hugely consequential when working with large datasets or performing complex calculations. 

We have to remember that optimizing code should not be placed at the expense of functionality. Often its best to get something that works before going back and finding the most optimal solution. However, I hope that by introducing a couple of "Do's & Don'ts" your first insticts can help you avoid some of the easiest traps.


1. Never directly iterate over the rows in a DataFrame. Avoid anything that goes row-by-row.  
2. Working with NumPy arrays will be faster than pandas Series
3. DataFrame data is stored based on columns, not rows. This means its much faster to access a column than a row. 


# Open Work Time <a id='open'></a>