# Fundamentals of Data Analysis with Python 

## Day 3: Scientific Computing with NumPy, Pandas, Matplotlib, and Seaborn 

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

### Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr>

### Overview 

High-level overview coming soon... 

### Plan for the Day

1. [Introduction to NumPy](#numpy)
    * Arrays 
2. [Introduction to Pandas](#pandas)
    * Series 
    * Dataframes 
    * Groupby and descriptive statistics
3. [Simple Data Visualization](#viz)
    * A gentle introduction to Matplotlib 
    * Data Visualization with Seaborn 
4. [Best practices when working with Pandas Series and DataFrames](#pandasbp)
    * Understanding how data is stored in Pandas
    * Initialization
    * Transformation
5. [Open Work Time](#open)

<hr>

# Introduction to NumPy<a id='numpy'></a>

# Introduction to Pandas<a id='pandas'></a>

# Simple Data Visualization<a id='viz'></a>

# Best practices when working with Pandas Series and DataFrames <a id='pandasbp'></a>

From [pandas](https://pandas.pydata.org/):
>pandas is a **fast**, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. 

This is all true, with a pretty large caveat. Pandas is fast (and generally efficient), if you avoid some of the common pitfalls. Unfortunately, these traps are easy to fall for and many pandas users (even senior data scientists) don't know they might be slowing their code down 10-1000x. These people will often be hesitant to use pandas on large datasets and may dissuade others from using the library. 

However, by understanding a little about what is going on in the backend, we can avoid the worst of the problems and write relatively fast pandas code. 

[How is data stored in pandas?](#storage)   
[Efficient Transformation](#transform)   
[Efficient Initialization](#init)   

## How is data stored in Pandas? <a id='storage'></a>

### Series

### DataFrames
DataFrames are really just a collection of Series, with each column corresponding to its own Series. As we mentioned above, each item in a Series (or column) is stored right after the one before it. This means that the entire column is stored within a single range of memory.

However, the multiple Series (columns) that make a DataFrame can be stored anywhere in memory and are often not stored side-by-side. 

We can think of this like a grocery list for sandwiches. Lets imagine that each kind of sandwich we make is composed of 1 type of bread, 1 type of  meat and 1 type of vegetable. We could arrange our grocery list into a table like this: 

| sandwich_id | bread_type | meat_type  | vegetable_type |
|-------------|------------|------------|----------------|
| 0           | sourdough  | ham        | lettuce        |
| 1           | baguette   | turkey     | tomato         |
| 2           | rye        | roast beef | onion          |

We buy all of our bread products from a bakery, meat from a deli, and vegetables from a grocer. The result is that to get everything in a column, you can go to one location (e.g. bakery for bread_type). But to get everything from a row you will have to visit all three locations. 

This means that is it really fast to access an entire column, but really slow to access a row. Lets check it out!

In [None]:
print("Column\n------")
%timeit sl = iris['petal_length']

print("Row\n------")
%timeit example_1 = iris.iloc[12]

This difference in speed is more than a 50x difference! 



## Transformations <a id='transform'></a>
One instance where the underlying storage structure and its consquence on speed is when applying transformations (calculations or other functions) to a DataFrame. 

A common case of this is when we want to add a new column to our DataFrame based on values in other columns. For example, we may want to:  
* Extract month from a data column
* Calculate area from width & length columns
* Predict whether a flight will be late by applying a deep learning model to the values of 5 other columns. 

There is a long list of transformations we might be interested in, many of which operate on a single row, independent of other rows. 

There are many ways to implement transformations in pandas, some of which take advantage of how DataFrames are stored and others that do not. Below, we are going to look at __X__ methods for implementing transformations:
1. [For Loops](#for)
2. [Itterows]
2. [Apply Method]
3. [Zip & Iterate]
4. 

We will use a common example across transformation methods allowing us to compare the speed of each one. For each method we will create a new column called `petal_area` by multiplying `petal_length` by `petal_width`.

In [None]:
# Setup code -- ideally this is changed to a dataset we are using 
import numpy as np
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.sample(5)

### 1 For Loops <a id='for'></a>
Perhaps one of the most obvious ways to approach a transformation is to go row-by-row through the dataframe, doing the necessary transformations one at a time. A simple way to do this is using a for loop.

In [None]:
%%timeit
# Looping over the rows
area = []
for i in range(0, len(iris)):
    row = iris.loc[i]
    i_area = row['petal_length'] * row['petal_width']
    area.append(i_area)

In the code above, the `%%timeit` line is called a [magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html#). The `timeit` magic lets us time the execution of a Python statement. 

### 2  `iterrows()`
`iterrows()` is a built-in Pandas method made for the purpose of iterating the rows in a Pandas DataFrame. The method creates a generator (a special Python data type) which we can then use a for loop to iterate through. 

In [None]:
%%timeit
areas = []
for idx, row in iris.iterrows():
    area = row['petal_length'] * row['petal_width']
    areas.append(area)

### 3 Apply

In [None]:
%%timeit
def my_area(row):
    return row['petal_length'] * row['petal_width']

areas = iris.apply(my_area, axis=1)

### 4 Zip & Iterate

*Honestly, I'm surprised this is faster than 5 & 6.*   
I'm guessing it  must have to do with the specific transformation we are using. Will have to check when I do transformations on a dataset we are using during the workshop. Something a bit more interesting than adding. Edit distance might be nice.

In [None]:
%%timeit
areas = []
for w, l in zip(iris['petal_length'], iris['petal_width']):
    area = w*l
    areas.append(area)

### 5 Use Vectorized Functions
Sometimes we can use vectorized functions. These functions operate on entire Series in the DataFrame, instead of individual elements. In addition, these vectorized functions make use of pre-compiled code written in a lower level language like C (the language that was used to build Python). 

In [None]:
%%timeit
# Straight up vector calculations
iris['petal_area'] = iris['petal_length'] * iris['petal_width']

### 6 Vectorization with NumPy Arrays



In [None]:
%%timeit
areas = np.array(iris['petal_length']) * np.array(iris['petal_width'])

### Pandas 1.0 Improvements
https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#using-numba-in-rolling-apply-and-expanding-apply

### and Beyond
For the cases when even these aren't fast enough, you can implement more [advanced techniques](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=vectorization) to enhance the performance. The improvements these advanced techniques can offer differ based on the problem at hand. For example, some techniques use functions and methods that are optimized for boolean comparisons (e.g. greater than) but offer little improvements when working with addition. 

Also look into `NumExpr` when you need really fast transformations. 

## Takeaways

While this difference in speed is hard (if not impossible) to notice for small datasets, it can become hugely consequential when working with large datasets or performing complex calculations. 

Similar to initialization, we have to remember that optimizing your code should not be placed at the expense of functionality. Often its best to get something that works before going back and finding the most optimal solution. However, I hope that by introducing a couple of Do's and Don'ts your first insticts can help you avoid some of the easiest traps.



Checkout: https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=vectorization -- take an image of the big warning here actually

*Adapted from/Inspired by a lecture by Greg Baker, SFU*

1. Never directly iterate over the rows in a DataFrame. 
2. ...  

# Open Work Time <a id='open'></a>