# Fundamentals of Data Analysis with Python 

## Day 3: Scientific Computing with NumPy, Pandas, Matplotlib, and Seaborn 

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

### Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr>

### Overview 

High-level overview coming soon... 

### Plan for the Day

1. [Introduction to NumPy](#numpy)
    * Arrays 
2. [Introduction to Pandas](#pandas)
    * Series 
    * Dataframes 
    * Groupby and descriptive statistics
3. [Simple Data Visualization](#viz)
    * A gentle introduction to Matplotlib 
    * Data Visualization with Seaborn 
4. [Best practices when working with Pandas Series and DataFrames](#pandasbp)
    * Understanding how data is stored in Pandas
    * Initialization
    * Transformation
5. [Open Work Time](#open)

<hr>

# Best practices when working with Pandas Series and DataFrames <a id='pandasbp'></a>

In [22]:
# Setup code -- ideally this is changed to a dataset we are using 
import numpy as np
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


## Transformations
Sometimes we want to add new features to the data we are given. Often we can do this through some transformation on the data we already have. For example we might want to: 
* extract the month from a date column
* use a width column and a height column to calculate area
* apply a conditional statement to a set of columns to populate a new column (e.g. set the new column value to X if either the Y column or the Z column contains the word "X")

There is a long list of transformations we might be interested in, many of which operate on a single row, indepentendly of all other rows. 

However, there are some common practices that should be avoided when dealing with transforming Pandas DataFrames. Below we are going to look at __X__ ways we can apply these transformations, starting with a simple example. We will start by looking at the slowest and progress to the fastest. 

Similar to initialization, we have to remember that optimizing your code should not be placed at the expense of functionality. Often its best to get something that works before going back and finding the most optimal solution. However, I hope that by introducing a couple of Do's and Don'ts your first insticts can help you avoid some of the easiest traps.


Checkout: https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=vectorization -- take an image of the big warning here actually

*Adapted from/Inspired by a lecture by Greg Baker, SFU*

### 1 For Loops
Perhaps one of the most obvious ways to approach a transformation is to go row-by-row through the dataframe, doing the necessary transformations one at a time. A simple way to do this is using a for loop.

In [23]:
%%timeit
# Looping over the rows
area = []
for i in range(0, len(iris)):
    row = iris.loc[i]
    i_area = row['petal_length'] * row['petal_width']
    area.append(i_area)

30.5 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In the code above, the `%%timeit` line is called a [magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html#). The `timeit` magic lets us time the execution of a Python statement. 

### 2  `iterrows()`
`iterrows()` is a built-in Pandas method made for the purpose of iterating the rows in a Pandas DataFrame. The method creates a generator (a special Python data type) which we can then use a for loop to iterate through. 

In [40]:
%%timeit
areas = []
for idx, row in iris.iterrows():
    area = row['petal_length'] * row['petal_width']
    areas.append(area)

18.5 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 3 Apply

In [49]:
%%timeit
def my_area(row):
    return row['petal_length'] * row['petal_width']

areas = iris.apply(my_area, axis=1)

5.4 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 4 Zip & Iterate

*Honestly, I'm surprised this is faster than 5 & 6, I'm guessing it  must have to do with

In [46]:
%%timeit
areas = []
for w, l in zip(iris['petal_length'], iris['petal_width']):
    area = w*l
    areas.append(area)

66.2 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### 5 Use Vectorized Functions
Sometimes we can use vectorized functions. These functions operate on entire Series in the DataFrame, instead of individual elements. In addition, these vectorized functions make use of pre-compiled code written in a lower level language like C (the language that was used to build Python). 

In [45]:
%%timeit
# Straight up vector calculations
iris['petal_area'] = iris['petal_length'] * iris['petal_width']

308 µs ± 4.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### 6 Vectorization with NumPy Arrays



In [60]:
%%timeit
areas = np.array(iris['petal_length']) * np.array(iris['petal_width'])

106 µs ± 4.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### and Beyond
For the cases when even these aren't fast enough, you can implement more [advanced techniques](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=vectorization) to enhance the performance. The improvements these advanced techniques can offer differ based on the problem at hand. For example, some techniques use functions and methods that are optimized for boolean comparisons (e.g. greater than) but offer little improvements when working with addition. 

Also look into `NumExpr` when you need really fast transformations. 

## Takeaways

1. Never directly iterate over the rows in a DataFrame. 
2. ...  