## Idiomatic Pandas - NYC PyData Conference - Nov 30, 2017
[Tutorial Session from 9 a.m. to 12:30 p.m.](https://pydata.org/nyc2017/schedule/presentation/10/)

# Background Info
My name is [Ted Petrou](https://twitter.com/TedPetrou) and I am author of [Pandas Cookbook](https://www.amazon.com/Pandas-Cookbook-Ted-Petrou/dp/1784393878), which provides nearly 100 recipes with step-by-step instructions for developing powerful and efficient routines for exploring, analyzing, and visualizing real-world messy datasets. Buy my book and get a [30-minute one-on-one tutorial with me](http://tedpetrou.com/pandas-cookbook.html).

![](../images/idiomatic_pandas/pc_amazon.png)
____
I am founder of [Dunder Data](http://dunderdata.com/), a company dedicated to teaching the fundamentals of data science.

![](../images/idiomatic_pandas/dd_logo.png?1)

___
I earned a masters degree in statistics from Rice University and used these analytical skills to play poker professionally. I then taught math before becoming a data analyst and eventually a data scientist for Schlumberger in Houston, Texas. 
___
I founded the Houston Data Science Meetup group:

![](../images/idiomatic_pandas/hds2.png)


I now live in Toronto.

I really enjoy answering questions on Stack Overflow. It sharpens my ability to write idiomatic pandas

![](../images/idiomatic_pandas/my_so.png?534)


# Before Getting Started
* Use the latest version of pandas - 0.21. Update with command **`conda update pandas`** or **`pip install pandas -U`**

# Target Audience
This is not an introduction to pandas. If you want a beginners guide to pandas, please see:
* Tom Augspurger's pandas [.head to .tail tutorial](https://github.com/tomaugspurger/pydata-nyc-ph2t) scheduled Wednesday, November 29, 2017 from 1:30 - 5 p.m.
* My article [How to Learn Pandas](https://medium.com/dunder-data/how-to-learn-pandas-108905ab4955) for strategies on mastering pandas.

This tutorial assumes you have prior exposure to doing data analysis with pandas but...
* Would not feel comfortable answering questions tagged as pandas on Stack Overflow
* Don't know the difference between **`[]`**, **`.iloc`**, **`.loc`**, **`.ix`**, **`.at`**, **`.iat`**
* Use **`reset_index`** frequently because you have no idea how to deal with MultiIndexes
* Use for-loops frequently
* Use **`apply`** frequently
* Struggle with pandas, and find yourself wishing it was easy as R
* Absolutely hate pandas and wish to see it obliterated

## Pandas Overview
* Pandas is one of the most popular tools to do data analysis with. Approximately 1% of all new Stack Overflow questions are tagged as pandas.

![](../images/idiomatic_pandas/so_trends.png)
* The library has evolved substantially since it started becoming mainstream in 2012
* Many answers on Stack Overflow use older syntax that has not been updated
* There are multiple ways to accomplish the same task
* For beginners, there is not always an obvious way of doing it
* The documentation is over 2,000 pages long
* Easy to write inefficient pandas

# Attendance

In [1]:
from IPython.display import IFrame

In [2]:
IFrame('http://etc.ch/32Rr', 400, 300)

In [4]:
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg8lsM9v4IBajClR6cnLre6kxM2M3', 300, 250)

In [5]:
IFrame('http://etc.ch/AanT', 400, 400)

In [6]:
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg8OK6KgIliK6457pdCle7Ir3aFf7e7lJ', 600, 400)

In [7]:
IFrame('http://etc.ch/kg2C', 500, 400)

In [8]:
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg83aLxg811VZnaTvl0O3T2i8JxQzX', 400, 300)

# How well do you know pandas?

In [9]:
import pandas as pd
import numpy as np

In [10]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Exercise 1
<span  style="color:green; font-size:16px">Select the columns **`height`** and **`state`**.</span>

In [14]:
df[['height', 'color']]


Unnamed: 0,height,color
Jane,165,blue
Niko,70,green
Aaron,120,red
Penelope,80,white
Dean,180,gray
Christina,172,black
Cornelia,150,red


### Exercise 2
<span  style="color:green; font-size:16px">Select the columns **`height`** and **`state`** along with the rows **`Niko`** and **`Penelope`**.</span>

In [15]:
df.loc[['Niko', 'Penelope'], ['height', 'color']]

Unnamed: 0,height,color
Niko,70,green
Penelope,80,white


### Exercise 3
<span  style="color:green; font-size:16px">Select rows 3 and 5 and the last three columns using 0-based indexing.</span>

In [42]:
df.iloc[[3, 5], -3:]

Unnamed: 0,age,height,score
Penelope,4,80,3.3
Christina,33,172,9.5


### Exercise 4
<span  style="color:green; font-size:16px">Select all the people with **`color`** equal to red or green or with height less than 90. Only return the **`score`** column.</span>

In [29]:
criteria1 = df['color'].isin(['red', 'green']) 
criteria2 = df['height'] < 90
df.loc[(criteria1 | criteria2), 'score']

Niko        8.3
Aaron       9.0
Penelope    3.3
Cornelia    2.2
Name: score, dtype: float64

### Exercise 5
<span  style="color:green; font-size:16px">Two DataFrame are defined below. What will **`df1`** look like when displayed below?</span>

In [30]:
df1 = pd.DataFrame({'state':['Texas', 'California', 'Florida'], 
                    'oranges':[10, 5, 12]})
df2 = pd.DataFrame({'apples':[3, 4, 5]}, 
                   index=[1, 2, 3])
df1

Unnamed: 0,oranges,state
0,10,Texas
1,5,California
2,12,Florida


In [31]:
df2

Unnamed: 0,apples
1,3
2,4
3,5


In [32]:
df1['apples'] = df2['apples']

In [33]:
#  Answer question before executing
df1

Unnamed: 0,oranges,state,apples
0,10,Texas,
1,5,California,3.0
2,12,Florida,4.0


### Exercise 6
<span  style="color:green; font-size:16px">What will be the output when the following two Series are added together?</span>

In [34]:
s1 = pd.Series(index=['a', 'a', 'b', 'b'], data=[1, 2, 3, 4])
s2 = pd.Series(index=['a', 'a', 'b', 'b'], data=[1, 2, 3, 4])

s1

a    1
a    2
b    3
b    4
dtype: int64

In [35]:
s2

a    1
a    2
b    3
b    4
dtype: int64

In [36]:
# Answer question before executing
s1 + s2

a    2
a    4
b    6
b    8
dtype: int64

### Exercise 7
<span  style="color:green; font-size:16px">What will be the output when the following two Series are added together?</span>

In [37]:
s1 = pd.Series(index=['a', 'a', 'b', 'b'], data=[1, 2, 3, 4])
s2 = pd.Series(index=['a', 'a', 'b', 'b', 'c'], data=[1, 2, 3, 4, 5])

s1

a    1
a    2
b    3
b    4
dtype: int64

In [38]:
s2

a    1
a    2
b    3
b    4
c    5
dtype: int64

In [39]:
# Answer question before executing
s1 + s2

a    2.0
a    3.0
a    3.0
a    4.0
b    6.0
b    7.0
b    7.0
b    8.0
c    NaN
dtype: float64

# Enter your results of the quiz below

In [40]:
IFrame('http://etc.ch/pbjf', 300, 450)

In [41]:
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg8YX2glNsS7zNMFhXmApFoUu5jClJ', 300, 250)

# What does Idiomatic Pandas mean?
Let's come up with a definition for **idiomatic**. Idiomatic code, in general, refers to the most efficient and common convention for completing a specific task. Every language and library has its own idioms. We usually use this term in pandas to refer to short expressions where there exists one good or 'better' version versus other alternatives. 

In general, idiomatic pandas will be:
* Explicit and easy to read
* Performant 
* Commonly used by pandas experts

### The college scoreboard dataset
We will use the college scoreboard dataset for the following examples. This is the US department of education data on 7,535 colleges. Only a sample of the total number of columns available were used in this dataset. Visit [the website](https://collegescorecard.ed.gov/data/) for more info. Data was pulled in January, 2017.

In [43]:
college = pd.read_csv('../data/college.csv', index_col='INSTNM')
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### College Scoreboard data dictionary
Several of the columns are difficult to decipher. Use the following data dictionary to help you understand the columns

In [44]:
pd.read_csv('../data/college_data_dictionary.csv')

Unnamed: 0,column_name,description
0,INSTNM,Institution Name
1,CITY,City Location
2,STABBR,State Abbreviation
3,HBCU,Historically Black College or University
4,MENONLY,0/1 Men Only
5,WOMENONLY,0/1 Women only
6,RELAFFIL,0/1 Religious Affiliation
7,SATVRMID,SAT Verbal Median
8,SATMTMID,SAT Math Median
9,DISTANCEONLY,Distance Education Only


# Comparisons of non-idiomatic vs idiomatic pandas (Basic)
Let's see some examples of terrible pandas code vs their more idiomatic counterparts.

## Reading in data: `read_csv` vs `read_table`
Both the **`read_csv`** and **`read_table`** functions call the exact same underlying code. There is only a single minor difference. **`read_csv`** uses a **comma** as its default delimiter, while **`read_table`** uses a **tab**. That's it. In my opinion **`read_table`** should be deprecated as it adds no additional functionality.

In [45]:
c1 = pd.read_csv('../data/college.csv')
c2 = pd.read_table('../data/college.csv', delimiter=',')
c1.equals(c2)

True

## Find the total count of historically black colleges

#### non-idiomatic Using a loop

In [46]:
total = 0
for i in college['HBCU']:
    total += i
total

nan

So bad it didn't work. Let's drop the missing values and try again:

In [47]:
total = 0
for i in college['HBCU'].dropna():
    total += i
total

102.0

#### Idiomatic

In [48]:
college['HBCU'].sum()

102.0

## Find the percentage of historically black colleges

#### non-idiomatic summing and then dividing

In [49]:
college['HBCU'].sum() / college['HBCU'].count()

0.01423785594639866

#### Idiomatic

In [50]:
college['HBCU'].mean()

0.01423785594639866

## Find the percentage of schools with math SAT scores greater than 700

#### non-idiomatic

In [51]:
s_greater_700 = college['SATMTMID'].dropna() > 700
s_greater_700.head()

INSTNM
Alabama A & M University               False
University of Alabama at Birmingham    False
University of Alabama in Huntsville    False
Alabama State University               False
The University of Alabama              False
Name: SATMTMID, dtype: bool

In [52]:
s_greater_700 = s_greater_700.astype(int)
s_greater_700.head()

INSTNM
Alabama A & M University               0
University of Alabama at Birmingham    0
University of Alabama in Huntsville    0
Alabama State University               0
The University of Alabama              0
Name: SATMTMID, dtype: int64

In [53]:
s_greater_700.sum() / s_greater_700.count()

0.03678929765886288

#### Idiomatic

In [54]:
college['SATMTMID'].dropna().gt(700).mean()

0.03678929765886288

In [55]:
# or
(college['SATMTMID'].dropna() > 700).mean()

0.03678929765886288

## Testing mutiple 'or' clauses on same column

In [56]:
states = ['AL', 'LA', 'TX', 'FL', 'GA']

#### non-idiomatic

In [57]:
college[[sa in states for sa in college['STABBR']]].shape

(1309, 26)

In [58]:
criteria = ((college['STABBR'] == 'AL') | (college['STABBR'] == 'LA') | 
            (college['STABBR'] == 'TX') | (college['STABBR'] == 'FL') | 
            (college['STABBR'] == 'GA'))
college[criteria].shape

(1309, 26)

#### Idiomatic

In [59]:
college[college['STABBR'].isin(states)].shape

(1309, 26)

## `sum(s)` vs `s.sum()` 
Using the built-in **`sum`** function returns the same result as the **`sum`** Series method. Why should you care if you write it one way or the other?

**WidjiAnswer: Always use s.sum() because sum() is the Python method and .sum() is the Pandas method. The Pandas Methods are always the best way to go.**

Let's find the total undergraduate population.

In [60]:
pop = college['UGDS'].dropna()
pop.shape

(6874,)

In [61]:
sum(pop)

16200904.0

In [62]:
pop.sum()

16200904.0

Let's time the difference between the two:

In [65]:
%timeit sum(pop)

214 µs ± 6.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [66]:
%timeit pop.sum()

107 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


#### Lots of overhead with pandas

In [67]:
%timeit pop.values.sum()

6.69 µs ± 336 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


#### Larger performance difference with more data

In [68]:
pop_alot = pop.sample(n=1000000, replace=True)

In [69]:
%timeit sum(pop_alot)

52.4 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [70]:
%timeit pop_alot.sum()

7.27 ms ± 695 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [71]:
%timeit pop_alot.values.sum()

565 µs ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


#### What about taking the absolute value? 

In [72]:
s = pd.Series(np.random.randn(1000000))

In [73]:
%timeit abs(s)

3.51 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [74]:
%timeit s.abs()

3.33 ms ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Both ways of finding the absolute value have identical performance. Why is **`sum`** an order of magnitude less performant?

#### Special method `__abs__`
The reason for this massive discrepancy is because of how much control Python gives developers. Python provides a specific protocol for its built-in **`sum`** function. In contrast, developers can implement the **`abs`** function in whichever way they choose by defining the special method **`__abs__`** for their object.

* **`sum`** - you have no control
* **`abs`** - you have complete control

The built-in Python **`sum`** function only accepts objects that are iterable. An interpreted Python loop will be used to iterate through each value in the Series to sum the up. 

The Series **`sum`** method takes advantage of NumPy's pre-compiled c-code to sum.

When the built-in Python **`abs`** function is passed a DataFrame or Series, the underlying **`__abs__`** method is invoked which also uses NumPy. So **`abs(s)`** and **`s.abs()`** are equivalent.

#### More to the story when converting data to a list
The built-in python **`sum`** function works well when converting the data from a NumPy array to a list. Summing up a list in Python happens in C and not in interpreted Python bytecode. [See this SO answer for more](https://stackoverflow.com/a/24578976/3707607)

In [75]:
v = pop_alot.tolist()

Getting closer to NumPy performance, but Python uses pointers to C primitives. NumPy stores C-primitives directly in the array and only uses homogeneous data.

In [None]:
%timeit sum(v)

In [None]:
%timeit pop_alot.sum()

NumPy is now much slower when data is in a list! 

In [None]:
%timeit np.sum(v)

Most of this time is spent converting the list to a NumPy array

In [None]:
%timeit np.array(v)

# Use pandas DataFrame/Series methods for consistency
Although the built-in **`abs`** function is identical to DataFrame/Series **`abs`** methods, it preferable to use pandas operations when available. This will get you in a habit of using Series methods which have better performance.

### Exercise 1
<span  style="color:green; font-size:16px">Take a look at the following table of all the built-in Python functions. Can you find all the functions that accept a Series and return a useful result. From these functions, can you determine if a pandas special method is being invoked?</span>

In [76]:
IFrame('https://docs.python.org/3/library/functions.html#built-in-functions', 1000, 500)

Summary
* No Series implementation - any, all, max, min, sum - All iterate through each value
* Has Series implementation - divmod, pow, round

In [79]:
# define a Series
s = college['UGDS']

In [None]:
# your code here

# Summary

* Idiomatic Pandas is the most efficient, readable and effective way to write pandas
* Use **`read_csv`** and not **`read_table`** - they are same except default delimiter
* Use **`s.mean()`** on a boolean Series to find percentage of values that meet a condition 
* Use **`isin`** to test multiple 'or' conditions
* Use DataFrame/Series methods and not their Python function equivalents
