# What does Idiomatic Pandas mean?
Let's come up with a definition for **idiomatic**. Idiomatic code, in general, refers to the most efficient and common convention for completing a specific task. Every language and library has its own idioms. We usually use this term in pandas to refer to short expressions where there exists one good or 'better' version versus other alternatives. 

In general, idiomatic pandas will be:
* Explicit and easy to read
* Performant 
* Commonly used by pandas experts

### The college scoreboard dataset
We will use the college scoreboard dataset for the following examples. This is the US department of education data on 7,535 colleges. Only a sample of the total number of columns available were used in this dataset. Visit [the website](https://collegescorecard.ed.gov/data/) for more info. Data was pulled in January, 2017.

In [22]:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### College Scoreboard data dictionary
Several of the columns are difficult to decipher. Use the following data dictionary to help you understand the columns

In [None]:
pd.read_csv('../data/college_data_dictionary.csv')

# Comparisons of non-idiomatic vs idiomatic pandas (Basic)
Let's see some examples of terrible pandas code vs their more idiomatic counterparts.

## Find the total count of historically black colleges

#### non-idiomatic Using a loop

In [None]:
total = 0
for i in college['HBCU']:
    total += i
total

So bad it didn't work. Let's drop the missing values and try again:

In [None]:
total = 0
for i in college['HBCU'].dropna():
    total += i
total

#### Idiomatic

In [None]:
college['HBCU'].sum()

## Find the percentage of historically black colleges

#### non-idiomatic summing and then dividing

In [None]:
college['HBCU'].sum() / college['HBCU'].count()

#### Idiomatic

In [None]:
college['HBCU'].mean()

## Find the percentage of schools with math SAT scores greater than 700

#### non-idiomatic

In [None]:
s_greater_700 = college['SATMTMID'].dropna() > 700
s_greater_700.head()

In [None]:
s_greater_700 = s_greater_700.astype(int)
s_greater_700.head()

In [None]:
s_greater_700.sum() / s_greater_700.count()

#### Idiomatic

In [None]:
college['SATMTMID'].dropna().gt(700).mean()

In [None]:
# or
(college['SATMTMID'].dropna() > 700).mean()

## Testing mutiple 'or' clauses on same column

In [None]:
states = ['AL', 'LA', 'TX', 'FL', 'GA']

#### non-idiomatic

In [None]:
college[[sa in states for sa in college['STABBR']]].shape

In [None]:
criteria = ((college['STABBR'] == 'AL') | (college['STABBR'] == 'LA') | 
            (college['STABBR'] == 'TX') | (college['STABBR'] == 'FL') | 
            (college['STABBR'] == 'GA'))
college[criteria].shape

#### Idiomatic

In [None]:
college[college['STABBR'].isin(states)].shape

## `sum(s)` vs `s.sum()` 
Using the built-in **`sum`** function returns the same result as the **`sum`** Series method. Why should you care if you write it one way or the other?

Let's find the total undergraduate population.

In [23]:
pop = college['UGDS'].dropna()
pop.shape

(6874,)

In [24]:
sum(pop)

16200904.0

In [25]:
pop.sum()

16200904.0

Let's time the difference between the two:

In [26]:
%timeit sum(pop)

209 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [27]:
%timeit pop.sum()

122 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


#### Lots of overhead with pandas

In [28]:
%timeit pop.values.sum()

7.57 µs ± 899 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


#### Larger performance difference with more data

In [29]:
pop_alot = pop.sample(n=1000000, replace=True)

In [30]:
%timeit sum(pop_alot)

61 ms ± 7.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [31]:
%timeit pop_alot.sum()

6.32 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [32]:
%timeit pop_alot.values.sum()

837 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


#### What about taking the absolute value? 

In [33]:
s = pd.Series(np.random.randn(1000000))

In [34]:
%timeit abs(s)

2.58 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [35]:
%timeit s.abs()

2.69 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Both ways of finding the absolute value have identical performance. Why is **`sum`** an order of magnitude less performant?

#### Special method `__abs__`
The reason for this massive discrepancy is because of how much control Python gives developers. Python provides a specific protocol for its built-in **`sum`** function. In contrast, developers can implement the **`abs`** function in whichever way they choose by defining the special method **`__abs__`** for their object.

* **`sum`** - you have no control
* **`abs`** - you have complete control

The built-in Python **`sum`** function only accepts objects that are iterable. An interpreted Python loop will be used to iterate through each value in the Series to sum the up. 

The Series **`sum`** method takes advantage of NumPy's pre-compiled c-code to sum.

When the built-in Python **`abs`** function is passed a DataFrame or Series, the underlying **`__abs__`** method is invoked which also uses NumPy. So **`abs(s)`** and **`s.abs()`** are equivalent.

#### More to the story when converting data to a list
The built-in python **`sum`** function works well when converting the data from a NumPy array to a list. Summing up a list in Python happens in C and not in interpreted Python bytecode. [See this SO answer for more](https://stackoverflow.com/a/24578976/3707607)

In [36]:
v = pop_alot.tolist()

Getting closer to NumPy performance, but Python uses pointers to C primitives. NumPy stores C-primitives directly in the array and only uses homogeneous data.

In [37]:
%timeit sum(v)

7.47 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [38]:
%timeit pop_alot.sum()

6.26 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


NumPy is now much slower when data is in a list! 

In [39]:
%timeit np.sum(v)

41.4 ms ± 5.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Most of this time is spent converting the list to a NumPy array

In [40]:
%timeit np.array(v)

37 ms ± 2.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Use pandas DataFrame/Series methods for consistency
Although the built-in **`abs`** function is identical to DataFrame/Series **`abs`** methods, it preferable to use pandas operations when available. This will get you in a habit of using Series methods which have better performance.

### Exercise 1
<span  style="color:green; font-size:16px">Take a look at the following table of all the built-in Python functions. Can you find all the functions that accept a Series and return a useful result. From these functions, can you determine if a pandas special method is being invoked?</span>

In [6]:
IFrame('https://docs.python.org/3/library/functions.html#built-in-functions', 1000, 500)

In [None]:
# define a Series
s = college['UGDS']

In [None]:
# your code here

## Do not try to reinvent the wheel: Namespaces

Pandas has been expanding its use of namespaces (or accessors) on `DataFrame` to group together related methods. This also limits the number of methods direclty attached to `DataFrame` itself, which can be overwhelming.

Currently, we have these namespaces:

- `.str`: defined on `Series` and `Index`es containing strings (object dtype)
- `.dt`: defined on `Series` with `datetime` or `timedelta` dtype
- `.cat`: defined on `Series` and `Indexes` with `category` dtype
- `.plot`: defined on `Series` and `DataFrames`

See [this link](http://pandas.pydata.org/pandas-docs/stable/api.html#plotting) for more details.

# Summary

* Idiomatic Pandas is the most efficient, readable and effective way to write pandas
* Use **`s.mean()`** on a boolean Series to find percentage of values that meet a condition 
* Use **`isin`** to test multiple 'or' conditions
* Use DataFrame/Series methods and not their Python function equivalents
