# Solutions

* [01: How well do you know pandas?](#How-well-do-you-know-pandas?)
* [01: Use pandas DataFrame/Series methods for consistency](#Use-pandas-DataFrame/Series-methods-for-consistency)


* [02: Assigning new values to subsets of data](#Assigning-new-values-to-subsets-of-data)
* [02: Two common scenarios when assigning subsets of data](#Two-common-scenarios-when-assigning-subsets-of-data)


* [03: **`map`** vs **`apply`**](#map-vs-apply)
* [03: Do we really need apply?](#Do-we-really-need-apply?)
* [03: Tips for debugging **`apply`**](#Tips-for-debugging-apply)


* [04 Components of a groupby operation](#Components-of-a-groupby-operation)


* [05 Random Exercises](#Random-Exercises)

In [8]:
import pandas as pd
import numpy as np

# How well do you know pandas?

In [2]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Exercise 1
<span  style="color:green; font-size:16px">Select the columns **`height`** and **`state`**.</span>

In [3]:
df[['height', 'color']]

Unnamed: 0,height,color
Jane,165,blue
Niko,70,green
Aaron,120,red
Penelope,80,white
Dean,180,gray
Christina,172,black
Cornelia,150,red


### Exercise 2
<span  style="color:green; font-size:16px">Select the columns **`height`** and **`state`** along with the rows **`Niko`** and **`Penelope`**.</span>

In [4]:
df.loc[['Niko', 'Penelope'], ['height', 'color']]

Unnamed: 0,height,color
Niko,70,green
Penelope,80,white


### Exercise 3
<span  style="color:green; font-size:16px">Select rows 3 and 5 and the last three columns using 0-based indexing.</span>

In [5]:
df.iloc[[3, 5], -3:]

Unnamed: 0,age,height,score
Penelope,4,80,3.3
Christina,33,172,9.5


### Exercise 4
<span  style="color:green; font-size:16px">Select all the people with **`color`** equal to red or green or with height less than 90. Only return the **`score`** column.</span>

In [8]:
criteria1 = df['color'].isin(['red', 'green']) 
criteria2 = df['height'] < 90
df.loc[(criteria1 | criteria2), 'score']

Niko        8.3
Aaron       9.0
Penelope    3.3
Cornelia    2.2
Name: score, dtype: float64

### Exercise 5
<span  style="color:green; font-size:16px">Two DataFrame are defined below. What will **`df1`** look like when displayed below?</span>

In [13]:
df1 = pd.DataFrame({'state':['Texas', 'California', 'Florida'], 
                    'oranges':[10, 5, 12]})
df2 = pd.DataFrame({'apples':[3, 4, 5]}, 
                   index=[1, 2, 3])
df1

Unnamed: 0,oranges,state
0,10,Texas
1,5,California
2,12,Florida


In [14]:
df2

Unnamed: 0,apples
1,3
2,4
3,5


In [15]:
df1['apples'] = df2['apples']

In [17]:
# automatic alignment on the index. Only row labels 1 and 2 align
df1

Unnamed: 0,oranges,state,apples
0,10,Texas,
1,5,California,3.0
2,12,Florida,4.0


### Exercise 6
<span  style="color:green; font-size:16px">What will be the output when the following two Series are added together?</span>

In [25]:
s1 = pd.Series(index=['a', 'a', 'b', 'b'], data=[1, 2, 3, 4])
s2 = pd.Series(index=['a', 'a', 'b', 'b'], data=[1, 2, 3, 4])

s1

a    1
a    2
b    3
b    4
dtype: int64

In [26]:
s2

a    1
a    2
b    3
b    4
dtype: int64

In [27]:
s1 + s2

a    2
a    4
b    6
b    8
dtype: int64

### Exercise 7
<span  style="color:green; font-size:16px">What will be the output when the following two Series are added together?</span>

In [28]:
s1 = pd.Series(index=['a', 'a', 'b', 'b'], data=[1, 2, 3, 4])
s2 = pd.Series(index=['a', 'a', 'b', 'b', 'c'], data=[1, 2, 3, 4, 5])

s1

a    1
a    2
b    3
b    4
dtype: int64

In [29]:
s2

a    1
a    2
b    3
b    4
c    5
dtype: int64

In [30]:
# Answer question before executing
s1 + s2

a    2.0
a    3.0
a    3.0
a    4.0
b    6.0
b    7.0
b    7.0
b    8.0
c    NaN
dtype: float64

# Use pandas DataFrame/Series methods for consistency

### Exercise 1
<span  style="color:green; font-size:16px">Take a look at the following table of all the built-in Python functions. Can you find all the functions that accept a Series and return a useful result. From these functions, can you determine if a pandas special method is being invoked?</span>

In [4]:
from IPython.display import IFrame

In [178]:
IFrame('https://docs.python.org/3/library/functions.html#built-in-functions', 1000, 500)

In [10]:
college = pd.read_csv('../data/college.csv', index_col=0)

In [170]:
s = college['UGDS']

### `any`

In [127]:
s.any()

True

In [128]:
any(s)

True

In [129]:
%timeit s.any()

61.6 µs ± 3.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [14]:
%timeit any(s)

180 µs ± 6.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


No pandas special method

### `all`

In [130]:
s.all()

False

In [131]:
all(s)

False

In [132]:
%timeit s.all()

61.2 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [18]:
%timeit all(s)

176 µs ± 4.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


No pandas special method

### Python  `divmod` and Series `floordiv` and `mod`

In [38]:
q, r = s.floordiv(5), s.mod(5)

In [39]:
q1, r1 = divmod(s, 5)

In [40]:
q1.equals(q)

True

In [41]:
r1.equals(r)

True

In [42]:
%timeit q, r = s.floordiv(5), s.mod(5)

687 µs ± 25.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [43]:
%timeit q1, r1 = divmod(s, 5) # This is still done in pandas see below

314 µs ± 7.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [138]:
%timeit s.apply(lambda x: divmod(x, 5)) # so slow

3.16 ms ± 88.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Pandas Series implement the special method **`__divmod__`** even though there is no regular **`divmod`** method.

### max

In [81]:
s.max()

151558.0

In [82]:
max(s)

151558.0

In [83]:
%timeit s.max()

81.3 µs ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [84]:
%timeit max(s)

320 µs ± 16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


No pandas special method

### min

In [85]:
s.min()

0.0

In [86]:
min(s)

0.0

In [87]:
%timeit s.min()

83.3 µs ± 4.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [88]:
%timeit min(s)

301 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


No pandas special method

### pow

In [90]:
s_pow = s.pow(2)

In [91]:
s_pow1 = pow(s, 2)

In [92]:
s_pow.equals(s_pow1)

True

In [97]:
%timeit s.pow(2)

122 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [165]:
%timeit pow(s, 2)   # this is implemented with s.__pow__

82.9 µs ± 2.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [166]:
%timeit s ** 2

82.7 µs ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


A pandas special method exists with a twist - **`s.__pow__()`** and **`s.pow()`** are implemented differently.

### round

In [171]:
s_round = s.round(-2)

In [172]:
s_round1 = round(s, -2)

In [173]:
s_round.equals(s_round1)

True

In [174]:
%timeit s.round(-2)

83.8 µs ± 2.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [175]:
%timeit round(s, -2)

84.7 µs ± 1.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [177]:
%timeit s.values.round(-2)  # NumPy

58.6 µs ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Pandas special method exists

### Summary
* No Series implementation - **`any`**, **`all`**, **`max`**, **`min`**, **`sum`** - All iterate through each value
* Has Series implementation - **`divmod`**, **`pow`**, **`round`**

# Assigning new values to subsets of data

### Exercise 1
<span style="color:green; font-size:16px">Use **`.iloc`** to select the first 5 rows and columns. Is a view or copy created?</span>

In [179]:
c1 = college.copy()
df1 = c1.iloc[:5, :5]  # df1 is our new subset
df1

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0
Amridge University,Montgomery,AL,0.0,0.0,0.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0


In [180]:
df1.iloc[:2] = 99
df1['STABBR'] = 'TX'

c1.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Exercise 2
<span style="color:green; font-size:16px">Select a few rows and all the columns with **`.loc`**. Is a view or copy created?</span>

In [181]:
c1 = college.copy()
idx = ['University of Alabama at Birmingham', 'Amridge University']
df1 = c1.loc[idx]
df1.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


In [182]:
df1['MENONLY'] = 99
df1.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
University of Alabama at Birmingham,Birmingham,AL,0.0,99,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,99,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


In [184]:
c1.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


A copy is created

# Two common scenarios when assigning subsets of data

### Exercise 1
<span style="color:green; font-size:16px">Turn to your neighbors. Explain to them why the following did not work. 
Write another incorrect version of it as well as a correct version.</span>

In [237]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)

df[df['age'] < 30]['score'] = 100
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [274]:
# another incorrect
df = pd.read_csv('../data/sample_data.csv', index_col=0)

df['score'][df['age'] < 30] = 100
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,100.0
Aaron,FL,red,Mango,12,120,100.0
Penelope,AL,white,Apple,4,80,100.0
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


As we saw earlier selecting a Series, as with **`df['score']`**, creates a view

In [276]:
# correct
df.loc[df['age'] < 30, 'score'] = 100
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,100.0
Aaron,FL,red,Mango,12,120,100.0
Penelope,AL,white,Apple,4,80,100.0
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Exercise 2
<span style="color:green; font-size:16px">Make the following idiomatic</span>

In [240]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)

df.loc[df['color'].isin(['red', 'green']), 'score'] = 99

### Exercise 3
<span style="color:green; font-size:16px">Make the following idiomatic</span>

In [250]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)

df[df['state'] == 'TX'][df['age'] > 30][['height', 'score']] = 99

  This is separate from the ipykernel package so we can avoid doing imports until


In [251]:
criteria = (df['state'] == 'TX') & (df['age'] > 30)
df.loc[criteria, ['height', 'score']] = 99
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,99,99.0
Cornelia,TX,red,Beans,69,99,99.0


### Exercise 4
<span style="color:green; font-size:16px">Make the following idiomatic</span>

In [253]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)

df.iloc[[6,4,2,3]].iloc[:, :3] = 'CHANGED'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [255]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df.iloc[[6,4,2,3], :3] = 'CHANGED'
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,CHANGED,CHANGED,CHANGED,12,120,9.0
Penelope,CHANGED,CHANGED,CHANGED,4,80,3.3
Dean,CHANGED,CHANGED,CHANGED,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,CHANGED,CHANGED,CHANGED,69,150,2.2


### Exercise 5
<span style="color:green; font-size:16px">Select the first three rows and first three columns into a new variable and then change all the values to **`CHANGED`** without getting a **`SettingWithCopy`** warning.</span>

In [294]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)

df1 = df[['color', 'food']]

In [296]:
df1.loc[:, :] = 'CHANGED'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [297]:
# or this
df1.iloc[:, :] = 'CHANGED'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


# map vs apply

In [2]:
import pandas as pd
import numpy as np

### Exercise 1
<span style="color:green; font-size:16px">Can you use the **`apply`** method to do the same thing? Time the difference between the **`apply`** and **`map`**.</span>

In [331]:
n = 1000000 # 1 million
s = pd.Series(np.random.randint(1, 101, n))

In [303]:
s.apply(lambda x: 'odd' if x % 2 else 'even').head()

0     odd
1    even
2     odd
3     odd
4     odd
dtype: object

In [304]:
%timeit s.apply(lambda x: 'odd' if x % 2 else 'even')

218 ms ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [306]:
%%timeit
d = {i: 'odd' if i % 2 else 'even' for i in range(1, 101)}
s.map(d)

59.9 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Exercise 2
<span style="color:green; font-size:16px">Use the **`map`** method with a two-item dictionary to convert the Series of integers to 'even/odd' strings. You will need to perform an operation on the Series first.</span>

In [311]:
%timeit s.mod(2).map({1:'odd', 0:'even'})

85.3 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Exercise 3
<span style="color:green; font-size:16px">Write a for-loop to convert each value in the  Series to 'even/odd' strings. Time the operation.</span>

In [333]:
%%timeit 
d = {i: 'odd' if i % 2 else 'even' for i in range(1, 101)}
pd.Series([d[val] for val in s])

81 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


This is faster than **`apply`** was in the exercise 1.

### Exercise 4
<span style="color:green; font-size:16px">Convert the values from 1-33 to 'low', 34-67 to 'medium' and the rest 'high'.</span>

In [332]:
np.where(s <= 33, 'low', np.where(s <= 67, 'medium', 'high'))

array(['high', 'medium', 'medium', ..., 'medium', 'medium', 'medium'],
      dtype='<U6')

# Do we really need apply?

In [334]:
df = pd.DataFrame(np.random.rand(100, 5), columns=['a', 'b', 'c', 'd', 'e'])
df.head()

Unnamed: 0,a,b,c,d,e
0,0.64961,0.806852,0.389017,0.582023,0.140793
1,0.216948,0.914255,0.121187,0.806617,0.17075
2,0.30204,0.863148,0.630202,0.814147,0.575576
3,0.695118,0.789862,0.846326,0.267981,0.389699
4,0.903842,0.948717,0.235339,0.823927,0.418497


### Exercise 1
<span style="color:green; font-size:16px">Make the following idiomatic</span>

In [336]:
df.apply(np.cumsum).head()

Unnamed: 0,a,b,c,d,e
0,0.64961,0.806852,0.389017,0.582023,0.140793
1,0.866558,1.721107,0.510204,1.388639,0.311543
2,1.168598,2.584254,1.140406,2.202786,0.887119
3,1.863716,3.374116,1.986731,2.470767,1.276818
4,2.767558,4.322833,2.22207,3.294694,1.695315


In [337]:
df.cumsum().head()

Unnamed: 0,a,b,c,d,e
0,0.64961,0.806852,0.389017,0.582023,0.140793
1,0.866558,1.721107,0.510204,1.388639,0.311543
2,1.168598,2.584254,1.140406,2.202786,0.887119
3,1.863716,3.374116,1.986731,2.470767,1.276818
4,2.767558,4.322833,2.22207,3.294694,1.695315


### Exercise 2
<span style="color:green; font-size:16px">Make the following idiomatic</span>

In [338]:
df.apply(lambda x: x.max() - x.min())

a    0.991043
b    0.996544
c    0.969750
d    0.971780
e    0.976607
dtype: float64

In [339]:
df.max() - df.min()

a    0.991043
b    0.996544
c    0.969750
d    0.971780
e    0.976607
dtype: float64

### Exercise 3
<span style="color:green; font-size:16px">Add a column named **`distance`** to the following DataFrame that computes the euclidean distance between points **`(x1, y1)`** and **`(x2, y2)`**. Calculate it once with **`apply`** and again idiomatically using vectorized operations. Time the difference between them.</span>

In [4]:
df = pd.DataFrame(np.random.randint(0, 20, (100000, 4)), 
                  columns=['x1', 'y1', 'x2', 'y2'])
df.head()

Unnamed: 0,x1,y1,x2,y2
0,16,4,16,12
1,0,16,3,12
2,16,7,11,1
3,10,9,9,2
4,2,15,5,0


In [9]:
def dist_calc(s):
    x_diff = (s['x1'] - s['x2']) ** 2
    y_diff = (s['y1'] - s['y2']) ** 2
    return np.sqrt(x_diff + y_diff).round(2)

In [10]:
df['distance'] = df.apply(dist_calc, axis='columns')
df.head()

Unnamed: 0,x1,y1,x2,y2,distance
0,16,4,16,12,8.0
1,0,16,3,12,5.0
2,16,7,11,1,7.81
3,10,9,9,2,7.07
4,2,15,5,0,15.3


In [12]:
np.sqrt((df['x1'] - df['x2']) ** 2 + (df['y1'] - df['y2']) ** 2).head()

0     8.000000
1     5.000000
2     7.810250
3     7.071068
4    15.297059
dtype: float64

In [13]:
%timeit df.apply(dist_calc, axis='columns')

4.8 s ± 49.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
%timeit np.sqrt((df['x1'] - df['x2']) ** 2 + (df['y1'] - df['y2']) ** 2)

3.9 ms ± 354 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


1000x faster!

### Exercise 4
<span style="color:green; font-size:16px">The following example is from the documentation. Produce the same result without using apply by creating a function that it accepts a DataFrame and returns a DataFrame</span>

In [1]:
df = pd.DataFrame(np.random.randint(0, 20, (10, 4)), 
                  columns=['x1', 'y1', 'x2', 'y2'])
df.head()

NameError: name 'pd' is not defined

In [21]:
def subtract_and_divide(x, sub, divide=1):
    return (x - sub) / divide

In [23]:
df.apply(subtract_and_divide, args=(5,), divide=3)

Unnamed: 0,x1,y1,x2,y2
0,-1.0,3.0,-1.666667,4.333333
1,2.333333,4.666667,-1.0,3.666667
2,3.0,-0.333333,0.0,0.0
3,1.333333,4.666667,1.666667,3.666667
4,4.666667,4.666667,-0.666667,0.333333
5,-1.333333,4.666667,-0.333333,1.333333
6,4.0,4.333333,-0.666667,4.333333
7,0.666667,1.333333,-1.666667,4.333333
8,1.333333,2.666667,0.0,-0.666667
9,2.333333,4.333333,-1.666667,2.333333


In [24]:
# answer
def subtract_and_divide2(df, sub, divide=1):
    return (df - sub) / divide

In [26]:
subtract_and_divide2(df, 5, 3)

Unnamed: 0,x1,y1,x2,y2
0,-1.0,3.0,-1.666667,4.333333
1,2.333333,4.666667,-1.0,3.666667
2,3.0,-0.333333,0.0,0.0
3,1.333333,4.666667,1.666667,3.666667
4,4.666667,4.666667,-0.666667,0.333333
5,-1.333333,4.666667,-0.333333,1.333333
6,4.0,4.333333,-0.666667,4.333333
7,0.666667,1.333333,-1.666667,4.333333
8,1.333333,2.666667,0.0,-0.666667
9,2.333333,4.333333,-1.666667,2.333333


### Exercise 5
<span style="color:green; font-size:16px">Make the following idiomatic:</span>

In [27]:
college = pd.read_csv('../data/college.csv', 
                      usecols=lambda x: 'UGDS' in x or x == 'INSTNM', 
                      index_col='INSTNM')
college = college.dropna()
college.shape

(6874, 10)

In [28]:
college.head()

Unnamed: 0_level_0,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Alabama A & M University,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [29]:
def max_race_count(s):
    max_race_pct = s.iloc[1:].max()
    return (max_race_pct * s.loc['UGDS']).astype(int)

In [30]:
college.apply(max_race_count, axis=1).head()

INSTNM
Alabama A & M University               3933
University of Alabama at Birmingham    6741
Amridge University                      121
University of Alabama in Huntsville    3809
Alabama State University               4429
dtype: int64

In [36]:
# answer
(college.iloc[:, 1:].max(axis=1) * college['UGDS']).astype(int).head()

INSTNM
Alabama A & M University               3933
University of Alabama at Birmingham    6741
Amridge University                      121
University of Alabama in Huntsville    3809
Alabama State University               4429
dtype: int64

In [37]:
%timeit college.apply(max_race_count, axis=1)

1.36 s ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [38]:
%timeit (college.iloc[:, 1:].max(axis=1) * college['UGDS']).astype(int)

957 µs ± 7.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# Tips for debugging `apply`

### Exercise 1
<span style="color:green; font-size:16px">Use the **`display`** after each line in a custom function that gets used with **`apply`** and **`axis='columns'`** to find the population of the second highest race. Make sure you raise an exception or else you will have to kill your kernel because of the massive output</span>

In [58]:
college = pd.read_csv('../data/college.csv', 
                      usecols=lambda x: 'UGDS' in x or x == 'INSTNM', 
                      index_col='INSTNM')
college = college.dropna()

In [74]:
def second_most(s):
    ugds, s = s.iloc[0], s.iloc[1:]
    print("ugds is ", ugds)
    display(s)
    s = s.sort_values(ascending=False)
    display(s)
    second_pct = s.iloc[1]
    print("second pct", second_pct)
    second_pop = (second_pct * ugds).astype(int)
    print('second_pop', second_pop)
    print("\n\n\n")
    raise

In [75]:
college.apply(second_most, axis='columns')

ugds is  4206.0


UGDS_WHITE    0.0333
UGDS_BLACK    0.9353
UGDS_HISP     0.0055
UGDS_ASIAN    0.0019
UGDS_AIAN     0.0024
UGDS_NHPI     0.0019
UGDS_2MOR     0.0000
UGDS_NRA      0.0059
UGDS_UNKN     0.0138
Name: Alabama A & M University, dtype: float64

UGDS_BLACK    0.9353
UGDS_WHITE    0.0333
UGDS_UNKN     0.0138
UGDS_NRA      0.0059
UGDS_HISP     0.0055
UGDS_AIAN     0.0024
UGDS_NHPI     0.0019
UGDS_ASIAN    0.0019
UGDS_2MOR     0.0000
Name: Alabama A & M University, dtype: float64

second pct 0.0333
second_pop 140




ugds is  4206.0


UGDS_WHITE    0.0333
UGDS_BLACK    0.9353
UGDS_HISP     0.0055
UGDS_ASIAN    0.0019
UGDS_AIAN     0.0024
UGDS_NHPI     0.0019
UGDS_2MOR     0.0000
UGDS_NRA      0.0059
UGDS_UNKN     0.0138
Name: Alabama A & M University, dtype: float64

UGDS_BLACK    0.9353
UGDS_WHITE    0.0333
UGDS_UNKN     0.0138
UGDS_NRA      0.0059
UGDS_HISP     0.0055
UGDS_AIAN     0.0024
UGDS_NHPI     0.0019
UGDS_ASIAN    0.0019
UGDS_2MOR     0.0000
Name: Alabama A & M University, dtype: float64

second pct 0.0333
second_pop 140






RuntimeError: ('No active exception to reraise', 'occurred at index Alabama A & M University')

Once you are happy with the results, remove the print/display statements.

In [79]:
def second_most(s):
    ugds, s = s.iloc[0], s.iloc[1:]
    s = s.sort_values(ascending=False)
    second_pct = s.iloc[1]
    second_pop = (second_pct * ugds).astype(int)
    return second_pop

In [80]:
college.apply(second_most, axis='columns').head(10)

INSTNM
Alabama A & M University                140
University of Alabama at Birmingham    2959
Amridge University                       87
University of Alabama in Huntsville     684
Alabama State University                116
The University of Alabama              3340
Central Alabama Community College       415
Athens State University                 358
Auburn University at Montgomery        1453
Auburn University                      1444
dtype: int64

### Exercise 2 - Very difficult
<span style="color:green; font-size:16px">Can you do this without using **`apply`**?</span>

In [86]:
college_sorted = np.sort(college, axis=1)
(college_sorted[:, -3] * college_sorted[:, -1]).astype(int)

array([ 140, 2959,   87, ...,   22,    0,  446])

In [87]:
%%timeit
college_sorted = np.sort(college, axis=1)
(college_sorted[:, -3] * college_sorted[:, -1]).astype(int)

947 µs ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [88]:
%timeit college.apply(second_most, axis='columns')

1.97 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


2000x faster!

### Exercise 3
<span style="color:green; font-size:16px">When **`apply`** is called on a Series, what is the data type that gets passed to the function?</span>

In [90]:
def foo(x):
    print(type(x))
    raise

In [91]:
college['UGDS'].apply(foo)

<class 'float'>


RuntimeError: No active exception to reraise

# Components of a groupby operation

In [92]:
import pandas as pd
import numpy as np

In [94]:
college = pd.read_csv('../data/college.csv')
college.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Exercise 1
<span style="color:green; font-size:16px">Find the average and max SAT Math and Verbal scores by state and religious affiliation.</span>

In [131]:
state_sat = college.groupby(['STABBR', 'RELAFFIL'])['SATVRMID', 'SATMTMID'].agg(['mean', 'max']) \
                                                                           .dropna() \
                                                                           .astype(int)
state_sat.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,SATVRMID,SATVRMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,max,mean,max
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AK,1,555,555,503,503
AL,0,514,595,515,590
AL,1,498,565,485,560
AR,0,481,555,503,565
AR,1,505,600,531,600
AZ,0,549,565,548,580
AZ,1,485,485,480,480
CA,0,561,765,584,785
CA,1,529,665,528,665
CO,0,537,635,541,680


### Exercise 2
<span style="color:green; font-size:16px">Why would we ever use the method with **`map`** when **`join`** is more straightforward. Turn the **`state_sat`** DataFrame with single level index and columns.</span>

**`'_'.join`** fails on numerics

In [132]:
state_sat.columns = state_sat.columns.map('_'.join)
state_sat.index = state_sat.index.map('{0[0]}_{0[1]}'.format)
state_sat.head()

Unnamed: 0,SATVRMID_mean,SATVRMID_max,SATMTMID_mean,SATMTMID_max
AK_1,555,555,503,503
AL_0,514,595,515,590
AL_1,498,565,485,560
AR_0,481,555,503,565
AR_1,505,600,531,600


### Exercise 3
<span style="color:green; font-size:16px">Verify that the object passed to the custom function in **`apply`** is a DataFrame</span>

In [133]:
from IPython.display import display

In [138]:
def return_single(x):
    print(type(x))
    display(x.head(3))
    raise

In [139]:
college.groupby('STABBR').apply(return_single)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
60,University of Alaska Anchorage,Anchorage,AK,0.0,0.0,0.0,0,,,0.0,...,0.098,0.0181,0.0457,0.4539,1,0.2385,0.2647,0.4386,42500.0,19449.5
61,Alaska Bible College,Palmer,AK,0.0,0.0,0.0,1,,,0.0,...,0.037,0.0,0.0,0.1481,1,0.3571,0.2857,0.4286,,PrivacySuppressed
62,University of Alaska Fairbanks,Fairbanks,AK,0.0,0.0,0.0,0,,,0.0,...,0.0401,0.011,0.306,0.3887,1,0.2263,0.255,0.4519,36200.0,19355


<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
60,University of Alaska Anchorage,Anchorage,AK,0.0,0.0,0.0,0,,,0.0,...,0.098,0.0181,0.0457,0.4539,1,0.2385,0.2647,0.4386,42500.0,19449.5
61,Alaska Bible College,Palmer,AK,0.0,0.0,0.0,1,,,0.0,...,0.037,0.0,0.0,0.1481,1,0.3571,0.2857,0.4286,,PrivacySuppressed
62,University of Alaska Fairbanks,Fairbanks,AK,0.0,0.0,0.0,0,,,0.0,...,0.0401,0.011,0.306,0.3887,1,0.2263,0.255,0.4519,36200.0,19355


RuntimeError: No active exception to reraise

### Exercise 4
<span style="color:green; font-size:16px">Calculate the average SAT Math scores per state weighted by undergraduate population</span>

In [174]:
college_drop = college[['STABBR', 'SATMTMID', 'UGDS']].dropna()

In [177]:
def calc_wa(df):
    wa =  (df['SATMTMID'] * df['UGDS']).sum() / df['UGDS'].sum()
    return wa.astype(int)

In [178]:
college_drop.groupby('STABBR').apply(calc_wa).head(10)

STABBR
AK    503
AL    536
AR    529
AZ    569
CA    564
CO    553
CT    545
DC    621
DE    569
FL    565
dtype: int64

# Random Exercises

### Exercise 1
<span style="color:green; font-size:16px">Convert **`df1`** to **`df2`**</span>

In [184]:
df1 = pd.read_csv('../data/e1.csv')
df2 = pd.read_csv('../data/e2.csv')
df1

Unnamed: 0,A,B,C
0,1,0,0
1,1,1,0
2,0,1,0
3,0,0,1


In [185]:
df2

Unnamed: 0,A,B,C,label
0,1,0,0,A
1,1,1,0,AB
2,0,1,0,B
3,0,0,1,C


In [183]:
df1['label'] = np.where(df1, df1.columns, '').sum(axis=1)
df1

Unnamed: 0,A,B,C,label
0,1,0,0,A
1,1,1,0,AB
2,0,1,0,B
3,0,0,1,C
