# solutions

In [1]:
import pandas as pd

## problem 6
Use the file `shakespeare_words.tsv` to display the total words Shakespeare wrote by genre.

In [2]:
SHAKES_FILE = '~/gits/gads_26/datasets/shakespeare_words.tsv'
df6 = pd.read_csv(SHAKES_FILE, sep='\t')

df6.groupby('Genre').sum()

Unnamed: 0_level_0,Words
Genre,Unnamed: 1_level_1
Comedy,283011
History,263358
Tragedy,289628


## problem 7
Use the file `state_hts.tsv` to print the highest peak for only those states beginning with the letter 'A'.

In [3]:
STATES_FILE = '~/gits/gads_26/datasets/state_hts.tsv'
df7 = pd.read_csv(STATES_FILE, sep='\t')

df7[df7.state.apply(lambda k: k[0] == 'A')]

Unnamed: 0,state,peak,elev_ft
0,Alabama,Cheaha Mountain,2405
1,Alaska,Denali,20320
2,Arizona,Humphreys Peak,12633
3,Arkansas,Magazine Mountain,2753


By the way, `lambda` functions can be useful but sometimes they're not the right tool for the job. If you plan to use a function more than once, or if using `lambda` makes your code harder to understand, then it's better to define the function independently.

Here's a solution to problem 7 that takes things more slowly. Can you see why they're the same? Note that `k` and `my_string` are playing the same role.

In [4]:
def starts_with_A(my_string):
    return my_string[0] == 'A'

mask = df7.state.apply(starts_with_A)    # apply starts_with_A to each elt of df7.state
mask[:10]                                # first 10 elts of boolean mask

0     True
1     True
2     True
3     True
4    False
5    False
6    False
7    False
8    False
9    False
Name: state, dtype: bool

In [5]:
df7[mask]

Unnamed: 0,state,peak,elev_ft
0,Alabama,Cheaha Mountain,2405
1,Alaska,Denali,20320
2,Arizona,Humphreys Peak,12633
3,Arkansas,Magazine Mountain,2753


## problem 8
Use the file `admissions.tsv` to look at a) total admissions rates by gender and b) departmental admissions rates by gender. What do you think is going on here?

### part a

In [6]:
ADM_FILE = '~/gits/gads_26/datasets/admissions.tsv'
df8 = pd.read_csv(ADM_FILE, sep='\t')

m_total_rate = df8.m_admit.sum() / float(df8.m_appl.sum())   # need float for python2.x
f_total_rate = df8.f_admit.sum() / float(df8.f_appl.sum())

print 'm_total_rate = {}, f_total_rate = {}'.format(m_total_rate, f_total_rate)

m_total_rate = 0.460231660232, f_total_rate = 0.303542234332


At first glance, it looks like men are admitted at a substantially higher rate than women. But...

### part b

In [7]:
df8['m_rate'] = df8.m_admit / df8.m_appl
df8['f_rate'] = df8.f_admit / df8.f_appl
df8['f_wins'] = df8.f_rate > df8.m_rate

df8

Unnamed: 0,dept,m_appl,m_admit,f_appl,f_admit,m_rate,f_rate,f_wins
0,A,825,512,108,89,0.620606,0.824074,True
1,B,560,353,25,17,0.630357,0.68,True
2,C,325,120,593,202,0.369231,0.340641,False
3,D,417,138,375,131,0.330935,0.349333,True
4,E,191,53,393,94,0.277487,0.239186,False
5,F,272,16,341,24,0.058824,0.070381,True


But in fact, women are admitted at higher rates than men in 4 of 6 departments! This one is a thinker...see if you can figure out what's going on (hint: conditional probabilities).

## problem 9
see `presidents.py`