# INFO 212: Data Science Programming 1
___

### Week 9: Data Analysis Examples
___

### Mon., 28, 2018 (Holiday, no class), and Wed., May 230, 2018
---

**Question:**
- What can I learn from real world data analysis examples? 

**Objectives:**
- Apply the techiques learned in this course to real world datay analysis problems

In [1]:
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

## USDA Food Database
The US Department of Agriculture makes available a database of food nutrient information. 
The records look like this:

```
{
  "id": 21441,
  "description": "KENTUCKY FRIED CHICKEN, Fried Chicken, EXTRA CRISPY,
Wing, meat and skin with breading",
  "tags": ["KFC"],
  "manufacturer": "Kentucky Fried Chicken",
  "group": "Fast Foods",
  "portions": [
    {
      "amount": 1,
      "unit": "wing, with skin",
      "grams": 68.0
    },

    ...
  ],
  "nutrients": [
    {
      "value": 20.8,
      "units": "g",
      "description": "Protein",
      "group": "Composition"
    },

    ...
  ]
}
```

Each food has a number of identifying attributes along with two lists of nutrients and portion sizes. Data in this form is not particularly amenable to analysis, so we need to do some work to wrangle the data into a better form.

```
import json
db = json.load(open('datasets/usda-food-database.json'))
len(db)```

Each entry in db is a dict containing all the data for a single food. The 'nutrients'
field is a list of dicts, one for each nutrient:

```
import pandas as pd
db[0].keys()```

```
db[0]['nutrients'][0]```

```
nutrients = pd.DataFrame(db[0]['nutrients'])
nutrients.head()```

When converting a list of dicts to a DataFrame, we can specify a list of fields to
extract. We’ll take the food names, group, ID, and manufacturer:

```
info_keys = ['description', 'group', 'id', 'manufacturer']```

```
info = pd.DataFrame(db, columns=info_keys)
info.head()```

```
info.info()```

You can see the distribution of food groups with value_counts:

```
pd.value_counts(info.group)[:10]```

Now, to do some analysis on all of the nutrient data, it’s easiest to assemble the
nutrients for each food into a single large table. To do so, we need to take several
steps. First, I’ll convert each list of food nutrients to a DataFrame, add a column for
the food id, and append the DataFrame to a list. Then, these can be concatenated
together with concat:

```
nutrients = []

for rec in db:
    fnuts = pd.DataFrame(rec['nutrients'])
    fnuts['id'] = rec['id']
    nutrients.append(fnuts)

nutrients = pd.concat(nutrients, ignore_index=True)```

```
nutrients```

Check duplicates:

```
nutrients.duplicated().sum()  # number of duplicates```

```
nutrients = nutrients.drop_duplicates()```

Since 'group' and 'description' are in both DataFrame objects, we can rename for
clarity:

```
col_mapping = {'description' : 'food',
               'group'       : 'fgroup'}```

```
info = info.rename(columns=col_mapping, copy=False)
info.info()```

```
col_mapping = {'description' : 'nutrient',
               'group' : 'nutgroup'}
nutrients = nutrients.rename(columns=col_mapping, copy=False)
nutrients.head()```

With all of this done, we’re ready to merge info with nutrients:

```
ndata = pd.merge(nutrients, info, on='id', how='outer')
ndata.info()```

```
ndata.iloc[30000]```

We could now make a plot of median values by food group and nutrient type

```
fig = plt.figure()```

```
result = ndata.groupby(['nutrient', 'fgroup'])['value'].quantile(0.5)```

```
result['Zinc, Zn'].sort_values().plot(kind='barh')```

We can find which food is most dense in each nutrient:

```
by_nutrient = ndata.groupby(['nutgroup', 'nutrient'])

get_maximum = lambda x: x.loc[x.value.idxmax()]
get_minimum = lambda x: x.loc[x.value.idxmin()]

max_foods = by_nutrient.apply(get_maximum)[['value', 'food']]

# make the food a little smaller
max_foods.food = max_foods.food.str[:50]```

```
max_foods.loc['Amino Acids']['food']```

## 2012 Federal Election Commission Database
The US Federal Election Commission publishes data on contributions to political
campaigns. This includes contributor names, occupation and employer, address, and
contribution amount. An interesting dataset is from the 2012 US presidential election.
A version of the dataset is a 150 megabyte CSV file
P00000001-ALL.csv

```
fec = pd.read_csv('datasets/fec/P00000001-ALL.csv')
fec.info()```

A simple records looks like:

```
fec.iloc[123456]```

You may think of some ways to start slicing and dicing this data to extract informative
statistics about donors and patterns in the campaign contributions. Here is 
a number of different analyses that apply techniques in this course.

You can see that there are no political party affiliations in the data, so this would be
useful to add. You can get a list of all the unique political candidates using unique:

```
unique_cands = fec.cand_nm.unique()
unique_cands```

One way to indicate party affiliation is using a dict (for illustration purpose only):

```
parties = {'Bachmann, Michelle': 'Republican',
           'Cain, Herman': 'Republican',
           'Gingrich, Newt': 'Republican',
           'Huntsman, Jon': 'Republican',
           'Johnson, Gary Earl': 'Republican',
           'McCotter, Thaddeus G': 'Republican',
           'Obama, Barack': 'Democrat',
           'Paul, Ron': 'Republican',
           'Pawlenty, Timothy': 'Republican',
           'Perry, Rick': 'Republican',
           "Roemer, Charles E. 'Buddy' III": 'Republican',
           'Romney, Mitt': 'Republican',
           'Santorum, Rick': 'Republican'}```

Now, using this mapping and the map method on Series objects, you can compute an
array of political parties from the candidate names:

```
fec.cand_nm[123456:123461]```

```
fec.cand_nm[123456:123461].map(parties)```

```
`fec['party'] = fec.cand_nm.map(parties)
fec['party'].value_counts()```

A couple of data preparation points. First, this data includes both contributions and
refunds (negative contribution amount):

```
(fec.contb_receipt_amt > 0).value_counts()```

To simplify the analysis, I’ll restrict the dataset to positive contributions:

```
fec = fec[fec.contb_receipt_amt > 0]```

Since Barack Obama and Mitt Romney were the main two candidates, I’ll also prepare
a subset that just has contributions to their campaigns:

```
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]```

### Donation Statistics by Occupation and Employer
Donations by occupation is another oft-studied statistic. For example, lawyers (attorneys)
tend to donate more money to Democrats, while business executives tend to
donate more to Republicans. You have no reason to believe it; you can see for yourself
in the data. First, the total number of donations by occupation is easy:

```
fec.contbr_occupation.value_counts()[:10]```

You will notice by looking at the occupations that many refer to the same basic job
type, or there are several variants of the same thing. The following code snippet illustrates
a technique for cleaning up a few of them by mapping from one occupation to
another; note the “trick” of using dict.get to allow occupations with no mapping to
“pass through”:

```
occ_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'INFORMATION REQUESTED (BEST EFFORTS)' : 'NOT PROVIDED',
   'C.E.O.': 'CEO'
}```

```
f = lambda x: occ_mapping.get(x, x)
fec.contbr_occupation = fec.contbr_occupation.map(f)```

Do the same thing for employer:

```
emp_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'SELF' : 'SELF-EMPLOYED',
   'SELF EMPLOYED' : 'SELF-EMPLOYED',
}

# If no mapping provided, return x
f = lambda x: emp_mapping.get(x, x)
fec.contbr_employer = fec.contbr_employer.map(f)```

Now, you can use pivot_table to aggregate the data by party and occupation, then
filter down to the subset that donated at least $2 million overall:

```
by_occupation = fec.pivot_table('contb_receipt_amt',
                                index='contbr_occupation',
                                columns='party', aggfunc='sum')```

```
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
over_2mm```

It can be easier to look at this data graphically as a bar plot ('barh' means horizontal
bar plot:

```
plt.figure()```

```
over_2mm.plot(kind='barh')```

You might be interested in the top donor occupations or top companies that donated
to Obama and Romney. To do this, you can group by candidate name and use a variant
of the top method from earlier in the chapter:

```
def get_top_amounts(group, key, n=5):
    totals = group.groupby(key)['contb_receipt_amt'].sum()
    return totals.nlargest(n)```

Then aggregate by occupation and employer:

```
grouped = fec_mrbo.groupby('cand_nm')```

```
grouped.apply(get_top_amounts, 'contbr_employer', n=10)```

```
grouped.apply(get_top_amounts, 'contbr_employer', n=10)```

### Bucketing Donation Amounts
A useful way to analyze this data is to use the cut function to discretize the contributor
amounts into buckets by contribution size:

```
bins = np.array([0, 1, 10, 100, 1000, 10000,
                 100000, 1000000, 10000000])
labels = pd.cut(fec_mrbo.contb_receipt_amt, bins)
labels```

We can then group the data for Obama and Romney by name and bin label to get a
histogram by donation size:

```
grouped = fec_mrbo.groupby(['cand_nm', labels])
grouped.size().unstack(0)```

This data shows that Obama received a significantly larger number of small donations
than Romney. You can also sum the contribution amounts and normalize within
buckets to visualize percentage of total donations of each size by candidate:

```
bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
normed_sums = bucket_sums.div(bucket_sums.sum(axis=1), axis=0)
normed_sums```

```
normed_sums[:-2].plot(kind='barh')```

### Donation Statistics by State
Aggregating the data by candidate and state is a routine affair:

```
grouped = fec_mrbo.groupby(['cand_nm', 'contbr_st'])
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
totals = totals[totals.sum(1) > 100000]
totals[:10]```

If you divide each row by the total contribution amount, you get the relative percentage
of total donations by state for each candidate:

```
percent = totals.div(totals.sum(1), axis=0)
percent[:10]```