## Data Analysis with pandas

Each survey respondent was asked questions about what they typically eat for Thanksgiving, along with some demographic questions, like their gender, income, and location. This dataset will allow us to discover regional and income-based patterns in what Americans eat for Thanksgiving dinner.

Using this Thanksgiving survey data, we can answer quite a few interesting questions, like:

- Do people in Suburban areas eat more Tofurkey than people in Rural areas?
- Where do people go to Black Friday sales most often?
- Is there a correlation between praying on Thanksgiving and income?
- What income groups are most likely to have homemade cranberry sauce?

### Task 1 - Load Data

- Read about the [data set](https://github.com/fivethirtyeight/data/tree/master/thanksgiving-2015)
- Read in the data to pandas
- Check the shape of the dataframe and examine the different columns
- Check the summary statistics of the dataframe

In [61]:
import pandas as pd
import numpy as np

df = pd.read_csv('thanksgiving.csv', encoding ='Latin-1')
df_r, df_c = df.shape
print (df_r, df_c)
#print(df.head(5))
print (df.columns)
#print (df.index) 
#print (df.dtypes)
#df.describe(exclude=[np.number])

1058 65
Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypicall

### Task 2 - Initial Data Analysis

- Create a new dataframe that only includes people that celebrate Thanksgiving
- Find out what the most popular main dish is.
- Do people who eat Tofurkey also have gravy as a side dish (calculate proportions)?
- What is the most popular pie?
- How many pies are eaten other than Apple, Pecan and Pumpkin?
- Look at the age distribution.
- Write a function that approximates each respondent's age given the ranges and apply it to all cells (remember type conversion).
- Examine the summary statistics of the age column now.
- Look at the income distribution.
- Write a function that approximates each respondent's income given the ranges and apply it to all cells (remember type conversion).
- Examine the summary statistics of the income column now.
- Look at the distance traveled for lower income (< 50,000) vs. higher income (>150,000) respondents.
- Use the pivot_table function to examine what ages and incomes are more likely to "attend a Friendsgiving" or "meetup with hometown friends.


In [59]:
df_t = df[df['Do you celebrate Thanksgiving?'] == 'Yes'].copy()

main_dish_str =  'What is typically the main dish at your Thanksgiving dinner?' 
main_dish_other_str =  'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)' 
gravy_str = "Do you typically have gravy?"

df_t[main_dish_other_str] =  df_t[main_dish_other_str].map({'Other (please specify)':np.NaN})
df_t.dropna(axis=1)
maindish = df_t[[main_dish_str, main_dish_other_str]]

#print(maindish)
value = maindish[main_dish_str].value_counts().argmax()
key = maindish[main_dish_str].value_counts().max()
print("Most popular main dish:", value, " votes:", key)

# cleaning up data using map, an example
#df['How much total combined money did all members of your HOUSEHOLD earn last year?'] = \
#df['How much total combined money did all members of your HOUSEHOLD earn last year?'].\
#map(lambda x: x.replace('$','').replace('Prefer not to answer', 'NaN').replace('to', '-').replace(',','').\
#    replace('and up', '+').strip())

# convert to category
to_cat_columns = df.select_dtypes(['object']).columns
df[to_cat_columns] = df[to_cat_columns].apply(lambda s: s.astype('category'))

#df['How much total combined money did all members of your HOUSEHOLD earn last year?'] = \
#df['How much total combined money did all members of your HOUSEHOLD earn last year?'].astype('float64')

# pivot table ..
pd.pivot_table(df, index=['Age', 'What is your gender?'])

# find people who eat tofurkey and for that ask if they use gravy, collect answer as yes/no, then find
# how many yes in total ..
df_t['tfk'] = df_t[main_dish_str].apply(lambda x: 1 if x == "Tofurkey" else 0).astype('Int64')
df_t['gravy'] = df_t[gravy_str].apply(lambda x: 1 if x == "Yes" else 0).astype('Int64')
df_t['tfk_gravy'] = df_t.tfk & df_t.gravy
print("Proportion of people who eat Tofurkey and gravy = {}%".format(df_t['tfk_gravy'].sum()*100/df_t['tfk'].sum()))

#df_t_grav = df_grav[df_grav[gravy_str] == 1 & df_grav[main_dis_str] == 1]
#df_tfk_gra = df_tfk[df_tfk[gravy_str] == "Yes"]
#print(df_tfk_gra[gravy_str].map



Most popular main dish: Turkey  votes: 859
Proportion of people who eat Tofurkey and gravy = 60.0%


### Task 3 - Visualization

matplotlib
- Use groupby to examine the breakdown of income by type of cranberry sauce
- Use agg to compute the mean and plot the results in a bar chart
- Use agg to find the average income of people who eat "Homemade" cranberry sauce and "Tofurkey"
- Choose an appropriate plot or chart to visualize the results