## The Thanksgiving data**

This data came from Five-Thirty-Eight's data repository. I give them full credit for the collection of the data.   https://github.com/fivethirtyeight/data/tree/master/thanksgiving-2015)

\*\* *This data is no way representitive of the total US population.  Five-thirty-eight polled 1058 respondants on Nov 17th 2015.  Furthermore, due to small sample size, data is unlikely to be statistically significant.  This project was for learning purposes (NumPy, Pandas, Jupyter).  *


## Goals

** Some important questions that we set out to answer during this project!**

* What is the most popular dish at Thanksgiving dinner?  What is the most popular meal?
* What percent of people don't have pie during Thanksgiving?
* Is travel correlated with age or income?
* Is celebrating Friendsgiving correlated with age? 

## Motivations 

* To learn Pandas and Numpy;  two cool python libraries used for Data Analysis.  

**Numpy:**
* Makes it easy to create and manupulate data stored in multi-dimensional arrays.

**Pandas:**
* Built on top of numpy
* Flexibility to hold data of different types!
* Flexibility to deal with missing values.

In [1]:


import pandas as pd  

data = pd.read_csv("thanksgiving.csv", 
                   encoding="Latin-1")

## Remove responses from people who are not celebrating the holiday:

In [2]:
is_american = 'Do you celebrate Thanksgiving?'
print(data[is_american].value_counts())

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64


In [11]:
america_bool = data[is_american] == "Yes"
americans = data[america_bool]


### What are is the most common meal at Thanksgiving?

In [40]:
main_dish = "What is typically the main dish at your Thanksgiving dinner?"
main_cooked = 'How is the main dish typically cooked?'
stuffing = 'What kind of stuffing/dressing do you typically have?'
cran = 'What type of cranberry saucedo you typically have?'

main_dishes = americans[[main_cooked, main_dish, stuffing, cran]].copy()
main_dishes.mode()

Unnamed: 0,How is the main dish typically cooked?,What is typically the main dish at your Thanksgiving dinner?,What kind of stuffing/dressing do you typically have?,What type of cranberry saucedo you typically have?
0,Baked,Turkey,Bread-based,Canned


### Baked Turkey with stuffing was by far the most common main dish in Folk's Thanksgiving Dinner

### Next: Find most common 1) Side Dish 2) Pie, 3) Dessert

##### Side Dish

In [62]:
[americans[mac].value_counts().idxmax(), americans[mac].value_counts().max()]
a2 = americans[[mac, brus]].copy()
a2

# Use regex to select columns pertaining to side dishes

import re
#ok
def filter_names(list, dish_type):
    regex = re.compile(dish_type).search
    return [ name for name in list for m in (regex(name),) if m ]

col_names = americans.columns.values
side_dish_names = filter_names(col_names,'side dishes')

# Create new dataframe to analyze side dishes.

side_dishes = americans[side_dish_names].copy()

# Use regex to shorten column names
#ok
def shorten_list(list,filter):
    return [ m.group(0) for l in list for m in (filter(l),) if m]

searchRegex = re.compile('(?<= - ).*').search
short_dishes = shorten_list(side_dish_names,searchRegex)

side_dishes.columns = short_dishes 

side_dishes.head(5)


most_common_sides = {}
for food in side_dishes.columns:
    count_food = side_dishes[food].value_counts()
    most_common_sides[food] = count_food.max()

most_common_sides

best_side_dish = max(most_common_sides, key=most_common_sides.get)
best_side_dish

'Mashed potatoes'

In [None]:
##### The most common side dish reported was mashed potatoes

In [None]:
ate_gravy = "Do you typically have gravy?"
tofurkey_eaters_bool = americans[main_dish] == "Tofurkey"
tofurkeyers = americans[tofurkey_eaters_bool]

In [None]:
apple_q = "Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple"
pumpkin_q = "Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin"
pecan_q = "Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan"

ate_apple = americans[apple_q]
ate_pumpkin = americans[pumpkin_q]
ate_pecan = americans[pecan_q]

ate_pies = (pd.isnull(ate_apple) & 
            pd.isnull(ate_pumpkin) & 
            pd.isnull(ate_pecan))

print(ate_pies.value_counts())

In [None]:
def convert_age(n):
    if pd.isnull(n):
        return None
    
    age = str(n).split(" ")[0]
    age = age.replace("+","")
    return int(age)

In [None]:
americans["int_age"] = americans["Age"].apply(convert_age)
print(americans["int_age"].describe())

# Actual ages not used
### Age responses were converted from ranges to numbers for simplicity.

| Original Range  | Age used for analysis  |
|-----------------|------------------------|
| 18 - 29 | 18  | 
| 30 - 44  | 30  | 
| 45 - 59  |  45 |
| 60+  | 60  |
 


In [None]:
def convert_income(n): # i.e. $10,000
    if pd.isnull(n):
        return None 
    
    income = n.split(" ")[0]
    
    if (income == "Prefer"):
        return None
    
    income = income.replace("$", "")
    income = income.replace(",", "")
    
    return int(income) #==> 10000

In [None]:
income_q = "How much total combined money did all members of your HOUSEHOLD earn last year?"

americans["int_income"] = americans[income_q].apply(convert_income)


# Not a true depiction of incomes.  
### Income responses were converted from ranges to numbers for simplicity.


| Original Income Range  | Income used for analysis  |
|------------------------|---------------------------|
|  \$0                    |  \$0                        | 
|  \$10,000 - \$24,000    |  \$10,000                    | 
|  \$25,000 - \$49,000     |  \$25,000                    |
|  \$50,000 - \$74,999     |  \$50,000                    |
|  \$75,000 - \$99,999     |  \$75,000                    |
|  \$100,000 - \$124,999   |  \$100,000                   |
|  \$125,000 - \$149,999   |  \$125,000                   |
|  \$150,000 - \$174,999   |  \$150,000                   |
|  \$175,000 - \$199,999   |  \$175,000                   |
|  \$200,000+             |  \$200,000                   |

### Seemingly larger number of responses for the final category  (\$200,000+) is just because it covers the biggest income range.

| Income  | Respondents  |
|---------|--------------|
|  \$0       |       52  |
|  \$10,000   |       60  |
|  \$25,000   |      166  |
|  \$50,000   |      127  |
|  \$75,000   |      127  |
|  \$100,000  |      109  |
|  \$125,000  |       48  |
|  \$150,000  |       38  |
|  \$175,000  |       26  |
|  \$200,000  |       76  |

## Calculate breakdown of travel by Income
##### Test the hypothesis that low income earners are more likely to travel due to them travelling to their parents house


###### First create a pivot table broken down by Travel and Income

In [None]:
travel_q = "How far will you travel for Thanksgiving?"
index_id = "RespondentID"
income = "int_income"
no_travel = "Thanksgiving is happening at my home--I won't travel at all"


travel_by_income = pd.pivot_table(americans, 
                                  values=[index_id], 
                                  index=[travel_q],
                                  columns=[income],
                                  aggfunc=len, margins=True)

# next calculate percentabes by value 

travel_by_income_as_pct = travel_by_income.div(travel_by_income.iloc[-1,:], axis=1)
pct_no_travel = travel_by_income_as_pct.loc[no_travel]

In [None]:
pct_travel = (1 - pct_no_travel)
print(pct_travel)

| Income  | Percent of Income Group who Travel  |
|---------|-------------------------------------|
|  \$10,000   |       58%  |
|  \$25,000   |      64%  |
|  \$50,000   |      68%  |
|  \$75,000   |      54%  |
|  \$100,000  |      54%  |
|  \$125,000  |       46%  |
|  \$150,000  |       55%  |
|  \$175,000  |       58%  |
|  \$200,000  |       50%  |
|  All        |       58%  |

#### Without performing Statistical Analysis on the Data, it is hard to say if these numbers are significant.  
* On average, 58% of people travelled during thanksgiving. 
* On average, the $10k, $25k, and $50k income grous travelled 58%, 64%, and 68% of the time respectively.  
* All of these buckets were above average.  
* Again - Can't yet tell if these numbers are significant. 

## Calculate breakdown of travel for age group
#### Let's look at the same question, but for age group.  

In [None]:
travel_by_age = pd.pivot_table(americans,
                               values=[index_id],
                               index=[travel_q], 
                               columns=["int_age"],
                               aggfunc=len,
                               margins=True)




In [None]:
print(travel_by_age)

# Linking Friendship with Age and Income

In [None]:
hometown_q = "Have you ever tried to meet up with hometown friends on Thanksgiving night?"
friendsgiving_q = 'Have you ever attended a "Friendsgiving?"'

friendship_age = americans.pivot_table(index=hometown_q, 
                                       columns=[friendsgiving_q], 
                                       values="int_age")

print(friendship_age)

In [None]:
friendship_income = americans.pivot_table(index=hometown_q, 
                                          columns=[friendsgiving_q], 
                                          values="int_income")
print(friendship_income)