# Analysis of American Thanksgiving using Pandas

This analysis is based off of a dataset from FiveThirtyEight. This dataset is a individual response to a survey that included questions regarding the dishes served, location, and income. From here, we will be able to draw conclusions on the types of meals eaten during thanksgiving by geographical location and income. Let's see what we can find!

In [6]:
import pandas as pd

#import the thanksgiving.csv into a dataframe.
data = pd.read_csv("thanksgiving.csv", encoding="Latin-1")


#Here we can see that 980 out of the 1068 respondents celebrate thanksgiving and 78 that do not. 
counts = data["Do you celebrate Thanksgiving?"].value_counts()
print(counts)



Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64


In [12]:
#Let's now remove the values that do not celebrate Thanksgiving as we are
#only interested in the "Yes" data

yes_filter = data["Do you celebrate Thanksgiving?"]=="Yes"
#Only yes data is left
data = data[yes_filter]


#Let's find out what the main dishes response was, after running the code
#we can see that the main dish is Turkey by far!
main_dish_counts = data["What is typically the main dish at your Thanksgiving dinner?"].value_counts()
print(main_dish_counts)


Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64


In [15]:
#For fun, let's see how many respondents also have Gravy when they have Tofurkey.

tofurkey_filter = data["What is typically the main dish at your Thanksgiving dinner?"] == "Tofurkey"
gravy_tofurkey = data[tofurkey_filter]["Do you typically have gravy?"]

#When printed, we can see roughly 3:2 ratio of "Yes" to "No"
print(gravy_tofurkey)


4      Yes
33     Yes
69      No
72      No
77     Yes
145    Yes
175    Yes
218     No
243    Yes
275     No
393    Yes
399    Yes
571    Yes
594    Yes
628     No
774     No
820     No
837    Yes
860     No
953    Yes
Name: Do you typically have gravy?, dtype: object


In [28]:
#Let's now look at desserts. Specifically, how many people eat Apple, Pecan, and pumpkin during thanksgiving

apple_isnull = data["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple"].isnull()
pumpkin_isnull = data["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin"].isnull()
pecan_isnull = data["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan"].isnull()


ate_pies = apple_isnull & pumpkin_isnull & pecan_isnull
#We can see that 104 respondents eat all three pies
print(ate_pies.value_counts())

False    876
True     104
dtype: int64


In [125]:
#We can now look at the age column, but first we will need to clean up this data slightly.
def age_to_int(row):
    #print(row.to_string())
    
    row_age = row["Age"]
    #print(type(row_age))
    if pd.isnull(row_age):
        return None
    elif "+" in row_age:
        first_item = row_age.replace("+", "")
    else:
        first_item = row_age.split(" ",1)[0]
    return first_item
    
data["int_age"] = data.apply(age_to_int, axis=1)

print(data["int_age"].value_counts())

    

45    269
60    258
30    235
18    185
Name: int_age, dtype: int64


Using these values we see one main glaringly inaccurate thing, the numbers of ages seem way off:

This is a problem with the sampling model. There is a greater number respondents in the older brackets, than the younger bracket. This is not a true representative of the 18-29 year old population as there would likely be fewer that respond to this survey. No conclusions can be drawn from this data at this point.


In [137]:
#Let's now look at income bracket to see if we can see any trends

def income_to_int(row):

    
    row_age = row["How much total combined money did all members of your HOUSEHOLD earn last year?"]

    if pd.isnull(row_age):
        return None
    else:
        first_item = row_age.split(" ",1)[0]
        if "Prefer" in first_item:
            return None
        first_item = first_item.replace("$","").replace(",","")
        first_item = int(first_item)
    return first_item
    
data["int_income"] = data.apply(income_to_int, axis=1)
print(data["int_income"].value_counts())

25000.0     166
75000.0     127
50000.0     127
100000.0    109
200000.0     76
10000.0      60
0.0          52
125000.0     48
150000.0     38
175000.0     26
Name: int_income, dtype: int64


Looking at this data, it doesn't seem representative of the survey population. We can almost be 100% certain that no one has $0.0 of household income. That is a dead give away that our method is incorrect. 

In [140]:
#Now let's see if we can find a correlation between distance travelled and income.
#The hypothesis would be that as a lower income earner you are probably going to 
#travel farther for thanksgiving instead of being able to host your own party.

less_150000 = (data["int_income"] < 150000)
greater_150000 = (data["int_income"] > 150000)

print(data[less_150000]["How far will you travel for Thanksgiving?"].value_counts())
print(data[greater_150000]["How far will you travel for Thanksgiving?"].value_counts())

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64
Thanksgiving is happening at my home--I won't travel at all                         49
Thanksgiving is local--it will take place in the town I live in                     25
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    16
Thanksgiving is out of town and far away--I have to drive several hours or fly      12
Name: How far will you travel for Thanksgiving?, dtype: int64


Well, it seems our hypothesis was wrong. It looks like the numbers are similar when you compare their percentages. It seems that no matter what income you have, the breakout down looks to be fairly consistent.

In [173]:
#Now let's look at the data pertaining to "Friendsgivings" and meeting with hometown friends on thanksgiving night.
#The hypothesis here is that this pertains more towards younger people.
#print(data["Have you ever tried to meet up with hometown friends on Thanksgiving night?"])
data["int_age"] = data["int_age"].astype(float)


print(pd.pivot_table(data,index="Have you ever tried to meet up with hometown friends on Thanksgiving night?", columns='Have you ever attended a "Friendsgiving?"',values="int_age"))
pd.pivot_table(data,index="Have you ever tried to meet up with hometown friends on Thanksgiving night?", columns='Have you ever attended a "Friendsgiving?"',values="int_income")


Have you ever attended a "Friendsgiving?"                  No        Yes
Have you ever tried to meet up with hometown fr...                      
No                                                  42.283702  37.010526
Yes                                                 41.475410  33.976744


"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,78914.549654,72894.736842
Yes,78750.0,66019.736842


Using these results we can see our hypothesis is somewhat correct. The most representative is where we see someone attending a "Friendsgiving" AND a meeting up with friends on Thanksgiving night. the Average age is 33.97 where as with both no