# Analyzing Thanksgiving Dinner using Pandas


The project invovles analyzing and finding interesting patterns in what people ate for thanksgiving 
using the pandas package in python.

The data used for the project can be found [here](https://github.com/fivethirtyeight/data/blob/master/thanksgiving-2015/thanksgiving-2015-poll-data.csv)

The first step is to read in the data correctly and to analyze the first few rows and columns of the dataframe. 

In [2]:
#Importing the pandas package

import pandas as pd

#Reading in the file with the proper encoding

data = pd.read_csv("thanksgiving.csv", encoding = "Latin-1")

#Viewing the first few rows of the dataframe

data.head(5)

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain
3,4337933040,Yes,Turkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$200,000 and up",Pacific
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific


As a next step we want to get a list of all the column names:

In [6]:
columns = data.columns
columns[:5]

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?'],
      dtype='object')

## Filtering by people who celebrate thanksgiving only

Since we only want to analyze data and discover patterns from the respondents who actually 
celebrate Thanksgiving we are going to filter those who answered **Yes** to the question:
    
- **Do you celebrate Thanksgiving?**

In [7]:
#Finding out the counts of the number of people who celebrate thanksgiving

data['Do you celebrate Thanksgiving?'].value_counts()

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64

Thus we see that 980 people celebrate thanksgiving and thus when we filter it out our new 
dataframe should have 980 rows

In [13]:
#Filtering out the new dataframe where the respondants answered 'Yes'

data_yes = data[data['Do you celebrate Thanksgiving?'] == 'Yes']
data_yes['Do you celebrate Thanksgiving?'].count()

980

## Analyzing the main dish served at thanksgiving.

We can now explore the main dishes that people were eating for thanksgiving.

In [15]:
#Finding out the counts of the main dishes that people eat

data_yes['What is typically the main dish at your Thanksgiving dinner?'].value_counts()

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64

From the data above we can observe that:

- Turkey is the most common main dish to be served at Thanksgiving 
- While Turducken is the least common main dish to be served at Thanksgiving

Tofurkey is dish that's usually enjoyed with gravy. Let's explore our data and find out if people
who have Tofurkey as their main dish also have gravey with it.

In [19]:
#Filtering out the data that have the main dish as Tofurkey

data_tofurkey = data_yes[data_yes['What is typically the main dish at your Thanksgiving dinner?'] == 'Tofurkey']

#Getting the counts of the people who have gravy with Tofurkey

data_tofurkey['Do you typically have gravy?']

4      Yes
33     Yes
69      No
72      No
77     Yes
145    Yes
175    Yes
218     No
243    Yes
275     No
393    Yes
399    Yes
571    Yes
594    Yes
628     No
774     No
820     No
837    Yes
860     No
953    Yes
Name: Do you typically have gravy?, dtype: object

We can observe that **12 people have gravy** with their Tofurkey while **8 people do not**.

## Analyzing the desserts served during thanksgiving

In the section below we are going to explore the dessert dishes that people have for 
thanksgiving. From the data we know there are three types of desserts:

- Apple
- Pecan
- Pumpkin

Let's analyze the 3 columns and find out how many of the desserts have a null value:

In [20]:
ate_pies = (pd.isnull(data_yes["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple"])
&
pd.isnull(data_yes["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan"])
 &
 pd.isnull(data_yes["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin"])
)

#Finding out the counts for for the number of people who eat desserts

ate_pies.value_counts()

False    876
True     104
dtype: int64

So we know that 104 people did not have any of the 3 dessets while 876 people did have desserts.

## Calculating summary statistics for the age of the people who took the survey

Let's now analyze the age of the participants: 

In [24]:
data_yes["Age"].value_counts()

45 - 59    269
60+        258
30 - 44    235
18 - 29    185
Name: Age, dtype: int64

We can observe that the ages are in a range of values and there is a '+' after the 60. 
This means that our data for ages are in strings and that we cannot perform numerical computations 
on it in order to obtain useful summary statistics like the mean or median.

In order to correct for this we write a function that returns the age in an integer format:

In [32]:
def extract_age(age_str):
    if pd.isnull(age_str):
        return None
    age_str = age_str.split(" ")[0]
    age_str = age_str.replace("+", "")
    return int(age_str)

In [34]:
#Applying the function we created

data_yes["Age_int"] = data_yes["Age"].apply(extract_age)

#Extracting the summary statistics for the age

data_yes["Age_int"].describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: Age_int, dtype: float64

From the data above we can see that: 
    
- The mean age of the people who attended the survey was 40
- The minimum age was 18.
- While the maximum age was 60.
- The data skews downward because we took the first age in every range of ranges 
  (For ex: 40-50 - we extracted the age 40)

## Analyzing the household income of the people who took the survey

We have another numeric column that is in the form of strings for which we want to 
extract the summary statistics. This column is the called - 'How much total combined 
'money did all members of your HOUSEHOLD earn last year?'


In [38]:
#Analyzing the column 
data_yes["How much total combined money did all members of your HOUSEHOLD earn last year?"].value_counts()

$25,000 to $49,999      166
$75,000 to $99,999      127
$50,000 to $74,999      127
Prefer not to answer    118
$100,000 to $124,999    109
$200,000 and up          76
$10,000 to $24,999       60
$0 to $9,999             52
$125,000 to $149,999     48
$150,000 to $174,999     38
$175,000 to $199,999     26
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: int64

We now have to remove the '$' symbol and extract the first number in every range so that we can extract
the summary statistics of interest. 

In [39]:
def extract_income(income_str):
    if pd.isnull(income_str):
        return None
    income_str = income_str.split(" ")[0]
    if income_str == "Prefer":
        return None
    income_str = income_str.replace(",", "")
    income_str = income_str.replace("$", "")
    return int(income_str)

data_yes["Money_int"] = data_yes["How much total combined money did all members of your HOUSEHOLD earn last year?"].apply(extract_income)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


In [40]:
#Displaying the summary statistics of the incomes

data_yes["Money_int"].describe()

count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: Money_int, dtype: float64

From the data above we can see that:

- The mean of the household income of everyone who was surveyed was 75K USD.
- The lowest household income of everyone who was surveyed was 0 USD suggesting that this could 
  be an error in data entry/ missing value.
- The maximum household income was 200K USD.
- The standard deviation in the incomes are fairly high suggesting a wide spread of incomes.

## Analyzing travel behavior of people based on income for thanksgiving dinners

Next, we want to analyze the distance travelled by someone for thanksgiving dinner. 
Here my hypothesis is that, if someone has to travel far, they are probably young and earn
less and hence are travelling back to their parents for dinner.

If they are older and/or have a higher income they have dinner at their own house and hence 
do not travel far. 

In order to test this claim:

In [45]:
#Filtering out the data where the income is lesser than 35000

low_income = data_yes[data_yes['Money_int'] < 35000]

#Counting the different distances that people will travel in the low income bracket

low_income['How far will you travel for Thanksgiving?'].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         106
Thanksgiving is local--it will take place in the town I live in                      92
Thanksgiving is out of town but not too far--it's a drive of a few hours or less     64
Thanksgiving is out of town and far away--I have to drive several hours or fly       16
Name: How far will you travel for Thanksgiving?, dtype: int64

From the data above:

- We can see that the most number of people in the low income bracket won't travel at all
  or will have dinner in the town they are in.

In [49]:
#Filtering out the data where the income is greater than 150,000

high_income = data_yes[data_yes['Money_int'] > 150000]

#Counting the different distances that people will travel in the high income bracket

high_income['How far will you travel for Thanksgiving?'].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         49
Thanksgiving is local--it will take place in the town I live in                     25
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    16
Thanksgiving is out of town and far away--I have to drive several hours or fly      12
Name: How far will you travel for Thanksgiving?, dtype: int64

From the data above:
    
- We can see that most people in the high income bracket tend to stay at home or will have dinner in the town they are in.

However this does not tell us much about the two income brackets. We want to find the proportion of 
people who stayed at home to the total number of people in both the brackets.

In [52]:
#Proportion for low income 

len(low_income[low_income["How far will you travel for Thanksgiving?"] == 
               "Thanksgiving is happening at my home--I won't travel at all"])/len(low_income)

0.381294964028777

In [53]:
#Proportion for high income

len(high_income[high_income["How far will you travel for Thanksgiving?"] == 
               "Thanksgiving is happening at my home--I won't travel at all"])/len(high_income)

0.4803921568627451

Thus we see that the people with a higher income tended to stay at home while people with a lower 
income travelled more indicating that our initial hypothesis was correct.

# Conclusion

Using the pandas package in python we have discovered several interesting insights and patterns with respect to:

- Main dishes served at Thanksgiving 
- Income of the people taking the survey.
- Age of the people taking the survey.
- Desserts served at Thanksgiving.
- Travel behaviours of people during Thanksgiving based on income.