This is a guided project by DataQuest. I will explore a data on what Americans eat during thanksgiving.

The dataset is from [here](https://github.com/fivethirtyeight/data/tree/master/thanksgiving-2015). The data was collected via SurveyMonkey poll, from 1,058 respondents on 17 November 2015.

Note that this project is exploratory, and uses only descriptive statistics. That is, any suggestions or ideas are not statistically sound.

# 1. Read in data

In [1]:
import pandas as pd
data = pd.read_csv("thanksgiving-2015-poll-data.csv", encoding="Latin-1")
print("Number of respondents: " + str(len(data)))

# display first five rows
data[:5]

Number of respondents: 1058


Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain
3,4337933040,Yes,Turkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$200,000 and up",Pacific
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific


## 2. Display column names for an overview

In [2]:
data.columns.tolist()

['RespondentID',
 'Do you celebrate Thanksgiving?',
 'What is typically the main dish at your Thanksgiving dinner?',
 'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
 'How is the main dish typically cooked?',
 'How is the main dish typically cooked? - Other (please specify)',
 'What kind of stuffing/dressing do you typically have?',
 'What kind of stuffing/dressing do you typically have? - Other (please specify)',
 'What type of cranberry saucedo you typically have?',
 'What type of cranberry saucedo you typically have? - Other (please specify)',
 'Do you typically have gravy?',
 'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
 'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
 'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Cauliflower',
 

## 3. Exclude respondents who do not celebrate thanksgiving

In [3]:
# remove individuals from data if their response to "Do you celebrate Thanksgiving?" is not a "Yes".
cInd = "Do you celebrate Thanksgiving?"
data[cInd].value_counts()
data = data[data[cInd] == "Yes"]
print("Number of respondents celebrating thanksgiving: " + str(len(data)))

# this can also be done differently (source https://stackoverflow.com/a/27360130/7194743)
#data.drop(data[data["Do you celebrate Thanksgiving?"] != "Yes"].index)
#print("Number of respondents celebrating thanksgiving: " + str(len(data)))

Number of respondents celebrating thanksgiving: 980


## 4. Find main dishes eaten for thanksgiving

### 4.1. Overview

In [4]:
main_dish_q = "What is typically the main dish at your Thanksgiving dinner?"

data[main_dish_q].value_counts()

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64

It looks like Turkey is undoubtedly the most consumed dish during the thanksgiving.

### 4.2. Food eaten with gravy

Was turkey eaten with gravy?

In [5]:
sub_dish_q = "Do you typically have gravy?"

def findCombination(main_dish_q, sub_dish_q, dish):
    """
    main_dish_q finds people eaten during thanksgiving.
    sub_dish_q finds numbers of people eathing the dish a subdish
    """
    dish_eaters = data[data[main_dish_q] == dish]

    dish_eaters_withSubDish = dish_eaters[sub_dish_q]
    
    return dish_eaters_withSubDish.value_counts()

findCombination(main_dish_q, sub_dish_q, "Turkey")

Yes    814
No      45
Name: Do you typically have gravy?, dtype: int64

Yes! Turkey was served with gravy for most of the people.

What about others on the menu?

Let's check Tofurkey and Roast beef.

#### 4.2.1. Tofurkey

In [6]:

def findCombination(main_dish_q, sub_dish_q, dish):
    """
    main_dish_q finds people eaten during thanksgiving.
    sub_dish_q finds numbers of people eathing the dish a subdish
    """
    dish_eaters = data[data[main_dish_q] == dish]

    dish_eaters_withSubDish = dish_eaters[sub_dish_q]
    
    return dish_eaters_withSubDish.value_counts()
 
findCombination(main_dish_q, sub_dish_q, "Tofurkey")

Yes    12
No      8
Name: Do you typically have gravy?, dtype: int64

#### 4.2.2. Roast beef

In [7]:
findCombination(main_dish_q, sub_dish_q, "Roast beef")

Yes    7
No     4
Name: Do you typically have gravy?, dtype: int64

Not as much as for Turkey, but gravy was also served with Tofurkey and Roast beef most of the time.

Now let's check **desserts**.

## 5. Find main desserts eaten for thanksgiving

In [8]:
# get records for each type of pies
app_p = "Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple"
pec_p = "Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan"
pum_p = "Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin"

# find people who did NOT eat each type of pies
apple_isnull = data[app_p].isnull()
pecan_isnull = data[pec_p].isnull()
pumpkin_isnull = data[pum_p].isnull()

# find people who did NOT eat any type of pie
ate_pies_not = apple_isnull & pecan_isnull & pumpkin_isnull
ate_pies_not.value_counts()

False    876
True     104
dtype: int64

So, 876 people ate at least one among apple, pecan and pumpkin pies.

Then, how many people ate all of them?

In [9]:
# find people who ate all types of pies
apple_notnull = (apple_isnull == False)
pecan_notnull = (pecan_isnull == False)
pumpkin_notnull = (pumpkin_isnull == False)

ate_allpies = apple_notnull & pecan_notnull & pumpkin_notnull
ate_allpies.value_counts()

False    843
True     137
dtype: int64

Well, only 137 people ate all three types of pies.

## 6. Distributions of respondents

Now, let's find a bit about our respondents.

How about the age distribution? Are they old? Young?

### 6-1. Age

So, how is the age data stored?

In [10]:
# show age groups in the data
print(data["Age"].unique())

['18 - 29' '30 - 44' '60+' '45 - 59' nan]


Well, ages are stored categorically, so they can be only approximated. For consistency, I will use the lower bound in each group (i.e. 18, 30, 45 and 60).

In [11]:
def strToNum(string, bound="lower", delimiter=" ", outUnit=int, remove=[], ignore=[None]):
    """
    Extract number-like characters from string and convert them to integer.
    
    string: string to process
    bound: This is either "lower" or "upper", for when string yields two values representing range.
    delimiter: character to split string by
    outUnit: units to convert strings to
    remove1 and 2: list of characters to remove
    ignore: return None if string matches this
    """
    
    # handle NULL values
    if pd.isnull(string) or (string == ignore):
        return None

    # remove characters
    for char in remove:
        string = string.replace(char, "")
    
    # get first or last groups of characters
    if bound == "lower":
        output = outUnit(string.split(delimiter)[0])
    elif bound == "upper":
        upper = string.split(delimiter)[-1]
        if upper:
            output = outUnit(upper)
        else:
            return None
    
    # convert output to integer and return it
    return output

# get descriptive statistics for age, using lower bound of each age group
int_age = data["Age"].apply(lambda x: strToNum(x, remove=["+"]))
data["int_age"] = int_age
data["int_age"].describe()

count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64

The mean age is around 40. The median age (45) is greater than the mean, so the distribution is skewed to the left.

This probably reflects that we took the lower bound from each age group. Let's check if this is true, by using an upper bound this time.

In [12]:
# use upper bound of age groups to get descriptive statistics
data["Age"].apply(lambda x: strToNum(x, bound="upper", remove=["+"])).describe()[["mean", "50%"]]

mean    49.689546
50%     59.000000
Name: Age, dtype: float64

Oh, no. Picking the upper bound still resulted in greater median than the mean.

Does this mean the age distribution is truly skewed to the left? Unfortunately, I will not be able to answer that because I cannot access the upper bound of '60+' group.


### 6-2. Income
So, how about income distrubition? Shall we have a look?

In [13]:
# show income groups in the data
column = "How much total combined money did all members of your HOUSEHOLD earn last year?"
print(data[column].unique())

['$75,000 to $99,999' '$50,000 to $74,999' '$0 to $9,999' '$200,000 and up'
 '$100,000 to $124,999' '$25,000 to $49,999' 'Prefer not to answer'
 '$10,000 to $24,999' '$175,000 to $199,999' '$150,000 to $174,999'
 '$125,000 to $149,999' nan]


OK, it's categorical data again. So, I'll use the lower bound this time as well.

In [14]:
# get descriptive statistics for income, using lower bound of each income group
int_income = data[column].apply(lambda x: strToNum(x, outUnit=float, \
                               remove=["$", ",", "and", "up"], ignore="Prefer not to answer"))
data["int_income"] = int_income
int_income.describe()

count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: float64

Wow, the huge standard deviation shows that there is a lot of variance in the data.

On the other hand, the median and mean is very close, making it close to a normal distribution. Let's see what happens if we use upper bound of each income class.

In [15]:
# use upper bound of income groups to get descriptive statistics
data[column].apply(lambda x: strToNum(x, bound="upper", outUnit=float, \
                  remove=["$", ",", "and", "up"], ignore="Prefer not to answer")).describe()

count       753.000000
mean      86612.545817
std       48652.241615
min        9999.000000
25%       49999.000000
50%       74999.000000
75%      124999.000000
max      199999.000000
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: float64

Now, that made the right-skewness a lot greater. But, again, the true distribution of income is obscure with the upper bound of the top income group unknown. To be more precise, I will not be able to go beyond making exploratory checks using categorical data.

## 7. Question: Are people with higher income more likely to have thanksgiving at home?

The hypothesis made by DataQuest team is that people earning less money would be younger and travel to their parents' houses (who earn more). Let's see if the data supports the idea.

First, how far do people earning less than $150,000 travel?

In [18]:
# get travel distances of people earning < $150,000
data[int_income < 150000]["How far will you travel for Thanksgiving?"].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64

Contrary to the hypothesis, this lower income group tends not to travel so much.

What about the higher income group then, earning $150,000 or more?

In [19]:
# get travel distances of people earning >= $150,000
data[int_income >= 150000]["How far will you travel for Thanksgiving?"].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         66
Thanksgiving is local--it will take place in the town I live in                     34
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    25
Thanksgiving is out of town and far away--I have to drive several hours or fly      15
Name: How far will you travel for Thanksgiving?, dtype: int64

OK, the higher income group does not seem to travel much, either.

So, overall, most people appear to have thanksgiving at home or somewhere not far regardless of their income level.

## 8. Question: Are younger people more likely to have thanksgiving with friends?

This is another question proposed by the DataQuest team.

Let's check the answers to two questions.

*"Have you ever tried to meet up with hometown friends on Thanksgiving night?"*

*'Have you ever attended a "Friendsgiving?"'*

In [35]:
# average ages of those who spent thanksgiving with friends
data.pivot_table(index="Have you ever tried to meet up with hometown friends on Thanksgiving night?", \
                 columns='Have you ever attended a "Friendsgiving?"', values="int_age")

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,42.283702,37.010526
Yes,41.47541,33.976744


The overall trend is consistent with the hypothesis.

## 9. Quesiton: Are lower-earning people more likely to have thanksgiving with friends?

Perhaps this is so, consideing that students generally have more time for friends and also earn less?

In [39]:
# average income level of those who spent thanks giving with friends
data.pivot_table(index="Have you ever tried to meet up with hometown friends on Thanksgiving night?", \
                  columns='Have you ever attended a "Friendsgiving?"', values="int_income")

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,78914.549654,72894.736842
Yes,78750.0,66019.736842


Again, the overall trend is consistent with the hypothesis.