# Project: Analyzing Thanksgiving Dinner

## 1: Introducing Thanksgiving Dinner Data
In this project, we will be analyzing data on Thanksgiving dinner in the US. The dataset came from [FiveThirtyEight](https://www.fivethirtyeight.com/), and can be found [here](https://github.com/fivethirtyeight/data/tree/master/thanksgiving-2015).

The dataset is stored in the `thanksgiving.csv` file. It contains 1058 responses to an online survey about what Americans eat for Thanksgiving dinner. Each survey respondent was asked questions about what they typically eat for Thanksgiving, along with some demographic questions, like their gender, income, and location. This dataset will allow us to discover regional and income-based patterns in what Americans eat for Thanksgiving dinner.

The dataset has 65 columns, and 1058 rows. Most of the column names are questions, and most of the column values are string responses to the questions. Most of the columns are categorical, as a survey respondent had to select one of a few options. For example, one of the first column names is `What is typically the main dish at your Thanksgiving dinner?`. The potential responses are:
* `Turkey`
* `Other (please specify)`
* `Ham/Pork`
* `Tofurkey`
* `Chicken`
* `Roast beef`
* `I don't know`
* `Turducken`

Most of the columns follow the same question/response format as the above. There are also quite a few `NaN` values in the columns, which occurred when a survey respondent didn't fill out a question because they didn't want to, or it didn't apply to them.

Here are descriptions of some of the most important:
* `RespondentID` -- a unique ID of the respondent to the survey.
* `Do you celebrate Thanksgiving?` -- a Yes/No reponse to the question.
* `How would you describe where you live?` -- responses are Suburban, Urban, and Rural.
* `Age` -- responses are one of several categories, such as 18-29, and 30-44.
* `How much total combined money did all members of your HOUSEHOLD earn last year?` -- one of several categories, such as \$75,000 to \$99,999.

In this project, we will explore the data, and try to find interesting patterns. Our first step is to read in and display the data.

* Import the pandas package. Use the pandas.read_csv() function to read the thanksgiving.csv file in. Specify the keyword argument encoding="Latin-1", as the CSV file isn't encoded normally. Assign the result to the variable data.
* Display the first few rows of data to see what the columns and rows look like.
* Display all of the column names to get a sense of what the data consists of.

In [3]:
import math
import pandas
data = pandas.read_csv("data/thanksgiving.csv",encoding="Latin-1")
print(data.head(5))

   RespondentID Do you celebrate Thanksgiving?  \
0    4337954960                            Yes   
1    4337951949                            Yes   
2    4337935621                            Yes   
3    4337933040                            Yes   
4    4337931983                            Yes   

  What is typically the main dish at your Thanksgiving dinner?  \
0                                             Turkey             
1                                             Turkey             
2                                             Turkey             
3                                             Turkey             
4                                           Tofurkey             

  What is typically the main dish at your Thanksgiving dinner? - Other (please specify)  \
0                                                NaN                                      
1                                                NaN                                      
2                            

In [6]:
print(data.columns)

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

In [4]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1058 entries, 0 to 1057
Data columns (total 65 columns):
RespondentID                                                                                                                                    1058 non-null int64
Do you celebrate Thanksgiving?                                                                                                                  1058 non-null object
What is typically the main dish at your Thanksgiving dinner?                                                                                    974 non-null object
What is typically the main dish at your Thanksgiving dinner? - Other (please specify)                                                           35 non-null object
How is the main dish typically cooked?                                                                                                          974 non-null object
How is the main dish typically cooked? - Other (please specify)          

## 2: Filtering Out Rows From A DataFrame
Because we want to understand what people ate for Thanksgiving, we'll remove any responses from people who don't celebrate it. The column `Do you celebrate Thanksgiving?` contains this information. We only want to keep data for people who answered Yes to this questions.

* Display counts of how many times each category occurs in the `Do you celebrate Thanksgiving?` column.
* Filter out any rows in data where the response to `Do you celebrate Thanksgiving?` is not `Yes`. At the end, all of the values in the `Do you celebrate Thanksgiving?` column of data should be `Yes`.

In [8]:
data["Do you celebrate Thanksgiving?"].value_counts()

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64

In [13]:
celebrate_filter = data["Do you celebrate Thanksgiving?"] == "Yes"
data = data[celebrate_filter]
data["Do you celebrate Thanksgiving?"].value_counts()

Yes    980
Name: Do you celebrate Thanksgiving?, dtype: int64

## 3: Using Value_counts To Explore Main Dishes
Let's explore what main dishes people tend to eat during Thanksgiving dinner.

* Display counts of how many times each category occurs in the `What is typically the main dish at your Thanksgiving dinner?` column.
* Display the `Do you typically have gravy?` column for any rows from data where the `What is typically the main dish at your Thanksgiving dinner?` column equals `Tofurkey`.

In [14]:
print(data["What is typically the main dish at your Thanksgiving dinner?"].value_counts())

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64


In [15]:
filter_Tofurkey = data["What is typically the main dish at your Thanksgiving dinner?"] == "Tofurkey"
data_Tofurkey = data[filter_Tofurkey]
data_gravy = data_Tofurkey["Do you typically have gravy?"]
print(data_gravy)

4      Yes
33     Yes
69      No
72      No
77     Yes
145    Yes
175    Yes
218     No
243    Yes
275     No
393    Yes
399    Yes
571    Yes
594    Yes
628     No
774     No
820     No
837    Yes
860     No
953    Yes
Name: Do you typically have gravy?, dtype: object


## 4: Figuring Out What Pies People Eat
Now that we've looked into the main dishes, let's explore the dessert dishes. Specifically, we'll look at how many people eat Apple, Pecan, or Pumpkin pie during Thanksgiving dinner. This data is encoded in the following three columns:
* `Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple`
* `Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin`
* `Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan`

In all three columns, the value is either the name of the pie if the person eats it for Thanksgiving dinner, or null otherwise.

We can find out how many people eat one of these three pies for Thanksgiving dinner by figuring out for how many people all three columns are null.

* Generate a Boolean Series indicating where the `Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple` column is null. Assign to the `apple_isnull` variable.
* Generate a Boolean Series indicating where the `Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin` column is null. Assign to the `pumpkin_isnull` variable.
* Generate a Boolean Series indicating where the `Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan` column is null. Assign to the `pecan_isnull` variable.
* Join all three Series using the & operator, and assign the result to `ate_pies`
* Display the unique values and how many times each occurs in the `ate_pies` column.

In [17]:
apple_isnull = (pandas.isnull
    (data["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple"]))
pumpkin_isnull = (pandas.isnull
    (data["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin"]))
pecan_isnull = (pandas.isnull
    (data["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan"]))
ate_pies = apple_isnull & pumpkin_isnull & pecan_isnull
print(ate_pies.head(5))
print(ate_pies.value_counts())
#Display the unique values and 
#how many times each occurs in the ate_pies column.

0    False
1    False
2    False
3    False
4    False
dtype: bool
False    876
True     104
dtype: int64


## 5: Converting Age To Numeric
Let's analyze the `Age` column in more depth. In order to analyze the `Age` column, we'll first need to convert it to numeric values. This will make it simple to figure out things like the average age of survey respondents. The `Age` column contains values that fall into one of a few categories:
* `18 - 29`
* `30 - 44`
* `45 - 59`
* `60+`
* `null`

Because we're missing the exact age value, we won't be able to extract an exact integer value, and we'll instead have to extract the first age value in the strings given.

We can do this by splitting each value on the space character (` `), then taking the first item in the resulting list. We'll also have to replace the "+" character to account for "60+", which follows a different format than the rest.

* Write a function to convert a single string to an appropriate integer value. This will allow us to convert the values in the Age column to integers.
  * Use the isnull() function to check if the value is null. If it is, return None.
  * Split the string on the space character (), and extract the first item of the resulting list.
  * Replace the + character in the result with an empty string to remove it.
  * Use int() to convert the result to an integer.
  * Return the result.
* Use the pandas.Series.apply() method to apply the function to each value in the Age column of data.
  * Assign the result to the int_age column of data.
* Call the pandas.Series.describe() method on the int_age column of data, and display the result.
* In a separate markdown cell, write up your findings.
  * Is there anything that we should be aware of about the results or our methodology?
  * Is this a true depiction of the ages of survey participants?

In [23]:
print(len(data["Age"]))
data["Age"].value_counts()

980


45 - 59    269
60+        258
30 - 44    235
18 - 29    185
Name: Age, dtype: int64

In [26]:
def str_to_int(row):
    #print(row, type(row))
    #if type(row) is "str":
    if pandas.isnull(row):
        #print(row)
        return None
    else:
        #if isinstance(row,str):
        row_lst1 = str(row)
        row_lst = row_lst1.split()
        age1 = row_lst[0].replace("+","")
        #print(age1)
        age_int = int(age1)
        #print("age1=", age1, "age_int=", age_int)
        return age_int    
        
data["int_age"] = data["Age"].apply(str_to_int)
#data["int_age"] = data["int_age"].astype("int")
data["int_age"].describe()




count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%             NaN
50%             NaN
75%             NaN
max       60.000000
Name: int_age, dtype: float64

## 6: Converting Income To Numeric
The `How much total combined money did all members of your HOUSEHOLD earn last year?` column is very similar to the `Age` column. It contains categories, but can be converted to numerical values. Here are the unique values in the column:

* `Prefer not to answer`
* `$0 to $9,999`
* `$10,000 to $24,999`
* `$25,000 to $49,999`
* `$50,000 to $74,999`
* `$75,000 to $99,999`
* `$100,000 to $124,999`
* `$125,000 to $149,999`
* `$150,000 to $174,999`
* `$175,000 to $199,999`
* `$200,000 and up`
* `null`

We can convert these values to numeric by again splitting on the space character ( ). We'll then have to account for the string `Prefer`. Finally, we'll be able to replace the dollar sign character "$" and the comma "," and return the result.

* Write a function to convert a single string to an appropriate integer income value.
  * Use the `isnull()` function to check if the value is null. If it is, return `None`.
  * Split the string on the space character (), and extract the first item of the resulting list.
  * If the result equals `Prefer`, return `None`.
  * Replace the "$" and "," characters in the result with empty strings to remove them.
  * Use `int()` to convert the result to an integer.
  * Return the result.
* Use the `pandas.Series.apply()` method to apply the function to each value in the `How much total combined money did all members of your HOUSEHOLD earn last year?` column of data.
  * Assign the result to the `int_income` column of data.
* Call the `pandas.Series.describe()` method on the `int_income` column of data, and display the result.
* In a separate markdown cell, write up your findings.
  * Is there anything that we should be aware of about the results or our methodology?
  * Is this a true depiction of the incomes of survey participants?

In [30]:
print(data["How much total combined money did all members of your HOUSEHOLD earn last year?"].value_counts())

$25,000 to $49,999      166
$75,000 to $99,999      127
$50,000 to $74,999      127
Prefer not to answer    118
$100,000 to $124,999    109
$200,000 and up          76
$10,000 to $24,999       60
$0 to $9,999             52
$125,000 to $149,999     48
$150,000 to $174,999     38
$175,000 to $199,999     26
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: int64


In [31]:
def str_to_int_income(row):
    if pandas.isnull(row):
        return None
    else: 
        #row_lst1 = str(row)
        row_lst = row.split()
        if row_lst[0] == "Prefer":
            return None
        income1 = row_lst[0].replace("$","")
        income2 = income1.replace(",","")
        income_int = int(income2)
        #print("income_int=",income_int)
        return income_int
    
data["int_income"] = (
        data["How much total combined money did all members of your HOUSEHOLD earn last year?"].apply(str_to_int_income))

data["int_income"].describe()



count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%                NaN
50%                NaN
75%                NaN
max      200000.000000
Name: int_income, dtype: float64

## 7: Correlating Travel Distance And Income
We can now see how the distance someone travels for Thanksgiving dinner relates to their income level. It's safe to hypothesize that people earning less money could be younger, and would travel to their parent's houses for Thanksgiving. People earning more are more likely to have Thanksgiving at their house as a result.

We can test this by filtering data based on `int_income` and seeing what the values in the `How far will you travel for Thanksgiving?` column are.

* See how far people earning under 150000 will travel.
  * Filter data, and only select rows where `int_income` is less than 150000.
  * Use indexing to select the `How far will you travel for Thanksgiving?` column.
  * Use the `value_counts()` method to count up how many times each value occurs in the column.
  * Display the results.
* See how far people earning over 150000 will travel.
  * Filter data, and only select rows where `int_income` is greater than 150000.
  * Use indexing to select the `ow far will you travel for Thanksgiving?` column.
  * Use the `value_counts()` method to count up how many times each value occurs in the column.
  * Display the results
* Write up your findings in a markdown cell.

In [44]:
filter_income_lt_150000 = data["int_income"] < 150000
data1 = data[filter_income_lt_150000]
travel1 = data1["How far will you travel for Thanksgiving?"]
print(len(travel1))
print()
print(travel1.value_counts())

travel1_per = (travel1.value_counts())/len(travel1)
print(travel1_per)

689

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64
Thanksgiving is happening at my home--I won't travel at all                         0.407837
Thanksgiving is local--it will take place in the town I live in                     0.294630
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    0.217707
Thanksgiving is out of town and far away--I have to drive several hours or fly      0.079826
Name: How far will you travel for Thanksgiving?, dtype: float64


In [45]:
filter_income_gt_150000 = data["int_income"] > 150000
data2 = data[filter_income_gt_150000]
travel2 = data2["How far will you travel for Thanksgiving?"]
print(len(travel2))
print()
print(travel2.value_counts())

travel2_per = (travel2.value_counts())/len(travel2)
print(travel2_per)

102

Thanksgiving is happening at my home--I won't travel at all                         49
Thanksgiving is local--it will take place in the town I live in                     25
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    16
Thanksgiving is out of town and far away--I have to drive several hours or fly      12
Name: How far will you travel for Thanksgiving?, dtype: int64
Thanksgiving is happening at my home--I won't travel at all                         0.480392
Thanksgiving is local--it will take place in the town I live in                     0.245098
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    0.156863
Thanksgiving is out of town and far away--I have to drive several hours or fly      0.117647
Name: How far will you travel for Thanksgiving?, dtype: float64



People with higher income levels are more likely to host Thanksgiving at their home (48%) compared to people with lower income levels (40%).
They also are more likely (12%) to travel far (drive several hours or fly) compared to people with lower income levels (8%).

---

## 8: Linking Friendship And Age
There are two columns which directly pertain to friendship, `Have you ever tried to meet up with hometown friends on Thanksgiving night?`, and `Have you ever attended a "Friendsgiving?`. In the US, a "Friendsgiving" is when instead of traveling home for the holiday, you celebrate it with friends who live in your area. Both questions seem skewed towards younger people. Let's see if this hypothesis holds up.

In order to see the average ages of people who have done both, we can generate a pivot table with the `pandas.DataFrame.pivot_table()` method. By calling this method on data, and passing in the right keyword arguments, we can generate a table showing the average ages of people who answered `Yes` to both questions, answered `Yes` to one question, and so on.

* Generate a pivot table showing the average age of respondents for each category of `Have you ever tried to meet up with hometown friends on Thanksgiving night?` and `Have you ever attended a "Friendsgiving?`.
  * Call the `pivot_table()` method on data.
  * Pass in `"Have you ever tried to meet up with hometown friends on Thanksgiving night?"` to the index keyword argument.
  * Pass in `'Have you ever attended a "Friendsgiving?"'` to the columns keyword argument.
  * Pass in `"int_age"` to the values keyword argument.
  * Display the results.
* Generate a pivot table showing the average income of respondents for each category of `Have you ever tried to meet up with hometown friends on Thanksgiving night?` and `Have you ever attended a "Friendsgiving?`.
* Write up a markdown cell with your findings.

In [46]:
print(data.pivot_table(index="Have you ever tried to meet up with hometown friends on Thanksgiving night?",
                       columns='Have you ever attended a "Friendsgiving?"',values="int_age"))


Have you ever attended a "Friendsgiving?"                  No        Yes
Have you ever tried to meet up with hometown fr...                      
No                                                  42.283702  37.010526
Yes                                                 41.475410  33.976744


Younger people seem to be more likely to attend a Friendsgiving than older people. 

---

## 9: Next Steps
That's it for the guided steps! We recommend exploring the data more on your own.

Here are some potential next steps:

Figure out the most common dessert people eat.
Figure out the most common complete meal people eat.
Identify how many people work on Thanksgiving.
Find regional patterns in the dinner menus.
Find age, gender, and income based patterns in dinner menus.