# Basic Data Analysis Review: Thanksgiving

## Importing Libraries

In [3]:
import numpy as np
import pandas as pd
import pandas_profiling

import matplotlib.pyplot as plt
import seaborn as sns


## Overview

In 2017, 277 million people celebrated Thanksgiving in the United States <sup>[1](#myfootnote1)</sup>. The population in 2017 was 325 million <sup>[2](#myfootnote1)</sup> indicative of an 85% participation rate in the holiday. Considering this popularity, there ought to be some data that could measure, predict and forecast a multitude of questions. 
In 2015, the journal Fivethirtyeight, made a survey asking participants to answer what they were serving at Thanksgiving dinner. This analysis hopes to create a machine learning model that can predict where they live in the United States based on what they eat. 


### The Data

The data is free to use from [this website](https://github.com/fivethirtyeight/data/blob/master/thanksgiving-2015/thanksgiving-2015-poll-data.csv). It is the results of a survey taken on SurveyMonkey from 1,058 participants on Nov. 17, 2015.
The dataset chosen has 65 columns and 1058 rows.

In [7]:
# Loading in data
data = pd.read_csv("data/thanksgiving-2015-poll-data.csv", encoding="latin1")

In [8]:
data.head()

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain
3,4337933040,Yes,Turkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$200,000 and up",Pacific
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific


Let's take a look at the column titles.

In [4]:
list(data.columns) 

['RespondentID',
 'Do you celebrate Thanksgiving?',
 'What is typically the main dish at your Thanksgiving dinner?',
 'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
 'How is the main dish typically cooked?',
 'How is the main dish typically cooked? - Other (please specify)',
 'What kind of stuffing/dressing do you typically have?',
 'What kind of stuffing/dressing do you typically have? - Other (please specify)',
 'What type of cranberry saucedo you typically have?',
 'What type of cranberry saucedo you typically have? - Other (please specify)',
 'Do you typically have gravy?',
 'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
 'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
 'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Cauliflower',
 

This looks like it needs some cleaning. Let's make these more readable and easier to use. 

In [13]:
data.columns = ['respondentid', 'celebrate_thanksgiving', 'main_dish', 'main_dish_other',
                'how_main_dish_cooked', 'how_main_dish_cooked_other', 'stuffing_dressing',
                'stuffing_dressing_other', 'cranberry_sauce', 'cranberry_sauce_other',
                'gravy', 'sidedish_brusselsprouts', 'sidedish_carrots', 'sidedish_cauliflower',
                'sidedish_corn', 'sidedish_cornbread', 'sidedish_fruitsalad',
                'sidedish_green_beans_or_gb_casserole', 'sidedish_mac_cheese',
                'sidedish_mashed_potatoes', 'sidedish_rolls_biscuits', 'sidedish_squash',
                'sidedish_vegetable_salad', 'sidedish_yams_sweet_potato_casserole',
                'sidedish_other', 'sidedish_other2', 'pie_apple', 'pie_buttermilk',
                'pie_cherry',  'pie_chocolate', 'pie_coconut_cream', 'pie_key_lime',
                'pie_peach', 'pie_pecan', 'pie_pumpkin', 'pie_sweet_potato', 'pie_none',
                'pie_other', 'pie_other2', 'dessert_apple_cobbler', 'dessert_blondies',
                'dessert_brownies', 'dessert_carrot_cake', 'dessert_cheesecake',
                'dessert_cookies', 'dessert_fudge', 'dessert_ice_cream', 'dessert_peach_cobbler',
                'dessert_none', 'dessert_other', 'dessert_other2', 'prayers', 'travel_distance',
                'watch_macys_parade', 'kidstable_max_age', 'hometown_friends_meetup',
                'friendsgiving',  'black_fiday_sales', 'work_retail',  'work_on_black_friday',
                'describe_where_you_live', 'age', 'gender', 'household_income', 'us_region']

In [14]:
data.head()

Unnamed: 0,respondentid,celebrate_thanksgiving,main_dish,main_dish_other,how_main_dish_cooked,how_main_dish_cooked_other,stuffing_dressing,stuffing_dressing_other,cranberry_sauce,cranberry_sauce_other,...,hometown_friends_meetup,friendsgiving,black_fiday_sales,work_retail,work_on_black_friday,describe_where_you_live,age,gender,household_income,us_region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain
3,4337933040,Yes,Turkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$200,000 and up",Pacific
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific


The definition of the columns are explained below: (_to do later_)

* `respondentid`:
* `celebrate_thanksgiving`:
* `main_dish`:
* `main_dish_other`:
* `how_main_dish_cooked`:
* `how_main_dish_cooked_other`:
* `stuffing_dressing`: 
* `stuffing_dressing_other`:
* `cranberry_sauce`:
* `cranberry_sauce_other`: 
* `gravy`: 
* `sidedish_brusselsprouts`: 
* `sidedish_carrots`:
* `sidedish_cauliflower`:
* `sidedish_corn`:
* `sidedish_cornbread`:
* `sidedish_fruitsalad`:
* `sidedish_green_beans_or_gb_casserole`: 
* `sidedish_mac_cheese`:
* `sidedish_mashed_potatoes`:
* `sidedish_rolls_biscuits`: 
* `sidedish_squash`: 
* `sidedish_vegetable_salad`:
* `sidedish_yams_sweet_potato_casserole`:
* `sidedish_other`:
* `sidedish_other2`:
* `pie_apple`:
* `pie_buttermilk`:
* `pie_cherry`:
* `pie_chocolate`:`
* `pie_coconut_cream`:
* `pie_key_lime`:
* `pie_peach``pie_pecan`:
* `pie_pumpkin`:
* `pie_sweet_potato`:
* `pie_none`:
* `pie_other`:
* `pie_other2`:
* `dessert_apple_cobbler`:
* `dessert_blondies`
* `dessert_brownies`:
* `dessert_carrot_cake`:
* `dessert_cheesecake`:
* `dessert_cookies`:
* `dessert_fudge`:
* `dessert_ice_cream`:
* `dessert_peach_cobbler`: 
* `dessert_none`:
* `dessert_other`:
*`dessert_other2`:
* `prayers`:
* `travel_distance`:
* `watch_macys_parade`:
* `kidstable_max_age`: 
* `hometown_friends_meetup`:
* `friendsgiving`:
* `black_fiday_sales`:
* `work_retail`:
* `work_on_black_friday`:
* `describe_where_you_live`:
* `age`:
* `gender`:
* `household_income`:
* `us_region`: 

## Feature Exploring

In [15]:
data.describe(include = "all")

Unnamed: 0,respondentid,celebrate_thanksgiving,main_dish,main_dish_other,how_main_dish_cooked,how_main_dish_cooked_other,stuffing_dressing,stuffing_dressing_other,cranberry_sauce,cranberry_sauce_other,...,hometown_friends_meetup,friendsgiving,black_fiday_sales,work_retail,work_on_black_friday,describe_where_you_live,age,gender,household_income,us_region
count,1058.0,1058,974,35,974,51,974,36,974,25,...,951,951,951,951,70,948,1025,1025,1025,999
unique,,2,8,32,5,34,4,29,4,24,...,2,2,2,2,3,3,4,2,11,9
top,,Yes,Turkey,Turkey and Ham,Baked,Smoked,Bread-based,cornbread,Canned,Both Canned and Homemade,...,No,No,No,No,Yes,Suburban,45 - 59,Female,"$25,000 to $49,999",South Atlantic
freq,,980,859,2,481,7,836,6,502,2,...,594,683,727,881,43,496,286,544,180,214
mean,4336731000.0,,,,,,,,,,...,,,,,,,,,,
std,493783.4,,,,,,,,,,...,,,,,,,,,,
min,4335895000.0,,,,,,,,,,...,,,,,,,,,,
25%,4336339000.0,,,,,,,,,,...,,,,,,,,,,
50%,4336797000.0,,,,,,,,,,...,,,,,,,,,,
75%,4337012000.0,,,,,,,,,,...,,,,,,,,,,


Using `Pandas Profiling` I am going to take a look at the features. 

In [18]:
data.profile_report(style={'full_width':True})

  return np.sqrt(phi2corr / min((kcorr - 1.0), (rcorr - 1.0)))
(using `df.profile_report(correlations={"cramers": False}`)
If this is problematic for your use case, please report this as an issue:
https://github.com/pandas-profiling/pandas-profiling/issues
(include the error message: 'The internally computed table of expected frequencies has a zero element at (0, 39).')
  correlation_name=correlation_name, error=error




###  Feature observations 

* There are a lot of missing values that will need to be handled properly.
    * Some are boolean values when describing dessert which will not cause a problem
    * Some `NA` values can be converted into an additional categorical 
* The features which missing values are concerning me are 
    * `black_fiday_sales` - this is a boolean and thus I have questions on if we should drop the 107 rows, convert it to a categorical variable or drop the feature.
    * `Age` - 33 missing, add categorical or drop?
    *  `cranberry_sauce` and `stuffing_dressing` - can the NAs here be interpreted as None? 
    *  `cranberry_sauce_other` - this can be added to the `cranberry_sauce` as new category `both`
    *  `describe_where_you_live` - can the missing become a new category?
    *  `dessert_other` - can add new feature as generic `pie` 
    *  Still need to explore the 90 values in `dessert_other2` and the 574 in `household_income`
    *  `friendsgiving` - - this is a boolean and thus I have questions on if we should drop the 107 rows, convert it to a categorical variable or drop the feature.
    * `gender` - convert to a categorical instead of boolean or do we drop 33? 
    * `gravy`, `hometown_friends_meetup` - convert missing to false?
    *  `how_main_dish_cooked` this may need to become - baked, roasted, other, and not_specified
    *  `main_dish` maybe make a boolean? 
    * `prayers` boolean or drop 99?
    * 
 
* 57 people are missing from `us_region` and thus immediately need to be dropped. 
* 107 values is a repeated number that keeps coming up for missing values in features and I want to explore if it's the same 107 people, in that case It could be a good idea to drop them especially if they are also the missing values in `us_region`

## Get feedback before proceeding

## References 

<a name="myfootnote1">1</a>: https://www.finder.com/american-thanksgiving-turkey-spend   
<a name="myfootnote1">2</a>: https://www.census.gov/popclock/