# Association between Protein Level and Calorie Count in Recipes

**Name(s)**: Hillary Chang, Paige Pagaduan

**Website Link**: https://hillarychang.github.io/Association-between-Protein-and-Calorie-Count/

## Code

In [1]:
import pandas as pd
import numpy as np
import os

import plotly.express as px
pd.options.plotting.backend = 'plotly'

### Cleaning and EDA

In [2]:
# Merge Dataframes
raw_interactions = pd.read_csv('data/RAW_interactions.csv')
raw_recipes = pd.read_csv('data/RAW_recipes.csv')
merged_data = raw_recipes.merge(raw_interactions, left_on='id', right_on='recipe_id', how='left')
merged_data
merged_data['rating'] = merged_data['rating'].replace(0, np.nan)
avg = merged_data['average_rating'] = merged_data.groupby('id')['rating'].mean()

df = avg.to_frame()
df = df.rename(columns={'rating': 'average_rating'})

df = merged_data.merge(df, on='id', how='left')
df

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,user_id,recipe_id,date,rating,review,average_rating_x,average_rating_y
0,1 brownies in the world best ever,333281,40,985201,2008-10-27,"['60-minutes-or-less', 'time-to-make', 'course...","[138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0]",10,['heat the oven to 350f and arrange the rack i...,"these are the most; chocolatey, moist, rich, d...","['bittersweet chocolate', 'unsalted butter', '...",9,3.865850e+05,333281.0,2008-11-19,4.0,"These were pretty good, but took forever to ba...",,4.0
1,1 in canada chocolate chip cookies,453467,45,1848091,2011-04-11,"['60-minutes-or-less', 'time-to-make', 'cuisin...","[595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0]",12,"['pre-heat oven the 350 degrees f', 'in a mixi...",this is the recipe that we use at my school ca...,"['white sugar', 'brown sugar', 'salt', 'margar...",11,4.246800e+05,453467.0,2012-01-26,5.0,Originally I was gonna cut the recipe in half ...,,5.0
2,412 broccoli casserole,306168,40,50969,2008-05-30,"['60-minutes-or-less', 'time-to-make', 'course...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"['preheat oven to 350 degrees', 'spray a 2 qua...",since there are already 411 recipes for brocco...,"['frozen broccoli cuts', 'cream of chicken sou...",9,2.978200e+04,306168.0,2008-12-31,5.0,This was one of the best broccoli casseroles t...,,5.0
3,412 broccoli casserole,306168,40,50969,2008-05-30,"['60-minutes-or-less', 'time-to-make', 'course...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"['preheat oven to 350 degrees', 'spray a 2 qua...",since there are already 411 recipes for brocco...,"['frozen broccoli cuts', 'cream of chicken sou...",9,1.196280e+06,306168.0,2009-04-13,5.0,I made this for my son's first birthday party ...,,5.0
4,412 broccoli casserole,306168,40,50969,2008-05-30,"['60-minutes-or-less', 'time-to-make', 'course...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"['preheat oven to 350 degrees', 'spray a 2 qua...",since there are already 411 recipes for brocco...,"['frozen broccoli cuts', 'cream of chicken sou...",9,7.688280e+05,306168.0,2013-08-02,5.0,Loved this. Be sure to completely thaw the br...,,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
234424,zydeco ya ya deviled eggs,308080,40,37779,2008-06-07,"['60-minutes-or-less', 'time-to-make', 'course...","[59.2, 6.0, 2.0, 3.0, 6.0, 5.0, 0.0]",7,"['in a bowl , combine the mashed yolks and may...","deviled eggs, cajun-style","['hard-cooked eggs', 'mayonnaise', 'dijon must...",8,8.445540e+05,308080.0,2009-10-14,5.0,These were very good. I meant to add some jala...,,5.0
234425,cookies by design cookies on a stick,298512,29,506822,2008-04-15,"['30-minutes-or-less', 'time-to-make', 'course...","[188.0, 11.0, 57.0, 11.0, 7.0, 21.0, 9.0]",9,['place melted butter in a large mixing bowl a...,"i've heard of the 'cookies by design' company,...","['butter', 'eagle brand condensed milk', 'ligh...",10,8.042340e+05,298512.0,2008-05-02,1.0,I would rate this a zero if I could. I followe...,,1.0
234426,cookies by design sugar shortbread cookies,298509,20,506822,2008-04-15,"['30-minutes-or-less', 'time-to-make', 'course...","[174.9, 14.0, 33.0, 4.0, 4.0, 11.0, 6.0]",5,"['whip sugar and shortening in a large bowl , ...","i've heard of the 'cookies by design' company,...","['granulated sugar', 'shortening', 'eggs', 'fl...",7,8.666510e+05,298509.0,2008-06-19,1.0,This recipe tastes nothing like the Cookies by...,,3.0
234427,cookies by design sugar shortbread cookies,298509,20,506822,2008-04-15,"['30-minutes-or-less', 'time-to-make', 'course...","[174.9, 14.0, 33.0, 4.0, 4.0, 11.0, 6.0]",5,"['whip sugar and shortening in a large bowl , ...","i've heard of the 'cookies by design' company,...","['granulated sugar', 'shortening', 'eggs', 'fl...",7,1.546277e+06,298509.0,2010-02-08,5.0,"yummy cookies, i love this recipe me and my sm...",,3.0


In [3]:
#Split Nutrition into Several Columns
import ast

vals = df['nutrition'].apply(ast.literal_eval)
df=df.assign(calories=vals.str[0],
         total_fat=vals.str[1],
         sugar=vals.str[2],
         sodium=vals.str[3],
         protein=vals.str[4],
         saturated_fat=vals.str[5],
         carbs=vals.str[6])
df

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,...,review,average_rating_x,average_rating_y,calories,total_fat,sugar,sodium,protein,saturated_fat,carbs
0,1 brownies in the world best ever,333281,40,985201,2008-10-27,"['60-minutes-or-less', 'time-to-make', 'course...","[138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0]",10,['heat the oven to 350f and arrange the rack i...,"these are the most; chocolatey, moist, rich, d...",...,"These were pretty good, but took forever to ba...",,4.0,138.4,10.0,50.0,3.0,3.0,19.0,6.0
1,1 in canada chocolate chip cookies,453467,45,1848091,2011-04-11,"['60-minutes-or-less', 'time-to-make', 'cuisin...","[595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0]",12,"['pre-heat oven the 350 degrees f', 'in a mixi...",this is the recipe that we use at my school ca...,...,Originally I was gonna cut the recipe in half ...,,5.0,595.1,46.0,211.0,22.0,13.0,51.0,26.0
2,412 broccoli casserole,306168,40,50969,2008-05-30,"['60-minutes-or-less', 'time-to-make', 'course...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"['preheat oven to 350 degrees', 'spray a 2 qua...",since there are already 411 recipes for brocco...,...,This was one of the best broccoli casseroles t...,,5.0,194.8,20.0,6.0,32.0,22.0,36.0,3.0
3,412 broccoli casserole,306168,40,50969,2008-05-30,"['60-minutes-or-less', 'time-to-make', 'course...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"['preheat oven to 350 degrees', 'spray a 2 qua...",since there are already 411 recipes for brocco...,...,I made this for my son's first birthday party ...,,5.0,194.8,20.0,6.0,32.0,22.0,36.0,3.0
4,412 broccoli casserole,306168,40,50969,2008-05-30,"['60-minutes-or-less', 'time-to-make', 'course...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"['preheat oven to 350 degrees', 'spray a 2 qua...",since there are already 411 recipes for brocco...,...,Loved this. Be sure to completely thaw the br...,,5.0,194.8,20.0,6.0,32.0,22.0,36.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
234424,zydeco ya ya deviled eggs,308080,40,37779,2008-06-07,"['60-minutes-or-less', 'time-to-make', 'course...","[59.2, 6.0, 2.0, 3.0, 6.0, 5.0, 0.0]",7,"['in a bowl , combine the mashed yolks and may...","deviled eggs, cajun-style",...,These were very good. I meant to add some jala...,,5.0,59.2,6.0,2.0,3.0,6.0,5.0,0.0
234425,cookies by design cookies on a stick,298512,29,506822,2008-04-15,"['30-minutes-or-less', 'time-to-make', 'course...","[188.0, 11.0, 57.0, 11.0, 7.0, 21.0, 9.0]",9,['place melted butter in a large mixing bowl a...,"i've heard of the 'cookies by design' company,...",...,I would rate this a zero if I could. I followe...,,1.0,188.0,11.0,57.0,11.0,7.0,21.0,9.0
234426,cookies by design sugar shortbread cookies,298509,20,506822,2008-04-15,"['30-minutes-or-less', 'time-to-make', 'course...","[174.9, 14.0, 33.0, 4.0, 4.0, 11.0, 6.0]",5,"['whip sugar and shortening in a large bowl , ...","i've heard of the 'cookies by design' company,...",...,This recipe tastes nothing like the Cookies by...,,3.0,174.9,14.0,33.0,4.0,4.0,11.0,6.0
234427,cookies by design sugar shortbread cookies,298509,20,506822,2008-04-15,"['30-minutes-or-less', 'time-to-make', 'course...","[174.9, 14.0, 33.0, 4.0, 4.0, 11.0, 6.0]",5,"['whip sugar and shortening in a large bowl , ...","i've heard of the 'cookies by design' company,...",...,"yummy cookies, i love this recipe me and my sm...",,3.0,174.9,14.0,33.0,4.0,4.0,11.0,6.0


In [4]:
# Remove Duplicate Recipe Entries (there are duplicate recipes with different reviews)
df_section1=df.groupby('name').mean()
# df_section1[['id', 'rating', 'calories','protein']]

In [5]:
# Check for NaNs
missing_values = df_section1[['calories']].isnull().sum()
missing_values

calories    0
dtype: int64

In [6]:
# Look at extremely high calories
max_calorie_entry = df_section1.loc[df_section1['calories'].idxmax()]
max_calorie_entry

# Find why there is an Entry with 30k Calories
df_calories_greater_than_2000 = df[df['calories'] > 2000]
df_calories_greater_than_2000

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,...,review,average_rating_x,average_rating_y,calories,total_fat,sugar,sodium,protein,saturated_fat,carbs
161,buffalo wing mushrooms,535020,30,33186,2018-01-29,"['30-minutes-or-less', 'time-to-make', 'course...","[2526.8, 345.0, 35.0, 106.0, 43.0, 87.0, 37.0]",8,"['in a medium-large mixing bowl , add in panko...",courtesy of ps kitchen.,...,Made these paleo style. I subbed cassava flour...,,,2526.8,345.0,35.0,106.0,43.0,87.0,37.0
230,funny bones cake,360086,60,191015,2009-03-10,"['60-minutes-or-less', 'time-to-make', 'course...","[7016.6, 652.0, 2109.0, 260.0, 263.0, 568.0, 2...",18,['for the filling: beat the cream cheese and p...,this is a chocolate bundt cake with peanut but...,...,"Five plus I should say! This was awesome, and...",,4.75,7016.6,652.0,2109.0,260.0,263.0,568.0,245.0
231,funny bones cake,360086,60,191015,2009-03-10,"['60-minutes-or-less', 'time-to-make', 'course...","[7016.6, 652.0, 2109.0, 260.0, 263.0, 568.0, 2...",18,['for the filling: beat the cream cheese and p...,this is a chocolate bundt cake with peanut but...,...,This cake tasted just like funny bones but the...,,4.75,7016.6,652.0,2109.0,260.0,263.0,568.0,245.0
232,funny bones cake,360086,60,191015,2009-03-10,"['60-minutes-or-less', 'time-to-make', 'course...","[7016.6, 652.0, 2109.0, 260.0, 263.0, 568.0, 2...",18,['for the filling: beat the cream cheese and p...,this is a chocolate bundt cake with peanut but...,...,Really good cake. I took the suggestion of ba...,,4.75,7016.6,652.0,2109.0,260.0,263.0,568.0,245.0
233,funny bones cake,360086,60,191015,2009-03-10,"['60-minutes-or-less', 'time-to-make', 'course...","[7016.6, 652.0, 2109.0, 260.0, 263.0, 568.0, 2...",18,['for the filling: beat the cream cheese and p...,this is a chocolate bundt cake with peanut but...,...,"I love this recipe, although I didn&#039;t coo...",,4.75,7016.6,652.0,2109.0,260.0,263.0,568.0,245.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
234335,zucchini sweet corn ricotta quiche,302346,45,573325,2008-05-06,"['60-minutes-or-less', 'time-to-make', 'course...","[2041.6, 154.0, 115.0, 164.0, 193.0, 296.0, 66.0]",8,['grate the zucchini and preheat the oven to 1...,so easy to put together and so delicious.,...,"Lalaoula you did it aging. Very different, hea...",,5.00,2041.6,154.0,115.0,164.0,193.0,296.0,66.0
234336,zucchini sweet corn ricotta quiche,302346,45,573325,2008-05-06,"['60-minutes-or-less', 'time-to-make', 'course...","[2041.6, 154.0, 115.0, 164.0, 193.0, 296.0, 66.0]",8,['grate the zucchini and preheat the oven to 1...,so easy to put together and so delicious.,...,Followed your recipe right on down & put toget...,,5.00,2041.6,154.0,115.0,164.0,193.0,296.0,66.0
234375,zuppa di cipolla al vino rosso,529308,60,2001245595,2016-12-02,"['60-minutes-or-less', 'time-to-make', 'cuisin...","[9282.1, 136.0, 340.0, 715.0, 746.0, 142.0, 58...",22,"['1', 'peel the onions and cut them in thin ri...",onion soup,...,Flavorful,,5.00,9282.1,136.0,340.0,715.0,746.0,142.0,583.0
234376,zuppa di cipolla al vino rosso,529308,60,2001245595,2016-12-02,"['60-minutes-or-less', 'time-to-make', 'cuisin...","[9282.1, 136.0, 340.0, 715.0, 746.0, 142.0, 58...",22,"['1', 'peel the onions and cut them in thin ri...",onion soup,...,A seafood Lover,,5.00,9282.1,136.0,340.0,715.0,746.0,142.0,583.0


In [7]:
# Two Histograms: 0-2000 and 2000+
# Threshold is 2000 because it is the recommended average daily intake

df_calories_0_to_2000 = df_section1[df_section1['calories'] <= 2000]

df_calories_greater_than_2000 = df_section1[df_section1['calories'] > 2000]


fig1 = px.histogram(df_calories_0_to_2000, x='calories', nbins=50, title='Calories Distribution (0-2000)',
                    labels={'calories': 'Calories'}, range_x=[0, 2000])

fig2 = px.histogram(df_calories_greater_than_2000, x='calories', nbins=50, title='Calories Distribution (>2000)',
                    labels={'calories': 'Calories'}, range_x=[2000, df['calories'].max()])

fig1.show()
fig2.show()

In [8]:
fig1.write_html('calorie_below.html', include_plotlyjs='cdn')
fig2.write_html('calorie_above.html', include_plotlyjs='cdn')

In [9]:
# Plot shows there is a relationship between calories and protein count
# Interesting that plot also shows that recipes with a very high calorie count can have 0 grams of protein or very high counts of protein
fig3 = px.scatter(df_calories_0_to_2000, x='calories', y='protein', title='Scatter Plot: Calories vs. Protein (0-2000)',
labels={'calories': 'Calories', 'protein': 'Protein'})

fig4 = px.scatter(df_calories_greater_than_2000, x='calories', y='protein', title='Scatter Plot: Calories vs. Protein (>2000)',
labels={'calories': 'Calories', 'protein': 'Protein'})

fig3.show()
fig4.show()

In [10]:
fig3.write_html('bivariate.html', include_plotlyjs='cdn')
fig4.write_html('bivariate_above.html', include_plotlyjs='cdn')

In [11]:
pivot_table_recipe_type = pd.pivot_table(df_section1, values='calories', index='protein', aggfunc='mean')

# Sort the pivot table by average calories in descending order
pivot_table_recipe_type = pivot_table_recipe_type.sort_values(by='calories', ascending=False)
pivot_table_recipe_type

Unnamed: 0_level_0,calories
protein,Unnamed: 1_level_1
4356.000000,45609.000
360.000000,28930.200
446.000000,26604.400
3605.000000,21497.800
329.000000,18927.250
...,...
1.250000,91.925
7.250000,87.775
0.035714,82.625
7.500000,54.400


### Assessment of Missingness

In [12]:
# Missingness: description, review, rating (filled 0 with NaNs so doesn't matter)
# Using df, not df_section1, because doesn't mind duplicate recipes because missingness is on Review column

missing_values = df.isnull().sum()
missing_values

name                     1
id                       0
minutes                  0
contributor_id           0
submitted                0
tags                     0
nutrition                0
n_steps                  0
steps                    0
description            114
ingredients              0
n_ingredients            0
user_id                  1
recipe_id                1
date                     1
rating               15036
review                  58
average_rating_x    234429
average_rating_y      2777
calories                 0
total_fat                0
sugar                    0
sodium                   0
protein                  0
saturated_fat            0
carbs                    0
dtype: int64

In [13]:
df

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,...,review,average_rating_x,average_rating_y,calories,total_fat,sugar,sodium,protein,saturated_fat,carbs
0,1 brownies in the world best ever,333281,40,985201,2008-10-27,"['60-minutes-or-less', 'time-to-make', 'course...","[138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0]",10,['heat the oven to 350f and arrange the rack i...,"these are the most; chocolatey, moist, rich, d...",...,"These were pretty good, but took forever to ba...",,4.0,138.4,10.0,50.0,3.0,3.0,19.0,6.0
1,1 in canada chocolate chip cookies,453467,45,1848091,2011-04-11,"['60-minutes-or-less', 'time-to-make', 'cuisin...","[595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0]",12,"['pre-heat oven the 350 degrees f', 'in a mixi...",this is the recipe that we use at my school ca...,...,Originally I was gonna cut the recipe in half ...,,5.0,595.1,46.0,211.0,22.0,13.0,51.0,26.0
2,412 broccoli casserole,306168,40,50969,2008-05-30,"['60-minutes-or-less', 'time-to-make', 'course...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"['preheat oven to 350 degrees', 'spray a 2 qua...",since there are already 411 recipes for brocco...,...,This was one of the best broccoli casseroles t...,,5.0,194.8,20.0,6.0,32.0,22.0,36.0,3.0
3,412 broccoli casserole,306168,40,50969,2008-05-30,"['60-minutes-or-less', 'time-to-make', 'course...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"['preheat oven to 350 degrees', 'spray a 2 qua...",since there are already 411 recipes for brocco...,...,I made this for my son's first birthday party ...,,5.0,194.8,20.0,6.0,32.0,22.0,36.0,3.0
4,412 broccoli casserole,306168,40,50969,2008-05-30,"['60-minutes-or-less', 'time-to-make', 'course...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"['preheat oven to 350 degrees', 'spray a 2 qua...",since there are already 411 recipes for brocco...,...,Loved this. Be sure to completely thaw the br...,,5.0,194.8,20.0,6.0,32.0,22.0,36.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
234424,zydeco ya ya deviled eggs,308080,40,37779,2008-06-07,"['60-minutes-or-less', 'time-to-make', 'course...","[59.2, 6.0, 2.0, 3.0, 6.0, 5.0, 0.0]",7,"['in a bowl , combine the mashed yolks and may...","deviled eggs, cajun-style",...,These were very good. I meant to add some jala...,,5.0,59.2,6.0,2.0,3.0,6.0,5.0,0.0
234425,cookies by design cookies on a stick,298512,29,506822,2008-04-15,"['30-minutes-or-less', 'time-to-make', 'course...","[188.0, 11.0, 57.0, 11.0, 7.0, 21.0, 9.0]",9,['place melted butter in a large mixing bowl a...,"i've heard of the 'cookies by design' company,...",...,I would rate this a zero if I could. I followe...,,1.0,188.0,11.0,57.0,11.0,7.0,21.0,9.0
234426,cookies by design sugar shortbread cookies,298509,20,506822,2008-04-15,"['30-minutes-or-less', 'time-to-make', 'course...","[174.9, 14.0, 33.0, 4.0, 4.0, 11.0, 6.0]",5,"['whip sugar and shortening in a large bowl , ...","i've heard of the 'cookies by design' company,...",...,This recipe tastes nothing like the Cookies by...,,3.0,174.9,14.0,33.0,4.0,4.0,11.0,6.0
234427,cookies by design sugar shortbread cookies,298509,20,506822,2008-04-15,"['30-minutes-or-less', 'time-to-make', 'course...","[174.9, 14.0, 33.0, 4.0, 4.0, 11.0, 6.0]",5,"['whip sugar and shortening in a large bowl , ...","i've heard of the 'cookies by design' company,...",...,"yummy cookies, i love this recipe me and my sm...",,3.0,174.9,14.0,33.0,4.0,4.0,11.0,6.0


In [14]:
def calc_tvd_p(df, missing_col, dep_col):
    df = df.assign(is_missing=df[missing_col].isna())
    distribution = (
        df
        .pivot_table(index=dep_col, columns='is_missing', aggfunc='size')
    )
    distribution = distribution / distribution.sum()
    observed_tvd = distribution.diff(axis=1).iloc[:, -1].abs().sum() / 2
    
    n_repetitions = 500
    shuffled = df.copy()

    tvds = []
    for _ in range(n_repetitions):

        shuffled[dep_col] = np.random.permutation(shuffled[dep_col])

        # Computing and storing the TVD.
        pivoted = (
            shuffled
            .pivot_table(index=dep_col, columns='is_missing', aggfunc='size')
            .apply(lambda x: x / x.sum())
        )

        tvd = pivoted.diff(axis=1).iloc[:, -1].abs().sum() / 2
        tvds.append(tvd)
        
    return [tvds, observed_tvd]

In [15]:
stats = calc_tvd_p(df, 'review', 'contributor_id')
tvds = stats[0]
observed_tvd = stats[1]

fig = px.histogram(pd.DataFrame(tvds), x=0, nbins=25, histnorm='probability', 
                   title='Empirical Distribution of the TVD')
fig.add_vline(x=observed_tvd, line_color='red')
fig.add_annotation(text=f'<span style="color:red">Observed TVD = {round(observed_tvd, 2)}</span>',
                   x=0.48, showarrow=False, y=0.16)
fig.update_layout(yaxis_range=[0, 0.2], xaxis_range=[0.4, 0.5])

# fig.write_html('review_contributor_id.html', include_plotlyjs='cdn')


In [16]:
stats1 = calc_tvd_p(df, 'review', 'rating')
tvds1 = stats1[0]
observed_tvd1 = stats1[1]

fig = px.histogram(pd.DataFrame(tvds1), x=0, nbins=25, histnorm='probability', 
                   title='Empirical Distribution of the TVD')
fig.add_vline(x=observed_tvd1, line_color='red')
fig.add_annotation(text=f'<span style="color:red">Observed TVD = {round(observed_tvd1, 2)}</span>',
                   x= 0.25, showarrow=False, y=0.16)
fig.update_layout(yaxis_range=[0, 0.2], xaxis_range=[0, 0.3])

# fig.write_html('review_rating.html', include_plotlyjs='cdn')

In [17]:
stats2 = calc_tvd_p(df, 'review', 'n_steps')
tvds2 = stats2[0]
observed_tvd2 = stats2[1]

fig = px.histogram(pd.DataFrame(tvds2), x=0, nbins=25, histnorm='probability', 
                   title='Empirical Distribution of the TVD')
fig.add_vline(x=observed_tvd2, line_color='red')
fig.add_annotation(text=f'<span style="color:red">Observed TVD = {round(observed_tvd2, 2)}</span>',
                   x=0, showarrow=False, y=0.16)
fig.update_layout(yaxis_range=[0, 0.2], xaxis_range=[0.1, 0.4])

# fig.write_html('review_n_steps.html', include_plotlyjs='cdn')

### Hypothesis Testing

In [18]:
'''
Null hypothesis: calorie count of foods below/equal to the protein threshold and above protein threshold have same distribution
Alternative hypothesis: calorie count of foods below/equal to the protein threshold is lower than foods above protein threshold

above_threshold column: True or False means protein equal/below or above threshold (mean of protein)
shuffled_calories -> shuffle calorie and assign
test statistic: difference of means()

'''

'\nNull hypothesis: calorie count of foods below/equal to the protein threshold and above protein threshold have same distribution\nAlternative hypothesis: calorie count of foods below/equal to the protein threshold is lower than foods above protein threshold\n\nabove_threshold column: True or False means protein equal/below or above threshold (mean of protein)\nshuffled_calories -> shuffle calorie and assign\ntest statistic: difference of means()\n\n'

In [19]:
# PROTEIN
mean_protein_threshold = df_section1['protein'].mean()
df_section1['above_threshold'] = df_section1['protein'] > mean_protein_threshold #True if protein above threshold
num_repetitions = 500
protein_diffs = np.array([])

observed_diff = df_section1.groupby("above_threshold").mean().loc[:,"calories"].diff().iloc[-1]

for i in range(num_repetitions):
    df_section1['shuffled_calories'] = np.random.permutation(df_section1['calories'])    
    difference = df_section1.groupby("above_threshold").mean().loc[:,"shuffled_calories"].diff().iloc[-1]
    protein_diffs = np.append(protein_diffs, difference)

p_value = (protein_diffs >= observed_diff).mean()
p_value


0.0

In [20]:
fig_protein = px.histogram(
    pd.DataFrame(protein_diffs), x=0, nbins=50, histnorm='probability', 
    title='Empirical Distribution of Calorie Differences for Recipes with High Protein vs. Low/Equal Protein')
fig_protein.add_vline(x=observed_diff, line_color='red')
fig_protein.update_layout(xaxis_range=[-30, 446], margin=dict(t=60))
fig_protein.update_traces(marker_line_color='black', marker_line_width=1)


In [21]:
fig_protein.write_html('protein-calorie.html', include_plotlyjs='cdn')