# An Investigation into Nutrition vs Yummy: How does Nutritional Content Impact the Average Rating of a Recipe?

**Name(s)**: Mia Jerphagnon, Alyssa

**Website Link**: (your website link)

In [123]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
pd.options.plotting.backend = 'plotly'

from ast import literal_eval

from scipy.stats import pearsonr

from dsc80_utils import * 

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, r2_score



## Step 1: Introduction

If you have ever been a college student, you know just how much unhealthy food we consume on a daily basis. Without the help from our parents, we struggle with eating healthy food. We crave quick and tasty meals in order to get through our busy schedules, sleepless nights, and hours of homework. But, does it have to be this way? What if we could eat healthy _and_ yummy food? 

Well, as part of UC San Diego's DSC 80 curriculum, this project explores __how the nutritional contect of a recipe (calories, total fat, sugar, sodium, protein, saturated fat, carbohydrates) affects its average rating__. In Step 4, we specifically examine the relationship between protein and average rating. In Steps 5-8, we develop a multivariate predictive model to predict average rating based on nutritional content. 

We analyze two datasets from food.com. These datasets include recipes and ratings posted until and including 2008.

The first dataset is called `recipes`. It has 83,782 rows and 12 columns. 

| Column | Description |
|--------|-------------|
| 'name' | Recipe name |
| 'id' | Recipe ID |
| 'minutes' | Minutes to prepare recipe |
| 'contributor_id'| User ID who submitted this recipe |
| 'submitted'    | Date recipe was submitted |
| 'tags' | Food.com tags for recipe |
| 'nutrition' | Nutrition information in the form [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]; PDV stands for “percentage of daily value” |
| 'n_steps' | Number of steps in recipe |
| 'steps' | Text for recipe steps, in order |
| 'description' | Description of the recipe |

The second dataset is called `interactions`. It has 73,1927 rows and 5 columns. 

| Column | Description |
|--------|-------------|
| 'user_id' | User ID |
| 'recipe_id' | Recipe ID |
| 'date' | Date of interaction |
| 'ratingt'| Rating given |
| 'review' | Review text |

## Step 2: Data Cleaning and Exploratory Data Analysis

#### Data Cleaning

In order to clean the data, we implement the following steps:
1. Read the data sets
2. Left merge datasets on `id`
3. Fill ratings of 0 with NaN
4. Calculate average rating per recipe
5. Merge average ratings back to recipes dataset
6. Rename the column containing the average rating
7. Create new columns from the `nutrition` column
8. Drop the `nutrition` column
9. Create a new column called `fulfills_protein_DV` with Boolean values. If the value is True, then the protein PDV $\geq100$, and if the value is False, the protein PDV $<100$. This column aims to differentiate recipes that meet or do not meet the reccomended daily value of protein.
10. Reorder the columns so more relevant features are to the left

Ratings for recipes can only be between 1 and 5, so intuitively, ratings of 0 imply a user did not properly rate the recipe. 

Because the multiple features are stored as a string of a list in the `nutrition` column, we create new columns with these features. These columns, in addition to recipe `name`, `avg_rating`, and `fulfills_protein_DV` are integral and relevant to our analysis.

The full list of these columns are:

| Relevant Column | Description |
|--------|-------------|
| 'name' | Recipe name |
| 'avg_rating' | Average rating for recipe | 
| 'fulfills_protein_DV' | True/false whether protein content fulfills reccomended daily value intake |
| 'calories (#)' | Number of kilocalories | 
| 'total fat (PDV)' | Percent daily value of total fat |
| 'sugar (PDV)' | Percent daily value of sugar |
| 'sodium (PDV)' | Percent daily value of sodium | 
| 'protein (PDV)' | Percent daily value of protein | 
| 'saturated fat (PDV)' | Percent daily value of saturated fat |
| 'carbohydrates (PDV)' | Percent daily value of carbohydrates | 


In [76]:
# Read datasets
data = '/data'
recipes = pd.read_csv(Path('data') / 'RAW_recipes.csv')
interactions = pd.read_csv(Path('data') / 'RAW_interactions.csv')

# Left merge datasets on 'id'
merged = pd.merge(recipes, interactions, left_on='id', right_on='recipe_id', how='left')

# Fill ratings of 0 with NaN

merged['rating'] = merged['rating'].replace(0, np.NaN)

# Calculate average rating per recipe
average_ratings = merged.groupby('id')['rating'].mean()

# Merge average ratings back to recipes dataset
recipes_avg_ratings = pd.merge(recipes, average_ratings, left_on='id', right_index=True, how='left')

# Rename the column containing the average rating
recipes_avg_ratings.rename(columns={'rating': 'avg_rating'}, inplace=True)

# Display the resulting dataset
recipes_avg_ratings.head()

# Apply the literal_eval method to the strings in nutrition in order to get them in list form
recipes_avg_ratings['nutrition'] = recipes_avg_ratings['nutrition'].apply(literal_eval)

# Create columns for the nutrition data
nutrition_columns = ['calories (#)', 'total fat (PDV)', 'sugar (PDV)', 'sodium (PDV)', 'protein (PDV)', 'saturated fat (PDV)', 'carbohydrates (PDV)']

# Loop through nutrition variables and fill the new columns
for i, col in enumerate(nutrition_columns):
    recipes_avg_ratings[col] = recipes_avg_ratings['nutrition'].apply(lambda x: x[i] if len(x) > i else np.NaN)

recipes_avg_ratings = recipes_avg_ratings.drop(columns=['nutrition'])

# Create fulfills protein DV column 
recipes_avg_ratings['fulfills_protein_DV'] = recipes_avg_ratings['protein (PDV)'] >= 100

# Reorder columns 
all_other_columns = recipes_avg_ratings.drop(columns=['name', 'avg_rating', 'fulfills_protein_DV']+nutrition_columns).columns.to_list()
recipes_avg_ratings = recipes_avg_ratings[['name', 'avg_rating', 'fulfills_protein_DV']+nutrition_columns+all_other_columns]

# Fill calories of 0 with NaN
recipes_avg_ratings['calories (#)'] =recipes_avg_ratings['calories (#)'].replace(0, np.NaN)

Here is a peek of the first 5 rows of the cleaned dataset. It has 83,782 rows and 19 columns. Because our dataset has so many columns, we selected the most relevant to display on the left. Please scroll to the right to see the remaining columns.

In [36]:
# Here is a peek at the first 5 rows of the cleaned dataset with the most relevant features
recipes_avg_ratings.head()

Unnamed: 0,name,avg_rating,fulfills_protein_DV,calories (#),...,steps,description,ingredients,n_ingredients
0,1 brownies in the world best ever,4.0,False,138.4,...,['heat the oven to 350f and arrange the rack i...,"these are the most; chocolatey, moist, rich, d...","['bittersweet chocolate', 'unsalted butter', '...",9
1,1 in canada chocolate chip cookies,5.0,False,595.1,...,"['pre-heat oven the 350 degrees f', 'in a mixi...",this is the recipe that we use at my school ca...,"['white sugar', 'brown sugar', 'salt', 'margar...",11
2,412 broccoli casserole,5.0,False,194.8,...,"['preheat oven to 350 degrees', 'spray a 2 qua...",since there are already 411 recipes for brocco...,"['frozen broccoli cuts', 'cream of chicken sou...",9
3,millionaire pound cake,5.0,False,878.3,...,"['freheat the oven to 300 degrees', 'grease a ...",why a millionaire pound cake? because it's su...,"['butter', 'sugar', 'eggs', 'all-purpose flour...",7
4,2000 meatloaf,5.0,False,267.0,...,"['pan fry bacon , and set aside on a paper tow...","ready, set, cook! special edition contest entr...","['meatloaf mixture', 'unsmoked bacon', 'goat c...",13


#### Univariate Analysis

The distribution of protein in the dataset is skewed to the right. Most recipes are 200% of daily value or less. As the PDV of protein increases, the number of recipes decreases. Below is a plot of the distribution of sugar after removing outliers with sugar above 1,000 PDV. 

In [37]:
df = recipes_avg_ratings[recipes_avg_ratings['protein (PDV)'] <= 1000]
fig_1 = px.histogram(df, x="protein (PDV)", title="Distribution of Protein (PDV)")
fig_1

The distribution of calories in the dataset is also skewed to the right. Most recipes have 1,500 calories or less. As the number of calories increases, so does the number of recipes. Below is a plot of the distribution of sugar after removing outliers with more than 5,000 calories. 

In [38]:
df = recipes_avg_ratings[recipes_avg_ratings['calories (#)'] <= 5000]

fig_2 = px.histogram(df, x='calories (#)', title='Distribution of Calories (#)')
fig_2

#### Bivariate Analysis

The relationship between calories and average rating is quite weak. For instance, the correlation between the two variables is approximately 0. Recipes with higher calorie content are spread across all average ratings, and the same is true for recipes with lower calorie content. Looking at the scatter plot, knowing a recipe's calorie content does not tell you much about what its average rating would be.

In [39]:
print('Correlation between calories and average rating: '+ str(df['calories (#)'].corr(df['avg_rating'])))

Correlation between calories and average rating: -0.0047284500629757404


In [40]:
fig_3 = px.scatter(df, x='calories (#)', y='avg_rating')
fig_3

In our next bivariate analysis, we looked at the distribution of the average rating of the recipe conditioned on whether the protein content meets or does not meet the reccomended daily value. Based on the bar chart below, the proportion of non-protein-fulfilling recipes with an average rating of 5 is higher than the proportion of protein-fulfilling recipes with an average rating of 5. The same goes for recipes with an average rating between [1,2) and between [2,3). For average rating groups [3,4) and [4,5), the proportion of non-protein-fulfilling recipes is smaller than then the proportion of protein of protein-fulfilling recipes. 

However, the differences in proportions are quite small. As the chart displays, the proportions are __almost the same__ for each range of average rating. Thus, we cannot find any pattern that indicates how whether a recipe reaches a 100 protein PDV impacts its average rating. We need to do further analysis to see if we can perhaps find some pattern, or if truly there is no relationship between protein and rating. 

In [41]:
# Set up a table with the value counts for each avg_rating 
df = recipes_avg_ratings.copy()

# Group average ratings based on the integer (1, 2, 3, 4, or 5)
df['avg_rating'] = df['avg_rating'].round()

# Get the totals for the counts of trues and falses for fulfills_protein_DV
num_fulfills= df[df['fulfills_protein_DV']==True].shape[0]
num_not_fulfills = df[df['fulfills_protein_DV']==False].shape[0]
totals =  [num_not_fulfills]*5  + [num_fulfills]*5

# Create the table to store the proportion values
df = pd.DataFrame(df.groupby(['fulfills_protein_DV', 'avg_rating']).size()).reset_index()
df[0] = df[0] / totals
df = df.rename(columns={0: 'proportion'})

# Plot
fig_4 = px.bar(df, x='avg_rating', y='proportion', color='fulfills_protein_DV', barmode='group', title='Distribution avg_rating conditional on fulfills_protein_DV')
fig_4

## Step 3: Assessment of Missingness

In [85]:
rec = recipes_avg_ratings.copy(deep=True)
rec[rec['description'].isna() == True]

Unnamed: 0,name,avg_rating,fulfills_protein_DV,calories (#),...,steps,description,ingredients,n_ingredients
1486,almond cookie bites,2.67,False,40.1,...,"['preheat oven to 350 degrees f', 'in medium b...",,"['all-purpose flour', ""fisher chef's naturals ...",9
3087,apricot gorgonzola crescent appetizers,4.67,False,139.9,...,['heat oven to 350f spray large cookie sheet w...,,['pillsbury refrigerated crescent dinner rolls...,6
3685,asparagus milanese,4.50,False,225.2,...,"['snap off the tough ends of the asparagus', '...",,"['asparagus', 'parmigiano-reggiano cheese', 'b...",5
...,...,...,...,...,...,...,...,...,...
81188,wasatch mountain chili,5.00,False,672.9,...,['in a large saucepan over medium heat cook on...,,"['onion', 'olive oil', 'hominy', 'great northe...",14
81701,white bean chicken chili giada de laurentiis,5.00,False,409.0,...,['in a large heavy-bottomed saucepan or dutch ...,,"['olive oil', 'onion', 'garlic cloves', 'groun...",18
83070,yukon gold potatoes jacques pepin style,4.00,False,292.1,...,['place the potatoes in a deep skillet and add...,,"['yukon gold potatoes', 'salt', 'fresh ground ...",6


In [86]:
rec[rec['avg_rating'].isna() == True]

Unnamed: 0,name,avg_rating,fulfills_protein_DV,calories (#),...,steps,description,ingredients,n_ingredients
10,lplermagronen,,False,651.8,...,['heat oven to 375f set a large pot of salted ...,"known as swiss mac n cheese, älplermagronen wa...","['potato', 'penne pasta', 'onions', 'butter', ...",8
14,der wiener schnitzel style chili dog sauce,,False,259.7,...,['in a large size dutch oven or large size dee...,this was the best chili dog ever invented! i l...,"['ground beef', 'ground pork', 'water', 'corns...",14
64,boo tiful jell o cups,,False,150.0,...,['add boiling water to gelatin mix in large bo...,this was so good...everyone loved them.,"['boiling water', 'orange gelatin', 'ice cubes...",5
...,...,...,...,...,...,...,...,...,...
83631,zucchini lemon poppyseed bread,,False,2144.9,...,"['preheat oven to 375 degrees', 'in a medium b...",i wanted to try something a little different w...,"['sugar', 'eggs', 'vanilla', 'vegetable oil', ...",14
83651,zucchini pancakes with a difference,,False,197.6,...,"[""i peel the zucchini if i buy it , not if it'...",this started as a recipe from a mexican cookbo...,"['zucchini', 'corn', 'jalapeno', 'egg', 'bisqu...",8
83737,zucchini oat bread,,True,4588.6,...,"['preheat oven to 350 degrees f', 'lightly coa...",the bh&g $400 winner of the bread and rolls ca...,"['nonstick cooking spray', 'sugar', 'ground ci...",15


Upon analyzing the data, we see that many of the columns describe characteristics of the recipes. For example, for each recipe, we can see the breakdown of the nutritonal content, the number of steps, ingredients, and a brief description of what exactly the recipe is. Based on the columns of the dataset, we see that the columns with many missing values is 'avg_rating.' Of all columns, 'avg_rating' is the only column that describes ordinal data; the missingness of average rating cannot be inferred based on the rest of the data because of the fundamentally different nature of what ratings represent. Objectively, ratings are more of an extrinsic measure, reflecting user preferences, user experience, and more, which are all factors that are not reflected within the dataset. Because of this, we can infer that patterns of missingness within 'avg_rating' are not missing at random (NMAR) are likely related to factors which exist outside the scope of our data. Some additional data that might help to explain the missingness of 'avg_rating' and points its missingness mechanism towards missing at random (MAR) include user feedback and trends, such as the number of users who reviewed the recipe or how popular the recipe is, as recipes reviewed by less people can easily have the average skewed in one direction.

### Missingness Dependency

#### Description and Number of Ingredients
 
Null hypothesis: The missingness of description does not depend on the number of ingredients.

Alternative hypothesis: The missingness of description depends on the number of ingredients.

Here, we analyze the dependency between the missingness of `description` and the column `n_ingredients` using a permutation test to see if missing values of description are related to the number of ingredients. The test statistic used in this permutation test is the difference in group means of `n_ingredient` for recipes with descriptions against recipes without descriptions.



In [87]:
obs_with_description = rec[rec['description'].isna() == False]['n_ingredients'].mean()
obs_without_description = rec[rec['description'].isna() == True]['n_ingredients'].mean()
obs_diff = abs(obs_with_description - obs_without_description)
obs_diff

n = 1000
perm_means = []

for i in range(n):
    shuffed = rec['description'].sample(frac = 1, replace=False).values
    shuffed_df = rec.copy()
    shuffed_df['description'] = shuffed

    # find diff in group means
    no_description = shuffed_df[shuffed_df['description'].isna()]['n_ingredients'].mean()
    with_description = shuffed_df[shuffed_df['description'].notna()]['n_ingredients'].mean()
    obs_shuff = abs(no_description - with_description)
    perm_means.append(obs_shuff)

In [89]:
diff_means_plot = px.histogram(x=perm_means, title='Empirical Distribution of Absolute Differences in Group Means')
diff_means_plot.add_vline(x=obs_diff, line_dash="dash", line_color="red", annotation_text=f'Observed Diff in Group Means')
diff_means_plot

In [90]:
p_val = (perm_means >= obs_diff).mean()
p_val

0.001

With a p-value of 0.001, we reject the null hypothesis and infer that from performing a permutation
test and comparing the abs diff of means for 'n_ingredients' for recipes with null descriptions against
recipes with valid descriptions, there exists a dependence of 'description' on 'n_ingredients', making
the missingness mechanism missing at random (MAR)

### Description and Calories (#)

Null hypothesis: The missingness of description does not depend on the number of calories

Alternative hypothesis: The missingness of description depends on the number of calories.

Here, we analyze the dependency between the missingness of `description` and the column `calories (#)` using a permutation test to see if missing values of description are related to a recipe's calories. The test statistic used in this permutation test is the difference in group means of `calories (#)` for recipes with descriptions against recipes without descriptions.

In [91]:
obs_with_description = rec[rec['description'].isna() == False]['calories (#)'].mean()
obs_without_description = rec[rec['description'].isna() == True]['calories (#)'].mean()
obs_cals_diff = abs(obs_with_description - obs_without_description)
obs_cals_diff

n = 1000
perm_means_cals = []

for i in range(n):
    shuffed_cals = rec['description'].sample(frac = 1, replace=False).values
    shuffed_df_cals = rec.copy()
    shuffed_df_cals['description'] = shuffed_cals

    # find diff in group means
    no_description_cals = shuffed_df_cals[shuffed_df_cals['description'].isna()]['calories (#)'].mean()
    with_description_cals = shuffed_df_cals[shuffed_df_cals['description'].notna()]['calories (#)'].mean()
    obs_shuff_cals = abs(no_description_cals - with_description_cals)
    perm_means_cals.append(obs_shuff_cals)

In [92]:
diff_means_cals = px.histogram(x=perm_means_cals, title='Empirical Distribution of Absolute Differences in Group Means')
diff_means_cals.add_vline(x=obs_diff, line_dash="dash", line_color="red", annotation_text=f'Observed Diff in Group Means')
diff_means_cals

In [93]:
cals_p_val = (perm_means_cals >= obs_cals_diff).mean()
cals_p_val


0.233

Upon analyzing the relationship between the missingness of `description` and `calories (#)'`
and performing a permutation test by shuffling the description column, we see that the p-value, the probability of seeing diffs in group means of calories for recipes with descriptions vs. without descriptions is not statistically significant at 0.229. This suggests that the missingness of description based on calories is likely due to random chance, and points towards the missingness mechanism for description to be missing completely at random (MCAR).

## Step 4: Hypothesis Testing

#### Research Question

Is there a difference between the ratings of protein-fulfilling and non-protein-fulfilling recipes? 
#### Hypothesis

$H_0$: There is no difference between the population mean rating of protein-fulfilling recipes (`protein (PDV)` $\geq 100$) and the population mean rating of non-protein-fulfilling recipes. 

$H_a$: There is a difference between the population mean rating of protein-fulfilling recipes (`protein (PDV)` $\geq 100$) and the population mean rating of non-protein-fulfilling recipes. 

#### Test Statistic

Absolute value of the difference between the mean rating for protein-fulfilling and non-protein fulfilling recipes

$|\mu_{protein PDV >= 100} - \mu_{protein PDV < 100}|$

#### Significance Level

$\alpha=0.05$ 


To test our hypothesis, we run a permutation test to see if under the null (which is simulated through shuffling the `fulfills_protein_DV` column), whether the observed absolute mean difference is unlikely to occur under the null, ergo, is there statistically significant evidence in favor of the alternate hypothesis. The observed absolute mean difference between the two groups in the dataset is approximately 0.0031266. 

To run the test, we split the dataset into two groups, one where `fulfills_protein_DV` is true, and the other where it is false. Then, we shuffle the average ratings $n=1000$ times to find the mean differences of the two groups for the thousand simulations. 

In [42]:
# Remove observations without an average rating 
df = recipes_avg_ratings[recipes_avg_ratings['avg_rating'].isna()==False]

fulfill = df[df['fulfills_protein_DV']==True]
unfulfilled = df[df['fulfills_protein_DV']==False]

observed_diff = abs(fulfill['avg_rating'].mean() - unfulfilled['avg_rating'].mean())
observed_diff

0.0031266155172193777

In [49]:
sample_diffs = []
for i in range(1000):
    shuffled = np.random.permutation(df['fulfills_protein_DV'])
    shuffled_df = df.copy()
    shuffled_df['fulfills_protein_DV'] = shuffled
    shuffled_df = shuffled_df.groupby('fulfills_protein_DV')['avg_rating'].mean()
    sample_diffs.append(abs(shuffled_df.iloc[1] - shuffled_df.iloc[0]))
(sample_diffs >= observed_diff).mean()

0.739

Because our p-value of 0.739 is greater than the signifiance level, we fail to reject the null hypothesis. There is no statistially significant evidence to suggest that the absolute mean difference in average ratings between protein-fulfilling and non-protein-fulfilling recipes is difference in the population. Based on this permutation test, and previous bivariate analysis, it does not seem that people rate protein-heavy foods higher or lower than non-protein-heavy foods. 

## Step 5: Framing a Prediction Problem

For our predictive model, we plan on predicting the average rating of a recipe using multivariate regression. Rather than predict the ordinal values of an individual recipe rating (1-5), we decided to predict a recipe's average rating, a continuous response variable that we believe is a better representation of the overall reception and popularity of a recipe. 

We hope to use variables related to nutritional content (e.g. sugar, calories) as predictors of average rating. We initially hoped to use protein as a predictor, but our previous analyses show that there might not be such a strong correlation between protein and average rating. Thus, we will include protein in the model, but might not weight it heavily compared to other variables in the regression. 

We will evaluate our model using the $R^2$ score and root mean squared error. The $R^2$ score shows the variance in average rating that is predictable from the predictor variables, and thus helps us understand how well our model's predictions match the data. The root mean squared error measures the average magnitude of the errors in our model's predictions and will help us understand the accuracy of our model. We will not use other scores such as F1 because they work better for classification.

At the time of prediction, we should have access to the nutritional content and all the other features in the rating dataset as described in the Introduction section. These features are related to the recipe and do not include data on users' opinions on the recipe.

## Step 6: Baseline Model

For the baseline model, we want to look at `calories (#)`, `sugar (PDV)`,  and `carbohydrates (PDV)`.  These feautures are all continuous and quantitative. 

In [132]:
fig_5 = px.scatter(recipes_avg_ratings, x="calories (#)", y="avg_rating", title='Calories vs Average Rating ')
fig_5.show()

In [133]:
fig_6 = px.scatter(recipes_avg_ratings, x="sugar (PDV)", y="avg_rating", title='Sugar vs Average Rating ')
fig_6.show()

In [134]:
fig_7 = px.scatter(recipes_avg_ratings, x="carbohydrates (PDV)", y="avg_rating", title='Carbohydrates vs Average Rating ')
fig_7.show()

Based on the figures, it might be a good idea to try a logarithmic regression on the predictor variables. However, because there are many rows with `carbohydrates (PDV)` and `sugar (PDV)` equal to 0, logarithmic regression does not work well. So, those variables will be pre-processed as is (passthrough) for linear regression. 

Our model is thus a combination of linear regression model on carbohydrates and sugar and a logarithm regression on calories. 

In [135]:
def baseline_model(data):
    X = data[['calories (#)', 'sugar (PDV)', 'carbohydrates (PDV)']]
    
    y = data['avg_rating']
    
    log_transformer = FunctionTransformer(np.log, validate=True)
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('carbs_sugar', 'passthrough', ['sugar (PDV)', 'carbohydrates (PDV)']), # Use carbs as is
            ('log', log_transformer, ['calories (#)']),  # Log-scale calories and sugar
        ]
    )
    
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ])
    
    pipeline.fit(X, y)
    
    predictions = pipeline.predict(X)

    print('Root mean squared error: ' + str(mean_squared_error(y, predictions, squared=False)))
    print('R^2: ' + str(r2_score(y, predictions)))
    
    return (pipeline, predictions)

baseline_df = recipes_avg_ratings.dropna()
baseline_model(baseline_df)

Root mean squared error: 0.6403491353052199
R^2: 0.0002890072336504401


(Pipeline(steps=[('preprocessor',
                  ColumnTransformer(transformers=[('carbs_sugar', 'passthrough',
                                                   ['sugar (PDV)',
                                                    'carbohydrates (PDV)']),
                                                  ('log',
                                                   FunctionTransformer(func=<ufunc 'log'>,
                                                                       validate=True),
                                                   ['calories (#)'])])),
                 ('regressor', LinearRegression())]),
 array([4.63, 4.62, 4.63, ..., 4.64, 4.63, 4.63]))

Based on the high root mean squared error (0.6403) and a $R^2$ score of almost 0 (0.00029), our baseline model performed extremely poorly. The low $R^2$ indicates that our predictors have low explanatory power. The high RMSE indicates a high error and inaccuracy. We must change the model completely going forward.

## Step 7: Final Model

In [None]:
# TODO

## Step 8: Fairness Analysis

In [None]:
# TODO