# Exploratory Analysis of Food Recipes and Reviews

**Name(s)**: Kristen Lee, Jordi Pham

**Website Link**: <a href='https://kristen-lee-120.github.io/wokingwithdata/'> Woking with Data

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
pd.options.plotting.backend = 'plotly'

# from dsc80_utils import * # Feel free to uncomment and use this.

In [None]:
# sklearn
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

## Step 1: Introduction

The recipes and ratings dataset is a dataset that revolves around recipes for food and reviews for how well those recipes did. For our project, we are particularly interested in what is the relationship between cooking time and the average rating of recipes. Readers of should care about our dataset and questions because they can provide a baseline when choosing a recipe to cook for their next meal. Per our given csv files, `interactions` is a dataset with 731927 rows, `recipes` is a dataset with 83782 rows, and the left-merged dataset evaluates to 234429 rows of data. The columns that will prove most relevant to our question are `rating` (user-given rating of the recipe) and `minutes` (preparation time of the recipe). Using these two columns, we believe we will be able to compile the right information to hopefully answer our data science question.

In [None]:
recipes = pd.read_csv('../RAW_recipes_copy.csv')
interactions = pd.read_csv('../RAW_interactions_copy.csv')

In [None]:
print(recipes.shape)
recipes.head()

In [None]:
print(interactions.shape)
interactions.head()

In [None]:
# Left Merge Interactions to Recipes - Keep All Recipes
main = recipes.merge(interactions, how='left', left_on='id', right_on='recipe_id')

In [None]:
print(main.shape)
main.head()

## Step 2: Data Cleaning and Exploratory Data Analysis

In [None]:
# Cleaning Rating Column - 0 to NaN
main['rating'] = main['rating'].replace(0, np.nan)

In [None]:
# view to check
main.head()

In [None]:
# Creating an Average Rating Column
# Creating Series of avg ratings
avg_ratings = main.groupby('id')['rating'].mean()

# map it onto the original df for the new column
main['avg_rating'] = main['id'].map(avg_ratings)

In [None]:
# view to check
main.head()

In [None]:
# Cleaning Nutrition Column
# Define column names
nutrition_cols = ['calories', 
                  'total_fat_pdv', 
                  'sugar_pdv', 
                  'sodium_pdv', 
                  'protein_pdv', 
                  'saturated_fat_pdv', 
                  'carbohydrates_pdv']

# Function to process nutrition data
def process_nutrients(nutri_str):
    try:
        nutri_list = [float(x) for x in nutri_str.strip('[]').split(', ')]  # Convert to float for numerical operations
        if len(nutri_list) == len(nutrition_cols):
            return nutri_list
        else:
            return [None] * len(nutrition_cols)  # Handle unexpected lengths
    except:
        return [None] * len(nutrition_cols)  # Handle errors gracefully

# Apply function and expand into multiple columns
main[nutrition_cols] = main['nutrition'].apply(process_nutrients).apply(pd.Series)

# Drop the original column if no longer needed
main.drop(columns=['nutrition'], inplace=True)

In [None]:
# view to check
main.head()

In [None]:
# Cleaning tags and adding column for the number of tags per recipe
main['tags'] = main['tags'].str.strip('[]').str.replace("'", '').str.split(', ')
main['n_tags'] = main['tags'].apply(lambda x: len(x))

# view to check
main.head()

In [None]:
# check NA's
n_missing = main.isna().sum()
n_missing

### Univariate, Bivariate, and Aggregate analysis

In [None]:
# isolate some information
time_n_rating = main[['id', 'minutes', 'n_steps', 'rating', 'avg_rating']].drop_duplicates()
time_n_rating.head()

In [None]:
# Univariate
# histogram post getting rid of the stupid ass million minute recipe
Q1 = time_n_rating['minutes'].quantile(0.25)
Q3 = time_n_rating['minutes'].quantile(0.75)
IQR = Q3 - Q1

outliers_condition = (time_n_rating['minutes'] > (Q3 + 1.5 * IQR)) | (time_n_rating['minutes'] < (Q1 - 1.5 * IQR))

# Filter out the outliers
clean_time = time_n_rating[~outliers_condition]

px.histogram(clean_time, x = 'minutes', nbins=20, histnorm='density', title='Density Histogram of cook time for a recipe')

In [None]:
# Create a scatter plot with no outliers
fig = px.scatter(clean_time, x='minutes', y=clean_time.index, title="Minutes for a recipe without Outliers")
fig.show()

In [None]:
# histogram of average rating
px.histogram(time_n_rating[['avg_rating']], title='Average rating of different recipes')

In [None]:
# box plot of average rating across recipes
px.box(time_n_rating['avg_rating'], x='avg_rating', title = 'average rating of recipes')

In [None]:
# Bivariate
# without removing the outlier minutes scatterplot
px.scatter(time_n_rating, x='avg_rating', y='minutes')

In [None]:
# removing the outlier minutes scatterplot with a line of best fit

fig = px.scatter(clean_time, x='avg_rating', y='minutes', trendline='ols', title='Cook time vs Average Rating of recipes')
fig.update_traces(line_color='red')
fig.show()

In [None]:
main.columns

In [None]:
main[(main['minutes']==1680) & (main['avg_rating'] == 5)]

In [None]:
# interesting aggregates
main.pivot_table(index='avg_rating',
               columns='n_steps',
               values='minutes',
               aggfunc='mean')

One way to read part of this graph is, say, let's look at the first cell where the average rating is 1 and the recipe takes 1 step. A recipe that has an average rating of 1 and takes 1 step has an average cooking time of 12 minutes flat. All other cells can be read in a similar manner.

From this table, we can also see that the most amount of steps in the entire main dataframe is a 100-step recipe. At 100 steps and an average rating of 5, the average cook time was 1680 minutes.

## Step 3: Assessment of Missingness

In [None]:
# NMAR Analysis

In [None]:
main.columns

In [None]:
# main[main.isna().any(axis=1)].head() -- do we need this cell

The reviews column is a column that could be deemed as NMAR, or not missing at random. After all, this is a dataset of food recipes and their respective reviews; a missing review for a recipe can really only be traced back to how a reviewer interacted with the recipe. The food could've have been so bad that the person felt no need to even leave a review, the recipe may or may not have ever been tried before for a review to be made, or human nature got in the way and the reviewer essentially just forgot to leave a review when logging their personal interaction. 

If we were to change the missingness from NMAR to MAR, where some other column in the dataset could explain the missingness of the `review` column, the additional data that we could possibly obtain would be 

In [None]:
# look at cols with missing values again
n_missing

In [None]:
# main.groupby('id')['review'].count() --- tbh dont need

Missingess: avg ratings on diff-of-means for protein pdv (percent daily value)

Non-missingness: avg ratings on diff-of-means for calories

In [None]:
# IQR elimination to deal with extreme outliers (maybe move this to cleaning)
Q1 = main['minutes'].quantile(0.25)
Q3 = main['minutes'].quantile(0.75)
IQR = Q3 - Q1

iqr_conditions = ((main['minutes'] > (Q3 + IQR * 1.5)) | (main['minutes'] < (Q1 - IQR * 1.5)))
iqr_main = main[~iqr_conditions]
iqr_main['minutes'].max()


In [None]:
# Missing Mechanism 1 - MAR between average rating and protein_pdv
m_rating_n_tag = iqr_main[iqr_main['avg_rating'].isna()]['protein_pdv'].mean()
nm_rating_n_tag = iqr_main[iqr_main['avg_rating'].notna()]['protein_pdv'].mean()
m_t_stat1 = m_rating_n_tag - nm_rating_n_tag
m_t_stat1

In [None]:
# Shuffle avg rating column
n = 500
shuffler1 = iqr_main.copy()

diff_means1 = []
for _ in range(n):
    # shuffle the column 
    shuffler1['avg_rating'] = np.random.permutation(shuffler1['avg_rating'])
    # determine the differences
    dummy_na = shuffler1[shuffler1['avg_rating'].isna()]['protein_pdv'].mean()
    dummy_isna = shuffler1[shuffler1['avg_rating'].notna()]['protein_pdv'].mean()
    diff_means1.append(dummy_na - dummy_isna)
    
np.mean(np.array(diff_means1) >= np.abs(m_t_stat1))

The p-value is 0.016, which is greater than our chosen significance level of 0.01. 

In [None]:
# Missing Mechanisms 2 - MAR is nonexistent between average rating and calories
m_rating_cal = main[main['avg_rating'].isna()]['calories'].mean()
nm_rating_cal = main[main['avg_rating'].notna()]['calories'].mean()
m_t_stat2 = m_rating_cal - nm_rating_cal
m_t_stat2

In [None]:
# Shuffle avg rating column
n = 500
shuffler2 = main.copy()

diff_means2 = []
for _ in range(n):
    # shuffle the column
    shuffler2['avg_rating'] = np.random.permutation(shuffler2['avg_rating'])
    # determine the differences
    dummy_na = shuffler2[shuffler2['avg_rating'].isna()]['calories'].mean()
    dummy_isna = shuffler2[shuffler2['avg_rating'].notna()]['calories'].mean()
    diff_means2.append(np.abs(dummy_na - dummy_isna))

np.mean(np.array(diff_means2) >= np.abs(m_t_stat2))

## Step 4: Hypothesis Testing

Null: There is no relationship between average rating of a recipe and its cooking time.

Alt: There is a relationship between the average rating of a recipe and its cooking time 

Using Pearson's R

In [None]:
# observed test statistic, correlation between minutes and average rating
obs_test_stat = iqr_main['minutes'].corr(iqr_main['avg_rating'])
obs_test_stat

In [None]:
n = 500
shuffler3 = iqr_main.copy()

corrs = []
for _ in range(n):
    # shuffle the column 
    shuffler3['avg_rating'] = np.random.permutation(shuffler1['avg_rating'])
    # determine the differences
    dummy_corr = iqr_main['minutes'].corr(shuffler3['avg_rating'])
    corrs.append(dummy_corr)

np.mean(np.abs(corrs) >= np.abs(obs_test_stat))

In [None]:
# Create the histogram using Plotly Express
fig = px.histogram(
    x=corrs, 
    nbins=30, 
    opacity=0.7, 
    title="Permutation Test: Distribution of Shuffled Correlations",
    labels={"x": "Shuffled Correlations", "y": "Frequency"},
    template="plotly_white"
)

# Add observed test statistic as a vertical line
fig.add_vline(x=obs_test_stat, line=dict(color="red", width=2, dash="dash"), annotation_text="Observed Test Stat", annotation_position="top")
fig.add_vline(x=-np.abs(obs_test_stat), line=dict(color="blue", width=2, dash="dash"), annotation_text="- |Observed|", annotation_position="top left")
fig.add_vline(x=np.abs(obs_test_stat), line=dict(color="blue", width=2, dash="dash"), annotation_text="+ |Observed|", annotation_position="top right")

# Show the plot
fig.show()

In [None]:
# shown with a scatterplot as well
fig = px.scatter(iqr_main, x="minutes", y="avg_rating", 
                 title="Scatter Plot of Minutes vs. Avg Rating",
                 labels={"minutes": "Minutes", "avg_rating": "Average Rating"},
                 template="plotly_white",
                 trendline="ols")  # Ordinary Least Squares Regression

# Compute Pearson correlation coefficient (r)
pearson_r = iqr_main['minutes'].corr(iqr_main['avg_rating'])

# Add annotation with Pearson's r value
fig.add_annotation(
    x=min(iqr_main["minutes"]),  # Position on the x-axis
    y=max(iqr_main["avg_rating"]),  # Position on the y-axis
    text=f"Pearson's r = {pearson_r:.3f}",  # Show r rounded to 3 decimals
    showarrow=False,
    font=dict(size=14, color="black"),
    align="left",
    xanchor="left",
    yanchor="top"
)

# Show plot
fig.show()

## Step 5: Framing a Prediction Problem

We are planning to predict the number of steps of a recipe based on cooking time and number of ingredients. This type of prediction problem is a regression type problem.

## Step 6: Baseline Model

In [None]:
# Predictive Models
# Training predictive model to predict number of steps based on cooking time and number of ingredients

In [None]:
pipeline = Pipeline([
    ('model', LinearRegression())
    ])

In [None]:
X = iqr_main[['minutes', 'n_ingredients']]
Y = iqr_main['n_steps']

In [None]:
pipeline.fit(X, Y)

In [None]:
pipeline.score(X, Y)

In [None]:
# Create a DataFrame for Plotly
results_df = pd.DataFrame({
    'Actual Steps': Y,
    'Predicted Steps': pipeline.predict(X)
})

# Scatter plot using Plotly
fig = px.scatter(
    results_df, 
    x='Actual Steps', 
    y='Predicted Steps',
    title='Actual vs. Predicted Number of Steps',
    opacity=0.5,  # Equivalent to `alpha` in matplotlib
)

# Final touches
fig.update_layout(
    xaxis_title='Actual Steps',
    yaxis_title='Predicted Steps',
    template='plotly_white'
)

fig.show()

In [None]:
# Create a DataFrame for Plotly
results_df = pd.DataFrame({
    'Predicted Steps': pipeline.predict(X)
})

# Distribution plot
fig = px.histogram(
    results_df, 
    x='Predicted Steps',
    nbins=30,  # Adjust the number of bins as needed
    title='Distribution of Predicted Steps',
    marginal='box',  # Adds a box plot for additional insights
)

# Final touches
fig.update_layout(
    xaxis_title='Predicted Steps',
    yaxis_title='Count',
    template='plotly_white'
)

fig.show()

In [None]:
# Scatter plot 1: Number of Steps vs Cooking Minutes
fig1 = px.scatter(
    iqr_main, 
    x='minutes', 
    y='n_steps',
    title='Number of Steps vs Cooking Minutes',
    labels={'minutes': 'Cooking Minutes', 'n_steps': 'Number of Steps'},
    opacity=0.5
)
fig1.show()

# Scatter plot 2: Number of Steps vs Number of Ingredients
fig2 = px.scatter(
    iqr_main, 
    x='n_ingredients', 
    y='n_steps',
    title='Number of Steps vs Number of Ingredients',
    labels={'n_ingredients': 'Number of Ingredients', 'n_steps': 'Number of Steps'},
    opacity=0.5
)
fig2.show()

## Step 7: Final Model

In [None]:
# TODO

## Step 8: Fairness Analysis

In [None]:
# TODO