# Recipe Project

**Name(s)**: Ryan Lindberg

**Website Link**: lindbergryan04.github.io/reponame

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import os
import plotly.express as px
import ast
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.plotting.backend = 'plotly'
from dsc80_utils import * 

## Step 1: Introduction

In [None]:
pd.set_option('display.max_rows', 50)
pd.set_option('display.float_format', '{:.2f}'.format)

In [None]:
food_data = pd.read_csv('data/RAW_recipes.csv')
interactions_data = pd.read_csv('data/RAW_interactions.csv')
interactions_data['recipe_id'] = interactions_data['recipe_id'].astype(int)
recipes = food_data.merge(interactions_data, how = 'left', right_on = 'recipe_id', left_on = 'id')
recipes = recipes.set_index('id').drop(columns = ['Unnamed: 0_x','Unnamed: 0_y','recipe_id'])

In [None]:
recipes

## Understanding the Data

### Index
- **id:** Recipe ID

### Columns
- **name:** Recipe name
- **minutes:** Minutes required to prepare the recipe
- **contributor_id:** User ID who submitted the recipe
- **submitted:** Date the recipe was submitted
- **tags:** Food.com tags associated with the recipe
- **nutrition:** Nutrition information in the format:  
  `[calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]`
- **n_steps:** Number of steps in the recipe
- **steps:** Text for the recipe steps, in order
- **description:** User-provided description of the recipe
- **user_id:** ID of the user who interacted with the recipe
- **date:** Date of the interaction
- **rating:** Rating given in the interaction
- **review:** Review text from the interaction

**Note:** Not all recipes have ratings due to the left merge, so some rows may have missing values for user_id, date, rating, and review.

### Possible Questions:

**1.** What types of recipes tend to be healthier (i.e. more protein, fewer carbs)?

**2.** What types of recipes tend to have higher average ratings?

**3.** What is the relationship between the number of steps and rating of recipes?

**4.** What is the relationship between nutritional metrics (e.g., calories, total fat, sugar) and user ratings?

**5.** Which recipes offer the best balance between healthiness, quick preparation, and taste?

**Motivation:**

Since I have transferred to UCSD, my cooking has devolved to become a bit less healthy due to lack of food access and lack of time. I'm curious about what ingredients/recipes I could  use to help me fix this issue. Ideally these recipes are cost effective in terms of time and money, and use simple ingredients available at the campus Target. Also, the food should be tasty of course!

I think **Question #5** will be the best to help me solve my predicament. 

## Step 2: Data Cleaning and Exploratory Data Analysis

### Cleaning

Repalce 0 ratings with np.nan:

In [None]:
recipes['rating'] = recipes['rating'].replace(0, np.nan)

Add avg_rating column (avg rating for each recipe):

In [None]:
recipes['avg_rating'] = recipes.groupby('id')['rating'].transform('mean')

Remove unnecessary columns:

In [None]:
recipes = recipes.drop(columns = ['review','contributor_id','user_id','date','submitted','description','steps'])

Convert string lists to actual lists:

For example: The string "['60-minutes-or-less', 'time-to-make', 'course']" from tags column, or the string '[138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0]' from nutrition column.

In [None]:
columns_to_convert = ['tags','nutrition','ingredients']  

for col in columns_to_convert:
    recipes[col] = recipes[col].apply(ast.literal_eval)

Parse nutrition lists into their own columns:

  `[calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]`


In [None]:
recipes[[
    'calories',
    'total_fat',
    'sugar',
    'sodium',
    'protein',
    'sat_fat',
    'carbs'
]] = pd.DataFrame(recipes['nutrition'].tolist(), index=recipes.index)

In [None]:
recipes = recipes.drop(columns = ['nutrition'])

In [None]:
recipes

### Exploratory Analysis

**Eda**

What is the distribution of preparation times (minutes) across recipes?

How do ratings (as a proxy for taste) correlate with preparation time and nutritional metrics (e.g., calories, total fat, sugar, etc.)?

**what to look at**

Plot histograms of preparation times and ratings.

Create scatter plots comparing minutes versus rating, and overlay nutritional metrics (e.g., using color or size) to see if quicker recipes tend to have better or worse nutritional profiles.

Compute correlation matrices to quantify relationships between variables such as minutes, calories, fat, and rating.

How does preparation time (minutes) correlate to rating?

In [None]:
eda_recipes = recipes.copy()

Remove duplicate recipes:

In [None]:
eda_recipes = eda_recipes[~eda_recipes.index.duplicated(keep='first')]


Lets look at the descriptive statistics of rating and prep time:

In [None]:
print(eda_recipes['minutes'].describe())
print(eda_recipes['rating'].describe())


In [None]:
np.count_nonzero(eda_recipes['minutes'] > (60 * 4))

The maximum prep time is 1051200 minutes. Lets remove the outliers with a prep time > 4 hours so we can plot the distribution.

The mean rating for all recipes is 4.65, which is oddly high. On a 1-10 scale thats equivalent to 9.3. Either the recipes in this dataset are very good, or the ratings are biased towards higher values. 

In [None]:
eda_recipes = eda_recipes[eda_recipes['minutes'] < (60 * 4)]

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Use an available style, e.g., "seaborn" or "seaborn-whitegrid"
plt.style.use('fivethirtyeight')

plt.figure(figsize=(8, 6))
plt.hist(eda_recipes['minutes'], bins=np.arange(0, 185, 5),
         edgecolor='black', color='#4c72b0')  # A pleasing blue

plt.ticklabel_format(style='plain', axis='x')
plt.xlabel("Minutes", fontsize=12, fontweight='bold')
plt.ylabel("Frequency", fontsize=12, fontweight='bold')
plt.title("Distribution of Recipe Minutes", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()


import matplotlib.pyplot as plt
import numpy as np

# Use an available style, e.g., "seaborn" or "seaborn-whitegrid"
plt.style.use('fivethirtyeight')

plt.figure(figsize=(8, 6))
plt.hist(eda_recipes['rating'], bins=5,
         edgecolor='black', color='#4c72b0')  # A pleasing blue

plt.ticklabel_format(style='plain', axis='x')
plt.xlabel("Rating", fontsize=12, fontweight='bold')
plt.ylabel("Frequency", fontsize=12, fontweight='bold')
plt.title("Distribution of Ratings", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Next, which nutritional values are most linearly correlated with rating?

In [None]:
# List of numeric columns you want to correlate
cols_to_correlate = [
    'rating',
    'calories',
    'total_fat',
    'sugar',
    'sodium',
    'protein',
    'sat_fat',
    'carbs'
]

# Calculate the correlation matrix (Pearson correlation by default)
corr_matrix = eda_recipes[cols_to_correlate].corr()

# Print the correlation matrix
print(corr_matrix)

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(
    corr_matrix,
    annot=True,      # Annotate each cell with the correlation value
    cmap='coolwarm', # Color palette; try 'viridis', 'YlGnBu', etc.
    vmin=-1, vmax=1  # Correlation ranges from -1 to 1
)
plt.title("Correlation Heatmap")
plt.show()

It seems that no linear correlation between any nutritional metric and rating. My intuition would lead me to think maybe sugar and carbs would have some positive association. We can see that calories and fat are highly correlated, as well as sugar and carbs.

In [None]:
# List of numeric columns you want to correlate
cols_to_correlate = [
    'rating',
    'minutes',
    'n_steps',
    'n_ingredients',
]

# Calculate the correlation matrix (Pearson correlation by default)
corr_matrix = eda_recipes[cols_to_correlate].corr()

# Print the correlation matrix
print(corr_matrix)

So if the nutritional facts dont tell us about the rating, what does?

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(
    corr_matrix,
    annot=True,      # Annotate each cell with the correlation value
    cmap='coolwarm', # Color palette; try 'viridis', 'YlGnBu', etc.
    vmin=-1, vmax=1  # Correlation ranges from -1 to 1
)
plt.title("Correlation Heatmap")
plt.show()

So it seems that there is no linear correlation between rating and any of our variables. Lets take a look at the scatter plots to verify.

## Step 3: Assessment of Missingness

In [None]:
# TODO

## Step 4: Hypothesis Testing

In [None]:
# TODO

## Step 5: Framing a Prediction Problem

2. Data Preprocessing and Feature Engineering
Extract Nutritional Data:
Parse the nutrition column into individual features (calories, total fat, sugar, sodium, protein, saturated fat, carbohydrates). This will allow you to standardize each metric.

Normalization:
Normalize each feature so that all metrics are on a similar scale. For example, convert calories and protein into z-scores or min–max scaled values.

Create Derived Metrics:

Healthiness Score: A possible formulation might be:
Healthiness = (Normalized Protein) – (Normalized Calories + Normalized Saturated Fat + Normalized Sodium)
(You can adjust the formula and include other nutrients as needed.)
Quickness Score: Compute something like:
Quickness = 1 / (1 + minutes)
so that lower preparation times yield a higher score.
Taste Score: Directly use the normalized average rating.

3. Constructing the Composite Score
Combine the three components using a weighted sum:

Composite Score
=
�
×
Healthiness Score
+
�
×
Taste Score
+
�
×
Quickness Score
Composite Score=α×Healthiness Score+β×Taste Score+γ×Quickness Score
Weights (α, β, γ):
These values represent how important each component is relative to the others. You can start with equal weights or adjust them based on your personal preference (e.g., if healthiness is your top priority, choose a higher α).


4. Model Development Options
You have a couple of paths for implementing this:

Rule-Based Model:
Start with the weighted sum approach described above. This is transparent and easy to adjust, especially if you don’t have a target variable that directly represents the “ideal” balance.

Supervised Machine Learning:
If you have historical data where recipes are labeled (or you can derive a target composite score from user interactions), you could train a regression model (e.g., linear regression, random forest, gradient boosting) to predict the composite score from the features.

Training Data: Use your engineered features (healthiness, quickness, taste) and any additional recipe metadata.
Validation: Apply cross-validation to ensure that your model generalizes well.
Multi-Criteria Decision Making (MCDM):
Techniques like the Analytic Hierarchy Process (AHP) or TOPSIS can also be used to rank recipes based on multiple attributes. This might be particularly useful if you want a model that doesn't necessarily require a large training dataset.

5. Implementation and Evaluation
Model Tuning:
Experiment with different weight combinations (α, β, γ) to see how your recipe rankings change. You might use grid search or a similar optimization technique to find the best performing combination.

Validation:
Compare the composite scores against user feedback (ratings or reviews) to check if higher scores correlate with better perceived recipes.

Iterative Refinement:
Gather feedback and adjust the model. For instance, if you notice that recipes with short preparation times but very unhealthy profiles are scoring too high, you might need to penalize the healthiness score more heavily.

## Step 6: Baseline Model

In [None]:
# TODO

## Step 7: Final Model

In [None]:
# TODO

## Step 8: Fairness Analysis

In [None]:
# TODO