# Do Longer Recipes Get Higher Ratings?

**Name(s)**: Casey So and Keilani Li

**Website Link**: https://keil4ni.github.io/recipe-analysis/

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
pd.options.plotting.backend = 'plotly'

# from dsc80_utils import * # Feel free to uncomment and use this.

## Step 1: Introduction

When looking for a recipe online, one of the first things people notice besides from the ingredients is how long it takes to cook. Some users are looking for quick meals they can prepare in under 30 minutes, while others are willing to invest time in more complex dishes. But does the time required to cook a recipe actually affect how well it's rated?

This project explores the connection between cooking time and user ratings of recipes. The goal is to find out whether recipes that take longer to make tend to receive better ratings, or if users prefer faster, simpler options. To do this, we will be working with a dataset of recipes that includes details like total cooking time, ingredients, steps, and user ratings.

By analyzing these variables, we want to see if there's a pattern, do people reward effort with higher ratings, or do they value convenience more? The results might help explain what makes a recipe more appealing to home cooks, and whether time investment is actually reflected in how satisfied users are with the outcome.

## Data Sets


We are analyzing two datasets from Food.com, containing recipes and user ratings posted between 2008 and 2018. These datasets were originally compiled for a research paper on recommender systems titled "Generating Personalized Recipes from Historical User Preferences" by Majumder et al.

The first dataset, called recipes, includes 83,782 entries, each representing a unique recipe. It contains 10 columns that capture various attributes of each recipe, such as:

      Column             | Description
      -------------------|------------------
      'name'	     | Recipe name
      'id'	             | Recipe ID
      'minutes'          | Minutes to prepare recipe
      'contributor_id'   | User ID who submitted this recipe
      'submitted'        | Date recipe was submitted
      'tags'             | Food.com tags for recipe
      'nutrition'	     | Nutrition information in the form
                         | [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV),
                         | saturated fat (PDV), carbohydrates (PDV)];
                         | PDV stands for “percentage of daily value"
      'n_steps'	     | Number of steps in recipe
      'steps'            | Text for recipe steps, in order
      'description'	     | User-provided description
      'ingredients'	     | Text for recipe ingredients
      'n_ingredients'    | Number of ingredients in recipe

The second dataset, interactions, contains 731,927 entries, with each row representing a user's interaction with a specific recipe—typically a review or rating. This dataset helps capture user preferences and engagement over time. The columns included are:

      Column             | Description
      -------------------|------------------
      'user_id'	     | User ID
      'recipe_id'	     | Recipe ID
      'date'	     | Date of interaction
      'rating'	     | Rating given
      'review'	     | Review text

'name'	Recipe name
'id'	Recipe ID
'minutes'	Minutes to prepare recipe
'contributor_id'	User ID who submitted this recipe
'submitted'	Date recipe was submitted
'tags'	Food.com tags for recipe
'nutrition'	Nutrition information in the form [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]; PDV stands for “percentage of daily value”
'n_steps'	Number of steps in recipe
'steps'	Text for recipe steps, in order
'description'	User-provided description
'ingredients'	Text for recipe ingredients
'n_ingredients'	Number of ingredients in recipe

In [None]:
# TODO

## Step 2: Data Cleaning and Exploratory Data Analysis

In [None]:
# read in recipes df
recipes_path = Path('data') / 'RAW_recipes.csv'
recipes = pd.read_csv(recipes_path)
recipes.head()

In [None]:
# read in interactions df
interactions_path = Path('data') / 'interactions.csv'
interactions = pd.read_csv(interactions_path)
interactions.head()

In [None]:
# merge recipes + interactions dfs
df = recipes.merge(interactions, how = 'left', left_on = 'id', right_on = 'recipe_id')
df.head()

In [None]:
# num of (rows, cols)
df.shape

In [None]:
# num of nans before replacement
df[df['rating'].isnull()].shape

In [None]:
# num of 0 ratings before replacement
df[df['rating'] == 0.0].shape

In [None]:
# num of nans after replacement
df = df.replace(0.0, np.nan)
df[df['rating'].isnull()].shape

In [None]:
# find avg rating per recipe
df['avg_rating'] = df.groupby('id')['rating'].transform('mean')
df

In [None]:
# check if avg rating is correct for any recipe
df[(df['id'] == 306168)]

In [None]:
df.columns

In [None]:
# df[['id', 'recipe_id']]

# drop id bc its a dupe of recipe_id col. recipe_id is a more specific col name
# drop contributor_id bc it's unique, doesn't contrib to our analysis

df = df.drop(columns = ['id', 'contributor_id'])

In [None]:
df

In [None]:
'''
reorder cols for better readability
'''

## Step 3: Assessment of Missingness

In [None]:
# TODO

## Step 4: Hypothesis Testing

In [None]:
# TODO

## Step 5: Framing a Prediction Problem

In [None]:
# TODO

## Step 6: Baseline Model

In [None]:
# TODO

## Step 7: Final Model

In [None]:
# TODO

## Step 8: Fairness Analysis

In [None]:
# TODO