<a href="https://colab.research.google.com/github/juliahumphrys/data-2000/blob/main/Julia_midterm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATA-2000 Midterm Exam

## Recipe Rating Prediction

For this exercise, we are going to use a dataset of recipes and their ratings, taken from [the website Epicurious](https://www.epicurious.com/recipes-menus).

Our dataset contains basic information about the dish (its name, description, ingredients, and directions), as well as nutritional content (calories, protein, sodium, and fat contents). Based on this information, we want to try and predict how well or poorly the dish will be rated by users.


## Grading Rubric

This midterm will be worth 15% of your total grade for this course. It will be graded out of 50 points, divided into 4 sections:

  - Data Prep: 10 points
    - 5 points will be awarded for the actual data cleaning (evaluating your Python code)
    - 5 points will be awarded for the text commentary narrating your choices and explaining your rationale for the data quality checks that you chose to use
  - Feature Engineering: 12 points
    - 2 points will be awarded by default, but may be subtracted from if there are substantial errors in your data prep that reduce the quality of your engineered features
    - 5 points will be awarded for the actual feature engineering (evaluating your Python code)
    - 5 points will be awarded for the text commentary narrating your choices and explaining your rationale
  - Model Building: 14 points
    - 4 points will be awarded by default, but may be subtracted from if there are substantial errors in your feature engineering that reduce the quality of your model
    - 5 points will be awarded for the actual model building (evaluating your Python code)
    - 5 points will be awarded for the text commentary narrating your choices and explaining your rationale
  - Model Validation/Evaluation: 14 points
    - 4 points will be awarded by default, but may be subtracted from if there are substantial errors in your model building that negatively impact the validity of your model
    - 5 points will be awarded for the actual model validation and evaluation (evaluating your Python code)
    - 5 points will be awarded for the text commentary narrating your choices and explaining your rationale

> **NOTE:** You will NOT be evaluated on whether you model actually makes accurate predictions or not


## Using Additional Resources

This is an open-resource exam. You may use any available resources as references. I will be available for any questions that you have during the exam.

Remember that all work must still be your own, and that this exam is governed by the [Policy on Academic Honesty outlined in our course syllabus](https://docs.google.com/document/d/1Aoh7LvTKTEZO74eOsNhLzorkLtljkuchpg3ScNM_VEs/edit#heading=h.r0b18a8gh450).

-----

## Importing the Data

First, let's download our dataset and take a look at what it contains:

In [67]:
import pandas as pd
import numpy as np
from scipy import stats

data = pd.read_json('https://cdn.c18l.org/full_format_recipes.json')

In [68]:
data.head()

Unnamed: 0,directions,fat,date,categories,calories,desc,protein,rating,title,ingredients,sodium
0,"[1. Place the stock, lentils, celery, carrot, ...",7.0,2006-09-01 04:00:00+00:00,"[Sandwich, Bean, Fruit, Tomato, turkey, Vegeta...",426.0,,30.0,2.5,"Lentil, Apple, and Turkey Wrap","[4 cups low-sodium vegetable or chicken stock,...",559.0
1,[Combine first 9 ingredients in heavy medium s...,23.0,2004-08-20 04:00:00+00:00,"[Food Processor, Onion, Pork, Bake, Bastille D...",403.0,This uses the same ingredients found in boudin...,18.0,4.375,Boudin Blanc Terrine with Red Onion Confit,"[1 1/2 cups whipping cream, 2 medium onions, c...",1439.0
2,[In a large heavy saucepan cook diced fennel a...,7.0,2004-08-20 04:00:00+00:00,"[Soup/Stew, Dairy, Potato, Vegetable, Fennel, ...",165.0,,6.0,3.75,Potato and Fennel Soup Hodge,"[1 fennel bulb (sometimes called anise), stalk...",165.0
3,[Heat oil in heavy large skillet over medium-h...,,2009-03-27 04:00:00+00:00,"[Fish, Olive, Tomato, Sauté, Low Fat, Low Cal,...",,The Sicilian-style tomato sauce has tons of Me...,,5.0,Mahi-Mahi in Tomato Olive Sauce,"[2 tablespoons extra-virgin olive oil, 1 cup c...",
4,[Preheat oven to 350°F. Lightly grease 8x8x2-i...,32.0,2004-08-20 04:00:00+00:00,"[Cheese, Dairy, Pasta, Vegetable, Side, Bake, ...",547.0,,20.0,3.125,Spinach Noodle Casserole,"[1 12-ounce package frozen spinach soufflé, th...",452.0


## Data Prep & Cleaning

Perform any data quality checks and data cleaning that you believe is appropriate. Convert any categorical columns to numeric ones, if needed. Provide a narrative explanation of your choices to accompany any code.

First, I dropped some of the columns, keeping fat, calories, description, protein, rating, title, ingredients, and sodium. I chose to keep all numerical features as I think they will be useful for a predictive model. I also chose to keep the categorical columns of description, title, and ingredients as I think they may be useful further in the process. Then, I dropped all recipies containing null values to further clean the data. Lastly, I removed the outliers by using the z-scores.

In [69]:
new_data = data.loc[:, ["directions", "fat", "calories", "desc", "protein", "rating", "title", "ingredients", "sodium"]]
new_data.head()

Unnamed: 0,directions,fat,calories,desc,protein,rating,title,ingredients,sodium
0,"[1. Place the stock, lentils, celery, carrot, ...",7.0,426.0,,30.0,2.5,"Lentil, Apple, and Turkey Wrap","[4 cups low-sodium vegetable or chicken stock,...",559.0
1,[Combine first 9 ingredients in heavy medium s...,23.0,403.0,This uses the same ingredients found in boudin...,18.0,4.375,Boudin Blanc Terrine with Red Onion Confit,"[1 1/2 cups whipping cream, 2 medium onions, c...",1439.0
2,[In a large heavy saucepan cook diced fennel a...,7.0,165.0,,6.0,3.75,Potato and Fennel Soup Hodge,"[1 fennel bulb (sometimes called anise), stalk...",165.0
3,[Heat oil in heavy large skillet over medium-h...,,,The Sicilian-style tomato sauce has tons of Me...,,5.0,Mahi-Mahi in Tomato Olive Sauce,"[2 tablespoons extra-virgin olive oil, 1 cup c...",
4,[Preheat oven to 350°F. Lightly grease 8x8x2-i...,32.0,547.0,,20.0,3.125,Spinach Noodle Casserole,"[1 12-ounce package frozen spinach soufflé, th...",452.0


In [70]:
new_data = new_data.dropna(subset = ["directions", "fat", "calories", "desc", "protein", "rating", "title", "ingredients", "sodium"])
new_data.head()

Unnamed: 0,directions,fat,calories,desc,protein,rating,title,ingredients,sodium
1,[Combine first 9 ingredients in heavy medium s...,23.0,403.0,This uses the same ingredients found in boudin...,18.0,4.375,Boudin Blanc Terrine with Red Onion Confit,"[1 1/2 cups whipping cream, 2 medium onions, c...",1439.0
5,"[Mix basil, mayonnaise and butter in processor...",79.0,948.0,This recipe can be prepared in 45 minutes or l...,19.0,4.375,The Best Blts,[2 1/2 cups (lightly packed) fresh basil leave...,1042.0
8,"[Stir together soy sauce, sugar, sesame oil, w...",10.0,170.0,Bulgogi,7.0,4.375,Korean Marinated Beef,"[1/4 cup soy sauce, 1 tablespoon sugar, 2 teas...",1272.0
9,[Chop enough parsley leaves to measure 1 table...,41.0,602.0,Transform your picnic into un pique-nique to r...,23.0,3.75,Ham Persillade with Mustard Potato Salad and M...,"[6 long parsley sprigs, divided, 1 3/4 cups re...",1696.0
10,[Heat oil in heavy large skillet over medium-h...,5.0,256.0,Simmering the yams fills them with flavor and ...,4.0,3.75,"Yams Braised with Cream, Rosemary and Nutmeg","[4 teaspoons olive oil, 1/2 cup finely chopped...",30.0


In [71]:
limit = 3
z_scores = stats.zscore(new_data[["fat", "calories", "protein", "sodium"]])
abs = np.abs(z_scores)
outliers = (abs > limit).any(axis=1)
new_data = new_data[~outliers]

new_data.head()

Unnamed: 0,directions,fat,calories,desc,protein,rating,title,ingredients,sodium
1,[Combine first 9 ingredients in heavy medium s...,23.0,403.0,This uses the same ingredients found in boudin...,18.0,4.375,Boudin Blanc Terrine with Red Onion Confit,"[1 1/2 cups whipping cream, 2 medium onions, c...",1439.0
5,"[Mix basil, mayonnaise and butter in processor...",79.0,948.0,This recipe can be prepared in 45 minutes or l...,19.0,4.375,The Best Blts,[2 1/2 cups (lightly packed) fresh basil leave...,1042.0
8,"[Stir together soy sauce, sugar, sesame oil, w...",10.0,170.0,Bulgogi,7.0,4.375,Korean Marinated Beef,"[1/4 cup soy sauce, 1 tablespoon sugar, 2 teas...",1272.0
9,[Chop enough parsley leaves to measure 1 table...,41.0,602.0,Transform your picnic into un pique-nique to r...,23.0,3.75,Ham Persillade with Mustard Potato Salad and M...,"[6 long parsley sprigs, divided, 1 3/4 cups re...",1696.0
10,[Heat oil in heavy large skillet over medium-h...,5.0,256.0,Simmering the yams fills them with flavor and ...,4.0,3.75,"Yams Braised with Cream, Rosemary and Nutmeg","[4 teaspoons olive oil, 1/2 cup finely chopped...",30.0


## Feature Engineering

Develop any new feature(s) that you feel may be relevant to a model. Provide a narrative explanation of your choices to accompany any code.

To help, I've included a `column_builder()` utility function that will create a new boolean column based on whether a string of text appears in any of (1) the recipe title; (2) the recipe description; or (3) the recipe tags.

In [72]:
def column_builder(category: str, new_data: pd.DataFrame) -> pd.DataFrame:
    new_data[f'is_{category}'] = ((
        new_data['title'].str.contains(f'{category}', na=False, case=False)
    ) | (
        new_data['desc'].str.contains(f'{category}', na=False, case=False)
    )).astype(int)

    return new_data


categories = [
    'easy',
    'breakfast',
    'dinner',
    'lunch',
    'beginner'
]

for category in categories:
    new_data = column_builder(category, new_data)

new_data['is_easy'].describe()

count    10603.000000
mean         0.034896
std          0.183525
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: is_easy, dtype: float64

I created a column that calculates the protien to calorie ratio of each recipe. This ratio represents the protein density of each recipe. This information may be useful in the predictive model, as individuals may rate protein dense recipies higher or lower.

In [73]:
new_data['protein_calories_ratio'] = new_data['protein'] / new_data['calories']
new_data.head()

Unnamed: 0,directions,fat,calories,desc,protein,rating,title,ingredients,sodium,is_easy,is_breakfast,is_dinner,is_lunch,is_beginner,protein_calories_ratio
1,[Combine first 9 ingredients in heavy medium s...,23.0,403.0,This uses the same ingredients found in boudin...,18.0,4.375,Boudin Blanc Terrine with Red Onion Confit,"[1 1/2 cups whipping cream, 2 medium onions, c...",1439.0,0,0,0,0,0,0.044665
5,"[Mix basil, mayonnaise and butter in processor...",79.0,948.0,This recipe can be prepared in 45 minutes or l...,19.0,4.375,The Best Blts,[2 1/2 cups (lightly packed) fresh basil leave...,1042.0,0,0,0,0,0,0.020042
8,"[Stir together soy sauce, sugar, sesame oil, w...",10.0,170.0,Bulgogi,7.0,4.375,Korean Marinated Beef,"[1/4 cup soy sauce, 1 tablespoon sugar, 2 teas...",1272.0,0,0,0,0,0,0.041176
9,[Chop enough parsley leaves to measure 1 table...,41.0,602.0,Transform your picnic into un pique-nique to r...,23.0,3.75,Ham Persillade with Mustard Potato Salad and M...,"[6 long parsley sprigs, divided, 1 3/4 cups re...",1696.0,0,0,0,0,0,0.038206
10,[Heat oil in heavy large skillet over medium-h...,5.0,256.0,Simmering the yams fills them with flavor and ...,4.0,3.75,"Yams Braised with Cream, Rosemary and Nutmeg","[4 teaspoons olive oil, 1/2 cup finely chopped...",30.0,0,0,0,0,0,0.015625


I then found the length of the directions for each recipe, which can be used to determine if recipes with longer directions are interpreted as harder, thus rated lower, or some relationship of the sort.

In [74]:
new_data['directions'] = new_data['directions'].astype(str)
new_data['directions_length'] = new_data['directions'].str.split().apply(len)

new_data.head()

Unnamed: 0,directions,fat,calories,desc,protein,rating,title,ingredients,sodium,is_easy,is_breakfast,is_dinner,is_lunch,is_beginner,protein_calories_ratio,directions_length
1,['Combine first 9 ingredients in heavy medium ...,23.0,403.0,This uses the same ingredients found in boudin...,18.0,4.375,Boudin Blanc Terrine with Red Onion Confit,"[1 1/2 cups whipping cream, 2 medium onions, c...",1439.0,0,0,0,0,0,0.044665,249
5,"['Mix basil, mayonnaise and butter in processo...",79.0,948.0,This recipe can be prepared in 45 minutes or l...,19.0,4.375,The Best Blts,[2 1/2 cups (lightly packed) fresh basil leave...,1042.0,0,0,0,0,0,0.020042,108
8,"['Stir together soy sauce, sugar, sesame oil, ...",10.0,170.0,Bulgogi,7.0,4.375,Korean Marinated Beef,"[1/4 cup soy sauce, 1 tablespoon sugar, 2 teas...",1272.0,0,0,0,0,0,0.041176,97
9,['Chop enough parsley leaves to measure 1 tabl...,41.0,602.0,Transform your picnic into un pique-nique to r...,23.0,3.75,Ham Persillade with Mustard Potato Salad and M...,"[6 long parsley sprigs, divided, 1 3/4 cups re...",1696.0,0,0,0,0,0,0.038206,142
10,['Heat oil in heavy large skillet over medium-...,5.0,256.0,Simmering the yams fills them with flavor and ...,4.0,3.75,"Yams Braised with Cream, Rosemary and Nutmeg","[4 teaspoons olive oil, 1/2 cup finely chopped...",30.0,0,0,0,0,0,0.015625,100


In [79]:
new_data = new_data.dropna(subset = ["fat", "calories", "protein", "sodium", "directions_length", "is_easy", "is_breakfast",	"is_dinner",	"is_lunch",	"is_beginner",	"protein_calories_ratio"])

## Model Building

Build a model (either a regression or a neural network) to predict a recipe's rating based on any relevant attributes that you defined in the prior steps.

You may choose to predict rating as a continuous value (0.0 to 5.0), or as a categorical (low/medium/high or similar).

Provide a narrative explanation of your choices to accompany any code.

I built a model plotting all numerical features vs. rating. I did not include that categorical variables, only the numerical columns I made off of them, like the length of the directions, for example. I believe that all the variables I included are important for predicting the rating of each recipe. I built the model using two different ways from class and/or online and they were both successful.

In [86]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

features = ["fat", "calories", "protein", "sodium", "directions_length", "is_easy", "is_breakfast",	"is_dinner",	"is_lunch",	"is_beginner",	"protein_calories_ratio"]

target = "rating"


X = new_data[features]
y = new_data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .8, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

In [94]:
train_data, test_data = train_test_split(
    new_data,
    train_size=0.8,
    random_state=42)

model = LinearRegression().fit(
    X=train_data.loc[:, ["fat", "calories", "protein", "sodium", "directions_length", "is_easy", "is_breakfast",	"is_dinner",	"is_lunch",	"is_beginner",	"protein_calories_ratio"]],
    y=train_data['rating']
)

In [96]:
model

In [92]:
model.score(X_train, y_train)

0.019187304571031016

In [93]:
model.score(X_test, y_test)

0.006343020861714899

In [95]:
model.score(
    X = new_data.loc[:, ["fat", "calories", "protein", "sodium", "directions_length", "is_easy", "is_breakfast",	"is_dinner",	"is_lunch",	"is_beginner",	"protein_calories_ratio"]],
    y= new_data['rating']
)

0.012000410103762071

## Model Evaluation

After training your model, evaluate its performance. What metric(s) did you choose to optimize on? Would you say that your model performed well or poorly? How did you evaluate its performance to arrive at that conclusion?

Again, I focused on all numerical features only. I believe my model performed well, as it produced a result when scored. I scored it using multiple different variables and it produced a score each time. However, since it scored low, I believe that it could have been better. It definitely performed poorly, considering that other models we've done in class scored much higher. However, this could be due to the data set used.

-----

# Midterm Submission

To submit this exam, in Canvas navigate to DATA-2000-51 > Assignments > Midterm Exam ([link](https://canvas.jcu.edu/courses/33514/assignments/407120)). You can either upload the `.ipynb` file directly to Canvas, or you can provide a link to the assignment on your GitHub.