# Your Title Here

**Name(s)**: Nahuel Canavy

**Website Link**: (your website link)

## Introduction:

We're looking at a dataset from Food.com, a large recipe-sharing site. This dataset has lots of information about recipes and people's reactions to them. I've aleady done some analysis on this dataset too see if recipes with more calories get lower rating.

However, today we're going to tackle a new subject.

## Framing the Problem

**Prediction Problem:** Regression

**Response Variable:** Number of Steps

**Metric:** Mean Square Error and R2 Score

**Explanation:**
In this problem, the goal is to predict the number of steps using various features from the dataset. The evaluation of the regression model's performance can be done using Mean Square Error (MSE) as the metric. MSE measures the average difference between predicted and true values, providing an indication of prediction error magnitude. Additionally, R-squared can be used as it measures the proportion of variance in the number of steps that can be explained by the predictors. A high R-squared value suggests a good fit, while a low value indicates a poor fit. Combining MSE and R-squared provides a comprehensive understanding of the model's accuracy and explanatory power in predicting the number of steps.

## Code

In [1]:
import pandas as pd
import numpy as np
import os
import plotly.express as px
pd.options.plotting.backend = 'plotly'
from scipy import stats

## Load the data and merge the data

In [2]:
# Load the datasets
food_recipe = pd.read_csv(os.path.join('food_data', 'RAW_recipes.csv'))
food_interaction = pd.read_csv(os.path.join('food_data', 'RAW_interactions.csv'))

# Left merge the recipes and interactions datasets together.
merged_data = pd.merge(food_recipe, food_interaction, how='left', left_on='id', right_on='recipe_id')

# Fill all ratings of 0 with np.nan.
# This is a reasonable step because we can assume that a rating of 0 implies no rating given,
# and replacing these with NaN will prevent these 0 ratings from skewing the average rating calculation.
merged_data['rating'] = merged_data['rating'].replace(0, np.nan)

# Find the average rating per recipe
# Define a custom aggregation function that calculates the mean while excluding NaN values
def mean_without_nan(series):
    if series.dropna().empty:
        return np.nan
    else:
        return np.nanmean(series)

# Calculate the mean without including NaN values
average_rating = merged_data.groupby('id')['rating'].apply(mean_without_nan)
# Add this Series containing the average rating per recipe back to the recipes dataset
# Here we create a new dataframe which is a copy of the original food_recipe dataframe and adds the new 'average_rating' column
food_recipe_with_ratings = food_recipe.copy()
food_recipe_with_ratings = pd.merge(food_recipe_with_ratings, average_rating, how='left', left_on='id', right_on='id')
food_recipe_with_ratings.head()


Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,rating
0,1 brownies in the world best ever,333281,40,985201,2008-10-27,"['60-minutes-or-less', 'time-to-make', 'course...","[138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0]",10,['heat the oven to 350f and arrange the rack i...,"these are the most; chocolatey, moist, rich, d...","['bittersweet chocolate', 'unsalted butter', '...",9,4.0
1,1 in canada chocolate chip cookies,453467,45,1848091,2011-04-11,"['60-minutes-or-less', 'time-to-make', 'cuisin...","[595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0]",12,"['pre-heat oven the 350 degrees f', 'in a mixi...",this is the recipe that we use at my school ca...,"['white sugar', 'brown sugar', 'salt', 'margar...",11,5.0
2,412 broccoli casserole,306168,40,50969,2008-05-30,"['60-minutes-or-less', 'time-to-make', 'course...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"['preheat oven to 350 degrees', 'spray a 2 qua...",since there are already 411 recipes for brocco...,"['frozen broccoli cuts', 'cream of chicken sou...",9,5.0
3,millionaire pound cake,286009,120,461724,2008-02-12,"['time-to-make', 'course', 'cuisine', 'prepara...","[878.3, 63.0, 326.0, 13.0, 20.0, 123.0, 39.0]",7,"['freheat the oven to 300 degrees', 'grease a ...",why a millionaire pound cake? because it's su...,"['butter', 'sugar', 'eggs', 'all-purpose flour...",7,5.0
4,2000 meatloaf,475785,90,2202916,2012-03-06,"['time-to-make', 'course', 'main-ingredient', ...","[267.0, 30.0, 12.0, 12.0, 29.0, 48.0, 2.0]",17,"['pan fry bacon , and set aside on a paper tow...","ready, set, cook! special edition contest entr...","['meatloaf mixture', 'unsmoked bacon', 'goat c...",13,5.0


## Cleaning and preparing the data


I need to clean  my data. I'm going to drop columns that I think are not related to predict the numbers of steps
like name,id,contributor_id,submitted, tag and nutrition.
Furthermore, I need to convert columns of string data to numbers so my models can understand the data.
I'm going to transform the step,description and ingredients columns to the numbers of words they contains.


In [12]:
# Remove rows where rating or nutrition information is missing
clean_data = food_recipe_with_ratings.dropna(subset=['rating', 'nutrition'])
columns_to_drop=['name','id','contributor_id','submitted', 'tags','nutrition']
clean_data=clean_data.drop(columns=columns_to_drop,axis=1)
# Convert string columns to the number of words they contain
clean_data['steps_words'] = clean_data['steps'].apply(lambda x: len(x.split()))
clean_data['description_words'] = clean_data['description'].astype(str).apply(lambda x: len(x.split()))
clean_data['ingredients_words'] = clean_data['ingredients'].astype(str).apply(lambda x: len(x.split()))

columns_to_drop_v2=['steps','description','ingredients']
clean_data=clean_data.drop(columns=columns_to_drop_v2,axis=1)
clean_data.head().to_markdown(index=False)
#clean_data.dtypes

'|   minutes |   n_steps |   n_ingredients |   rating |   steps_words |   description_words |   ingredients_words |\n|----------:|----------:|----------------:|---------:|--------------:|--------------------:|--------------------:|\n|        40 |        10 |               9 |        4 |           128 |                  41 |                  18 |\n|        45 |        12 |              11 |        5 |           148 |                  42 |                  18 |\n|        40 |         6 |               9 |        5 |            92 |                  64 |                  21 |\n|       120 |         7 |               7 |        5 |           128 |                  34 |                  12 |\n|        90 |        17 |              13 |        5 |           257 |                  29 |                  29 |'

### Code for plotting my data

In [87]:
data = result_finalmodel_2t

In [90]:
path=r'D:\IMT-Atlantrique\TC\Cours\Spring_Quarter\DSC_80\lab\dsc80-2023-sp\projects\05-topics-II\assets\''
fig1.write_html(path+'Predicted_vs_Actual_Number_of_Steps_final_model_2t.html', include_plotlyjs='cdn')
fig2.write_html(path+'Residual_Plot_baseline_final_model_2t.html', include_plotlyjs='cdn')

In [88]:
# Actual and predicted values
actual_steps = data['n_steps']
predicted_steps = data['predicted_n_steps']

# Calculate residuals
residuals = np.array(actual_steps) - np.array(predicted_steps)

# Plotting predicted versus actual steps
fig1 = px.scatter(data_frame=data, x='n_steps', y='predicted_n_steps',
                  labels={'n_steps': 'Actual Number of Steps', 'predicted_n_steps': 'Predicted Number of Steps'},
                  title='Predicted vs Actual Number of Steps')
fig1.add_shape(type='line', x0=0, y0=0, x1=max(actual_steps), y1=max(actual_steps), line=dict(color='red', dash='dash'))
fig1.show()



# Plotting residuals
fig2 = px.scatter(data_frame=data, x='n_steps', y=residuals,
                  labels={'n_steps': 'Actual Number of Steps', 'y': 'Residuals'},
                  title='Residual Plot')
fig2.add_shape(type='line', x0=min(actual_steps), y0=0, x1=max(actual_steps), y1=0, line=dict(color='red', dash='dash'))
fig2.show()

# Calculate precision at different percentiles
precision_90 = np.percentile(np.abs(residuals), 90)
precision_95 = np.percentile(np.abs(residuals), 95)
precision_99 = np.percentile(np.abs(residuals), 99)

print('Precision at 90%:', precision_90)
print('Precision at 95%:', precision_95)
print('Precision at 99%:', precision_99)

Precision at 90%: 4.0
Precision at 95%: 6.0
Precision at 99%: 10.0


### Baseline Model

So, the baseline model is a regression model that predicts the number of steps. It uses the features 'minutes' and 'n_ingredients'. 
The model's performance is evaluated using Mean Squared Error (MSE) and R2 Score.

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [38]:
def baseline_model(data):
    
    X= data.drop(["n_steps",'rating','steps_words','description_words' , 'ingredients_words' ],axis=1)
    y=data["n_steps"]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    
    pipeline = Pipeline([
        ('lin-reg', LinearRegression())
    ])
    
    pipeline.fit(X_train, y_train)
    
    y_pred = pipeline.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print("Mean Squared Error:", mse)
    print("R2 Score:", r2)
    
     # Create a new DataFrame with predicted values
    result = X_test.copy()
    result['n_steps'] = y_test
    result["predicted_n_steps"] = y_pred
    
    return result
  
    

In [39]:
result_baseline=baseline_model(clean_data)

Mean Squared Error: 31.637901359032252
R2 Score: 0.1763121834190734


In [44]:
result_baseline.head().to_markdown(index=False)

'|   minutes |   n_ingredients |   n_steps |   predicted_n_steps |\n|----------:|----------------:|----------:|--------------------:|\n|       495 |              15 |         9 |             14.2936 |\n|        25 |               9 |        10 |              9.9304 |\n|        35 |              12 |        11 |             12.1088 |\n|        75 |              21 |        22 |             18.6441 |\n|       170 |              15 |        38 |             14.289  |'

#### Final Model - Step 1 


In [79]:
from sklearn.model_selection import GridSearchCV

In [48]:
def final_model_1(data):
    
    X= data.drop(["n_steps"],axis=1)
    y=data["n_steps"]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    
    pipeline = Pipeline([
        ('lin-reg', LinearRegression())
    ])
    
    pipeline.fit(X_train, y_train)
    
    y_pred = pipeline.predict(X_test).round()
    
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print("Mean Squared Error:", mse)
    print("R2 Score:", r2)
    
     # Create a new DataFrame with predicted values
    result = X_test.copy()
    result['n_steps'] = y_test
    result["predicted_n_steps"] = y_pred
    
    return result

In [49]:
result_finalmodel_1=final_model_1(clean_data)

Mean Squared Error: 8.683256134818173
R2 Score: 0.7823139674906084


In [50]:
result_finalmodel_1.head().to_markdown(index=False)

'|   minutes |   n_ingredients |   rating |   steps_words |   description_words |   ingredients_words |   n_steps |   predicted_n_steps |\n|----------:|----------------:|---------:|--------------:|--------------------:|--------------------:|----------:|--------------------:|\n|         1 |               2 |      4.5 |             5 |                   9 |                   3 |         1 |                   2 |\n|        70 |              15 |      1   |            94 |                  37 |                  21 |         8 |                  10 |\n|        10 |               3 |      5   |            28 |                  22 |                   3 |         3 |                   4 |\n|        25 |              11 |      2.5 |           114 |                  31 |                  22 |         8 |                  11 |\n|        40 |               7 |      5   |            63 |                  28 |                  11 |         4 |                   7 |'

#### Final Model - V2

This model uses RandomForestRegressor as the algorithm and includes all columns except for 'n_steps' as features for prediction. The hyperparameters 'max_depth' and 'n_estimators' are set to specific values (10 and 200, respectively) after doing a GridSearch to find the bests. The model's performance is still evaluated using metrics such as Mean Squared Error (MSE) and R2 Score.

In [58]:
from sklearn.ensemble import RandomForestRegressor

In [59]:
def final_model_2(data):
    X = data.drop(["n_steps"], axis=1)
    y = data["n_steps"]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

    pipeline = Pipeline([
        ('rf-reg', RandomForestRegressor(max_depth=10, n_estimators=200))
    ])

    pipeline.fit(X_train, y_train)

    y_pred = pipeline.predict(X_test).round()

    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print("Mean Squared Error:", mse)
    print("R2 Score:", r2)

    # Create a new DataFrame with predicted values
    result = X_test.copy()
    result['n_steps'] = y_test
    result["predicted_n_steps"] = y_pred

    return result

In [60]:
result_finalmodel_2=final_model_2(clean_data)

Mean Squared Error: 8.883265989947768
R2 Score: 0.783468767587091


In [61]:
result_finalmodel_2.head().to_markdown(index=False)

'|   minutes |   n_ingredients |   rating |   steps_words |   description_words |   ingredients_words |   n_steps |   predicted_n_steps |\n|----------:|----------------:|---------:|--------------:|--------------------:|--------------------:|----------:|--------------------:|\n|        13 |               8 |  5       |            63 |                  41 |                  17 |         5 |                   7 |\n|       135 |              12 |  5       |           198 |                  49 |                  17 |        14 |                  17 |\n|        70 |               9 |  4       |           171 |                   3 |                  21 |        22 |                  16 |\n|        35 |              12 |  5       |           131 |                  84 |                  25 |        13 |                  12 |\n|        25 |               4 |  4.90909 |            32 |                  12 |                   7 |         4 |                   4 |'

 ( Value from the gridSeach if interested : 

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV 1/5] END regression__max_depth=None, regression__n_estimators=50;, score=0.792 total time=  57.2s
[CV 2/5] END regression__max_depth=None, regression__n_estimators=50;, score=0.778 total time=  57.3s
[CV 3/5] END regression__max_depth=None, regression__n_estimators=50;, score=0.777 total time=  57.1s
[CV 4/5] END regression__max_depth=None, regression__n_estimators=50;, score=0.777 total time=  57.8s
[CV 5/5] END regression__max_depth=None, regression__n_estimators=50;, score=0.785 total time=  57.3s
[CV 1/5] END regression__max_depth=None, regression__n_estimators=100;, score=0.794 total time= 1.9min
[CV 2/5] END regression__max_depth=None, regression__n_estimators=100;, score=0.780 total time= 1.9min
[CV 3/5] END regression__max_depth=None, regression__n_estimators=100;, score=0.779 total time= 1.9min
[CV 4/5] END regression__max_depth=None, regression__n_estimators=100;, score=0.779 total time= 1.9min
[CV 5/5] END regression__max_depth=None, regression__n_estimators=100;, score=0.787 total time= 1.8min
[CV 1/5] END regression__max_depth=None, regression__n_estimators=200;, score=0.794 total time= 3.7min
[CV 2/5] END regression__max_depth=None, regression__n_estimators=200;, score=0.782 total time= 3.6min
[CV 3/5] END regression__max_depth=None, regression__n_estimators=200;, score=0.780 total time= 3.6min
[CV 4/5] END regression__max_depth=None, regression__n_estimators=200;, score=0.781 total time= 3.6min
[CV 5/5] END regression__max_depth=None, regression__n_estimators=200;, score=0.790 total time= 3.7min
[CV 1/5] END regression__max_depth=10, regression__n_estimators=50;, score=0.797 total time=  28.0s
[CV 2/5] END regression__max_depth=10, regression__n_estimators=50;, score=0.785 total time=  27.1s
[CV 3/5] END regression__max_depth=10, regression__n_estimators=50;, score=0.784 total time=  26.8s
[CV 4/5] END regression__max_depth=10, regression__n_estimators=50;, score=0.782 total time=  26.8s
[CV 5/5] END regression__max_depth=10, regression__n_estimators=50;, score=0.792 total time=  25.9s
[CV 1/5] END regression__max_depth=10, regression__n_estimators=100;, score=0.798 total time=  52.0s
[CV 2/5] END regression__max_depth=10, regression__n_estimators=100;, score=0.785 total time=  51.9s
[CV 3/5] END regression__max_depth=10, regression__n_estimators=100;, score=0.784 total time=  52.2s
[CV 4/5] END regression__max_depth=10, regression__n_estimators=100;, score=0.782 total time=  51.9s
[CV 5/5] END regression__max_depth=10, regression__n_estimators=100;, score=0.793 total time=  53.4s
[CV 1/5] END regression__max_depth=10, regression__n_estimators=200;, score=0.798 total time= 1.8min
[CV 2/5] END regression__max_depth=10, regression__n_estimators=200;, score=0.786 total time= 1.8min
[CV 3/5] END regression__max_depth=10, regression__n_estimators=200;, score=0.784 total time= 1.7min
[CV 4/5] END regression__max_depth=10, regression__n_estimators=200;, score=0.783 total time= 1.7min
[CV 5/5] END regression__max_depth=10, regression__n_estimators=200;, score=0.793 total time= 1.8min
[CV 1/5] END regression__max_depth=20, regression__n_estimators=50;, score=0.793 total time=  51.1s
[CV 2/5] END regression__max_depth=20, regression__n_estimators=50;, score=0.780 total time=  52.5s
[CV 3/5] END regression__max_depth=20, regression__n_estimators=50;, score=0.779 total time=  49.9s
[CV 4/5] END regression__max_depth=20, regression__n_estimators=50;, score=0.778 total time=  48.9s
[CV 5/5] END regression__max_depth=20, regression__n_estimators=50;, score=0.786 total time=  50.3s
[CV 1/5] END regression__max_depth=20, regression__n_estimators=100;, score=0.795 total time= 1.7min
[CV 2/5] END regression__max_depth=20, regression__n_estimators=100;, score=0.781 total time= 1.6min
[CV 3/5] END regression__max_depth=20, regression__n_estimators=100;, score=0.780 total time= 1.6min
[CV 4/5] END regression__max_depth=20, regression__n_estimators=100;, score=0.780 total time= 1.6min
[CV 5/5] END regression__max_depth=20, regression__n_estimators=100;, score=0.788 total time= 1.6min
[CV 1/5] END regression__max_depth=20, regression__n_estimators=200;, score=0.795 total time= 3.3min
[CV 2/5] END regression__max_depth=20, regression__n_estimators=200;, score=0.783 total time= 3.1min
[CV 3/5] END regression__max_depth=20, regression__n_estimators=200;, score=0.780 total time= 1.3min
[CV 4/5] END regression__max_depth=20, regression__n_estimators=200;, score=0.781 total time= 1.3min
[CV 5/5] END regression__max_depth=20, regression__n_estimators=200;, score=0.790 total time= 1.3min
[CV 1/5] END regression__max_depth=30, regression__n_estimators=50;, score=0.792 total time=  21.0s
[CV 2/5] END regression__max_depth=30, regression__n_estimators=50;, score=0.779 total time=  20.9s
[CV 3/5] END regression__max_depth=30, regression__n_estimators=50;, score=0.777 total time=  21.0s
[CV 4/5] END regression__max_depth=30, regression__n_estimators=50;, score=0.777 total time=  20.9s
[CV 5/5] END regression__max_depth=30, regression__n_estimators=50;, score=0.785 total time=  20.9s
[CV 1/5] END regression__max_depth=30, regression__n_estimators=100;, score=0.794 total time=  42.2s
[CV 2/5] END regression__max_depth=30, regression__n_estimators=100;, score=0.781 total time=  43.3s
[CV 3/5] END regression__max_depth=30, regression__n_estimators=100;, score=0.779 total time=  44.4s
[CV 4/5] END regression__max_depth=30, regression__n_estimators=100;, score=0.779 total time=  46.2s
[CV 5/5] END regression__max_depth=30, regression__n_estimators=100;, score=0.787 total time=  46.9s
[CV 1/5] END regression__max_depth=30, regression__n_estimators=200;, score=0.794 total time= 1.6min
[CV 2/5] END regression__max_depth=30, regression__n_estimators=200;, score=0.782 total time= 1.5min
[CV 3/5] END regression__max_depth=30, regression__n_estimators=200;, score=0.780 total time= 1.6min
[CV 4/5] END regression__max_depth=30, regression__n_estimators=200;, score=0.781 total time= 1.4min
[CV 5/5] END regression__max_depth=30, regression__n_estimators=200;, score=0.790 total time= 1.4min
Best parameters: {'regression__max_depth': 10, 'regression__n_estimators': 200}
Mean Squared Error: 8.389467200492762, R2 Score: 0.7928470071274102

)


In [67]:
from sklearn.preprocessing import StandardScaler

In [107]:
def final_model_2_thune(data):
    # Feature Engineering
    data['total_words'] = data['steps_words'] + data['description_words'] + data['ingredients_words']
    
    # Splitting into features and target variable
    X = data.drop(['n_steps'], axis=1)
    y = data['n_steps']
    
    # Splitting into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    
    # Pipeline with feature scaling and random forest regressor
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  # Feature scaling
        ('rf-reg', RandomForestRegressor(max_depth=10, n_estimators=300, random_state=42))
    ])
    

    # Fitting the model
    pipeline.fit(X_train, y_train)
    
    # Predictions
    y_pred = pipeline.predict(X_test).round()
    
    # Evaluation metrics
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print("Mean Squared Error:", mse)
    print("R2 Score:", r2)

    
    # Create a new DataFrame with predicted values
    result = X_test.copy()
    result['n_steps'] = y_test
    result['predicted_n_steps'] = y_pred
    
    return result

In [108]:
result_finalmodel_2t=final_model_2_thune(clean_data)

Mean Squared Error: 8.458509904405243
R2 Score: 0.792055755312701


In [89]:
result_finalmodel_2t.head().to_markdown(index=False)

'|   minutes |   n_ingredients |   rating |   steps_words |   description_words |   ingredients_words |   total_words |   n_steps |   predicted_n_steps |\n|----------:|----------------:|---------:|--------------:|--------------------:|--------------------:|--------------:|----------:|--------------------:|\n|        20 |               6 |  4.66667 |            52 |                  40 |                   8 |           100 |         7 |                   6 |\n|        50 |              10 |  4       |           102 |                  21 |                  30 |           153 |        13 |                  10 |\n|        35 |               8 |  5       |            92 |                  70 |                  13 |           175 |        11 |                   9 |\n|        40 |              18 |  5       |            86 |                  26 |                  37 |           149 |         9 |                   8 |\n|        10 |               5 |  5       |            92 |                 

## Fairness Analysis

For our fairness analysis, let's take two categories of recipes such as simple recipes that take than 10 ingredients and complex recipes that take 10 or more ingredients. 

Let's define the null and alternative hypotheses:

Null Hypothesis: The model's mean squared error (MSE) or R2 score is the same for both simple recipes and complex recipes.

Alternative Hypothesis: The model's MSE or R2 score is different for simple recipes and complex recipes.

In [109]:
### Divide our data is two catagories
data_simple_recipe=result_finalmodel_2t[result_finalmodel_2t['n_ingredients'] < 10]
data_complex_recipe= result_finalmodel_2t[result_finalmodel_2t['n_ingredients'] >= 10]


## Compure the MSE and R2 score 
mse_simple_recipe = mean_squared_error(data_simple_recipe['n_steps'], data_simple_recipe['predicted_n_steps'])
r2_simple_recipe = r2_score(data_simple_recipe['n_steps'], data_simple_recipe['predicted_n_steps'])
mse_complex_recipe = mean_squared_error(data_complex_recipe['n_steps'], data_complex_recipe['predicted_n_steps'])
r2_complex_recipe = r2_score(data_complex_recipe['n_steps'], data_complex_recipe['predicted_n_steps'])


abs_r2_diff = abs(r2_complex_recipe - r2_simple_recipe)
abs_mse_diff = abs(mse_complex_recipe - mse_simple_recipe)

percentage_r2_diff = abs_r2_diff / r2_simple_recipe * 100
percentage_mse_diff = abs_mse_diff / mse_simple_recipe * 100

print("The absolute r2 difference is:", abs_r2_diff)
print("The absolute mse difference is:", abs_mse_diff)
print("The percentage r2 difference is:", percentage_r2_diff,'%.')
print("The percentage mse difference is:", percentage_mse_diff,'%.')

The absolute r2 difference is: 0.030291847744350675
The absolute mse difference is: 6.48260169085053
The percentage r2 difference is: 3.8714647336710275 %.
The percentage mse difference is: 114.83283054389571 %.


As we can see, the difference is really hight for the MSE. Let's perform a permutation test now.  

In [110]:
# Lists to hold permuted differences
permuted_r2_diffs = []
permuted_mse_diffs = []
# Permutation test
n_permutations = 1000
for _ in range(n_permutations):
    # Randomly permute the 'n_ingredients' series
    permuted_n_ingredients = np.random.permutation(result_finalmodel_2t['n_ingredients'])
    
    # Create two dataframes based on permuted series
    permuted_simple_recipe = result_finalmodel_2t[result_finalmodel_2t['n_ingredients'] < 10]
    permuted_complex_recipe = result_finalmodel_2t[result_finalmodel_2t['n_ingredients'] >= 10]

    # Compute MSE and R2 score
    mse_permuted_simple = mean_squared_error(permuted_simple_recipe['n_steps'], permuted_simple_recipe['predicted_n_steps'])
    r2_permuted_simple = r2_score(permuted_simple_recipe['n_steps'], permuted_simple_recipe['predicted_n_steps'])
    mse_permuted_complex = mean_squared_error(permuted_complex_recipe['n_steps'], permuted_complex_recipe['predicted_n_steps'])
    r2_permuted_complex = r2_score(permuted_complex_recipe['n_steps'], permuted_complex_recipe['predicted_n_steps'])

    # Compute the differences
    permuted_r2_diff = abs(r2_permuted_complex - r2_permuted_simple)
    permuted_mse_diff = abs(mse_permuted_complex - mse_permuted_simple)
    print(permuted_r2_diff)
    print(permuted_mse_diff)
    # Append the differences
    permuted_r2_diffs.append(permuted_r2_diff)
    permuted_mse_diffs.append(permuted_mse_diff)

# Convert to pandas series
permuted_r2_diffs = pd.Series(permuted_r2_diffs)
permuted_mse_diffs = pd.Series(permuted_mse_diffs)

# Compute p-values
p_val_r2 = (permuted_r2_diffs >= abs_r2_diff).mean()
p_val_mse = (permuted_mse_diffs >= abs_mse_diff).mean()

print(f'P-value for R2 score difference: {p_val_r2}')
print(f'P-value for MSE difference: {p_val_mse}')

0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.0302918477

0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.0302918477

0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.0302918477

0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.030291847744350675
6.48260169085053
0.0302918477