##### **Disclaimer: We use some advanced packages here without detailed explanation. You can use these, but we do not provide any support.**

In [1]:
# To install them, you can uncomment the following lines:
# (%pip will call pip from the currently active python environment)
# %pip install scikit-learn

# Note: Some of these packages are still not compatible with Python 3.12 yet
# %pip install sweetviz
# %pip install ydata_profiling
# %pip install shap

## CRISP-DM

In [2]:
import pandas as pd
import numpy as np
import sklearn

import matplotlib.pyplot as plt
import seaborn as sns

# Note: The following do not work with Python 3.12
#import shap
#from ydata_profiling import ProfileReport
#import sweetviz as sv

#### Reproducibility 

A best practice in data analytics projects is to work with *seeds* to ensure the reproducability of results. 
This is especially important in the Analytics Cup, since the rules require you to write a self-contained
script that produces reproducable results. 

To achieve this, we can set seeds for all used random number generators.

In [3]:
seed = 55

### Phase 1: Business Understanding

Serves to assess use cases, feasibility, requirements, and
risks of the endeavored data driven project.

Startup that suggests new recipes to users\
But we have been having many cancelations of subscriptions\
Problem was that the users found that the recipes suggested (even though they had high quality) did not match the customer's diet and needs\
Now we have a system of likes and dislikes for the recipes and a new user interface, where the users can enter information about what they want

### Phase 2: Data Understanding

Assess the data quality and content.

In [4]:
# load the data
diet = pd.read_csv("diet.csv")
recipes = pd.read_csv("recipes.csv")
requests = pd.read_csv("requests.csv")
reviews = pd.read_csv("reviews.csv")

  reviews = pd.read_csv("reviews.csv")


have a look at the data and its attributes \
check if columns are properly named \
general overview over data, check for missing values, etc.

#### Diet pre-processing

In [5]:
diet["Diet"] = diet["Diet"].astype('category')

#### Recipes pre-processing

In [6]:
# Change types of column
def refactorIngredients(ingredients):
    if ingredients == "character(0)":
        return []
    ingredients = ingredients.replace("\\", '').replace("\"", '').replace('c(','').replace(')', '')
    ingredients = ingredients.split(",")
    return ingredients

recipes["RecipeIngredientQuantities"] = recipes["RecipeIngredientQuantities"].apply(lambda x: refactorIngredients(x))
recipes["RecipeIngredientParts"] = recipes["RecipeIngredientParts"].apply(lambda x: refactorIngredients(x))

recipes.head()

Unnamed: 0,RecipeId,Name,CookTime,PrepTime,RecipeCategory,RecipeIngredientQuantities,RecipeIngredientParts,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield
0,73440,Bow Ties With Broccoli Pesto,0,1800,Other,"[6, 2, 1 1/2, 1/4, 1/2, 4, 1 1/2, 1 1/2...","[hazelnuts, broccoli florets, fresh parsley ...",241.3,10.1,1.2,0.0,13.1,31.8,2.3,1.4,6.7,9.0,
1,365718,Cashew-chutney Rice,3600,600,Other,"[1, 3/4, 6, 5, 2, 1, 2]","[celery, onion, butter, chicken broth, lon...",370.8,17.5,7.2,22.9,553.3,44.3,1.6,2.2,9.4,8.0,
2,141757,Copycat Taco Bell Nacho Fries BellGrande,3600,2700,Other,"[3, 1/2, 1, 1, 3, 2, 1, 2 1/2, 2, 1, ...","[Copycat Taco Bell Seasoned Beef, yellow onio...",377.6,20.9,10.5,45.7,1501.8,36.6,3.8,6.1,12.9,8.0,
3,280351,Slow Cooker Jalapeno Cheddar Cheese Soup,18000,1800,Other,"[2, 1, 2, 2, 1, 1, 1/8, 1/4, 1, 4, 3...","[unsalted butter, yellow onion, carrots, ga...",282.8,16.5,10.3,50.5,630.2,22.8,2.3,2.7,11.7,6.0,
4,180505,Cool & Crisp Citrus Chiffon Pie,3600,1800,Other,"[1, 1/4, 1/2, 1/2, 1, 1/2, 4, 4, 1/2, ...","[unflavored gelatin, water, sugar, lemon, ...",257.5,8.6,2.4,110.7,160.9,39.8,0.4,30.2,6.3,6.0,


In [7]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75604 entries, 0 to 75603
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   RecipeId                    75604 non-null  int64  
 1   Name                        75604 non-null  object 
 2   CookTime                    75604 non-null  int64  
 3   PrepTime                    75604 non-null  int64  
 4   RecipeCategory              75604 non-null  object 
 5   RecipeIngredientQuantities  75604 non-null  object 
 6   RecipeIngredientParts       75604 non-null  object 
 7   Calories                    75604 non-null  float64
 8   FatContent                  75604 non-null  float64
 9   SaturatedFatContent         75604 non-null  float64
 10  CholesterolContent          75604 non-null  float64
 11  SodiumContent               75604 non-null  float64
 12  CarbohydrateContent         75604 non-null  float64
 13  FiberContent                756

In [8]:
# Determines if recipe is veggie, vegan or omnivore
def categorizeRecipe(ingredients):
    meat_derivates = ["pork", "beef", "meat", "fish", "tuna", "chicken", "squid", "schrimp", "trout", "mussels", 
                      "fillet", "lamb", "scallops", "sardine", "salmon", "lobster", "steak", "bacon", "ham", "oyster"]
    animal_derivates = ["milk", "egg", "honey", "gelatin", "butter", "mayonnaise", "cheese", "margarine", 
                    " heavy", "yogurt", "pudding", "shortening", "ice cream", "chocolate", "alfredo", "Miracle Whip", "half-and-half"]
    vegan_exclusions = ["substitute", "peanut", "apple", "vegan", "soymilk"]
    vegan = True
    for ingredient in ingredients:
        if any(word in ingredient.lower() for word in meat_derivates):
            return "Omnivore"
        if ingredient in vegan_exclusions:
            continue
        if any(word in ingredient.lower() for word in animal_derivates):
            vegan = False
    if vegan: 
        return "Vegan"
    else: 
        return "Vegetarian"

recipes["RecipeDiet"] = recipes["RecipeIngredientParts"].apply(lambda x: categorizeRecipe(x))
recipes['RecipeDiet'] = recipes['RecipeDiet'].astype('category')

# Create another table "recipe extra info" columns category, ingredient quatities, parts
selected_columns = ['RecipeCategory', 'RecipeIngredientQuantities', 'RecipeIngredientParts', 'RecipeServings', 'RecipeYield']
recipe_extra_info = recipes[selected_columns]
recipes = recipes.drop(columns=selected_columns)

recipes

recipe_extra_info.head()


Unnamed: 0,RecipeCategory,RecipeIngredientQuantities,RecipeIngredientParts,RecipeServings,RecipeYield
0,Other,"[6, 2, 1 1/2, 1/4, 1/2, 4, 1 1/2, 1 1/2...","[hazelnuts, broccoli florets, fresh parsley ...",9.0,
1,Other,"[1, 3/4, 6, 5, 2, 1, 2]","[celery, onion, butter, chicken broth, lon...",8.0,
2,Other,"[3, 1/2, 1, 1, 3, 2, 1, 2 1/2, 2, 1, ...","[Copycat Taco Bell Seasoned Beef, yellow onio...",8.0,
3,Other,"[2, 1, 2, 2, 1, 1, 1/8, 1/4, 1, 4, 3...","[unsalted butter, yellow onion, carrots, ga...",6.0,
4,Other,"[1, 1/4, 1/2, 1/2, 1, 1/2, 4, 4, 1/2, ...","[unflavored gelatin, water, sugar, lemon, ...",6.0,


#### Requests pre-processing

In [9]:
requests.head()

Unnamed: 0,AuthorId,RecipeId,Time,HighCalories,HighProtein,LowFat,LowSugar,HighFiber
0,2001012259B,73440,1799.950949,0.0,Indifferent,0,0,0
1,437641B,365718,4201.82098,0.0,Yes,0,Indifferent,1
2,1803340263D,141757,6299.861496,0.0,Indifferent,1,Indifferent,0
3,854048B,280351,19801.365796,0.0,Yes,1,0,1
4,2277685E,180505,5400.093457,0.0,Indifferent,0,0,0


In [10]:
requests.info()
# no missing values: GOOD!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140195 entries, 0 to 140194
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   AuthorId      140195 non-null  object 
 1   RecipeId      140195 non-null  int64  
 2   Time          140195 non-null  float64
 3   HighCalories  140195 non-null  float64
 4   HighProtein   140195 non-null  object 
 5   LowFat        140195 non-null  int64  
 6   LowSugar      140195 non-null  object 
 7   HighFiber     140195 non-null  int64  
dtypes: float64(2), int64(3), object(3)
memory usage: 8.6+ MB


In [11]:
# renaming the columns
requests = requests.rename(columns={"HighCalories": "Calories", "HighProtein":"Protein", "LowFat": "Fat", "LowSugar": "Sugar", "HighFiber":"Fiber"})

In [12]:
# standardizing column Calorie to the same format
requests["Calories"] = requests["Calories"].astype("int")

# standardizing column Protein Yes->1
requests["Protein"] = requests["Protein"].replace("Yes","1")

# changing 0 -> 1 in column Sugar 
requests["Sugar"] = requests["Sugar"].replace("0","1")

# changing 0 -> 1 and 1 -> 0  column Fat
#requests["Fat"] = requests["Fat"].replace({1 : 0, 0 : 1})
requests["Fat"] = 1 - requests["Fat"]

# transforming macronutrients columns -> categories 
#requests[["Calories", "Protein", "Fiber", "Sugar","]] = requests[["Calories", "Protein", "Fiber", "Sugar", "Fat"]].astype("category")

requests


Unnamed: 0,AuthorId,RecipeId,Time,Calories,Protein,Fat,Sugar,Fiber
0,2001012259B,73440,1799.950949,0,Indifferent,1,1,0
1,437641B,365718,4201.820980,0,1,1,Indifferent,1
2,1803340263D,141757,6299.861496,0,Indifferent,0,Indifferent,0
3,854048B,280351,19801.365796,0,1,0,1,1
4,2277685E,180505,5400.093457,0,Indifferent,1,1,0
...,...,...,...,...,...,...,...,...
140190,163793B,78171,1560.649725,0,Indifferent,1,1,1
140191,33888B,333262,1502.011466,1,Indifferent,0,1,0
140192,401942C,49200,5999.274269,0,Indifferent,1,1,1
140193,346866B,214815,899.523513,0,1,0,Indifferent,1


#### Reviews pre-processing

In [13]:
reviews = reviews.drop(columns = ["Rating"])

In [14]:
"""
df_grouped_by_class = df.groupby(by="variety")

df_setosa = df_grouped_by_class.get_group("Setosa")
df_versicolor = df_grouped_by_class.get_group("Versicolor")
df_virginica = df_grouped_by_class.get_group("Virginica")

class_labels = {
    "Setosa" : {
        "color" : "blue",
        "data" : df_setosa
    },
    "Versicolor" : {
        "color" : "green",
        "data" : df_versicolor
    },
    "Virginica" : {
        "color" : "red",
        "data" : df_virginica
    }
}

for class_i in class_labels:
    class_color = class_labels[class_i]["color"]
    class_df = class_labels[class_i]["data"]
    p = sns.pairplot(class_df, diag_kind="hist", diag_kws={"color" : class_color}, plot_kws={"color" : class_color, "label" : class_i})
    p.fig.suptitle(class_i, y=1.0, size=15)
"""

'\ndf_grouped_by_class = df.groupby(by="variety")\n\ndf_setosa = df_grouped_by_class.get_group("Setosa")\ndf_versicolor = df_grouped_by_class.get_group("Versicolor")\ndf_virginica = df_grouped_by_class.get_group("Virginica")\n\nclass_labels = {\n    "Setosa" : {\n        "color" : "blue",\n        "data" : df_setosa\n    },\n    "Versicolor" : {\n        "color" : "green",\n        "data" : df_versicolor\n    },\n    "Virginica" : {\n        "color" : "red",\n        "data" : df_virginica\n    }\n}\n\nfor class_i in class_labels:\n    class_color = class_labels[class_i]["color"]\n    class_df = class_labels[class_i]["data"]\n    p = sns.pairplot(class_df, diag_kind="hist", diag_kws={"color" : class_color}, plot_kws={"color" : class_color, "label" : class_i})\n    p.fig.suptitle(class_i, y=1.0, size=15)\n'

In [15]:
"""
# We can also leverage the dataprep package to get a nice summary report
report = sv.analyze(df)
report.show_notebook()

# We can also leverage the yadata_profiling package to get a nice summary report
profile = ProfileReport(df, title="Iris Data - Summary Report")
profile
"""

'\n# We can also leverage the dataprep package to get a nice summary report\nreport = sv.analyze(df)\nreport.show_notebook()\n\n# We can also leverage the yadata_profiling package to get a nice summary report\nprofile = ProfileReport(df, title="Iris Data - Summary Report")\nprofile\n'

### Phase 3: Data Preparation

The goal is assure data quality: includes removing wrong/corrupt 
data entries and making sure the entries are standardized, e.g. enforcing certain encodings. 
Then transforms the data in order to make it suitable for the modelling step. This includes scaling, dimensionality
reduction, data augmentation, outlier removal, etc.\
 \
In practise, this will rarely be the case. On average, this step takes up to **80%** of 
the time of the whole project.

In [16]:
#To do: transform categorical feature into categorical variables (exemplo df["variety"] = df["variety"].astype("category"))
# fill/remove/change missing/corrupt values

# To do: ver se precisamos standardize alguma feature (exemplo na celula seguinte com o StandardScaler), se precisamos imputar valores em registros com valores nulos, 
# se precisamos lidar com outliers, se precisamos usar alguma estretégia de redução de dimensionalidade (tipo PCA na próxima celula)...

In [17]:
"""
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# data scaling
transform_scaler = StandardScaler()

# dimensionality reduction
transform_pca = PCA()

# value imputing

# outlier detection/removal
"""

'\nfrom sklearn.decomposition import PCA\nfrom sklearn.preprocessing import StandardScaler\n\n# data scaling\ntransform_scaler = StandardScaler()\n\n# dimensionality reduction\ntransform_pca = PCA()\n\n# value imputing\n\n# outlier detection/removal\n'

Join das 4 tabelas
- há users na tabela "diet" que nao estao na tabela "reviews" -- Ok!
- match perfeito de recipeid and authorid entre requests e reviews -- Otimo!
- todas as receitas de "recipes" estao sendo mostradas para pelo menos um usuario -- Ok!

In [18]:
# tabelas: diet, requests, reviews, recipes
dietrequestsmerged = diet.merge(requests, on = ["AuthorId"])
dietrequestsreviewsmerged = dietrequestsmerged.merge(reviews, on = ["AuthorId", "RecipeId"])
dietrequestsreviewsmerged = dietrequestsreviewsmerged.rename(columns={"Calories" : "Requested_Calories"})
mergedtables = dietrequestsreviewsmerged.merge(recipes, on = ["RecipeId"])
mergedtables = mergedtables.rename(columns={"Calories" : "Recipe_Calories"})

In [19]:
mergedtables["Total_time_recipe"] = mergedtables["CookTime"] + mergedtables["PrepTime"]
mergedtables[["Total_time_recipe", "CookTime", "PrepTime"]]
mergedtables = mergedtables.drop(columns=["PrepTime", "CookTime"])
mergedtables["Time"] = np.where(mergedtables["Time"] < 0, 28_000_000, mergedtables["Time"])
mergedtables["Recipe_Time_Match"] = (mergedtables["Total_time_recipe"] <= (1.2 * mergedtables["Time"]))


In [20]:
mergedtables[(mergedtables["Recipe_Time_Match"] == False)][["Recipe_Time_Match", "Total_time_recipe", "Time"]]
mergedtables = mergedtables.drop(columns=["Time", "Total_time_recipe"])

In [21]:
# categorical_values = ['Diet', 'RecipeDiet', 'Requested_Calories', 'Protein', 'Sugar', 'Fiber']

# fat_labels= [0 ,1]
# bins = [-1, 22.0, np.inf]
# mergedtables["FatCategory"]= pd.cut(mergedtables["FatContent"], bins = bins , labels= fat_labels)
# mergedtables["MatchFat"] = mergedtables["FatCategory"] == mergedtables["Fat"]
# mergedtables = mergedtables.drop(columns=["FatContent", "Fat", "FatCategory"])

# sugar_labels= [0 ,1]
# bins = [-1, 10.0, np.inf]
# mergedtables["SugarCategory"]= pd.cut(mergedtables["SugarContent"], bins = bins , labels= sugar_labels)
# mergedtables["MatchSugar"] = mergedtables["SugarCategory"] == mergedtables["Sugar"]
# mergedtables = mergedtables.drop(columns=["SugarContent", "Sugar", "SugarCategory"])

# protein_labels= [0 ,1]
# bins = [-1, 10.0, np.inf]
# mergedtables["ProteinCategory"]= pd.cut(mergedtables["ProteinContent"], bins = bins , labels= protein_labels)
# mergedtables["MatchProtein"] = mergedtables["ProteinCategory"] == mergedtables["Protein"]
# mergedtables = mergedtables.drop(columns=["ProteinContent", "Protein", "ProteinCategory"])

# fiber_labels= [0 ,1]
# bins = [-1, 5.0, np.inf]
# mergedtables["FiberCategory"]= pd.cut(mergedtables["FiberContent"], bins = bins, labels= fiber_labels)
# mergedtables["MatchFiber"] = mergedtables["FiberCategory"] == mergedtables["Fiber"]
# mergedtables = mergedtables.drop(columns=["FiberContent", "Fiber", "FiberCategory"])


# calories_labels= [0 ,1]
# bins = [-1, 5.0, np.inf]
# mergedtables["CaloriesCategory"]= pd.cut(mergedtables["Recipe_Calories"], bins = bins, labels= calories_labels)
# mergedtables["MatchCalories"] = mergedtables["Requested_Calories"] == mergedtables["Recipe_Calories"]
# mergedtables = mergedtables.drop(columns=["Requested_Calories", "Recipe_Calories", "CaloriesCategory"])'

mergedtables = mergedtables.drop(columns=["SaturatedFatContent", "CholesterolContent", "SodiumContent", "CarbohydrateContent"])


In [22]:
def diet_match(person_diet, recipe_diet):
    if person_diet == "Omnivore":
        return True
    if person_diet == "Vegetarian" and recipe_diet != "Omnivore":
        return True
    if person_diet == "Vegan" and recipe_diet == "Vegan":
        return True
    
    return False

mergedtables["RecipeMatch"] = mergedtables.apply(lambda row: diet_match(row["Diet"], row["RecipeDiet"]), axis= 1)

mergedtables[["RecipeMatch", "Diet", "RecipeDiet"]]


Unnamed: 0,RecipeMatch,Diet,RecipeDiet
0,True,Vegetarian,Vegan
1,False,Vegetarian,Omnivore
2,False,Vegetarian,Omnivore
3,False,Vegetarian,Omnivore
4,True,Omnivore,Omnivore
...,...,...,...
140190,True,Vegetarian,Vegetarian
140191,True,Omnivore,Omnivore
140192,True,Vegan,Vegan
140193,True,Vegetarian,Vegetarian


In [23]:
mergedtables = mergedtables.drop(columns=["Diet", "RecipeDiet"])

In [24]:
submissiondataset = mergedtables[mergedtables["Like"].isna()] #com Null na coluna Like
trainandtestdataset = mergedtables[mergedtables["Like"].notna()] #sem Null na coluna Like

trainandtestdataset.info()
submissiondataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 97381 entries, 0 to 140194
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   AuthorId            97381 non-null  object 
 1   Age                 97381 non-null  int64  
 2   RecipeId            97381 non-null  int64  
 3   Requested_Calories  97381 non-null  int32  
 4   Protein             97381 non-null  object 
 5   Fat                 97381 non-null  int64  
 6   Sugar               97381 non-null  object 
 7   Fiber               97381 non-null  int64  
 8   Like                97381 non-null  object 
 9   TestSetId           0 non-null      float64
 10  Name                97381 non-null  object 
 11  Recipe_Calories     97381 non-null  float64
 12  FatContent          97381 non-null  float64
 13  FiberContent        97381 non-null  float64
 14  SugarContent        97381 non-null  float64
 15  ProteinContent      97381 non-null  float64
 16  Recipe_T

#### Sampling

Split our data set into *train* and *test* data set.

In [25]:
# TODO: ver se vamos usar um split para validação, ou usar cross validation

In [26]:
from sklearn.model_selection import train_test_split

# Drop columns that should not be considered
# Drop Name because is string and Random Forest doesn't accept strings
selected_columns_test = ['AuthorId', 'RecipeId', 'TestSetId', 'Name']
test_extra_info = trainandtestdataset[selected_columns_test]
submission_extra_info = submissiondataset[selected_columns_test]

submissiondataset = submissiondataset.drop(columns= selected_columns_test)
trainandtestdataset = trainandtestdataset.drop(columns=selected_columns_test)

# Drop categorical values and transform them into one column for each of possible categories
# This also removes remaining string values
# ATTENTION: Eu nao sei se essa parte eh necessaria para o Linear Regression. Acredito que sim, mas, se nao, reorganizamos o codigo de repente
categorical_values = ["Requested_Calories", "Protein", "Fat", "Sugar", "Fiber"] # 'Diet', 'RecipeDiet', 

for column in categorical_values:
    new_data = pd.get_dummies(trainandtestdataset[column], prefix=column)
    trainandtestdataset = pd.concat([trainandtestdataset, new_data], axis=1)

    new_data = pd.get_dummies(submissiondataset[column], prefix=column)
    submissiondataset = pd.concat([submissiondataset, new_data], axis=1)
    
trainandtestdataset = trainandtestdataset.drop(columns=categorical_values)
submissiondataset = submissiondataset.drop(columns=categorical_values)

submissiondataset.head()


Unnamed: 0,Age,Like,Recipe_Calories,FatContent,FiberContent,SugarContent,ProteinContent,Recipe_Time_Match,RecipeMatch,Requested_Calories_0,Requested_Calories_1,Protein_1,Protein_Indifferent,Fat_0,Fat_1,Sugar_1,Sugar_Indifferent,Fiber_0,Fiber_1
5,52,,395.7,19.2,0.8,4.3,16.3,True,False,False,True,False,True,False,True,True,False,True,False
15,37,,104.4,8.2,2.0,4.0,2.1,True,True,False,True,False,True,False,True,True,False,True,False
20,55,,239.1,12.1,0.7,24.1,2.4,True,True,False,True,True,False,False,True,True,False,False,True
22,61,,239.1,12.1,0.7,24.1,2.4,True,True,True,False,False,True,False,True,False,True,True,False
23,45,,239.1,12.1,0.7,24.1,2.4,True,True,True,False,False,True,False,True,True,False,True,False


In [27]:
# Separate train and test data and X and Y variables

X_features = trainandtestdataset.drop(columns="Like")
Y_classes = trainandtestdataset["Like"]
Y_classes = Y_classes.astype('category')

trainandtestdataset.info()

X_train, X_test, Y_train, Y_test = train_test_split(X_features, Y_classes,
                                                    test_size=0.2, 
                                                    shuffle=True,
                                                    random_state=seed) # for reproducibility
train_df = X_train
train_df["Y_train"] = Y_train
train_df = train_df.loc[train_df["Recipe_Calories"] > 0]

to_be_filtered = ["Recipe_Calories", "FatContent", "FiberContent", "SugarContent", "ProteinContent"]

for column in to_be_filtered:
    good_max_value = train_df[column].mean() + 5 * train_df[column].std()
    good_min_value = train_df[column].mean() - 5 * train_df[column].std()

    train_df = train_df.loc[(train_df[column] > good_min_value) & (train_df[column] < good_max_value)]

X_train = train_df.drop(columns=["Y_train"])
Y_train = train_df["Y_train"]






<class 'pandas.core.frame.DataFrame'>
Index: 97381 entries, 0 to 140194
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Age                   97381 non-null  int64  
 1   Like                  97381 non-null  object 
 2   Recipe_Calories       97381 non-null  float64
 3   FatContent            97381 non-null  float64
 4   FiberContent          97381 non-null  float64
 5   SugarContent          97381 non-null  float64
 6   ProteinContent        97381 non-null  float64
 7   Recipe_Time_Match     97381 non-null  bool   
 8   RecipeMatch           97381 non-null  bool   
 9   Requested_Calories_0  97381 non-null  bool   
 10  Requested_Calories_1  97381 non-null  bool   
 11  Protein_1             97381 non-null  bool   
 12  Protein_Indifferent   97381 non-null  bool   
 13  Fat_0                 97381 non-null  bool   
 14  Fat_1                 97381 non-null  bool   
 15  Sugar_1               9

- X_train: 77.904 rows × 24 columns
- Y_train: 77.904 rows
- X_test: 19.477 rows × 24 columns
- Y_test: 19.477 rows

### Phase 4: Modeling

In this phase, the model is trained and tuned.

#### Logistic Regression

In [28]:
"""from sklearn.linear_model import LogisticRegression

#trying to adjust feature balance
logistic_regression = LogisticRegression(max_iter= 1000, class_weight='balanced' )

logistic_regression.fit(X_train, Y_train)

Y_pred = logistic_regression.predict(X_test)"""

"from sklearn.linear_model import LogisticRegression\n\n#trying to adjust feature balance\nlogistic_regression = LogisticRegression(max_iter= 1000, class_weight='balanced' )\n\nlogistic_regression.fit(X_train, Y_train)\n\nY_pred = logistic_regression.predict(X_test)"

#### Random Forest

In [29]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier()

random_forest.fit(X_train, Y_train)

Y_pred = random_forest.predict(X_test)

Thrsehold Adjustment

In [36]:
Y_probabilities = random_forest.predict_proba(X_test)[:, 1]

# lower the threshold greater the sensitivity
new_threshold = 0.25

Y_pred_adjusted = (Y_probabilities > new_threshold).astype(int)

##### Random Forest Analysis

Comparision between predictions 

Adjusted 
-accuracy: 0.87
-sensitivity =  0.72
-specificity =  0.89

No adjusted:
-accuracy: 0.90
-sensitivity =  0.44
-specificity =  0.97

In [37]:
from sklearn.metrics import classification_report

print("Threshold" , classification_report(Y_test, Y_pred_adjusted))

#print("ROC-AUC:", roc_auc_score(Y_test, Y_probabilities))

print("Sem ajuste", classification_report(Y_test, Y_pred))


Threshold               precision    recall  f1-score   support

         0.0       0.95      0.89      0.92     16935
         1.0       0.49      0.72      0.58      2542

    accuracy                           0.87     19477
   macro avg       0.72      0.80      0.75     19477
weighted avg       0.89      0.87      0.88     19477

Sem ajuste               precision    recall  f1-score   support

         0.0       0.92      0.97      0.94     16935
         1.0       0.68      0.43      0.53      2542

    accuracy                           0.90     19477
   macro avg       0.80      0.70      0.74     19477
weighted avg       0.89      0.90      0.89     19477



In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

accuracy = accuracy_score(Y_test, Y_pred)
print("Accuracy:", accuracy)
confusion_matrix = confusion_matrix(Y_test, Y_pred)
print(confusion_matrix)

true_negatives = confusion_matrix[0][0]
false_negatives = confusion_matrix[1][0]
false_positives = confusion_matrix[0][1]
true_positives = confusion_matrix[1][1]

sensitivity = true_positives / (true_positives + false_negatives)
specificity = true_negatives / (true_negatives + false_positives)

print("sensitivity = ", sensitivity)
print("specificity = ", specificity)

# Too many False predictions

# Possible ways to improve
# Re add the recipe name in some way - parse the string and see if the title is vegetarian. 
# Group the cook time in discrete chunks?
# Group the other nutritional facts columns of recipe in discrete chunks?
# Group age in chunks ?
# Drop some columns from recipe like sodium 
# Reduce dimensionality. I guess fat, saturated fat and cholesterol are correlated.


Accuracy: 0.9016789033218668
[[16437   498]
 [ 1417  1125]]
sensitivity =  0.44256490952006294
specificity =  0.9705934455270151


In [None]:
tabela = X_test
tabela["Like"] = Y_test
tabela["Pred"] = Y_pred
tabela[(tabela["Pred"] == 1) & (tabela["Like"] == False)]

Unnamed: 0,Age,Time,CookTime,PrepTime,Recipe_Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,...,Protein_1,Protein_Indifferent,Fat_0,Fat_1,Sugar_1,Sugar_Indifferent,Fiber_0,Fiber_1,Like,Pred
101507,56,599.866500,0,600,121.5,0.7,0.1,0.0,863.2,29.3,...,False,True,True,False,True,False,True,False,False,1.0
79998,37,1200.521945,0,1200,82.8,0.8,0.5,2.5,55.6,17.4,...,False,True,True,False,False,True,True,False,False,1.0
92766,77,300.899904,0,300,230.1,4.9,3.0,18.0,29.2,6.5,...,False,True,False,True,False,True,True,False,False,1.0
96770,75,2699.582278,1800,900,8.7,0.1,0.0,0.0,363.8,1.9,...,False,True,False,True,True,False,True,False,False,1.0
53604,77,43199.644528,0,43200,311.2,5.5,1.0,0.0,2.0,53.3,...,True,False,False,True,True,False,True,False,False,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105361,60,7799.063718,7200,600,90.8,3.5,0.5,0.0,1091.9,13.6,...,False,True,True,False,True,False,True,False,False,1.0
20278,78,5100.777689,4500,600,804.8,55.5,26.8,17.4,21.4,68.4,...,False,True,False,True,True,False,False,True,False,1.0
70597,71,3600.422585,2400,1200,82.3,4.9,2.8,44.5,40.0,7.9,...,False,True,True,False,True,False,True,False,False,1.0
20099,65,5401.158953,4500,900,1077.5,76.2,44.9,303.9,833.6,86.0,...,False,True,False,True,True,False,False,True,False,1.0


In [None]:
tabela.groupby(["Like", "Pred"]).mean()

  tabela.groupby(["Like", "Pred"]).mean()


Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Time,CookTime,PrepTime,Recipe_Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,...,Requested_Calories_0,Requested_Calories_1,Protein_1,Protein_Indifferent,Fat_0,Fat_1,Sugar_1,Sugar_Indifferent,Fiber_0,Fiber_1
Like,Pred,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
False,0.0,45.484847,9211.543932,6768.098724,2438.393322,474.50862,25.064576,9.756361,86.37956,744.404815,45.282584,...,0.596818,0.403182,0.404452,0.595548,0.295989,0.704011,0.709092,0.290908,0.597665,0.402335
False,1.0,61.386139,6170.814155,4007.970297,2162.821782,747.667327,32.85099,11.962624,119.574752,948.912871,91.105198,...,0.643564,0.356436,0.341584,0.658416,0.356436,0.643564,0.673267,0.326733,0.601485,0.398515
True,0.0,55.551341,5193.365955,3131.45847,2061.77894,580.381426,27.49588,11.093394,98.080641,733.502027,64.582145,...,0.671027,0.328973,0.352518,0.647482,0.268803,0.731197,0.644866,0.355134,0.605625,0.394375
True,1.0,64.83613,8115.294357,3789.358342,4325.982231,806.904344,35.204047,13.716782,124.929911,1138.600592,100.924975,...,0.59921,0.40079,0.430405,0.569595,0.396841,0.603159,0.615005,0.384995,0.601185,0.398815


#### Submission

In [None]:
X_features_submission = submissiondataset.drop(columns="Like")
X_features_submission.head()

Unnamed: 0,Age,Recipe_Calories,FatContent,FiberContent,SugarContent,ProteinContent,Recipe_Time_Match,RecipeMatch,Requested_Calories_0,Requested_Calories_1,Protein_1,Protein_Indifferent,Fat_0,Fat_1,Sugar_1,Sugar_Indifferent,Fiber_0,Fiber_1
5,52,395.7,19.2,0.8,4.3,16.3,True,False,False,True,False,True,False,True,True,False,True,False
15,37,104.4,8.2,2.0,4.0,2.1,True,True,False,True,False,True,False,True,True,False,True,False
20,55,239.1,12.1,0.7,24.1,2.4,True,True,False,True,True,False,False,True,True,False,False,True
22,61,239.1,12.1,0.7,24.1,2.4,True,True,True,False,False,True,False,True,False,True,True,False
23,45,239.1,12.1,0.7,24.1,2.4,True,True,True,False,False,True,False,True,True,False,True,False


In [None]:
# submissiion

# Let's assume that our id column is the index of the dataframe

id = submission_extra_info['TestSetId']
Y_pred_submission = random_forest.predict(X_features_submission)

output = pd.DataFrame({'id': id, 'prediction': Y_pred_submission})

#output = output.rename(columns={'TestSetId': 'id'})

#output
output.info()
output['id'] = output["id"].astype('int')
output['prediction'] = output["prediction"].astype('int')
output.to_csv('analzticscuppredictionfile.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
Index: 42814 entries, 5 to 140189
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          42814 non-null  float64
 1   prediction  42814 non-null  float64
dtypes: float64(2)
memory usage: 1003.5 KB


In [None]:
output.head()

Unnamed: 0,id,prediction
5,41190.0,0.0
15,18123.0,0.0
20,36379.0,0.0
22,33658.0,0.0
23,24872.0,0.0


In [None]:
# Here, you want to find the best classifier. As candidates, consider
#   1. LogisticRegression
#   2. RandomForestClassifier
#   3. other algorithms from sklearn (easy to add)
#   4. custom algorithms (more difficult to implement)
    
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

model_logistic_regression = LogisticRegression(max_iter=30)
model_random_forest = RandomForestClassifier()
model_gradient_boosting = GradientBoostingClassifier()

# train the models
pipeline = Pipeline(steps=[("scaler", transform_scaler), 
                           ("pca", transform_pca),
                           ("model", None)])

parameter_grid_preprocessing = {
  "pca__n_components" : [1, 2, 3, 4],
}

parameter_grid_logistic_regression = {
  "model" : [model_logistic_regression],
  "model__C" : [0.1, 1, 10],  # inverse regularization strength
}

parameter_grid_gradient_boosting = {
  "model" : [model_gradient_boosting],
  "model__n_estimators" : [10, 20, 30]
}

parameter_grid_random_forest = {
  "model" : [model_random_forest],
  "model__n_estimators" : [10, 20, 50],  # number of max trees in the forest
  "model__max_depth" : [2, 3, 4],
}

meta_parameter_grid = [parameter_grid_logistic_regression,
                       parameter_grid_random_forest,
                       parameter_grid_gradient_boosting]

meta_parameter_grid = [{**parameter_grid_preprocessing, **model_grid}
                       for model_grid in meta_parameter_grid]

search = GridSearchCV(pipeline,
                      meta_parameter_grid, 
                      scoring="balanced_accuracy",
                      n_jobs=2, 
                      cv=5,  # number of folds for cross-validation 
                      error_score="raise"
)

# here, the actual training and grid search happens
search.fit(X_train, Y_train.values.ravel())

print("best parameter:", search.best_params_ ,"(CV score=%0.3f)" % search.best_score_)

### Step 5: Evaluation

Once the appropriate models are chosen, they are evaluated on the test set. For
this, different evaluation metrics can be used. Furthermore, this step is where
the models and their predictions are analyzed resp. different properties, including
feature importance, robustness to outliers, etc.

In [None]:
# evaluate performance of model on test set
print("Score on test set:", search.score(X_test, Y_test.values.ravel()))

# contingency table
ct = pd.crosstab(search.best_estimator_.predict(X_test), Y_test.values.ravel(),
                 rownames=["pred"], colnames=["true"])
print(ct)

In [None]:
# (optional, if you're curious) 
# for a detailed look on the performance of the different models
def get_search_score_overview():
  for c,s in zip(search.cv_results_["params"],search.cv_results_["mean_test_score"]):
      print(c, s)

print(get_search_score_overview())

#### Interpretability

##### Disclaimer: This only works if shap is installed.

In addition to models and their predictions, it is often important to understand _why_ a model makes certain predictions. 
There is a lot of literature on how this can be achieved (explainability), but we will only show the use of Shapley values
using the python module "shap", which is a combination of Shapley values and LIME. 
You can find more information on this topic [here](https://christophm.github.io/interpretable-ml-book/shap.html).

In [None]:
# assume random forest model
model = RandomForestClassifier(n_estimators=10, random_state=seed)
model.fit(X_train, Y_train.values.ravel())

# compute shapley values
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
shap_interaction_values = explainer.shap_interaction_values(X_train)

expected_value = explainer.expected_value
print(expected_value)

In [None]:
# class dependent plots of shapley values for each feature
for i,c in enumerate(df.variety.unique()):
    shap.summary_plot(shap_values[i], X_train, show=False)
    plt.title("Shapley values for "+str(c))
    plt.show()

From the computed SHAP values, we can interpret that the *petal.width* has a positive impact on the output of the model 
if the feature value is moderate. For high aand low values, the impact is negative. The same observation
holds for *petal.length*. Besides, the impact of the *sepal.length* and *sepal.width* features are rather low. By impact on a 
the target, we model the probability that we classify that target. Thus, if *petal.width* is high, it is more likely
that we classify the data point as Versicolor.

### Step 6: Deployment

Now that you have chosen and trained your model, it is time to deploy it to your
clients system. 

In [None]:
def micro_service_classify_iris(datapoint):
    
  # make sure the provided datapoints adhere to the correct format for model input

  # fetch your trained model
  model = search.best_estimator_

  # make prediction with the model
  prediction = model.predict(datapoint)

  return prediction


In the Analytics Cup, you need to export your prediction in a very specific output format. This is a csv file without an index and two columns, *id* and *prediction*. Note that the values in both columns need to be integer values, and especially in the *prediction* column either 1 or 0.

In [None]:
# To do: arrumar a celula abaixo com os nossos dataframes