# Data Scientist Example Practical Exam Solution - Tasty Bytes

## Data Validation
This data set has 947 observations, 8 features (being one of them the target variable). I have validated all variables and I have identified few violations. Data dictionary with the validation issues identified:
 

| Column       | Type      | Description                                                     | Validation Issues                                              |
|--------------|-----------|-----------------------------------------------------------------|---------------------------------------------------------------|
| recipe       | Numeric   | Unique identifier of recipe                                     | As described. No cleaning needed.                              |
| calories     | Numeric   | Number of calories                                              | Type as described. 52 missing values.                          |
| carbohydrate | Numeric   | Amount of carbohydrates in grams                                | Type as described. 52 missing values.                          |
| sugar        | Numeric   | Amount of sugar in grams                                        | Type as described. 52 missing values.                          |
| protein      | Numeric   | Amount of protein in grams                                      | Type as described. 52 missing values.                          |
| category     | Character | Type of recipe. One of ten possible categories listed           | One extra category ('Chicken Breast') in the dataset.          |
| servings     | Numeric   | Number of servings for the recipe                               | Three values contain non-numeric characters.                   |
| high_traffic | Character | Indicates if the traffic to the site was high for this recipe   | Values marked with "High" if high traffic.                     |

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

plt.style.use('ggplot')

In [None]:
df = pd.read_csv('recipe_site_traffic_2212.csv')
df.info()

In [None]:
df.head()

In [None]:
# check the presence of null values in the dataset
df.isna().sum()

In [None]:
# locate the null values in the calories, carbo., sugar and protein.
df[df['calories'].isna()]

In [None]:
# duplicate recipe ids?
df['recipe'].nunique()

In [None]:
# validate the list of categories
df['category'].nunique()

In [None]:
# Identification of violating category
expected_cats = set(['Lunch/Snacks', 'Beverages', 'Potato', 'Vegetable', 'Meat', 'Chicken', 'Pork', 'Dessert', 'Breakfast', 'One Dish Meal'])
existing_cats = set(df['category'].unique().tolist()) 
extra_cat = existing_cats.difference(expected_cats)
print(extra_cat)

In [None]:
# Since the violating category is a variety of one existing category, lets convert it
df.loc[df['category']=='Chicken Breast', 'category'] = 'Chicken'

In [None]:
# are categories ok now?
existing_cats = set(df['category'].unique().tolist()) 
extra_cat = existing_cats.difference(expected_cats)
print(extra_cat)

In [None]:
# Overview on servings
df['servings'].unique()

In [None]:
# Is servings numeric?
df['servings'].dtype

In [None]:
# Convert the violating values to valid ones and adapt the type
df.loc[df['servings']=='4 as a snack', 'servings'] = '4'
df.loc[df['servings']=='6 as a snack', 'servings'] = '6'
df['servings'] = df['servings'].astype('int')

In [None]:
# High traffic values 
df['high_traffic'].value_counts()

In [None]:
# Lets get rid of the null values in the target variable by converting high_traffic into a boolean
df['high_traffic'] = df['high_traffic']=='High' 
df['high_traffic'].value_counts()

In [None]:
# The only data validation problem left to be tackled are the missing values in the nutritional info columns. 52 rows are 5% of the observations. That is too simply remove them.
# Lets see how the different nutriotional features vary regarding category

nutritional_cols = ['protein', 'sugar', 'carbohydrate', 'calories']

for feature in nutritional_cols:
    plt.figure(figsize=(12,6))
    sns.boxplot(data=df, hue='category', y=feature, linewidth=0.5)
    plt.title('{} distribution by category'.format(feature))
    plt.show()


In [None]:
# Lets do the same for servings

for feature in nutritional_cols:
    plt.figure(figsize=(12,6))
    sns.boxplot(data=df, hue='servings', y=feature, linewidth=0.5)
    plt.title('{} distribution by servings'.format(feature))
    plt.show()

In [None]:
# Since the distributions of the nutritional features seems impacted by the category and not by the servings, 
# lets use the category to impute means to the missing nutritional observations

for column in nutritional_cols:
    df[column] = df.groupby('category')[column].transform(lambda x: x.fillna(x.mean()))

### Post Data Validation Corrections

After having fixed the violations identified in the dataset compared to the provided data dictionary, the dataset remains as follows: 

| Column       | Type      | Actions undertaken                                                                                           |
|--------------|-----------|-------------------------------------------------------------------------------------------------------------|
| recipe       | Numeric   | No cleaning needed.                                              |
| calories     | Numeric   | 52 missing values filled with the mean of the **category**.                              |
| carbohydrate | Numeric   | 52 missing values filled with the mean of the **category**.                |
| sugar        | Numeric   | 52 missing values filled with the mean of the **category**.                        |
| protein      | Numeric   | 52 missing values filled with the mean of the **category**.                      |
| category     | Character | One extra category value ('Chicken Breast') merged with 'Chicken'.                           |
| servings     | Numeric   | Non-numeric characters converted with no information loss.                |
| high_traffic | Boolean   | Converted to boolean where True indicates 'High' and False indicates not 'High'.                             |

## Exploratory Analysis

After having investigated all the features and their relationship with the target variable I decided to apply the following changes to ease the modeling part:
- log transformation to the nutritional features
- servings as category
- drop recipe

### Target Variable - high_traffic

*high_traffic* is the feature we are trying to predict. It is unbalanced (60~40).

In [None]:
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x='high_traffic', hue='high_traffic')
plt.title('Distribution of the target variable (high_traffic)')
plt.show()

### Numeric Variables - Calories, Sugar, Protein, Carbohydrate and Servings

The distributions of the nutritional features (calories, sugar, protein and carbo.) are right skewed. To facilitate modeling its better to have data normally distributed, since numerous models have it as an assumption.

In [None]:
# Lets see first how the nutritional columns are distributed
fig, axes = plt.subplots(1,4,figsize=(15,5))
sns.histplot(data=df, x='calories', ax=axes[0]).set(title='Distribution of Calories')
sns.histplot(data=df, x='sugar', ax=axes[1]).set(title='Distribution of Sugar')
sns.histplot(data=df, x='protein', ax=axes[2]).set(title='Distribution of Protein')
sns.histplot(data=df, x='carbohydrate', ax=axes[3]).set(title='Distribution of Carbo.')

In [None]:
# Since the distributions are right skewed, lets appy a log transformation to try to normalize them
for col in nutritional_cols: 
    df[col + '_log'] = np.log1p(df[col])

In [None]:
fig, axes = plt.subplots(1,4,figsize=(15,5))
sns.histplot(data=df, x='calories_log', ax=axes[0]).set(title='Distribution of log(Calories)')
sns.histplot(data=df, x='sugar_log', ax=axes[1]).set(title='Distribution of log(Sugar)')
sns.histplot(data=df, x='protein_log', ax=axes[2]).set(title='Distribution of log(Protein)')
sns.histplot(data=df, x='carbohydrate_log', ax=axes[3]).set(title='Distribution of log(Carbo.)')

In [None]:
# Once transformed, lets see how the different nutritional features relate to the target variable
fig, axes = plt.subplots(2,2,figsize=(15,5))
sns.boxplot(data=df, y='calories_log', hue='high_traffic', width=0.3, ax=axes[0][0]).set(title='Relation log(Calories) and High_Traffic')
sns.boxplot(data=df, y='sugar_log', hue='high_traffic', width=0.3, ax=axes[0][1]).set(title='Relation log(Sugar) and High_Traffic')
sns.boxplot(data=df, y='protein_log', hue='high_traffic', width=0.3, ax=axes[1][0]).set(title='Relation log(Protein) and High_Traffic')
sns.boxplot(data=df, y='carbohydrate_log', hue='high_traffic', width=0.3, ax=axes[1][1]).set(title='Relation log(Carbo.) and High_Traffic')

In [None]:
# Lets see how the different servings observations are distributed
sns.countplot(data=df, x='servings')

In [None]:
# Lets see now how servings is linked to the target variable
sns.countplot(data=df, x='servings', hue='high_traffic')

### Categorical Variables - Category

After analysing the category feature and its relationship with the target variable it seems to be the the feature that explains the best the target variable

In [None]:
# Lets convert category into a categorical feature.
df['category'] = df['category'].astype('category')

In [None]:
# And see how many samples of each category we have
plt.figure(figsize=(12,6))
sns.countplot(data=df, x='category', hue='category', palette="dark")
plt.xticks(rotation=45)
plt.title('Distribution of categories')
plt.show()

In [None]:
# And lets see now how the different recipes are splitted into high traffic or not by category
plt.figure(figsize=(12,6))
sns.countplot(data=df, x='category', hue='high_traffic')
plt.xticks(rotation=45)
plt.title('Relationship categories and target variable')
plt.show()

So far, the *category* feature seems to be the one being able to best explain the target variable.

## Model Fitting & Evaluation

Predicting a boolean variable is a clasification problem in machine learning. Due to the small number of features and observations I am picking **Logistic Regression** as the model of choice and I will add a **Gradient Boosting Classifier** to compare. Both are interpretable and should be flexible enough to predict the target variable in this case. 

To ensure the best possible hyperparameters, I will perform a grid search during the training process. The model’s performance will be evaluated using cross-validation to avoid overfitting and provide a more reliable estimate of its generalization ability.

For the final evaluation, the two metrics I am choosing are **accuracy** and **precision**. Accuracy provides an overall sense of how well the model is predicting both classes correctly. However, precision is particularly important for this scenario because it focuses on the positive class (high traffic recipes) and indicates the percentage of correctly identified high traffic recipes out of all recipes predicted as high traffic.

In our specific case, we aren’t concerned if a true high traffic recipe does not make it to the homepage. What we want to avoid is incorrectly displaying a low traffic. Therefore, precision will help minimize these false positives and align with our objective.

A baseline we could use to compare our models is the fact that randomly picking a recipe from the provided dataset to be displayed on the home page will have an accuracy (and a precission) of 60%

## Prepare Data for Modelling

I am going to use all the variables (but recipe) as features and the high_traffic column as target variable.

On top of the imputation of the null values, the log transformation of the nutritional features, I have performed the following modifications on the input features:
- Drop unused features
- Normalize the numeric features
- Convert the categorical variables into numeric features
- Split the data into a training set and a test set

In [None]:
df = df.drop(['recipe', 'calories', 'sugar', 'protein', 'carbohydrate'], axis=1)

In [None]:
from sklearn.preprocessing import StandardScaler

numeric_cols = ['calories_log', 'protein_log', 'sugar_log', 'carbohydrate_log']

scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

In [None]:
df['servings'] = df['servings'].astype('category')

df_encoded = pd.get_dummies(df, columns=['category', 'servings'], drop_first=True)


In [None]:
X = df_encoded.drop('high_traffic', axis=1)
y = df_encoded['high_traffic']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

## Training and Evaluation

In [None]:
# Initialize models and parameters
models = {
    'Logistic Regression': LogisticRegression(max_iter=1500, class_weight={0: 2, 1: 1}),
    'Gradient Boosting': GradientBoostingClassifier()
}

# Parameter grids
param_grids = {
    'Logistic Regression': {
        'C': [0.01, 0.1, 1, 10, 100],
        'penalty': ['l2'],
        'solver': ['lbfgs', 'liblinear']
    },
    'Gradient Boosting': {
        'n_estimators': [20, 40, 70, 100, 200],
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
        'max_depth': [2, 4, 7]
    }
}

# Capture the trained models
trained_models={}

# Loop through models
for model_name, model in models.items():
    print(f"\nTraining Model: {model_name}")
    
    # Perform grid search
    grid_search = GridSearchCV(estimator=model, param_grid=param_grids[model_name], 
                               scoring='accuracy', cv=5, verbose=1, n_jobs=-1)
    
    # Fit the model
    grid_search.fit(X_train, y_train)
    
    # Best model from grid search
    best_model = grid_search.best_estimator_
    
    # Predictions
    y_pred_train = best_model.predict(X_train)
    y_pred_test = best_model.predict(X_test)
    
    # Calculate accuracies
    train_accuracy = accuracy_score(y_train, y_pred_train)
    test_accuracy = accuracy_score(y_test, y_pred_test)
    
    # Print results
    print(f"Best Parameters: {grid_search.best_params_}")
    print(f"Train Accuracy: {train_accuracy:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred_test)}")
    print(classification_report(y_test, y_pred_test))

    # Capture trained model 
    trained_models[model_name] = best_model
   

In [None]:
def display_feat_imp(feature_importance, model_name):
    feature_importance['Importance'] = feature_importance['Importance'].abs() 
    feature_importance = feature_importance.sort_values(by='Importance', ascending=False)
    
    plt.figure(figsize=(12,6))
    sns.barplot(
        data=feature_importance,
        x='Importance',
        hue='Feature',
        palette='Spectral'
    )
    plt.title('Feature Importances ({} Coefficients)'.format(model_name))
    plt.show()

In [None]:
# feature importances of the logistic regression model
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': trained_models['Logistic Regression'].coef_[0]
})

display_feat_imp(feature_importance, 'Logistic Regression')

In [None]:
# feature importances of the gradient boosting model
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': trained_models['Gradient Boosting'].feature_importances_
})

display_feat_imp(feature_importance, 'Gradient Boosting')

## Results

In terms of **test accuracy** the Logistic model beats Gradient Boosting probably because the later seems to be overfitting the training data (higher training accuracy than test accuracy). The **precission**, or the ability to _minimize false positives_ is higher in the Logistic Regression model too. So for this scenario, we have a clear winner. Since the best model has been found using Grid Search I dont estimate necessary to fine tune the best parameters already found.

## Evaluation based on Business Criteria

Tasty Bytes wants to be able to predict recipes that will lead to high traffic, and be able to predict high traffic recipes 80% of the time. 

With the Logistic Regression model we have trained we can even go beyond the initial business requirements and provide a tool that, when predicting a high traffic recipe, will be right 88% of the time. 

We can actually monitor this with a KPI that could be defined as *the percentage of high traffic recipes displayed on the home page**, assuming these recipes have been picked by the tool.


## Recommendation 

In order to increase the consistency of the success of the recipes displayed on the home page, remove the dependency towards the expertise of the person manually picking the recipes and to increase the level of automation of Tasty Bytes we recommend to deploy the Logistic Regression model and start using it as soon as possible. 

In a first phase, I would recommend to use it to assist the manual selection currently happening, to eventually identify and fix errors, fine tune the model in necessary, etc. 

I would like to assess how easy it could be to get the rest of the data currently in the platform like ingredients, cost per serving, time to make... that i saw on the website. These features could have a positive impact on the performance of the model, and if they are readily available it would be worthy to check their inclusion. 

In any case, once we get the OK from the person in charge, In a second phase I would deploy the model in production and automate the selection of the recipe to be included in the home page using the predictions our model outputs.

Along the way, I would keep collecting data to monitor the performance of the model to identify performance drops, eventuals data or concept drifts and retrain it regularly as new data is available.


