# Recipe Site Traffic 2

## 1. Define the Problem and Project Objectives
Tasty Bytes has today a manual mechanism to select the recipe to be displayed on the home page. Picking a 'popular' recipe has a positive impact of up to 40% in traffic in the rest of the website, leading to more subscriptions and therefore, more revenue.

The goal of this notebook is to perform an analysis on: 
- how to predict which recipes will be popular 80% of the time (accuracy)
- and minimize the chance of showing unpopular recipes (precision)

The provided dataset to perform this analysis is available in the file recipe_site_traffic_2212.csv and contains the following fields: 

| Column Name   | Details                                                                                                       |
|---------------|---------------------------------------------------------------------------------------------------------------|
| recipe        | Numeric, unique identifier of recipe                                                                          |
| calories      | Numeric, number of calories                                                                                   |
| carbohydrate  | Numeric, amount of carbohydrates in grams                                                                     |
| sugar         | Numeric, amount of sugar in grams                                                                             |
| protein       | Numeric, amount of protein in grams                                                                           |
| category      | Character, type of recipe. Recipes are listed in one of ten possible groupings ('Lunch/Snacks', 'Beverages', 'Potato', 'Vegetable', 'Meat', 'Chicken', 'Pork', '

## 2. Data Collection and Understanding

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

Lets read the data and have a first idea about its quality regarding types and null values.

In [None]:
df = pd.read_csv('recipe_site_traffic_2212.csv')


In [None]:
df.head()

In [None]:
df.isna().sum()

There are nulls in the dataset.

In [None]:
df.info()

- The *servings* feature should be numeric
- *high_traffic* can be boolean and has nulls too

In [None]:
df['recipe'].nunique()

In [None]:
df.describe()

The range of *calories* seems way over the other ones.

In [None]:
df[df['calories'].isna()]

In [None]:
df.shape

After a first look to the dataset, we can conclude: 
- We have 947 recipes 
- We have 6 independent variables and one target (high_traffic)
- 4 of the features have 2 null values, all of them corresponding to the same rows. These null values correspond to different 'categories' and serving values
- The target variable contains 377 null values that should be encoded as 'False' as estated in the data dictionary provided.
- The serving feature, supposed to contain integers, contain 3 faulty entries that must be cleaned.


## 3. Data Cleaning

### 3.1. Category type

The category feature has been loaded as an 'object'. We can encode it as a category. It will help us save some memory.

In [None]:
df['category'].value_counts()

In [None]:
df['category'] = df['category'].astype('category')

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(data=df, x='category', hue='category')
plt.xticks(rotation=45)
plt.title('Distribution of categories')
plt.show()

### 3.2. Servings non numeric

Some of the servings are non numeric

In [None]:
df['servings'].value_counts()

There are 3 weird serving values. Lets clean them up. And lets consider servings as a category rather than a numeric field.

In [None]:
df['servings'] = df['servings'].str[0].astype('category')

In [None]:
sns.countplot(data=df, x='servings')

### 3.3. Null values in Calories, Protein, Carbohydrate and Sugar

Since the nature of this issue seems pretty similar, lets group these features.

In [None]:
sns.histplot(data=df.calories)
plt.show()

In [None]:
sns.histplot(data=df.protein)
plt.show()

In [None]:
sns.histplot(data=df.sugar)
plt.show()

In [None]:
sns.histplot(data=df.carbohydrate)
plt.show()

The distributions of all these 4 numeric features is highly skewed to the right.

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df, hue='category', y='calories', linewidth=0.5)
plt.title('Calories distribution by category')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df, hue='category', y='protein', linewidth=0.5)
plt.title('Protein distribution by category')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df, hue='category', y='sugar', linewidth=0.5)
plt.title('Sugar distribution by category')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df, hue='category', y='carbohydrate', linewidth=0.5)
plt.title('Carbohydrate distribution by category')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df, hue='servings', y='protein', linewidth=0.5)

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df, hue='servings', y='sugar', linewidth=0.5)

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df, hue='servings', y='calories', linewidth=0.5)

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df, hue='servings', y='carbohydrate', linewidth=0.5)

Lets impute the missing calories, sugar, protein and carbohydrate with the mean for each category. I initially thought about using the servings feature to refine the mean to be imputed, but it seems there is no variation in these columns due to the amout of servings.

In [None]:
for column in ['sugar', 'calories', 'protein', 'carbohydrate']:
    df[column] = df.groupby('category')[column].transform(lambda x: x.fillna(x.mean()))


In [None]:
df.head(20)

Lets convert the target column into a boolean one, like that we get rid of the null values on this column too.

In [None]:
df['high_traffic'] = np.where(df['high_traffic'] == 'High', 1, 0)

In [None]:
df['high_traffic'].value_counts(normalize=True)

In [None]:
sns.countplot(data=df, x='high_traffic', hue='high_traffic')
plt.show()

If we randomly chose a recipe, it would be 60% of the times a high traffic one. This could be our baseline.

In [None]:
df.head(20)

In [None]:
df.describe()

In [None]:
df.info()

The data seems clean now after having imputed missing values, fix wrong typing and clean messy servings. We are now ready to prepare the data to train a model.

Since the protein, sugar, carbo and calories columns were highly skewed we will apply a log transformation to normalize their distributions. This will normalize the features, reduce their ranges and the outliers, making the distributions more symmetric too. This should have a positive impact in the training.

In [None]:
df['log_calories'] = np.log(df['calories'])
sns.kdeplot(data=df, x='calories')
plt.show()
sns.kdeplot(data=df, x='log_calories')
plt.show()

In [None]:
df['log_protein'] = np.log1p(df['protein']) # protein contains 0's
sns.kdeplot(data=df, x='protein')
plt.show()
sns.kdeplot(data=df, x='log_protein')
plt.show()

In [None]:
df['log_sugar'] = np.log1p(df['sugar'])
sns.kdeplot(data=df, x='sugar')
plt.show()
sns.kdeplot(data=df, x='log_sugar')
plt.show()

In [None]:
df['log_carbohydrate'] = np.log1p(df['carbohydrate'])
sns.kdeplot(data=df, x='carbohydrate')
plt.show()
sns.kdeplot(data=df, x='log_carbohydrate')
plt.show()

These new log_* features are more symmetric than the originals.

Lets now: 
- get rid of *recipe* id, useless for training
- scale the numeric fields
- encode the *category* and *servings* features

In [None]:
df.head(20)

In [None]:
df = df.drop(['recipe', 'calories', 'sugar', 'protein', 'carbohydrate'], axis=1)

In [None]:
df.head()

In [None]:
from sklearn.preprocessing import StandardScaler

numeric_cols = ['log_calories', 'log_protein', 'log_sugar', 'log_carbohydrate']

scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

In [None]:
numeric_cols

In [None]:
df.head()

In [None]:
df_encoded = pd.get_dummies(df, columns=['category', 'servings'], drop_first=True)
df_encoded.head()

In [None]:
from sklearn.model_selection import train_test_split

X = df_encoded.drop('high_traffic', axis=1)
y = df_encoded['high_traffic']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
import numpy as np

# Initialize models and parameters
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier()
}

# Parameter grids
param_grids = {
    'Logistic Regression': {
        'C': [0.01, 0.1, 1, 10, 100],
        'penalty': ['l2'],
        'solver': ['lbfgs', 'liblinear']
    },
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    }
}

# Loop through models
for model_name, model in models.items():
    print(f"\nTraining Model: {model_name}")
    
    # For Logistic Regression and Random Forest, perform grid search
    grid_search = GridSearchCV(estimator=model, param_grid=param_grids[model_name], 
                               scoring='accuracy', cv=5, verbose=1, n_jobs=-1)
    
    # Fit the model
    grid_search.fit(X_train, y_train)
    
    # Best model from grid search
    best_model = grid_search.best_estimator_
    
    # Predictions
    y_pred_train = best_model.predict(X_train)
    y_pred_test = best_model.predict(X_test)
    
    # Calculate accuracies
    train_accuracy = accuracy_score(y_train, y_pred_train)
    test_accuracy = accuracy_score(y_test, y_pred_test)
    
    # Print results
    print(f"Best Parameters: {grid_search.best_params_}")
    print(f"Train Accuracy: {train_accuracy:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred_test)}")
    print(classification_report(y_test, y_pred_test))

In [None]:
# Define the best parameters from the previous grid search
current_best_params = {'C': 100, 'penalty': 'l2', 'solver': 'lbfgs'}

# Create a refined parameter grid for Logistic Regression
refined_param_grid = {
    'C': [80, 100, 120],  # Narrowing around the best C
    'penalty': ['l2'],  # Keeping the same penalty
    'solver': ['lbfgs', 'liblinear']  # Retaining the previously successful solvers
}

# Initialize Logistic Regression
logistic_model = LogisticRegression(max_iter=1000, class_weight={0: 2, 1: 1})

# Create grid search for refined Logistic Regression
refined_lr_grid_search = GridSearchCV(estimator=logistic_model, param_grid=refined_param_grid, 
                                       scoring='accuracy', cv=5, verbose=1, n_jobs=-1)

# Fit the grid search
refined_lr_grid_search.fit(X_train, y_train)

# Get the best refined Logistic Regression model and parameters
best_refined_lr_model = refined_lr_grid_search.best_estimator_
best_refined_lr_params = refined_lr_grid_search.best_params_

# Make predictions on the test set
refined_lr_y_pred = best_refined_lr_model.predict(X_test)

# Calculate training accuracy
train_accuracy = best_refined_lr_model.score(X_train, y_train)

# Print results for refined Logistic Regression
print("Refined Logistic Regression:")
print(f"Best Parameters: {best_refined_lr_params}")
print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {accuracy_score(y_test, refined_lr_y_pred):.4f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, refined_lr_y_pred)}")
print(classification_report(y_test, refined_lr_y_pred))

The overall accuracy of this first trained model with the available data is 75% meaning that we can predict a recipe will be popular or not around 3 out of 4 times. On the other hand after playing a bit with the class weights, the precision of this model is around 87%, so when we say a recipe is popular, it will be popular 87 times out of 100.

In [None]:
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': best_refined_lr_model.coef_[0]
})

# Display feature importance sorted by absolute value
feature_importance['Importance'] = feature_importance['Importance'].abs()  # Optional: for magnitude sorting
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)
print(feature_importance)

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(
    data=feature_importance,
    x='Importance',
    y='Feature',
    palette='viridis'
)
plt.title('Feature Importances (Logistic Regression Coefficients)')
plt.show()