# **Restaurant revenue forecast**

In this notebook, we have code that builds a deep learning regression model that predicts monthly revenue for a restaurant. 

#### Notebook structure (**follows CRISP-DM framework**)

1. Business value<br>
2. Exploatory Data Analysis<br>
3. Data Selection<br>
4. Feature selection<br>
5. Modellingt<br>
6. Evaluation<br>

At the end of the notebook, you will find directions for how to submit your work.  Let's get started by importing the necessary libraries and reading in the data.

# <label style="color:blue">Part I : Business understanding</label>
Forecasting business mertics is very important for all businesses. I helps businesses be proactive. 
Imagine you know next month your business : <br/><br/>
`1`. is not going to make profit,meaning you are going to make a loss, <br/><br/>
`2`. with that knowledge you can react proactively by putting campaigns that are going to help you avoid running a loss. 

In [1]:
# import all the libraries
import xgboost as xgb
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, classification_report, roc_curve, auc
import pandas as pd 
import seaborn as sns
from scipy.stats import skew
import warnings
import pickle
import numpy as np
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import VarianceThreshold
from scipy.stats import pearsonr
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    explained_variance_score, median_absolute_error
)
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

  import pandas.util.testing as tm


In [2]:
#read data from csv to dataframe
restaurant_data = pd.read_csv("Restaurant_revenue.csv")

# <label style="color:blue" id="Exploratory-Data-Analysis">Part II :Data understanding (Exploratory Data Analysis)</label>


`1.` What is the distribution of all numeric variables?  <br/>
`2.` What is the distribution of our target variable, monthly revenue? <br/>
`3.` What is the distribution of customer spending, menu price, and number of customer? <br/>
`4.` What is the statistical attributes of the data looking at numerical columns? 

In [3]:
#check data dimensions
restaurant_data.shape

(1000, 8)

In [4]:
#check columns and their datatypes
restaurant_data.dtypes

Number_of_Customers            int64
Menu_Price                   float64
Marketing_Spend              float64
Cuisine_Type                  object
Average_Customer_Spending    float64
Promotions                     int64
Reviews                        int64
Monthly_Revenue              float64
dtype: object

In [5]:
#show data statistical attributes
restaurant_data.describe()

Unnamed: 0,Number_of_Customers,Menu_Price,Marketing_Spend,Average_Customer_Spending,Promotions,Reviews,Monthly_Revenue
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,53.271,30.21912,9.958726,29.477085,0.497,49.837,268.724172
std,26.364914,11.27876,5.845586,11.471686,0.500241,29.226334,103.98295
min,10.0,10.009501,0.003768,10.037177,0.0,0.0,-28.977809
25%,30.0,20.396828,4.690724,19.603041,0.0,24.0,197.103642
50%,54.0,30.860614,10.092047,29.251365,0.0,50.0,270.213964
75%,74.0,39.843868,14.992436,39.55322,1.0,76.0,343.395793
max,99.0,49.97414,19.994276,49.900725,1.0,99.0,563.381332


In [None]:
#show cuisine type distributions
restaurant_data.Cuisine_Type.value_counts()

In [None]:
#show Promotions distributions
restaurant_data.Promotions.value_counts()

In [None]:
# show variable relationships and distributions having hue as cuisine type
restaurant_data = restaurant_data.loc[:, ~restaurant_data.columns.duplicated()]
required_columns = ['Number_of_Customers', 'Menu_Price', 'Marketing_Spend', 'Average_Customer_Spending', 'Promotions', 'Reviews','Monthly_Revenue','Cuisine_Type']
missing_columns = [col for col in required_columns if col not in restaurant_data.columns]
if missing_columns:
    raise ValueError(f"Missing columns in DataFrame: {missing_columns}")
training_data_cleaned = restaurant_data.dropna(subset=required_columns)
training_data_cleaned['Cuisine_Type'] = training_data_cleaned['Cuisine_Type'].astype('category')
plt.figure(figsize=(12, 4))
sns.set_style('whitegrid')
sns.pairplot(training_data_cleaned[required_columns], height=2, hue='Cuisine_Type')
plt.show()


In [None]:
# show variable relationships and distributions having hue as promotion
restaurant_data = restaurant_data.loc[:, ~restaurant_data.columns.duplicated()]
required_columns = ['Number_of_Customers', 'Menu_Price', 'Marketing_Spend', 'Average_Customer_Spending', 'Promotions', 'Reviews','Monthly_Revenue','Cuisine_Type']
missing_columns = [col for col in required_columns if col not in restaurant_data.columns]
if missing_columns:
    raise ValueError(f"Missing columns in DataFrame: {missing_columns}")
training_data_cleaned = restaurant_data.dropna(subset=required_columns)
training_data_cleaned['Promotions'] = training_data_cleaned['Promotions'].astype('category')
plt.figure(figsize=(12, 4))
sns.set_style('whitegrid')
sns.pairplot(training_data_cleaned[required_columns], height=2, hue='Promotions',)
plt.show()

In [None]:
# show variable means having hue as promotion
plt.figure(figsize=(10, 3))
df_melted = restaurant_data.melt(id_vars='Promotions', value_vars=['Number_of_Customers', 'Menu_Price', 'Marketing_Spend', 'Average_Customer_Spending', 'Reviews','Monthly_Revenue'], var_name='variable', value_name='value')
mean_values = df_melted.groupby(['variable', 'Promotions']).mean().reset_index()
sns.barplot(data=mean_values, x='variable', y='value', hue='Promotions')
plt.xlabel('Variable')
plt.ylabel('Averages')
plt.title('Promotions distribution across all variable')
plt.xticks(rotation=45)
plt.show()

In [None]:
# show variable means having hue as cuisine type
plt.figure(figsize=(10, 3))
df_melted = restaurant_data.melt(id_vars='Cuisine_Type', value_vars=['Number_of_Customers', 'Menu_Price', 'Marketing_Spend', 'Average_Customer_Spending', 'Reviews','Monthly_Revenue'], var_name='variable', value_name='value')
mean_values = df_melted.groupby(['variable', 'Cuisine_Type']).mean().reset_index()
sns.barplot(data=mean_values, x='variable', y='value', hue='Cuisine_Type')
plt.xlabel('Variable')
plt.ylabel('Averages')
plt.title('Cuisines distribution across all variable')
plt.xticks(rotation=45)
plt.show()

In [None]:
# show variables distribution


# Creating the histogram with Seaborn
plt.figure(figsize=(10, 3))
sns.histplot(list(restaurant_data['Monthly_Revenue']), bins=8, kde=True, color='red', edgecolor='blue')
plt.title('Monthly revenue distribution', fontsize=16)
plt.xlabel('Revenue', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.show()


# Creating the histogram with Seaborn
plt.figure(figsize=(10, 3))
sns.histplot(list(restaurant_data['Average_Customer_Spending']), bins=8, kde=True, color='red', edgecolor='blue')
plt.title('Average customer spending distribution', fontsize=16)
plt.xlabel('Spending', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.show()


# Creating the histogram with Seaborn
plt.figure(figsize=(10, 3))
sns.histplot(list(restaurant_data['Marketing_Spend']), bins=8, kde=True, color='red', edgecolor='blue')
plt.title('Marketing spend distribution', fontsize=16)
plt.xlabel('Spend', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.show()


# Creating the histogram with Seaborn
plt.figure(figsize=(10, 3))
sns.histplot(list(restaurant_data['Number_of_Customers']), bins=8, kde=True, color='red', edgecolor='blue')
plt.title('Number of customers distribution', fontsize=16)
plt.xlabel('Customers', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.show()

# Creating the histogram with Seaborn
plt.figure(figsize=(10, 3))
sns.histplot(list(restaurant_data['Menu_Price']), bins=8, kde=True, color='red', edgecolor='blue')
plt.title('Menu price distribution', fontsize=16)
plt.xlabel('Menu price', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.show()

# <label style="color:blue" id="Exploratory-Data-Analysis">Part III : Data preparation</label>

In [6]:
def missing_value_percentages(df):
    """
    Calculate the percentage of missing values in each column of the DataFrame.

    Parameters:
    - df (pandas DataFrame): The DataFrame for which missing values percentages are calculated.

    Returns:
    - pandas DataFrame: A DataFrame containing two columns:
        - 'column_name': The name of each column in the input DataFrame.
        - 'percent_missing': The percentage of missing values in each column.
    """
    percent_missing = df.isnull().sum() * 100 / len(df)
    missing_value_df = pd.DataFrame({'column_name': df.columns,
                                     'percent_missing': percent_missing})
    return missing_value_df


In [7]:
#ceating a dataframe that column and missing values percentage
cust_data_missing = missing_value_percentages(restaurant_data)

In [8]:
cust_data_missing

Unnamed: 0,column_name,percent_missing
Number_of_Customers,Number_of_Customers,0.0
Menu_Price,Menu_Price,0.0
Marketing_Spend,Marketing_Spend,0.0
Cuisine_Type,Cuisine_Type,0.0
Average_Customer_Spending,Average_Customer_Spending,0.0
Promotions,Promotions,0.0
Reviews,Reviews,0.0
Monthly_Revenue,Monthly_Revenue,0.0


In [9]:
def skewness_detector(dataset, col):
    """
    Detect the skewness of a specified column in a dataset.

    Parameters:
    - dataset (pandas DataFrame): The dataset containing the column.
    - col (str): The name of the column for which skewness is to be detected.

    Returns:
    - str: A string indicating the skewness of the column's distribution:
        - "left-skewed" if the distribution is left-skewed.
        - "right-skewed" if the distribution is right-skewed.
        - "symmetrical" if the distribution is symmetrical.
    """
    skewness = skew(dataset[col])
    if skewness > 0:
        return "right-skewed"
    elif skewness < 0:
        return "left-skewed"
    else:
        return "symmetrical"

In [10]:
# extracting numerical columns from the dataframe
numeric_columns = restaurant_data.select_dtypes(include=['number']).columns

In [11]:
print(list(numeric_columns))

['Number_of_Customers', 'Menu_Price', 'Marketing_Spend', 'Average_Customer_Spending', 'Promotions', 'Reviews', 'Monthly_Revenue']


In [12]:
# determining columns skewness
cols_skewness = pd.DataFrame()
for column in numeric_columns:
    skewness = skewness_detector(restaurant_data,column)
    row = {'column':column,'skewness':skewness}
    cols_skewness = cols_skewness.append(row, ignore_index=True)
cols_skewness

Unnamed: 0,column,skewness
0,Number_of_Customers,right-skewed
1,Menu_Price,left-skewed
2,Marketing_Spend,left-skewed
3,Average_Customer_Spending,right-skewed
4,Promotions,right-skewed
5,Reviews,right-skewed
6,Monthly_Revenue,left-skewed


In [13]:
def remove_outliers_iqr(data_frame, cols):
    """
    Remove outliers from the specified columns of a DataFrame using the Interquartile Range (IQR) method.

    Parameters:
    - data_frame (pandas DataFrame): The DataFrame containing the data.
    - cols (list of str): A list of column names from which outliers should be removed.

    Returns:
    - pandas DataFrame: A DataFrame with outliers removed from the specified columns.
    """
    cleaned_df = data_frame.copy()
    for column_name in cols:
        Q1 = cleaned_df[column_name].quantile(0.25)
        Q3 = cleaned_df[column_name].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        cleaned_df = cleaned_df[(cleaned_df[column_name] >= lower_bound) & (cleaned_df[column_name] <= upper_bound)]
    return cleaned_df


In [14]:
#creating a dataframe where outliers are removed
restaurant_data_wou = remove_outliers_iqr(restaurant_data,numeric_columns)

In [15]:
# showing statistical attributes of our data when outlier are removed
restaurant_data_wou = restaurant_data_wou[restaurant_data_wou.Monthly_Revenue>=0]
restaurant_data_wou.describe()

Unnamed: 0,Number_of_Customers,Menu_Price,Marketing_Spend,Average_Customer_Spending,Promotions,Reviews,Monthly_Revenue
count,994.0,994.0,994.0,994.0,994.0,994.0,994.0
mean,53.434608,30.248379,9.987001,29.498219,0.49497,49.902414,269.898773
std,26.247184,11.283711,5.836555,11.466233,0.500226,29.235191,101.774306
min,10.0,10.009501,0.003768,10.037177,0.0,0.0,3.819308
25%,31.0,20.443839,4.704846,19.676892,0.0,24.0,199.160139
50%,54.0,30.860614,10.160257,29.251365,0.0,50.0,270.527956
75%,74.0,39.904905,14.995489,39.560178,1.0,76.0,343.461651
max,99.0,49.97414,19.994276,49.900725,1.0,99.0,542.467282


In [16]:
# dropping duplicates
has_duplicates = restaurant_data_wou.duplicated().any()

if has_duplicates:
    print("DataFrame has duplicates.")
else:
    print("DataFrame has no duplicates.")

# Get duplicate counts
duplicate_counts = restaurant_data_wou.duplicated().sum()

print("Number of duplicate rows:", duplicate_counts)
restaurant_data_wou = restaurant_data_wou.drop_duplicates()

DataFrame has no duplicates.
Number of duplicate rows: 0


In [17]:
#applying onehot encoding on nominal variable
restaurant_data_wou = pd.get_dummies(restaurant_data_wou, columns=['Cuisine_Type']) 

# <label style="color:blue" id="Exploratory-Data-Analysis">Part IV : Feature selection</label>

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
from scipy.stats import pearsonr
from sklearn.preprocessing import StandardScaler

# Assigning target and predictor variables
X = restaurant_data_wou.drop(columns=['Monthly_Revenue'])
y = restaurant_data_wou['Monthly_Revenue']
feature_names = X.columns

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def variance_threshold_selector(data, threshold=0.01):
    """
    Perform feature selection based on variance thresholding.

    Parameters:
    - data (array-like or sparse matrix): The input data.
    - threshold (float, optional): The threshold below which features will be removed. 
                                   Features with a variance lower than this threshold will be removed.
                                   Default is 0.01.

    Returns:
    - tuple: A tuple containing two elements:
        - Transformed data after removing features with low variance.
        - Indices of the selected features.
    """
    selector = VarianceThreshold(threshold)
    return selector.fit_transform(data), selector.get_support(indices=True)


# Apply Variance Threshold to the training and testing sets
X_train_var, selected_indices_var = variance_threshold_selector(X_train, threshold=0.01)
X_test_var = X_test.iloc[:, selected_indices_var].values  # Apply same indices to test set

# Update feature names after Variance Threshold
feature_names_var = feature_names[selected_indices_var]

def correlation_coefficient_selector(X, y, threshold=0.2):
    """
    Select features based on their correlation coefficient with the target variable.

    Parameters:
    - X (array-like): The feature matrix.
    - y (array-like): The target variable.
    - threshold (float, optional): The threshold for selecting features based on correlation coefficient.
                                   Default is 0.2.

    Returns:
    - list: A list containing the indices of the selected features.
    """
    selected_features = []
    for i in range(X.shape[1]):
        corr, _ = pearsonr(X[:, i], y)
        if abs(corr) >= threshold:
            selected_features.append(i)
    return selected_features

# Get the selected feature indices based on correlation coefficient
selected_features_corr = correlation_coefficient_selector(X_train_var, y_train, threshold=0.2)

# Reduce data to selected features
X_train_selected = X_train_var[:, selected_features_corr]
X_test_selected = X_test_var[:, selected_features_corr]

# Update feature names after Correlation Coefficient
feature_names_selected = feature_names_var[selected_features_corr]
print(f'Selected Feature Names: {list(feature_names_selected)}')

# Standardize the predictors
scaler = StandardScaler()
X_train_selected_normalized = scaler.fit_transform(X_train_selected)
X_test_selected_normalized = scaler.transform(X_test_selected)

Selected Feature Names: ['Number_of_Customers', 'Menu_Price', 'Marketing_Spend']


# <label style="color:blue" id="Exploratory-Data-Analysis">Part V : Modelling</label>

In [20]:
# Train MLP Regressor with selected features
models = {
    'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42),
    'MLP': MLPRegressor(hidden_layer_sizes=(100,), max_iter=1000, random_state=42),
    'XGBoost': xgb.XGBRegressor(n_estimators=100, random_state=42)
}

In [None]:
results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    train_predictions = model.predict(X_train)
    test_predictions = model.predict(X_test)
    train_mse = mean_squared_error(y_train, train_predictions)
    test_mse = mean_squared_error(y_test, test_predictions)
    train_rmse = np.sqrt(train_mse)
    test_rmse = np.sqrt(test_mse)
    train_mae = mean_absolute_error(y_train, train_predictions)
    test_mae = mean_absolute_error(y_test, test_predictions)
    train_r2 = r2_score(y_train, train_predictions)
    test_r2 = r2_score(y_test, test_predictions)
    results.append({
        'Model': name,
        'Train MSE': round(train_mse,2),
        'Test MSE': round(test_mse,2),
        'Train RMSE': round(train_rmse,2),
        'Test RMSE': round(test_rmse,2),
        'Train MAE': round(train_mae,2),
        'Test MAE': round(test_mae,2),
        'Train R2': round(train_r2,2),
        'Test R2': round(test_r2,2)
    })

results_df = pd.DataFrame(results)

results_df = results_df.sort_values(by='Test MAE', ascending=True)

# <label style="color:blue" id="Exploratory-Data-Analysis">Part VI : Evaluation</label>

In [None]:
# Display Models evaluation metrics matrix 
results_df

In [None]:
plt.figure(figsize=(15, 5))
for i, (name, model) in enumerate(models.items(), 1):
    model.fit(X_train, y_train)
    
    test_predictions = model.predict(X_test)
    
    plt.subplot(1, 3, i)
    plt.scatter(y_test, test_predictions, color='red', label='Predicted', alpha=0.5)
    plt.scatter(y_test, y_test, color='blue', label='Actual', alpha=0.5)
    
    plt.title(name)
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.legend()

plt.tight_layout()
plt.show()

#### Hyper parameter tuning

In [None]:
# Define parameter grids for each model
param_grids = {
    'RandomForest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30]
    },
    'MLP': {
        'hidden_layer_sizes': [(50,), (100,), (50, 50)],
        'activation': ['tanh', 'relu'],
        'solver': ['sgd', 'adam'],
        'max_iter': [500, 1000]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 9],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

# Initialize the regressors
models = {
    'RandomForest': RandomForestRegressor(random_state=42),
    'MLP': MLPRegressor(random_state=42),
    'XGBoost': xgb.XGBRegressor(random_state=42)
}

# Perform hyperparameter tuning using GridSearchCV and compute metrics
results = []
for name, model in models.items():
    grid_search = GridSearchCV(model, param_grids[name], cv=5, scoring='neg_mean_squared_error')
    grid_search.fit(X_train, y_train)
    best_model = grid_search.best_estimator_
    predictions_train = best_model.predict(X_train)
    mse_train = mean_squared_error(y_train, predictions_train)
    rmse_train = np.sqrt(mse_train)
    mae_train = mean_absolute_error(y_train, predictions_train)
    r2_train = r2_score(y_train, predictions_train)
    
    predictions_test = best_model.predict(X_test)
    mse_test = mean_squared_error(y_test, predictions_test)
    rmse_test = np.sqrt(mse_test)
    mae_test = mean_absolute_error(y_test, predictions_test)
    r2_test = r2_score(y_test, predictions_test)
    
    results.append({
        'Model': name,
        'Best_Params': grid_search.best_params_,
        'Train_MSE': round(mse_train,2),
        'Test_MSE': round(mse_test,2),
        'Train_RMSE': round(rmse_train,2),
        'Test_RMSE': round(rmse_test,2),
        'Train_MAE': round(mae_train,2),
        'Test_MAE': round(mae_test,2),
        'Train_R2': round(r2_train,2),
        'Test_R2': round(r2_test,2)
    })

results_df = pd.DataFrame(results)

In [None]:
results_df = results_df.sort_values(by='Test_MAE', ascending=True)
results_df


# <label style="color:blue" id="Exploratory-Data-Analysis">Part VII : Deployment</label>

In [None]:

with open("model_artifact.pkl", 'wb') as file:
        pickle.dump(mlp, file)