# Restaurant Revenue Prediction

## Problem Statement

With over 1,200 quick service restaurants across the globe, TFI is the company behind some of the world's most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and Arby’s. They employ over 20,000 people in Europe and Asia and make significant daily investments in developing new restaurant sites.

We are building a  regression model that will predict the Revenue of new restaurents and what should the TFI company take into consideration when investing on a new restaurant to achieve  highly profitability.R2 and RMSE will be used to choose the model to used.

###  _Data Collection and Data dictionnary_ 


The dataset used is from a Kaggle competion https://www.kaggle.com/c/restaurant-revenue-prediction/data.


1. `Id` : Restaurant id.
2. `Open Date` : opening date for a restaurant
3. `City` : City that the restaurant is in. Note that there are unicode in the names.
4. `City Group` : Type of the city. Big cities, or Other.
5. `Type` : Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive Thru, MB: Mobile
6. `P1`, `P2` - `P37` : There are three categories of these obfuscated data. Demographic data are gathered from third party providers with GIS systems. These include population in any given area, age and gender distribution, development scales. Real estate data mainly relate to the m2 of the location, front facade of the location, car park availability. Commercial data mainly include the existence of points of interest including schools, banks, other QSR operators.
7. `Revenue` : The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. Please note that the values are transformed so they don't mean real dollar values.

### _Imports_

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np 
import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge,ElasticNetCV, ElasticNet
#from xgboost import XGBRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing
from sklearn.metrics import r2_score,explained_variance_score,max_error,mean_absolute_error,mean_squared_error,confusion_matrix,mean_absolute_percentage_error,mean_squared_log_error
from sklearn.pipeline import Pipeline


ModuleNotFoundError: No module named 'xgboost'

In [None]:
train = pd.read_csv('./data/train.csv') # train data/

In [None]:
test = pd.read_csv('./data/test.csv') # Test data

### _Reading the Data_

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.dtypes

In [None]:
train.describe(include='all')

In [None]:
train['revenue'].describe()

In [None]:
test.shape

In [None]:
test.head()

### _Data Cleaning_

In [None]:
def missing_values(data): # function to check missing value 
        mis_total = data.isnull().sum()
        mis_pct = 100 * data.isnull().sum() / len(data)
        mis_value_table = pd.concat([mis_total, mis_pct], axis = 1)
        mis_value_table_columns = mis_value_table.rename(columns = {0 : 'No. of Missing Value', 1 : '% of Total Missing Value'})
        
        mis_value_table_columns = mis_value_table_columns[mis_value_table_columns.iloc[:,1] != 0].sort_values('% of Total Missing Value', ascending = False).round(2)
        return mis_value_table_columns

In [None]:
missing_values(train) # find missing value in train 

In [None]:
missing_values(test) # No missing values on testing data. 

In [None]:
train['Open Date'] = pd.to_datetime(train['Open Date']) # change `Open Date` datatype

In [None]:
test['Open Date'] = pd.to_datetime(test['Open Date'])

In [None]:
train['year'] = train['Open Date'].dt.year # extract the year from the dataset 

In [None]:
test['year'] = test['Open Date'].dt.year

In [None]:
train['month'] = train['Open Date'].dt.month # extract month from the dataset 

In [None]:
test['month'] = test['Open Date'].dt.month

In [None]:
lookup = {
    11: 'Winter',
    12: 'Winter',
    1: 'Winter',
    2: 'Spring',
    3: 'Spring',
    4: 'Spring',
    5: 'Summer',
    6: 'Summer',
    7: 'Summer',
    8: 'Fall',
    9: 'Fall',
    10: 'Fall'
}

In [None]:
train['season'] = train['Open Date'].apply(lambda x : lookup[x.month]) # convert month to seasons. 

In [None]:
test['season'] = test['Open Date'].apply(lambda x : lookup[x.month])

In [None]:
 train['City Group'].unique() # find the unique values 

In [None]:
train['City'].nunique() 

In [None]:
test['City'].nunique() # the unique value in city from the train the test data are significantly different. thus, this column is being dropped. 

In [None]:
train.drop('City', axis = 1, inplace = True)
test.drop('City', axis = 1, inplace = True)

##### _The `test` data has way more unique values than train data. Thus, the `City` feature will be less useful for any model_

In [None]:
train['Type'].unique()

In [None]:
test['Type'].unique() # the test data has one extract type as MB (mobile), we will tranlate that into DT . 

In [None]:
test.loc[test['Type'] == 'MB', 'Type'] = 'DT'

##### _In the `train` dataset, there is no such `Type` as `MB` as in the `test` data. The `MB` stands as mobile which similar to the `DT` in nature. we will change `MB` to `DT` for better predictions._

In [None]:
train['revenue'].describe()

In [None]:
test['Type'].unique()

In [None]:
train.drop(columns = ['Id', 'Open Date','month'], inplace = True) # dropping ID and open date in the training data as it will not be necessary. 

In [None]:
test.drop(columns = ['Open Date','month'], inplace = True) # dropping ID and open date in the testing data

In [None]:
train.head()

### _EDA_

In [None]:
con_var = []
dis_var = []
var_unique = train.nunique()
for var,var_num in enumerate(var_unique):
    if var_num>50:
        con_var.append(var)
    else:
        dis_var.append(var)

con_columns=[train.columns[i] for i in con_var ]
dis_columns=[train.columns[i] for i in dis_var ]

In [None]:
dis_columns # looking at discrete vs continous variable

In [None]:
sns.heatmap(train.corr()[['revenue']].sort_values(by = 'revenue', key = abs, ascending = False)) #correlation

In [None]:
revenue_corr = train.drop(['City Group','Type'],axis=1).corr()['revenue'].sort_values(ascending=False)
plt.figure(figsize=(10,7))
revenue_corr.drop('revenue').plot.bar(color = '#33415C')
plt.show(); # looking at correlations 

In [None]:
dis_columns.append('revenue') 

In [None]:
spearman = train[dis_columns].corr(method ='spearman') # Spearman Correlations. No Strong corrlations have been identified 
spearman_corr = spearman['revenue'].sort_values(ascending = False)
spearman_corr

In [None]:
plt.figure(figsize=(10,7))
spearman_corr.plot.bar()
plt.show() # this for the spearman correlation.

In [None]:
train[dis_columns].apply(pd.Series.nunique, axis = 0) # defining number of unique values in each discrete column 

In [None]:
test.hist(figsize = (36,20));# looking at the distribution of the features

In [None]:
#plt.figure(figsize = (20, 12))
#sns.scatterplot(x="revenue", y="City",s=35, alpha = 0.6,data=train)

In [None]:
train['revenue'].max() # define the max which helps to find the outliers shows above 

In [None]:
index_drop1 = train[train['revenue'] == 19696939.0].index # dropping outliers 
train.drop(index_drop1, inplace = True)

In [None]:
index_drop2 = train[train['revenue'] == 16549064.0].index # dropping outliers 
train.drop(index_drop2, inplace = True)

In [None]:
index_drop3 = train[train['revenue'] == 13575224.0].index # dropping outliers
train.drop(index_drop3, inplace = True)

In [None]:
plt.figure (figsize = (16, 9))
sns.distplot(train['revenue'] )
plt.title('Revenue Distribution (Train Data)', fontdict={'fontsize':20})

In [None]:
train['log_revenue'] = np.log1p(train['revenue']) # log transformation. 

In [None]:
plt.figure (figsize = (16, 9))
sns.distplot(train['log_revenue'])
plt.title('Log Transformation Revenue Distribution (Train Data)', fontdict={'fontsize':20})

In [None]:
plt.figure (figsize = (16, 9))
fig = sns.scatterplot(x="Type", y="log_revenue",s=50, alpha = 0.6,data=train);

In [None]:
plt.figure (figsize = (16, 9))
fig = sns.scatterplot(x="season", y="revenue",s=50, alpha = 0.6,data=train);

In [None]:
plt.figure (figsize = (16, 9))
fig = sns.scatterplot(x="P28", y="log_revenue",s=50,hue="Type", size = 'Type', alpha = 0.6,data=train);

In [None]:
plt.figure (figsize = (16, 9))
fig = sns.scatterplot(x="year", y="log_revenue",s=50,hue="Type", size = 'revenue', alpha = 0.6,data=train);

In [None]:
plt.figure (figsize = (16, 9))
sns.countplot(x = 'Type',hue ='City Group', data = train, palette='Set3')
plt.title('Train Number of Restaurant in each Restaurant Type (by City Group)', fontdict={'fontsize':20})

In [None]:
plt.figure (figsize = (16, 9))
sns.countplot(x = 'Type',hue ='City Group', data = test, palette='Set3')
plt.title('Test Data Number of Restaurant in each Restaurant Type (by City Group)', fontdict={'fontsize':20})

In [None]:
plt.figure (figsize = (16, 9))
sns.countplot(x = 'season',hue ='City Group', data = train, palette='Set1')
plt.title('Number of Restaurants Opened in each Season (by City Group / Train Data)', fontdict={'fontsize':20})

In [None]:
plt.figure (figsize = (16, 9))
sns.countplot(x = 'season',hue ='City Group', data = test, palette='Set1')
plt.title('Number of Restaurants Opened in each Season (by City Group / Test Data)', fontdict={'fontsize':20})

### _Modeling_ 

##### _Pre-Processing_

In [None]:
columns_to_dummy = train.select_dtypes(include = ['object']).columns # changing the cate
train = pd.get_dummies(train, columns = columns_to_dummy, drop_first = False)
test = pd.get_dummies(test, columns = columns_to_dummy, drop_first = False)

In [None]:
train.head()

In [None]:
train.shape

In [None]:
test.shape

In [None]:
X, y = train.drop(columns=['revenue','log_revenue'], axis=1), train['log_revenue']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

#### _Baseline Model - Linear Regression_

In [None]:
lr = LinearRegression ()

In [None]:
lr.fit(X_train, y_train)

In [None]:
lr.score(X_train, y_train)

In [None]:
lr.score(X_test, y_test) # 

In [None]:
mean_squared_error(y_train, lr.predict(X_train))

In [None]:
mean_squared_error(y_test, lr.predict(X_test))

In [None]:
y_pred = lr.predict(X_test)

In [None]:
print('Explained variance score',explained_variance_score(y_test, y_pred),'\n',
   'Mean absolute error      :',mean_absolute_error(y_test, y_pred),'\n',
   'Mean squared error       :',mean_squared_error(y_test, y_pred),'\n',
   'R² score            :',r2_score(y_test, y_pred))

In [None]:
score_df_lr = pd.DataFrame(columns=['Method','Linear Regression'])
score_df_lr['Method']=[
                    'Mean Absolute Error',
                    'Mean Squared Error',
                    'RMSE',
                    'R²']
score_df_lr['Linear Regression']=[
                   mean_absolute_error(y_test, y_pred),
                   mean_squared_error(y_test, y_pred),
                   mean_squared_error(y_test, y_pred, squared = False),
                   r2_score(y_test, y_pred)]

In [None]:
score_df_lr

#### _Lasso Regression_

In [None]:
params_lasso = {
    'alpha' : [.01, .1, .5, .7, .9, .95, .99, 1, 5, 10, 20],}

In [None]:
lasso = Lasso()

In [None]:
lasso_regressor = GridSearchCV(lasso, params_lasso, cv=5, n_jobs=8)

In [None]:
lasso_regressor.fit(X_train, y_train)

In [None]:
lasso_regressor.best_params_

In [None]:
lasso_model = Lasso(alpha = 0.5)

In [None]:
lasso_model.fit(X_train, y_train)

In [None]:
lasso_model.score(X_train, y_train)

In [None]:
lasso_model.score(X_test, y_test)

In [None]:
y_pred = lasso_model.predict(X_test)

In [None]:
print('Explained variance score',explained_variance_score(y_test, y_pred),'\n',
   'Mean absolute error      :',mean_absolute_error(y_test, y_pred),'\n',
   'Mean squared error       :',mean_squared_error(y_test, y_pred),'\n',
   'R² score            :',r2_score(y_test, y_pred))

In [None]:
score_df_lasso = pd.DataFrame(columns=['Method','Lasso Regression'])
score_df_lasso['Method']=[
                    'Mean Absolute Error',
                    'Mean Squared Error',
                    'RMSE',
                    'R²']
score_df_lasso['Lasso Regression']=[
                   mean_absolute_error(y_test, y_pred),
                   mean_squared_error(y_test, y_pred),
                   mean_squared_error(y_test, y_pred, squared = False),
                   r2_score(y_test, y_pred)]

In [None]:
score_df_lasso

#### _Ridge Regression_

In [None]:
ridge = Ridge()

In [None]:
params_ridge = {
    'alpha' : [.01, .1, .5, .7, .9, .95, .99, 1, 5, 10, 20],
    'solver' : ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
}

In [None]:
ridge_regressor = GridSearchCV(ridge, params_ridge,cv=5, n_jobs=-1)

In [None]:
ridge_regressor.fit(X_train, y_train)

In [None]:
ridge_regressor.best_params_

In [None]:
ridge_model = Ridge(alpha = 20,
                    solver = 'saga')

In [None]:
ridge_model.fit(X_train, y_train)

In [None]:
ridge_model.score(X_train, y_train)

In [None]:
ridge_model.score(X_test, y_test)

In [None]:
y_pred = ridge_model.predict(X_test)

In [None]:
print('Explained variance score',explained_variance_score(y_test, y_pred),'\n',
   'Mean absolute error      :',mean_absolute_error(y_test, y_pred),'\n',
   'Mean squared error       :',mean_squared_error(y_test, y_pred),'\n',
   'R² score            :',r2_score(y_test, y_pred))

In [None]:
score_df_ridge = pd.DataFrame(columns=['Method','Ridge Regression'])
score_df_ridge['Method']=[
                    'Mean Absolute Error',
                    'Mean Squared Error',
                    'RMSE',
                    'R²']
score_df_ridge['Ridge Regression']=[
                   mean_absolute_error(y_test, y_pred),
                   mean_squared_error(y_test, y_pred),
                   mean_squared_error(y_test, y_pred, squared = False),
                   r2_score(y_test, y_pred)]

In [None]:
score_df_ridge

#### _Random Forest Regressor_

In [None]:
params_rf = {
    'max_depth': [None, 1, 5, 10, 30, 35],
    'max_features': [.1, .2, .3],
    'n_estimators': [200, 300, 400,500]
}

In [None]:
rf = RandomForestRegressor()

In [None]:
rf_regressor = GridSearchCV(rf, params_rf,cv = 10, n_jobs = -1)

In [None]:
rf_regressor.fit(X_train, y_train)

In [None]:
rf_regressor.best_params_

In [None]:
rf_model = RandomForestRegressor(max_depth = 65,
                                 max_features = 0.3,
                                 n_estimators = 30)

In [None]:
rf_model.fit(X_train, y_train)

In [None]:
rf_model.score(X_train, y_train)# the score is 85% which indicate that random forest  have a coefficient of determination of 85% on trainned data.

In [None]:
rf_model.score(X_test, y_test)#random forest  have a coefficient of determination of 24% on unseen  data. this indicate a high variance of the model because of the difference with the score on the trainned data.

In [None]:
y_pred = rf_model.predict(X_test)

In [None]:
print('Explained variance score',explained_variance_score(y_test, y_pred),'\n',
   'Mean absolute error      :',mean_absolute_error(y_test, y_pred),'\n',
   'Mean squared error       :',mean_squared_error(y_test, y_pred),'\n',
   'RMSE                    :', mean_squared_error(y_test, y_pred), '\n',
   'R² score            :',r2_score(y_test, y_pred))

In [None]:
score_df_rf = pd.DataFrame(columns=['Method','Random Forest'])
score_df_rf['Method']=[
                    'Mean Absolute Error',
                    'Mean Squared Error',
                    'RMSE',
                    'R²']
score_df_rf['Random Forest']=[
                   mean_absolute_error(y_test, y_pred),
                   mean_squared_error(y_test, y_pred),
                   mean_squared_error(y_test, y_pred, squared = False),
                   r2_score(y_test, y_pred)]

In [None]:
score_df_rf

In [None]:
pred_sub_rf = rf_model.predict(test.drop(columns = 'Id'))

In [None]:
pred_sub_rf = np.exp(pred_sub_rf)

In [None]:
residual_rf = np.exp(y_test) - np.exp(y_pred)

In [None]:
plt.scatter(residual_rf, np.exp(y_pred))

In [None]:
submission_rf= pd.DataFrame(columns = ['Id', 'Prediction'])
submission_rf['Id'] = test['Id']
submission_rf['Prediction'] = pred_sub_rf
submission_rf.to_csv('submission_rf.csv', index = False)

In [None]:
rf_feature = pd.Series(index = X_train.columns, data = np.abs(rf_model.feature_importances_))

In [None]:
rf_feature.sort_values().plot(kind = 'bar', figsize = (16, 8))

#### _KNeighborRegressor_

In [None]:
knn = KNeighborsRegressor()

In [None]:
params_knn = {
    'n_neighbors' : [3, 5, 7, 9, 11],
}

In [None]:
knn_regressor = GridSearchCV(knn, params_knn,cv=10, n_jobs=-1)

In [None]:
knn_regressor.fit(X_train, y_train)

In [None]:
knn_regressor.best_params_

In [None]:
knn_model = KNeighborsRegressor(n_neighbors = 9)

In [None]:
knn_model.fit(X_train, y_train)

In [None]:
knn_model.score(X_train, y_train)#KNN have a coefficient of determination of 27% on trained  data

In [None]:
knn_model.score(X_test, y_test)#KNN have a coefficient of determination of 19% on unseen  data. this indicate a high variance of the model because of the difference with the score on the trainned data.

In [None]:
y_pred = knn_model.predict(X_test)

In [None]:
print('Explained variance score',explained_variance_score(y_test, y_pred),'\n',
   'Mean absolute error      :',mean_absolute_error(y_test, y_pred),'\n',
   'Mean squared error       :',mean_squared_error(y_test, y_pred),'\n',
   'R² score            :',r2_score(y_test, y_pred))

In [None]:
score_df_knn = pd.DataFrame(columns=['Method','KNN'])
score_df_knn['Method']=[
                    'Mean Absolute Error',
                    'Mean Squared Error',
                    'RMSE',
                    'R²']
score_df_knn['KNN']=[
                   mean_absolute_error(y_test, y_pred),
                   mean_squared_error(y_test, y_pred),
                   mean_squared_error(y_test, y_pred, squared = False),
                   r2_score(y_test, y_pred)]

In [None]:
score_df_knn

In [None]:
pred_sub_knn = knn_model.predict(test.drop(columns = 'Id'))

In [None]:
pred_sub_knn = np.exp(pred_sub_knn)

In [None]:
residual_rf = np.exp(y_test) - np.exp(y_pred)

In [None]:
plt.scatter(residual_rf, np.exp(y_pred))

In [None]:
submission_knn= pd.DataFrame(columns = ['Id', 'Prediction'])
submission_knn['Id'] = test['Id']
submission_knn['Prediction'] = pred_sub_knn
submission_knn.to_csv('submission_knn.csv', index = False)

#### _XGBoost Regressor_

In [None]:
xgb = XGBRegressor()

In [None]:
params_xgb = {
    'learning_rate': [.05,.1],
    'max_depth': [4, 9],
    'subsample': [.5, .7],
    'n_estimators': [100,200]
}

In [None]:
xgb_regressor = GridSearchCV(xgb, params_xgb, cv = 10, n_jobs=-1)

In [None]:
xgb_regressor.fit(X_train, y_train)

In [None]:
xgb_regressor.best_params_

In [None]:
xgb_model = XGBRegressor(learning_rate = 0.1, max_depth = 9, n_estimators = 200, subsample = 0.5)

In [None]:
xgb_model.fit(X_train, y_train)

In [None]:
xgb_model.score(X_train, y_train)

In [None]:
xgb_model.score(X_test, y_test)

In [None]:
y_pred = xgb_model.predict(X_test)

In [None]:
print('Explained variance score',explained_variance_score(y_test, y_pred),'\n',
   'Mean absolute error      :',mean_absolute_error(y_test, y_pred),'\n',
   'Mean squared error       :',mean_squared_error(y_test, y_pred),'\n',
   'R² score            :',r2_score(y_test, y_pred))

In [None]:
score_df_xgb = pd.DataFrame(columns=['Method','XGBoost'])
score_df_xgb['Method']=[
                    'Mean Absolute Error',
                    'Mean Squared Error',
                    'RMSE',
                    'R²']
score_df_xgb['XGBoost']=[
                   mean_absolute_error(y_test, y_pred),
                   mean_squared_error(y_test, y_pred),
                   mean_squared_error(y_test, y_pred, squared = False),
                   r2_score(y_test, y_pred)]

In [None]:
score_df_xgb

In [None]:
residual_rf = np.exp(y_test) - np.exp(y_pred)

In [None]:
plt.scatter(residual_rf, np.exp(y_pred))

In [None]:
pred_sub_xgb = xgb_model.predict(test.drop(columns = 'Id'))

In [None]:
pred_sub_xgb = np.exp(pred_sub_xgb)

In [None]:
submission_xgb= pd.DataFrame(columns = ['Id', 'Prediction'])
submission_xgb['Id'] = test['Id']
submission_xgb['Prediction'] = pred_sub_xgb
submission_xgb.to_csv('submission_xgb.csv', index = False)

In [None]:
score_df_lr.merge(score_df_ridge,on='Method').merge(score_df_lasso,on='Method').merge(score_df_rf, on = 'Method').merge(score_df_knn, on = 'Method').merge(score_df_xgb, on = 'Method')

### _Conclusion_

### Reflection

Looking back on this project , we didn't get a good score on the  models because the dataset we have needed more cleaning and feature engineering.Looking at the values of P1-p37 features , there are some  values that are equal to zero that may represent missing values rather than 0 as a value. One way to work on that is to use **knn imputer** to resolve some of those issues, we could also find a way to keep the city column because location is an important feature in predicting the profitability of a restaurant. 