<a href="https://colab.research.google.com/github/michaelcerda/Kaggle-Projects/blob/main/Regression_Restaurant_Revenue.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Description

This dataset contains demographic, real estate, and commercial data that can be used to predict the annual restaurant sales of 100,000 regional locations. The metric that will be used is RMSE (Root Mean Squared Error). 

### Load data

In [None]:
import pandas as pd
import numpy as np

In [None]:
train_df = pd.read_csv('/content/drive/MyDrive/Restaurant Revenue/train.csv.zip')
test_df = pd.read_csv('/content/drive/MyDrive/Restaurant Revenue/test.csv.zip')

Let's look at the first few rows of the dataset:

In [None]:
train_df.head()

Unnamed: 0,Id,Open Date,City,City Group,Type,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue
0,0,07/17/1999,İstanbul,Big Cities,IL,4,5.0,4.0,4.0,2,2,5,4,5,5,3,5,5.0,1,2,2,2,4,5,4,1,3,3,1,1,1.0,4.0,2.0,3.0,5,3,4,5,5,4,3,4,5653753.0
1,1,02/14/2008,Ankara,Big Cities,FC,4,5.0,4.0,4.0,1,2,5,5,5,5,1,5,5.0,0,0,0,0,0,3,2,1,3,2,0,0,0.0,0.0,3.0,3.0,0,0,0,0,0,0,0,0,6923131.0
2,2,03/09/2013,Diyarbakır,Other,IL,2,4.0,2.0,5.0,2,3,5,5,5,5,2,5,5.0,0,0,0,0,0,1,1,1,1,1,0,0,0.0,0.0,1.0,3.0,0,0,0,0,0,0,0,0,2055379.0
3,3,02/02/2012,Tokat,Other,IL,6,4.5,6.0,6.0,4,4,10,8,10,10,8,10,7.5,6,4,9,3,12,20,12,6,1,10,2,2,2.5,2.5,2.5,7.5,25,12,10,6,18,12,12,6,2675511.0
4,4,05/09/2009,Gaziantep,Other,IL,3,4.0,3.0,4.0,2,2,5,5,5,5,2,5,5.0,2,1,2,1,4,2,2,1,2,1,2,3,3.0,5.0,1.0,3.0,5,1,3,2,3,4,3,3,4316715.0


We can see that each row contains an open date, presumably the opening date of the restaurant, the city where the restaurant is located, the city group which categorizes the location, and a bunch of columns that have numerical values. These columns with numerical values are a representation of commercial, demographic and real estate data.

Let's look at the total number of rows and columns of the train dataset

In [None]:
train_df.shape

(137, 43)

Let's also look at the total number of rows and columns of the test dataset

In [None]:
test_df.shape

(100000, 42)

We can see a huge disparity between the size of the train and test datasets by looking at the number of rows (137 vs 100000). We'll consider this later when we select a machine learning model to train the data. 

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137 entries, 0 to 136
Data columns (total 43 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Id          137 non-null    int64  
 1   Open Date   137 non-null    object 
 2   City        137 non-null    object 
 3   City Group  137 non-null    object 
 4   Type        137 non-null    object 
 5   P1          137 non-null    int64  
 6   P2          137 non-null    float64
 7   P3          137 non-null    float64
 8   P4          137 non-null    float64
 9   P5          137 non-null    int64  
 10  P6          137 non-null    int64  
 11  P7          137 non-null    int64  
 12  P8          137 non-null    int64  
 13  P9          137 non-null    int64  
 14  P10         137 non-null    int64  
 15  P11         137 non-null    int64  
 16  P12         137 non-null    int64  
 17  P13         137 non-null    float64
 18  P14         137 non-null    int64  
 19  P15         137 non-null    i

In [None]:
train_df.describe()

Unnamed: 0,Id,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue
count,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0
mean,68.0,4.014599,4.408759,4.317518,4.372263,2.007299,3.357664,5.423358,5.153285,5.445255,5.489051,3.262774,5.29927,5.080292,1.416058,1.386861,1.941606,1.036496,1.941606,4.905109,4.547445,2.270073,2.226277,3.423358,1.372263,1.211679,1.470803,1.145985,3.222628,3.135036,2.729927,1.941606,2.525547,1.138686,2.489051,2.029197,2.211679,1.116788,4453533.0
std,39.692569,2.910391,1.5149,1.032337,1.016462,1.20962,2.134235,2.296809,1.858567,1.834793,1.847561,1.910767,1.941668,1.036527,2.729583,2.398677,3.505807,2.030679,3.300549,5.604467,3.708041,2.05263,1.23069,4.559609,2.304112,2.133179,2.612024,2.067039,2.308806,1.680887,5.536647,3.512093,5.230117,1.69854,5.165093,3.436272,4.168211,1.790768,2576072.0
min,0.0,1.0,1.0,0.0,3.0,1.0,1.0,1.0,1.0,4.0,4.0,1.0,2.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1149870.0
25%,34.0,2.0,4.0,4.0,4.0,1.0,2.0,5.0,4.0,4.0,5.0,2.0,4.0,5.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2999068.0
50%,68.0,3.0,5.0,4.0,4.0,2.0,3.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,3.0,4.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,2.5,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3939804.0
75%,102.0,4.0,5.0,5.0,5.0,2.0,4.0,5.0,5.0,5.0,5.0,4.0,5.0,5.0,2.0,2.0,3.0,1.0,4.0,5.0,5.0,3.0,3.0,5.0,2.0,2.0,2.5,2.0,4.0,3.0,4.0,3.0,3.0,2.0,3.0,4.0,3.0,2.0,5166635.0
max,136.0,12.0,7.5,7.5,7.5,8.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,7.5,15.0,10.0,15.0,15.0,12.0,25.0,15.0,15.0,5.0,25.0,10.0,10.0,12.5,12.5,12.5,7.5,25.0,15.0,25.0,6.0,24.0,15.0,20.0,8.0,19696940.0


### Check for null values

Let's check if the training dataset has null values so we can fill those with appropriate values.

In [None]:
train_na = train_df.isnull().sum()/len(train_df) * 100
train_na = train_na.drop(train_na[train_na == 0].index).sort_values(ascending=False)
print('There are {} missing values'.format(len(train_na)))

There are 0 missing values


It turns out that we don't need to worry about missing values as there is none in the training dataset. Let's do the same for the test dataset.

In [None]:
test_na = test_df.isnull().sum()/len(test_df) * 100
test_na = test_na.drop(test_na[test_na == 0].index).sort_values(ascending=False)
print('There are {} missing values'.format(len(test_na)))

There are 0 missing values


We can see that there are also no missing values in the test dataset.

To make things more efficient, let's combine the train and test datasets for now.

In [None]:
#Save the number of rows for each dataset
len_train = train_df.shape[0]
len_test = test_df.shape[0]

In [None]:
data = pd.concat((train_df, test_df)).reset_index(drop=True)

Since machine learning models require numerical values for training, we need to convert the "Open Date", "City", "City Group" and "Type" columns to numerical values. Let's start with the "Open Date" column.

In [None]:
import datetime

data['date'] = pd.to_datetime(data['Open Date'])
data.drop(['Open Date'], axis = 1, inplace=True)
data['day'] = data['date'].dt.day
data['month'] = data['date'].dt.month
data['year'] = data['date'].dt.year

data.drop(['date'],axis=1,inplace=True)

data.head()

Unnamed: 0,Id,City,City Group,Type,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue,day,month,year
0,0,İstanbul,Big Cities,IL,4,5.0,4.0,4.0,2,2,5,4,5,5,3,5,5.0,1,2,2,2,4,5,4,1,3,3,1,1,1.0,4.0,2.0,3.0,5,3,4,5,5,4,3,4,5653753.0,17,7,1999
1,1,Ankara,Big Cities,FC,4,5.0,4.0,4.0,1,2,5,5,5,5,1,5,5.0,0,0,0,0,0,3,2,1,3,2,0,0,0.0,0.0,3.0,3.0,0,0,0,0,0,0,0,0,6923131.0,14,2,2008
2,2,Diyarbakır,Other,IL,2,4.0,2.0,5.0,2,3,5,5,5,5,2,5,5.0,0,0,0,0,0,1,1,1,1,1,0,0,0.0,0.0,1.0,3.0,0,0,0,0,0,0,0,0,2055379.0,9,3,2013
3,3,Tokat,Other,IL,6,4.5,6.0,6.0,4,4,10,8,10,10,8,10,7.5,6,4,9,3,12,20,12,6,1,10,2,2,2.5,2.5,2.5,7.5,25,12,10,6,18,12,12,6,2675511.0,2,2,2012
4,4,Gaziantep,Other,IL,3,4.0,3.0,4.0,2,2,5,5,5,5,2,5,5.0,2,1,2,1,4,2,2,1,2,1,2,3,3.0,5.0,1.0,3.0,5,1,3,2,3,4,3,3,4316715.0,9,5,2009


We've converted "Open Date" into numerical values by adding the columns "day", "month" and "year".

Now let's convert 'City', 'City Group' and 'Type' columns to numerical values using One Hot Encoding:

In [None]:
from sklearn.preprocessing import OneHotEncoder

def OHE(df, cols):
  enc = OneHotEncoder(handle_unknown = 'ignore')
  ohe_cols = pd.DataFrame(enc.fit_transform(df[cols]).toarray())
  df = df.join(ohe_cols)

  df.drop(columns, axis = 1, inplace = True)

  return df

In [None]:
columns = ['City', 'City Group', 'Type']

data = OHE(data, columns)

In [None]:
data.head()

Unnamed: 0,Id,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue,day,...,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68
0,0,4,5.0,4.0,4.0,2,2,5,4,5,5,3,5,5.0,1,2,2,2,4,5,4,1,3,3,1,1,1.0,4.0,2.0,3.0,5,3,4,5,5,4,3,4,5653753.0,17,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,1,4,5.0,4.0,4.0,1,2,5,5,5,5,1,5,5.0,0,0,0,0,0,3,2,1,3,2,0,0,0.0,0.0,3.0,3.0,0,0,0,0,0,0,0,0,6923131.0,14,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,2,2,4.0,2.0,5.0,2,3,5,5,5,5,2,5,5.0,0,0,0,0,0,1,1,1,1,1,0,0,0.0,0.0,1.0,3.0,0,0,0,0,0,0,0,0,2055379.0,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,3,6,4.5,6.0,6.0,4,4,10,8,10,10,8,10,7.5,6,4,9,3,12,20,12,6,1,10,2,2,2.5,2.5,2.5,7.5,25,12,10,6,18,12,12,6,2675511.0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,4,3,4.0,3.0,4.0,2,2,5,5,5,5,2,5,5.0,2,1,2,1,4,2,2,1,2,1,2,3,3.0,5.0,1.0,3.0,5,1,3,2,3,4,3,3,4316715.0,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


Let's split the data back to train and test:

In [None]:
train = data[:len_train]
test = data[len_train:]

After splitting data into train and test, let's split the train data into folds. Using folds will prevent our ML models from overfitting.

In [None]:
from sklearn.model_selection import KFold

train.loc[:, "kfold"] = 0

kf = KFold(n_splits = 3)
y = train.revenue.values

for f, (tr, val) in enumerate(kf.split(X=train, y=y)):
  train.loc[val, "kfold"] = f


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


In [None]:
train.tail()

Unnamed: 0,Id,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue,day,...,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,kfold
132,132,2,3.0,3.0,5.0,4,2,4,4,4,4,4,4,4.0,0,0,0,0,0,4,3,2,1,1,0,0,0.0,0.0,2.0,3.0,0,0,0,0,0,0,0,0,5787594.0,25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,2
133,133,4,5.0,4.0,4.0,2,3,5,4,4,5,5,4,5.0,0,0,0,0,0,3,2,2,1,1,0,0,0.0,0.0,3.0,3.0,0,0,0,0,0,0,0,0,9262754.0,12,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,2
134,134,3,4.0,4.0,4.0,2,3,5,5,5,5,1,5,5.0,0,0,0,0,0,2,3,1,2,2,0,0,0.0,0.0,2.0,3.0,0,0,0,0,0,0,0,0,2544857.0,8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,2
135,135,4,5.0,4.0,5.0,2,2,5,5,5,5,2,5,5.0,0,0,0,0,0,1,1,1,1,1,0,0,0.0,0.0,3.0,3.0,0,0,0,0,0,0,0,0,7217634.0,29,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,2
136,136,4,5.0,3.0,5.0,2,2,5,4,4,5,4,4,5.0,0,0,0,0,0,2,1,1,1,1,0,0,0.0,0.0,3.0,3.0,0,0,0,0,0,0,0,0,6363241.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,2


In [None]:
train.shape

(137, 112)

In [None]:
test.shape

(100000, 111)

In [None]:
test.loc[:, ~test.columns.isin(['Id', 'revenue','kfold'])].values

array([[ 1. ,  4. ,  4. , ...,  1. ,  0. ,  0. ],
       [ 3. ,  4. ,  4. , ...,  0. ,  1. ,  0. ],
       [ 3. ,  4. ,  4. , ...,  1. ,  0. ,  0. ],
       ...,
       [ 4. ,  5. ,  4. , ...,  0. ,  1. ,  0. ],
       [12. ,  7.5,  6. , ...,  1. ,  0. ,  0. ],
       [ 2. ,  5. ,  4. , ...,  0. ,  1. ,  0. ]])

In [None]:
test.head()

Unnamed: 0,Id,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue,day,...,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68
137,0,1,4.0,4.0,4.0,1,2,5,4,5,5,5,3,4.0,0,0,0,2,0,5,5,3,1,4,0,0,0.0,0.0,2.0,3.0,0,0,0,0,0,0,0,0,,22,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
138,1,3,4.0,4.0,4.0,2,2,5,3,4,4,2,4,5.0,0,0,0,0,0,5,5,3,2,1,0,0,0.0,0.0,1.0,3.0,0,0,0,0,0,0,0,0,,18,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
139,2,3,4.0,4.0,4.0,2,2,5,4,4,5,4,5,5.0,0,0,0,0,0,5,5,5,5,5,0,0,0.0,0.0,2.0,3.0,0,0,0,0,0,0,0,0,,30,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
140,3,2,4.0,4.0,4.0,2,3,5,4,5,4,3,4,5.0,0,0,0,0,4,4,4,3,2,2,0,0,0.0,0.0,2.0,3.0,0,4,0,0,0,0,0,0,,6,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
141,4,2,4.0,4.0,4.0,1,2,5,4,5,4,3,5,4.0,0,0,0,0,0,1,5,3,1,1,0,0,0.0,0.0,5.0,3.0,0,0,0,0,0,0,0,0,,31,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


In [None]:
test_id = test.Id
print(test_id)

137           0
138           1
139           2
140           3
141           4
          ...  
100132    99995
100133    99996
100134    99997
100135    99998
100136    99999
Name: Id, Length: 100000, dtype: int64


It's time to train a model for our train dataset:

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [None]:
def run_training(fold):
  train_data = train[train.kfold != fold].reset_index(drop=True)
  valid_data = train[train.kfold == fold].reset_index(drop=True)

  xtrain = train_data.loc[:, ~train_data.columns.isin(['Id', 'revenue','kfold'])].values
  ytrain = train_data.revenue.values

  xvalid = valid_data.loc[:, ~train_data.columns.isin(['Id', 'revenue', 'kfold'])].values
  yvalid = valid_data.revenue.values

  rf = RandomForestRegressor()
  rf.fit(xtrain, ytrain)
  pred = rf.predict(valid_data.loc[:, ~train_data.columns.isin(['Id', 'revenue', 'kfold'])].values)

  mse = mean_squared_error(yvalid, pred, squared=False)
  print(f'fold = {fold}, mse = {mse}')

  valid_data.loc[:, 'rf_pred'] = pred

  return valid_data

In [None]:
updated_df = []

for i in range(3):
  temp_df = run_training(i)
  updated_df.append(temp_df)

fin_valid_data = pd.concat(updated_df)
fin_valid_data.to_csv('/content/drive/MyDrive/Restaurant Revenue/Predictions/rf_pred.csv', index = False)

fold = 0, mse = 2596939.707438737
fold = 1, mse = 2517076.809780324
fold = 2, mse = 2302642.513225501


In [None]:
import lightgbm as lgb

In [None]:
def run_training(fold):
  train_data = train[train.kfold != fold].reset_index(drop=True)
  valid_data = train[train.kfold == fold].reset_index(drop=True)

  xtrain = train_data.loc[:, ~train_data.columns.isin(['Id', 'revenue','kfold'])].values
  ytrain = train_data.revenue.values

  xvalid = valid_data.loc[:, ~train_data.columns.isin(['Id', 'revenue', 'kfold'])].values
  yvalid = valid_data.revenue.values

  reg = lgb.LGBMRegressor()
  reg.fit(xtrain, ytrain)
  pred = reg.predict(xvalid)

  mse = mean_squared_error(yvalid, pred, squared=False)
  print(f'fold = {fold}, mse = {mse}')

  valid_data.loc[:, 'lgb_pred'] = pred

  return valid_data

In [None]:
updated_df = []

for i in range(3):
  temp_df = run_training(i)
  updated_df.append(temp_df)

fin_valid_data = pd.concat(updated_df)
fin_valid_data.to_csv('/content/drive/MyDrive/Restaurant Revenue/Predictions/lgb_pred.csv', index = False)

fold = 0, mse = 2737726.0277667986
fold = 1, mse = 2365974.345421145
fold = 2, mse = 2395673.696898725


In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
def run_training(fold):
  train_data = train[train.kfold != fold].reset_index(drop=True)
  valid_data = train[train.kfold == fold].reset_index(drop=True)

  xtrain = train_data.loc[:, ~train_data.columns.isin(['Id', 'revenue','kfold'])].values
  ytrain = train_data.revenue.values

  xvalid = valid_data.loc[:, ~train_data.columns.isin(['Id', 'revenue', 'kfold'])].values
  yvalid = valid_data.revenue.values

  reg = GradientBoostingRegressor(max_depth=5,
          learning_rate=0.038347,
          n_estimators=3673,
          min_samples_split=3,
          min_samples_leaf=6,
          loss='huber',
          max_features='log2',
          criterion='friedman_mse')
  reg.fit(xtrain, ytrain)
  pred = reg.predict(xvalid)

  mse = mean_squared_error(yvalid, pred, squared=False)
  print(f'fold = {fold}, mse = {mse}')

  valid_data.loc[:, 'gbr_pred'] = pred

  return valid_data

In [None]:
updated_df = []

for i in range(3):
  temp_df = run_training(i)
  updated_df.append(temp_df)

fin_valid_data = pd.concat(updated_df)
fin_valid_data.to_csv('/content/drive/MyDrive/Restaurant Revenue/Predictions/gbr_pred.csv', index = False)

fold = 0, mse = 2528472.5830231095
fold = 1, mse = 2074709.3147441002
fold = 2, mse = 2172178.4222910353


In [None]:
import xgboost as xgb

In [None]:
def run_training(fold):
  train_data = train[train.kfold != fold].reset_index(drop=True)
  valid_data = train[train.kfold == fold].reset_index(drop=True)

  xtrain = train_data.loc[:, ~train_data.columns.isin(['Id', 'revenue','kfold'])].values
  ytrain = train_data.revenue.values

  xvalid = valid_data.loc[:, ~train_data.columns.isin(['Id', 'revenue', 'kfold'])].values
  yvalid = valid_data.revenue.values

  reg = xgb.XGBRegressor()
  reg.fit(xtrain, ytrain)
  pred = reg.predict(xvalid)

  mse = mean_squared_error(yvalid, pred, squared=False)
  print(f'fold = {fold}, mse = {mse}')

  valid_data.loc[:, 'xgb_pred'] = pred

  return valid_data

In [None]:
updated_df = []

for i in range(3):
  temp_df = run_training(i)
  updated_df.append(temp_df)

fin_valid_data = pd.concat(updated_df)
fin_valid_data.to_csv('/content/drive/MyDrive/Restaurant Revenue/Predictions/xgb_pred.csv', index = False)

fold = 0, mse = 2570828.1273529655
fold = 1, mse = 3590278.0630257195
fold = 2, mse = 2806370.590218401


In [None]:
from sklearn.ensemble import StackingRegressor

In [None]:
def run_training(fold):
  train_data = train[train.kfold != fold].reset_index(drop=True)
  valid_data = train[train.kfold == fold].reset_index(drop=True)

  xtrain = train_data.loc[:, ~train_data.columns.isin(['Id', 'revenue','kfold'])].values
  ytrain = train_data.revenue.values

  xvalid = valid_data.loc[:, ~train_data.columns.isin(['Id', 'revenue', 'kfold'])].values
  yvalid = valid_data.revenue.values

  estimators = [
              ('gbr', GradientBoostingRegressor(max_depth=5,
          learning_rate=0.038347,
          n_estimators=3673,
          min_samples_split=3,
          min_samples_leaf=6,
          loss='huber',
          max_features='log2',
          criterion='friedman_mse')),
          ('xgb', xgb.XGBRegressor()),
          ('lgb', lgb.LGBMRegressor()),
          ('rf', RandomForestRegressor())
  ]

  reg = StackingRegressor(
      estimators = estimators, 
      final_estimator = RandomForestRegressor()
  )
  reg.fit(xtrain, ytrain)
  pred = reg.predict(xvalid)

  mse = mean_squared_error(yvalid, pred, squared=False)
  print(f'fold = {fold}, mse = {mse}')

  valid_data.loc[:, 'stack_pred'] = pred

  return valid_data

In [None]:
updated_df = []

for i in range(3):
  temp_df = run_training(i)
  updated_df.append(temp_df)

fin_valid_data = pd.concat(updated_df)
fin_valid_data.to_csv('/content/drive/MyDrive/Restaurant Revenue/Predictions/stack_pred.csv', index = False)

fold = 0, mse = 2874014.450060041
fold = 1, mse = 2916853.991674585
fold = 2, mse = 2735267.397245524


In [None]:
import glob

files = glob.glob('/content/drive/MyDrive/Restaurant Revenue/Predictions/*.csv')
data = None
for f in files:
  if data is None:
    data = pd.read_csv(f)
  else:
    temp_data = pd.read_csv(f)
    data = data.merge(temp_data, on="Id", how="left")
  
  
  print(data.head(10))

   Id  P1   P2   P3   P4  P5  P6  ...   64   65   66   67   68  kfold      lgb_pred
0   0   4  5.0  4.0  4.0   2   2  ...  0.0  0.0  0.0  1.0  0.0      0  4.567630e+06
1   1   4  5.0  4.0  4.0   1   2  ...  0.0  0.0  1.0  0.0  0.0      0  3.724448e+06
2   2   2  4.0  2.0  5.0   2   3  ...  1.0  0.0  0.0  1.0  0.0      0  2.756154e+06
3   3   6  4.5  6.0  6.0   4   4  ...  1.0  0.0  0.0  1.0  0.0      0  2.867801e+06
4   4   3  4.0  3.0  4.0   2   2  ...  1.0  0.0  0.0  1.0  0.0      0  2.774238e+06
5   5   6  6.0  4.5  7.5   8  10  ...  0.0  0.0  1.0  0.0  0.0      0  3.945803e+06
6   6   2  3.0  4.0  4.0   1   5  ...  0.0  0.0  0.0  1.0  0.0      0  3.499442e+06
7   7   4  5.0  4.0  5.0   2   3  ...  0.0  0.0  0.0  1.0  0.0      0  5.600981e+06
8   8   1  1.0  4.0  4.0   1   2  ...  1.0  0.0  0.0  1.0  0.0      0  3.652422e+06
9   9   6  4.5  6.0  7.5   6   4  ...  1.0  0.0  0.0  1.0  0.0      0  4.544136e+06

[10 rows x 113 columns]
   Id  P1_x  P2_x  P3_x  P4_x  ...  66_y  67_y  68_

In [None]:
data = data.loc[:,~data.columns.duplicated()]

In [None]:
data["kfold"]

0      0
1      0
2      0
3      0
4      0
      ..
132    2
133    2
134    2
135    2
136    2
Name: kfold, Length: 137, dtype: int64

In [None]:
from functools import partial
from scipy.optimize import fmin
from sklearn.metrics import mean_squared_error

In [None]:
class OptimizeRMSE:
  def __init__(self):
    self.coef = 0

  def _rmse(self, coef, X, y):
    x_coef = X * coef
    preds = np.sum(x_coef, axis=1)
    rmse = mean_squared_error(y, preds)
    return rmse

  def fit(self, X, y):
    partial_loss = partial(self._rmse, X=X, y=y)
    init_coef = np.random.dirichlet(np.ones(X.shape[1]))
    self.coef = fmin(partial_loss, init_coef, disp=True)

  def predict(self, X):
    x_coef = X * self.coef
    preds = np.sum(x_coef, axis = 1)
    return preds

def run_training(pred_df, fold):
  train_df = pred_df[pred_df["kfold"] != fold].reset_index(drop=True)
  valid_df = pred_df[pred_df["kfold"] == fold].reset_index(drop=True)

  xtrain = train_df[["rf_pred", "lgb_pred", "gbr_pred", "xgb_pred", "stack_pred"]].values
  xvalid = valid_df[["rf_pred", "lgb_pred", "gbr_pred", "xgb_pred", "stack_pred"]].values

  opt = OptimizeRMSE()
  opt.fit(xtrain, train_df.revenue_x.values)
  preds = opt.predict(xvalid)
  rmse = mean_squared_error(valid_df.revenue.values, preds, squared=False)
  print(f"{fold}, {rmse}")

  return opt.coef


In [None]:
coefs = []

for j in range(3):
  coefs.append(run_training(data, j))

coefs = np.array(coefs)
print(coefs)
coefs = np.mean(coefs, axis=0)
print(coefs)

0, 2515632.261202532
Optimization terminated successfully.
         Current function value: 5258556740053.269531
         Iterations: 419
         Function evaluations: 838
1, 2190830.9601212363
Optimization terminated successfully.
         Current function value: 5262135968197.158203
         Iterations: 424
         Function evaluations: 831
2, 2133262.866300568
[[ 0.38518311 -0.1670858   1.24019076 -0.21378988 -0.20078169]
 [ 0.43961042 -0.01057172  1.14305194 -0.1624409  -0.31959643]
 [ 0.32791327 -0.15903329  1.1989344  -0.18009787 -0.18110892]]
[ 0.3842356  -0.11223027  1.19405904 -0.18544288 -0.23382901]


In [None]:
preds = 0.38809062*data["rf_pred"] + -0.03215735**data["lgb_pred"] +  1.07991009*data["gbr_pred"] +  -0.14640134*data["xgb_pred"] + -0.2009682*data["stack_pred"]

In [None]:
preds

0      3.988728e+06
1      4.286524e+06
2      4.075744e+05
3      4.577076e+06
4      3.068914e+06
           ...     
132    4.582815e+06
133    5.579772e+06
134    3.141127e+06
135    5.748977e+06
136    6.570532e+06
Length: 137, dtype: float64

In [None]:
def run_training_test(xtrain,ytrain,test_df, model):
  reg = model()
  reg.fit(xtrain, ytrain)
  pred = reg.predict(test_df)
  ml_name = model

  finpred = pd.DataFrame({ml_name: pred})

  return finpred

In [None]:
xtrain = train.loc[:, ~train.columns.isin(['Id', 'revenue','kfold'])].values
ytrain = train.revenue.values
test_df = test.loc[:, ~test.columns.isin(['Id', 'revenue','kfold'])].values

In [None]:
rf_finpred = run_training_test(xtrain,ytrain,test_df, RandomForestRegressor)

In [None]:
lgb_finpred = run_training_test(xtrain,ytrain,test_df, lgb.LGBMRegressor)

In [None]:
gbr_finpred = run_training_test(xtrain,ytrain,test_df, GradientBoostingRegressor)

In [None]:
xgb_finpred = run_training_test(xtrain,ytrain,test_df, xgb.XGBRegressor)



In [None]:
def run_training_stack(xtrain,ytrain,test_df):

  estimators = [
              ('gbr', GradientBoostingRegressor(max_depth=5,
          learning_rate=0.038347,
          n_estimators=3673,
          min_samples_split=3,
          min_samples_leaf=6,
          loss='huber',
          max_features='log2',
          criterion='friedman_mse')),
          ('xgb', xgb.XGBRegressor()),
          ('lgb', lgb.LGBMRegressor()),
          ('rf', RandomForestRegressor())
  ]

  reg = StackingRegressor(
      estimators = estimators, 
      final_estimator = RandomForestRegressor()
  )
  reg.fit(xtrain, ytrain)
  pred = reg.predict(test_df)
  
  ml_name = 'stack_pred'
  finpred = pd.DataFrame({ml_name: pred})

  return finpred

In [None]:
stack_finpred = run_training_stack(xtrain,ytrain,test_df)



In [None]:
preds = 0.8*lgb_finpred.iloc[:,0]  + 0.2*gbr_finpred.iloc[:,0]

In [None]:
preds

0        4.156105e+06
1        2.684037e+06
2        3.659415e+06
3        2.602642e+06
4        4.555648e+06
             ...     
99995    5.538774e+06
99996    2.021949e+06
99997    4.503523e+06
99998    4.574999e+06
99999    5.354395e+06
Length: 100000, dtype: float64

In [None]:
output = pd.DataFrame({'Id': test_df.Id, 'Prediction': preds})
output.to_csv('submission.csv', index = False)

In [None]:
output.head()

Unnamed: 0,Id,Prediction
0,0,4156105.0
1,1,2684037.0
2,2,3659415.0
3,3,2602642.0
4,4,4555648.0
