# Recruit Restaurant Forecasting - Time series or Normal Regression?

### In this notebook I will be using the inferences made in the EDA notebook along with the knowledge and discussions made in the competition webpage in kaggle to basically understand and proceed in this competition

* I will use some of the public kernels to base my knowledge and i will imporve upon when necessary. I will be quoting all the source i used to create this notebook

* In this notebook i will address the main issue about whether to treat this problem as a Time Series problem or a Regression problem?

* The cross validation technique will also be discussed (Time Series CV or KFold CV ?)

* I will be implementing H2O Auto ML to basically create a benchmark stacked ensemble

* Finally the conclusions will be drawn

# 1) Importing the necessary modules

In [130]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn import model_selection
from xgboost import XGBRegressor
import lightgbm as lgb
import scipy.stats as st
import warnings
warnings.filterwarnings("ignore")

# 2) Data I/O

In [131]:
# air_visit_data: Visitos data as main training set
train_data=pd.read_csv('air_visit_data.csv')

#air store info and hpg store informations
air_store_info=pd.read_csv('air_store_info.csv')
hpg_store_info=pd.read_csv('hpg_store_info.csv')

#reservation info from air_reserve and hpg reserve
air_reserve=pd.read_csv('air_reserve.csv')
hpg_reserve=pd.read_csv('hpg_reserve.csv')

#store id relation between hpg and air

store_id_relation=pd.read_csv('store_id_relation.csv')
date_info=pd.read_csv('date_info.csv').rename(columns={'calendar_date':'visit_date'})
test_data=pd.read_csv('sample_submission.csv')

train_size=train_data.shape[0]

# 3) Feature engineering

* The feature engineering here is based on the kernel from the1owl: https://www.kaggle.com/the1owl/surprise-me

* but i have also contributed in this by adding my own thoughts

### Exploring the temporal features in the dataset
#### Using the air reserve data file
1) Convert the datetime features and evaluate the reservation datetime difference from the visit datetime

In [132]:
air_reserve['visit_datetime'] = pd.to_datetime(air_reserve['visit_datetime'])
air_reserve['visit_datetime'] = air_reserve['visit_datetime'].dt.date
air_reserve['reserve_datetime'] = pd.to_datetime(air_reserve['reserve_datetime'])
air_reserve['reserve_datetime'] = air_reserve['reserve_datetime'].dt.date
air_reserve['reserve_datetime_diff'] = air_reserve.apply(lambda r: (r['visit_datetime'] - r['reserve_datetime']).days, axis=1)

In [133]:
air_reserve.head()

Unnamed: 0,air_store_id,visit_datetime,reserve_datetime,reserve_visitors,reserve_datetime_diff
0,air_877f79706adbfb06,2016-01-01,2016-01-01,1,0
1,air_db4b38ebe7a7ceff,2016-01-01,2016-01-01,3,0
2,air_db4b38ebe7a7ceff,2016-01-01,2016-01-01,6,0
3,air_877f79706adbfb06,2016-01-01,2016-01-01,2,0
4,air_db80363d35f10926,2016-01-01,2016-01-01,5,0


2) Sum and mean of the reserve visitors and reserve datetime difference

In [134]:
tmp1 = air_reserve.groupby(['air_store_id','visit_datetime'], as_index=False)[['reserve_datetime_diff', 'reserve_visitors']].sum().rename(columns={'visit_datetime':'visit_date', 'reserve_datetime_diff': 'rs1', 'reserve_visitors':'rv1'})
tmp2 = air_reserve.groupby(['air_store_id','visit_datetime'], as_index=False)[['reserve_datetime_diff', 'reserve_visitors']].mean().rename(columns={'visit_datetime':'visit_date', 'reserve_datetime_diff': 'rs2', 'reserve_visitors':'rv2'})
air_reserve = pd.merge(tmp1, tmp2, how='inner', on=['air_store_id','visit_date'])

In [135]:
air_reserve.head()
# rs1: sum of reserve datetime diff based on the airstore id
# rv1: sum of reserve visitors based on the air store id
# rs2: mean of reserve datetime diff based on the airstore id
# rv2: mean of reserve visitors based on the air store id

Unnamed: 0,air_store_id,visit_date,rs1,rv1,rs2,rv2
0,air_00a91d42b08b08d9,2016-10-31,0,2,0.0,2.0
1,air_00a91d42b08b08d9,2016-12-05,4,9,4.0,9.0
2,air_00a91d42b08b08d9,2016-12-14,6,18,6.0,18.0
3,air_00a91d42b08b08d9,2016-12-17,6,2,6.0,2.0
4,air_00a91d42b08b08d9,2016-12-20,2,4,2.0,4.0


#### Using the hpg reserve data file
1) Convert the datetime features and evaluate the reservation datetime difference from the visit datetime

2) Sum and mean of the reserve visitors and reserve datetime difference

In [136]:
hpg_reserve= pd.merge(hpg_reserve, store_id_relation, how='inner', on=['hpg_store_id'])

In [137]:
hpg_reserve['visit_datetime'] = pd.to_datetime(hpg_reserve['visit_datetime'])
hpg_reserve['visit_datetime'] = hpg_reserve['visit_datetime'].dt.date
hpg_reserve['reserve_datetime'] = pd.to_datetime(hpg_reserve['reserve_datetime'])
hpg_reserve['reserve_datetime'] = hpg_reserve['reserve_datetime'].dt.date
hpg_reserve['reserve_datetime_diff'] = hpg_reserve.apply(lambda r: (r['visit_datetime'] - r['reserve_datetime']).days, axis=1)
tmp1 = hpg_reserve.groupby(['air_store_id','visit_datetime'], as_index=False)[['reserve_datetime_diff', 'reserve_visitors']].sum().rename(columns={'visit_datetime':'visit_date', 'reserve_datetime_diff': 'rs1', 'reserve_visitors':'rv1'})
tmp2 = hpg_reserve.groupby(['air_store_id','visit_datetime'], as_index=False)[['reserve_datetime_diff', 'reserve_visitors']].mean().rename(columns={'visit_datetime':'visit_date', 'reserve_datetime_diff': 'rs2', 'reserve_visitors':'rv2'})
hpg_reserve = pd.merge(tmp1, tmp2, how='inner', on=['air_store_id','visit_date'])

#### Using the air visits data file -> train_data df
+ Extracting the temporal information from the air visits data. I have created some extra features below

In [138]:
train_data['visit_date'] = pd.to_datetime(train_data['visit_date'])
train_data['dow'] = train_data['visit_date'].dt.dayofweek
train_data['year'] = train_data['visit_date'].dt.year
train_data['month'] = train_data['visit_date'].dt.month
train_data['doy'] = train_data['visit_date'].dt.dayofyear
train_data['dim'] = train_data['visit_date'].dt.day
train_data['woy'] = train_data['visit_date'].dt.weekofyear
train_data['is_month_end'] = train_data['visit_date'].dt.is_month_end
train_data['visit_date'] = train_data['visit_date'].dt.date

In [139]:
train_data.head()

Unnamed: 0,air_store_id,visit_date,visitors,dow,year,month,doy,dim,woy,is_month_end
0,air_ba937bf13d40fb24,2016-01-13,25,2,2016,1,13,13,2,False
1,air_ba937bf13d40fb24,2016-01-14,32,3,2016,1,14,14,2,False
2,air_ba937bf13d40fb24,2016-01-15,29,4,2016,1,15,15,2,False
3,air_ba937bf13d40fb24,2016-01-16,22,5,2016,1,16,16,2,False
4,air_ba937bf13d40fb24,2016-01-18,6,0,2016,1,18,18,3,False


#### Using the sample submission data file -> test_data df
* Extracting the temporal information from forecasted data. I have created some extra features below to match the train set

In [140]:
test_data['visit_date'] = test_data['id'].map(lambda x: str(x).split('_')[2])
test_data['air_store_id'] = test_data['id'].map(lambda x: '_'.join(x.split('_')[:2]))
test_data['visit_date'] = pd.to_datetime(test_data['visit_date'])
test_data['dow'] = test_data['visit_date'].dt.dayofweek
test_data['year'] = test_data['visit_date'].dt.year
test_data['month'] = test_data['visit_date'].dt.month
test_data['doy'] = test_data['visit_date'].dt.dayofyear
test_data['dim'] = test_data['visit_date'].dt.day
test_data['woy'] = test_data['visit_date'].dt.weekofyear
test_data['is_month_end'] = test_data['visit_date'].dt.is_month_end
test_data['visit_date'] = test_data['visit_date'].dt.date

In [141]:
test_data.head()

Unnamed: 0,id,visitors,visit_date,air_store_id,dow,year,month,doy,dim,woy,is_month_end
0,air_00a91d42b08b08d9_2017-04-23,0,2017-04-23,air_00a91d42b08b08d9,6,2017,4,113,23,16,False
1,air_00a91d42b08b08d9_2017-04-24,0,2017-04-24,air_00a91d42b08b08d9,0,2017,4,114,24,17,False
2,air_00a91d42b08b08d9_2017-04-25,0,2017-04-25,air_00a91d42b08b08d9,1,2017,4,115,25,17,False
3,air_00a91d42b08b08d9_2017-04-26,0,2017-04-26,air_00a91d42b08b08d9,2,2017,4,116,26,17,False
4,air_00a91d42b08b08d9_2017-04-27,0,2017-04-27,air_00a91d42b08b08d9,3,2017,4,117,27,17,False


** Collecting the list of unique store ids**

In [142]:
unique_stores = test_data['air_store_id'].unique()
stores = pd.concat([pd.DataFrame({'air_store_id': unique_stores, 'dow': [i]*len(unique_stores)}) for i in range(7)], axis=0, ignore_index=True).reset_index(drop=True)

#### Used this code block from the kernel to basically create various stats related to the number of visitors

In [143]:
tmp = train_data.groupby(['air_store_id','dow'], as_index=False)['visitors'].min().rename(columns={'visitors':'min_visitors'})
stores = pd.merge(stores, tmp, how='left', on=['air_store_id','dow']) 
tmp = train_data.groupby(['air_store_id','dow'], as_index=False)['visitors'].mean().rename(columns={'visitors':'mean_visitors'})
stores = pd.merge(stores, tmp, how='left', on=['air_store_id','dow'])
tmp = train_data.groupby(['air_store_id','dow'], as_index=False)['visitors'].median().rename(columns={'visitors':'median_visitors'})
stores = pd.merge(stores, tmp, how='left', on=['air_store_id','dow'])
tmp = train_data.groupby(['air_store_id','dow'], as_index=False)['visitors'].max().rename(columns={'visitors':'max_visitors'})
stores = pd.merge(stores, tmp, how='left', on=['air_store_id','dow'])
tmp = train_data.groupby(['air_store_id','dow'], as_index=False)['visitors'].count().rename(columns={'visitors':'count_observations'})
stores = pd.merge(stores, tmp, how='left', on=['air_store_id','dow']) 

In [144]:
stores.head()

Unnamed: 0,air_store_id,dow,min_visitors,mean_visitors,median_visitors,max_visitors,count_observations
0,air_00a91d42b08b08d9,0,1.0,22.457143,19.0,47.0,35.0
1,air_0164b9927d20bcc3,0,2.0,7.5,6.0,19.0,20.0
2,air_0241aa3964b7f861,0,2.0,8.920635,8.0,23.0,63.0
3,air_0328696196e46f18,0,2.0,6.416667,4.0,27.0,12.0
4,air_034a3d5b40d5b1b1,0,1.0,11.864865,10.0,66.0,37.0


In [145]:
stores = pd.merge(stores, air_store_info, how='left', on=['air_store_id'])

** I added this dictionary below to help in reducing the genre categories as there are to many genres in the dataset. 
I basically bucketted the genres into five main categories** 

1) Bar or Club 	
2) European 	
3) Japanese 	
4) Other 	
5) Asian (excluding Japanese)

In [146]:
genres = {
    'Japanese style':'Japanese',
    'International cuisine':'Other',
    'Grilled meat':'Asian',
    'Creation':'Japanese',
    'Italian':'European',
    'Seafood':'Other',
    'Spain Bar/Italian Bar':'European',
    'Japanese food in general':'Japanese',
    'Shabu-shabu/Sukiyaki':'Japanese',
    'Chinese general':'Asian',
    'Creative Japanese food':'Japanese',
    'Japanese cuisine/Kaiseki':'Japanese',
    'Korean cuisine':'Asian',
    'Okonomiyaki/Monja/Teppanyaki':'Japanese',
    'Karaoke':'Bar or Club',
    'Steak/Hamburger/Curry':'Other',
    'French':'European',
    'Cafe':'European',
    'Bistro':'Other',
    'Sushi':'Japanese',
    'Party':'Bar or Club',
    'Western food':'Other',
    'Pasta/Pizza':'Other',
    'Thai/Vietnamese food':'Asian',
    'Bar/Cocktail':'Bar or Club',
    'Amusement bar':'Bar or Club',
    'Cantonese food':'Asian',
    'Dim Sum/Dumplings':'Asian',
    'Sichuan food':'Asian',
    'Sweets':'Other',
    'Spain/Mediterranean cuisine':'European',
    'Udon/Soba':'Japanese',
    'Shanghai food':'Asian',
    'Taiwanese/Hong Kong cuisine':'Asian',
    'Japanese food':'Japanese', 
    'Dining bar':'Bar or Club', 
    'Izakaya':'Japanese',
    'Okonomiyaki/Monja/Teppanyaki':'Japanese', 
    'Italian/French':'European', 
    'Cafe/Sweets':'Other',
    'Yakiniku/Korean food':'Asian', 
    'Western food':'Other', 
    'Bar/Cocktail':'Bar or Club', 
    'Other':'Other',
    'Creative cuisine':'Japanese', 
    'Karaoke/Party':'Bar or Club', 
    'International cuisine':'Other',
    'Asian':'Asian',
    'None':'None',
    'No Data':'No Data'}

In [147]:
stores['air_genre_name']=stores['air_genre_name'].map(genres) # categorical column bucketting

In [148]:
stores.head()

Unnamed: 0,air_store_id,dow,min_visitors,mean_visitors,median_visitors,max_visitors,count_observations,air_genre_name,air_area_name,latitude,longitude
0,air_00a91d42b08b08d9,0,1.0,22.457143,19.0,47.0,35.0,European,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595
1,air_0164b9927d20bcc3,0,2.0,7.5,6.0,19.0,20.0,European,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599
2,air_0241aa3964b7f861,0,2.0,8.920635,8.0,23.0,63.0,Japanese,Tōkyō-to Taitō-ku Higashiueno,35.712607,139.779996
3,air_0328696196e46f18,0,2.0,6.416667,4.0,27.0,12.0,Bar or Club,Ōsaka-fu Ōsaka-shi Nakanochō,34.701279,135.52809
4,air_034a3d5b40d5b1b1,0,1.0,11.864865,10.0,66.0,37.0,Other,Ōsaka-fu Ōsaka-shi Ōhiraki,34.692337,135.472229


** Tackling the air area name: **According to my explorations the air area name belongs to 3 local areas
for example: **Tōkyō-to Chiyoda-ku Kudanminami** is actually := 
* local area 1 -> Tōkyō-to 

* local area 2 -> Chiyoda-ku 

* local area 3 -> Kudanminami

This was inferred from the maps in EDA notebook. So i will be exploring more about these in my next notebook

So at present i will just stick to the area name splitting and label encoding

In [149]:
stores['air_area_name'] = stores['air_area_name'].map(lambda x: str(str(x).replace('-',' ')))
lbl = LabelEncoder()
for i in range(10):
    stores['air_area_name'+str(i)] = lbl.fit_transform(stores['air_area_name'].map(lambda x: str(str(x).split(' ')[i]) if len(str(x).split(' '))>i else ''))
stores['air_area_name'] = lbl.fit_transform(stores['air_area_name'])

In [150]:
date_info['visit_date'] = pd.to_datetime(date_info['visit_date'])
date_info['day_of_week'] = lbl.fit_transform(date_info['day_of_week'])
date_info['visit_date'] = date_info['visit_date'].dt.date

** Merging different dataframes to form the train and test set **

In [151]:
train = pd.merge(train_data, date_info, how='left', on=['visit_date']) 
test = pd.merge(test_data, date_info, how='left', on=['visit_date']) 

In [152]:
train = pd.merge(train, stores, how='left', on=['air_store_id','dow']) 
test = pd.merge(test, stores, how='left', on=['air_store_id','dow'])

In [153]:
train = pd.merge(train, air_reserve, how='left', on=['air_store_id','visit_date']) 
test = pd.merge(test, air_reserve, how='left', on=['air_store_id','visit_date'])

In [154]:
train = pd.merge(train, hpg_reserve, how='left', on=['air_store_id','visit_date']) 
test = pd.merge(test, hpg_reserve, how='left', on=['air_store_id','visit_date'])

In [155]:
train.head()

Unnamed: 0,air_store_id,visit_date,visitors,dow,year,month,doy,dim,woy,is_month_end,day_of_week,holiday_flg,min_visitors,mean_visitors,median_visitors,max_visitors,count_observations,air_genre_name,air_area_name,latitude,longitude,air_area_name0,air_area_name1,air_area_name2,air_area_name3,air_area_name4,air_area_name5,air_area_name6,air_area_name7,air_area_name8,air_area_name9,rs1_x,rv1_x,rs2_x,rv2_x,rs1_y,rv1_y,rs2_y,rv2_y
0,air_ba937bf13d40fb24,2016-01-13,25,2,2016,1,13,13,2,False,6,0,7.0,23.84375,25.0,57.0,64.0,Bar or Club,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,
1,air_ba937bf13d40fb24,2016-01-14,32,3,2016,1,14,14,2,False,4,0,2.0,20.292308,21.0,54.0,65.0,Bar or Club,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,
2,air_ba937bf13d40fb24,2016-01-15,29,4,2016,1,15,15,2,False,0,0,4.0,34.738462,35.0,61.0,65.0,Bar or Club,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,
3,air_ba937bf13d40fb24,2016-01-16,22,5,2016,1,16,16,2,False,2,0,6.0,27.651515,27.0,53.0,66.0,Bar or Club,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,
4,air_ba937bf13d40fb24,2016-01-18,6,0,2016,1,18,18,3,False,1,0,2.0,13.754386,12.0,34.0,57.0,Bar or Club,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,


In [156]:
test.head()

Unnamed: 0,id,visitors,visit_date,air_store_id,dow,year,month,doy,dim,woy,is_month_end,day_of_week,holiday_flg,min_visitors,mean_visitors,median_visitors,max_visitors,count_observations,air_genre_name,air_area_name,latitude,longitude,air_area_name0,air_area_name1,air_area_name2,air_area_name3,air_area_name4,air_area_name5,air_area_name6,air_area_name7,air_area_name8,air_area_name9,rs1_x,rv1_x,rs2_x,rv2_x,rs1_y,rv1_y,rs2_y,rv2_y
0,air_00a91d42b08b08d9_2017-04-23,0,2017-04-23,air_00a91d42b08b08d9,6,2017,4,113,23,16,False,3,0,2.0,2.0,2.0,2.0,1.0,European,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,,,,,,,,
1,air_00a91d42b08b08d9_2017-04-24,0,2017-04-24,air_00a91d42b08b08d9,0,2017,4,114,24,17,False,1,0,1.0,22.457143,19.0,47.0,35.0,European,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,,,,,,,,
2,air_00a91d42b08b08d9_2017-04-25,0,2017-04-25,air_00a91d42b08b08d9,1,2017,4,115,25,17,False,5,0,1.0,24.35,24.5,43.0,40.0,European,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,,,,,,,,
3,air_00a91d42b08b08d9_2017-04-26,0,2017-04-26,air_00a91d42b08b08d9,2,2017,4,116,26,17,False,6,0,15.0,28.125,28.0,52.0,40.0,European,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,,,,,,,,
4,air_00a91d42b08b08d9_2017-04-27,0,2017-04-27,air_00a91d42b08b08d9,3,2017,4,117,27,17,False,4,0,15.0,29.868421,30.0,47.0,38.0,European,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,,,,,,,,


In [157]:
train['id'] = train.apply(lambda r: '_'.join([str(r['air_store_id']), str(r['visit_date'])]), axis=1)

** Some more feature engg to create insights about reservations **

In [158]:
train['total_reserv_sum'] = train['rv1_x'] + train['rv1_y']
train['total_reserv_mean'] = (train['rv2_x'] + train['rv2_y']) / 2
train['total_reserv_dt_diff_mean'] = (train['rs2_x'] + train['rs2_y']) / 2

test['total_reserv_sum'] = test['rv1_x'] + test['rv1_y']
test['total_reserv_mean'] = (test['rv2_x'] + test['rv2_y']) / 2
test['total_reserv_dt_diff_mean'] = (test['rs2_x'] + test['rs2_y']) /2

In [159]:
train['date_int'] = train['visit_date'].apply(lambda x: x.strftime('%Y%m%d')).astype(int)
test['date_int'] = test['visit_date'].apply(lambda x: x.strftime('%Y%m%d')).astype(int)
train['var_max_lat'] = train['latitude'].max() - train['latitude']
train['var_max_long'] = train['longitude'].max() - train['longitude']
test['var_max_lat'] = test['latitude'].max() - test['latitude']
test['var_max_long'] = test['longitude'].max() - test['longitude']

** The feature : lon_plus_lat -> just give us a combined information, but i would keep it separated. I will use this in the next notebook**

In [160]:
# NEW FEATURES FROM Georgii Vyshnia
train['lon_plus_lat'] = train['longitude'] + train['latitude'] 
test['lon_plus_lat'] = test['longitude'] + test['latitude']

In [161]:
lbl = LabelEncoder()
train['air_store_id2'] = lbl.fit_transform(train['air_store_id'])
test['air_store_id2'] = lbl.transform(test['air_store_id'])

In [162]:
train['air_genre_name']=train['air_genre_name'].fillna('Other')

In [163]:
test['air_genre_name']=test['air_genre_name'].fillna('Other')

**Observation:** In the train and test set after bucketting the genres into 5 categories i still found there are missing values and i just put them in the other category for simplicity

In [164]:
train=pd.get_dummies(train,columns=['air_genre_name'])

In [165]:
test=pd.get_dummies(test,columns=['air_genre_name'])

Removing the String columns from the train and test set 

In [166]:
col = [c for c in train if c not in ['id', 'air_store_id', 'visit_date','visitors']]

In [167]:
train = train.fillna(-999)
test = test.fillna(-999)

** The train and test set is ready for analysis now **

**Some Remarks:**

* From the ARIMA notebook I have observed that time series have lot of gaps (in an individual restaurant perspective)

* It has also been observed that there are new addition of restaurants to the air visits database in June 2016

* So all the air restaurants does not have same time period -> this is not very convenient as we donot have enough training data for few restaurants

**Therefore instead of treating it as time series problem i will treat it as a regression problem**

In [168]:
def RMSLE(y, pred):
    return metrics.mean_squared_error(y, pred)**0.5

In [169]:
pd.set_option('max_columns',None)

In [170]:
train.head()

Unnamed: 0,air_store_id,visit_date,visitors,dow,year,month,doy,dim,woy,is_month_end,day_of_week,holiday_flg,min_visitors,mean_visitors,median_visitors,max_visitors,count_observations,air_area_name,latitude,longitude,air_area_name0,air_area_name1,air_area_name2,air_area_name3,air_area_name4,air_area_name5,air_area_name6,air_area_name7,air_area_name8,air_area_name9,rs1_x,rv1_x,rs2_x,rv2_x,rs1_y,rv1_y,rs2_y,rv2_y,id,total_reserv_sum,total_reserv_mean,total_reserv_dt_diff_mean,date_int,var_max_lat,var_max_long,lon_plus_lat,air_store_id2,air_genre_name_Asian,air_genre_name_Bar or Club,air_genre_name_European,air_genre_name_Japanese,air_genre_name_Other
0,air_ba937bf13d40fb24,2016-01-13,25,2,2016,1,13,13,2,False,6,0,7.0,23.84375,25.0,57.0,64.0,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_ba937bf13d40fb24_2016-01-13,-999.0,-999.0,-999.0,20160113,8.362564,4.521799,175.409667,603,0,1,0,0,0
1,air_ba937bf13d40fb24,2016-01-14,32,3,2016,1,14,14,2,False,4,0,2.0,20.292308,21.0,54.0,65.0,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_ba937bf13d40fb24_2016-01-14,-999.0,-999.0,-999.0,20160114,8.362564,4.521799,175.409667,603,0,1,0,0,0
2,air_ba937bf13d40fb24,2016-01-15,29,4,2016,1,15,15,2,False,0,0,4.0,34.738462,35.0,61.0,65.0,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_ba937bf13d40fb24_2016-01-15,-999.0,-999.0,-999.0,20160115,8.362564,4.521799,175.409667,603,0,1,0,0,0
3,air_ba937bf13d40fb24,2016-01-16,22,5,2016,1,16,16,2,False,2,0,6.0,27.651515,27.0,53.0,66.0,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_ba937bf13d40fb24_2016-01-16,-999.0,-999.0,-999.0,20160116,8.362564,4.521799,175.409667,603,0,1,0,0,0
4,air_ba937bf13d40fb24,2016-01-18,6,0,2016,1,18,18,3,False,1,0,2.0,13.754386,12.0,34.0,57.0,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_ba937bf13d40fb24_2016-01-18,-999.0,-999.0,-999.0,20160118,8.362564,4.521799,175.409667,603,0,1,0,0,0


In [171]:
test.head()

Unnamed: 0,id,visitors,visit_date,air_store_id,dow,year,month,doy,dim,woy,is_month_end,day_of_week,holiday_flg,min_visitors,mean_visitors,median_visitors,max_visitors,count_observations,air_area_name,latitude,longitude,air_area_name0,air_area_name1,air_area_name2,air_area_name3,air_area_name4,air_area_name5,air_area_name6,air_area_name7,air_area_name8,air_area_name9,rs1_x,rv1_x,rs2_x,rv2_x,rs1_y,rv1_y,rs2_y,rv2_y,total_reserv_sum,total_reserv_mean,total_reserv_dt_diff_mean,date_int,var_max_lat,var_max_long,lon_plus_lat,air_store_id2,air_genre_name_Asian,air_genre_name_Bar or Club,air_genre_name_European,air_genre_name_Japanese,air_genre_name_Other
0,air_00a91d42b08b08d9_2017-04-23,0,2017-04-23,air_00a91d42b08b08d9,6,2017,4,113,23,16,False,3,0,2.0,2.0,2.0,2.0,1.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170423,8.326629,4.519803,175.447598,0,0,0,1,0,0
1,air_00a91d42b08b08d9_2017-04-24,0,2017-04-24,air_00a91d42b08b08d9,0,2017,4,114,24,17,False,1,0,1.0,22.457143,19.0,47.0,35.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170424,8.326629,4.519803,175.447598,0,0,0,1,0,0
2,air_00a91d42b08b08d9_2017-04-25,0,2017-04-25,air_00a91d42b08b08d9,1,2017,4,115,25,17,False,5,0,1.0,24.35,24.5,43.0,40.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170425,8.326629,4.519803,175.447598,0,0,0,1,0,0
3,air_00a91d42b08b08d9_2017-04-26,0,2017-04-26,air_00a91d42b08b08d9,2,2017,4,116,26,17,False,6,0,15.0,28.125,28.0,52.0,40.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170426,8.326629,4.519803,175.447598,0,0,0,1,0,0
4,air_00a91d42b08b08d9_2017-04-27,0,2017-04-27,air_00a91d42b08b08d9,3,2017,4,117,27,17,False,4,0,15.0,29.868421,30.0,47.0,38.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170427,8.326629,4.519803,175.447598,0,0,0,1,0,0


** First training the naive models without CV and randomized grid search **

Training: 

1) GBM regressor
2) KNN regressor
3) Random Forest regressor
4) LightGBM Regressor
5) XGB regressor

In [44]:
model_stack_no_cv=train[['air_store_id','visit_date']]

In [45]:
model1 = GradientBoostingRegressor(learning_rate=0.2, random_state=3, n_estimators=200, subsample=0.8,max_depth =10)
model2 = KNeighborsRegressor(n_jobs=-1, n_neighbors=4)
model3 = XGBRegressor(learning_rate=0.2, random_state=3, n_estimators=280, subsample=0.8,colsample_bytree=0.8, max_depth =12)
model4 = RandomForestRegressor(random_state=2,n_estimators=300,max_depth=8,n_jobs=-1)
model1.fit(train[col], np.log1p(train['visitors'].values))
model2.fit(train[col], np.log1p(train['visitors'].values))
model3.fit(train[col], np.log1p(train['visitors'].values))
model4.fit(train[col], np.log1p(train['visitors'].values))
preds1 = model1.predict(train[col])
preds2 = model2.predict(train[col])
preds3 = model3.predict(train[col])
preds4 = model4.predict(train[col])

In [46]:
print('RMSE GradientBoostingRegressor: ', RMSLE(np.log1p(train['visitors'].values), preds1))
print('RMSE KNeighborsRegressor: ', RMSLE(np.log1p(train['visitors'].values), preds2))
print('RMSE XGBRegressor: ', RMSLE(np.log1p(train['visitors'].values), preds3))
print('RMSE RandomForestRegressor: ', RMSLE(np.log1p(train['visitors'].values), preds4))

RMSE GradientBoostingRegressor:  0.33919258425181953
RMSE KNeighborsRegressor:  0.42655751350463705
RMSE XGBRegressor:  0.20241664418844002
RMSE RandomForestRegressor:  0.5047672877663344


In [47]:
lgbm=lgb.LGBMRegressor(boosting_type='gbdt', colsample_bytree=1.0, learning_rate=0.2,max_bin=255, max_depth=-1, min_child_samples=20,min_child_weight=0.001, min_split_gain=0.0, n_estimators=1000,n_jobs=-1, num_leaves=40, objective='regression', random_state=0)
lgbm.fit(train[col],np.log1p(train['visitors'].values))

LGBMRegressor(boosting_type='gbdt', colsample_bytree=1.0, learning_rate=0.2,
       max_bin=255, max_depth=-1, min_child_samples=20,
       min_child_weight=0.001, min_split_gain=0.0, n_estimators=1000,
       n_jobs=-1, num_leaves=40, objective='regression', random_state=0,
       reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
       subsample_for_bin=200000, subsample_freq=1)

In [48]:
preds_lgb=lgbm.predict(train[col])
print('RMSE LGBMRegressor: ', RMSLE(np.log1p(train['visitors'].values), lgbm.predict(train[col])))

RMSE LGBMRegressor:  0.41274091788730466


In [49]:
model_stack_no_cv['pred_gb']=preds1
model_stack_no_cv['pred_knn']=preds2
model_stack_no_cv['pred_xgb']=preds3
model_stack_no_cv['pred_rf']=preds4
model_stack_no_cv['pred_lgb']=preds_lgb

In [56]:
model_stack_no_cv.to_csv('stacking_input/stack_no_cv_pub_mine.csv',index=False)

In [50]:
preds1 = model1.predict(test[col])
preds2 = model2.predict(test[col])
preds3 = model3.predict(test[col])
preds4 = model4.predict(test[col])
preds5 = lgbm.predict(test[col])

**Observations: ** 

* From the RMSE score of the predictors : XGBRegressor is having the lowest RMSE Score (Single modell I will use to check the LB Score and report it with a picture)

* Now i will take a weighted average the predictions (weigthing based on the RMSLE score from the website)

* I will just report the invidual model scores other than XGBRegressor here:


In [51]:
# Single XGB Model
test['visitors'] = preds3
test['visitors'] = np.expm1(test['visitors']).clip(lower=0.)
sub1 = test[['id','visitors']].copy()
sub1.to_csv('sub_single_xgb_nocv.csv',index=False)

**Single XGB Model**: Public Leaderboard vs Private Leaderboard 
![Public LB](submissions/xgb_no_cv.PNG)

** Observations: **Single XGB model with No Cross Validation and Random Grid search did not give a very good score, May be an avergaed model could be better

In [52]:
# Averaged predictions
test['visitors'] = 0.2*preds1+0.2*preds2+0.3*preds3+0.1*preds4+0.2*preds5
test['visitors'] = np.expm1(test['visitors']).clip(lower=0.)
sub1 = test[['id','visitors']].copy()

In [53]:
sub1.to_csv('sub_average_5models_nocv.csv',index=False)

**Avergaed model**: Public Leaderboard vs Private Leaderboard 
![Public LB](submissions/5models_avg_nocv.PNG)

** Observations:** 5 models averaged with no CV + Random grid search did Ok on the Public score but the private score is still bad. I think Cross validation could fix this 

In [54]:
chk=pd.read_csv('sub_average_5models_nocv.csv')
chk.head(10)

Unnamed: 0,id,visitors
0,air_00a91d42b08b08d9_2017-04-23,1.935118
1,air_00a91d42b08b08d9_2017-04-24,22.742191
2,air_00a91d42b08b08d9_2017-04-25,25.257026
3,air_00a91d42b08b08d9_2017-04-26,28.765542
4,air_00a91d42b08b08d9_2017-04-27,33.028357
5,air_00a91d42b08b08d9_2017-04-28,37.363631
6,air_00a91d42b08b08d9_2017-04-29,9.344577
7,air_00a91d42b08b08d9_2017-04-30,1.854709
8,air_00a91d42b08b08d9_2017-05-01,20.271127
9,air_00a91d42b08b08d9_2017-05-02,22.612261


## Cross Validation: TimeSeries CV or K-Fold ?

In [57]:
import time
from sklearn.model_selection import RandomizedSearchCV

In [58]:
train.head()

Unnamed: 0,air_store_id,visit_date,visitors,dow,year,month,doy,dim,woy,is_month_end,day_of_week,holiday_flg,min_visitors,mean_visitors,median_visitors,max_visitors,count_observations,air_area_name,latitude,longitude,air_area_name0,air_area_name1,air_area_name2,air_area_name3,air_area_name4,air_area_name5,air_area_name6,air_area_name7,air_area_name8,air_area_name9,rs1_x,rv1_x,rs2_x,rv2_x,rs1_y,rv1_y,rs2_y,rv2_y,id,total_reserv_sum,total_reserv_mean,total_reserv_dt_diff_mean,date_int,var_max_lat,var_max_long,lon_plus_lat,air_store_id2,air_genre_name_Asian,air_genre_name_Bar or Club,air_genre_name_European,air_genre_name_Japanese,air_genre_name_Other
0,air_ba937bf13d40fb24,2016-01-13,25,2,2016,1,13,13,2,False,6,0,7.0,23.84375,25.0,57.0,64.0,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_ba937bf13d40fb24_2016-01-13,-999.0,-999.0,-999.0,20160113,8.362564,4.521799,175.409667,603,0,1,0,0,0
1,air_ba937bf13d40fb24,2016-01-14,32,3,2016,1,14,14,2,False,4,0,2.0,20.292308,21.0,54.0,65.0,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_ba937bf13d40fb24_2016-01-14,-999.0,-999.0,-999.0,20160114,8.362564,4.521799,175.409667,603,0,1,0,0,0
2,air_ba937bf13d40fb24,2016-01-15,29,4,2016,1,15,15,2,False,0,0,4.0,34.738462,35.0,61.0,65.0,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_ba937bf13d40fb24_2016-01-15,-999.0,-999.0,-999.0,20160115,8.362564,4.521799,175.409667,603,0,1,0,0,0
3,air_ba937bf13d40fb24,2016-01-16,22,5,2016,1,16,16,2,False,2,0,6.0,27.651515,27.0,53.0,66.0,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_ba937bf13d40fb24_2016-01-16,-999.0,-999.0,-999.0,20160116,8.362564,4.521799,175.409667,603,0,1,0,0,0
4,air_ba937bf13d40fb24,2016-01-18,6,0,2016,1,18,18,3,False,1,0,2.0,13.754386,12.0,34.0,57.0,62.0,35.658068,139.751599,7.0,6.0,26.0,6.0,78.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_ba937bf13d40fb24_2016-01-18,-999.0,-999.0,-999.0,20160118,8.362564,4.521799,175.409667,603,0,1,0,0,0


In [59]:
train_orig=train.copy()
test_orig=test.copy()

### K-Fold CV with RandomSearch

#### Model1:   KFold+RS for GBM

** Observation: **As GBM takes lot of time, i just limited the number of randomsearch iterations to 3 

In [60]:
start=time.time()
model1 = GradientBoostingRegressor(random_state=3,verbose=True)
kf=model_selection.KFold(n_splits=5) 
params_dist_gb = {'n_estimators': st.randint(100, 300),'max_depth':st.randint(8, 12) ,'min_samples_leaf': [2,4],'learning_rate':st.uniform(0.1,0.25)}
random_search_gb = RandomizedSearchCV(model1, param_distributions=params_dist_gb,n_iter=5,cv=kf,n_jobs=-1)
gb_g=random_search_gb.fit(train[col],np.log1p(train['visitors'].values))
gb_best=gb_g.best_estimator_
gb_pred_best=gb_best.predict(test[col])
gb_pred_best=np.expm1(gb_pred_best)#predictions
print(gb_best)
print('Time elapsed for the GBM with CV: ',time.time()-start)

      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Remaining Time 
      Iter       Train Loss   Rema

        10           0.2356           31.79m
         9           0.2444           35.41m
        10           0.2406           31.87m
         9           0.2482           20.99m
        10           0.2352           32.31m
        10           0.2405           19.10m
        10           0.2459           19.08m
        10           0.2420           32.74m
        10           0.2401           19.41m
        10           0.2716           27.41m
        10           0.2655           27.54m
         9           0.2787           31.08m
        10           0.2720           27.97m
        10           0.2646           28.19m
        10           0.2467           20.03m
        10           0.2421           34.77m
        10           0.2455           20.63m
         4           0.2816           81.51m
         4           0.2759           85.53m
        10           0.2707           30.98m
         4           0.2748           87.67m
         4           0.2813           89.59m
         4

        70           0.1973            8.84m
        70           0.2028            8.98m
        80           0.1930           16.39m
        80           0.1966            7.09m
        70           0.2190           16.23m
        80           0.2011            7.16m
        70           0.2142           16.35m
        70           0.2186           16.50m
        80           0.1936           17.26m
        70           0.2128           16.60m
        20           0.1895          110.59m
        80           0.1904           17.43m
        80           0.1993            7.35m
        80           0.1857           17.52m
        80           0.2172           13.74m
        90           0.1863           14.81m
        20           0.1940          113.50m
        30           0.1965           58.11m
        80           0.1936            7.62m
        30           0.1921           58.62m
        90           0.1886           15.17m
        30           0.1892           59.14m
        80

In [61]:
print('RMSE GBMRegressor: ', RMSLE(np.log1p(train['visitors'].values), gb_best.predict(train[col])))

RMSE GBMRegressor:  0.44125854239531354


#### Model2: KFold+RS for KNN

In [62]:
start=time.time()
model2 = KNeighborsRegressor(n_jobs=-1)
kf=model_selection.KFold(n_splits=5) 
params_dist_knn={'n_neighbors':st.randint(3,7), 'algorithm':['ball_tree','kd_tree'], 'leaf_size':st.randint(20,30)}
random_search_knn = RandomizedSearchCV(model2, param_distributions=params_dist_knn,n_iter=10,cv=kf,n_jobs=-1)
knn_g=random_search_knn.fit(train[col],np.log1p(train['visitors'].values))
knn_best=knn_g.best_estimator_
knn_pred_best=knn_best.predict(test[col])
pred_knn=np.expm1(knn_pred_best)
print(knn_best)
print('Time elapsed for the KNN with CV: ',time.time()-start)

KNeighborsRegressor(algorithm='ball_tree', leaf_size=26, metric='minkowski',
          metric_params=None, n_jobs=-1, n_neighbors=6, p=2,
          weights='uniform')
Time elapsed for the KNN with CV:  2605.003373861313


In [63]:
print('RMSE KNNRegressor: ', RMSLE(np.log1p(train['visitors'].values), knn_best.predict(train[col])))

RMSE KNNRegressor:  0.4572716331390319


#### Model3: KFold+RS for XGB

In [64]:
start=time.time()
model3 = XGBRegressor(random_state=3,verbose=True)
kf=model_selection.KFold(n_splits=5) 
params_xgb={'learning_rate':st.uniform(0.1,0.25), 'n_estimators':st.randint(500,900), 'subsample':[0.8,0.9], 'colsample_bytree':[0.8,0.9], 'max_depth' :st.randint(8,12)}
random_search_xGB = RandomizedSearchCV(model3, param_distributions=params_xgb,n_iter=10,cv=kf,n_jobs=-1)
xgb_g=random_search_xGB.fit(train[col],np.log1p(train['visitors'].values))
xgb_best=xgb_g.best_estimator_
xgb_pred_best=xgb_best.predict(test[col])
pred_xgb=np.expm1(xgb_pred_best)
print(xgb_best)
print('Time elapsed for the XGB with CV: ',time.time()-start)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.9, gamma=0, learning_rate=0.11331219142398294,
       max_delta_step=0, max_depth=11, min_child_weight=1, missing=None,
       n_estimators=725, n_jobs=1, nthread=None, objective='reg:linear',
       random_state=3, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=0.9, verbose=True)
Time elapsed for the XGB with CV:  2648.544768333435


In [65]:
print('RMSE XGBRegressor: ', RMSLE(np.log1p(train['visitors'].values), xgb_best.predict(train[col])))

RMSE XGBRegressor:  0.21019385302205423


#### Model4: KFold+RS for RF

In [66]:
start=time.time()
model4 = RandomForestRegressor(random_state=2,n_jobs=4)
kf=model_selection.KFold(n_splits=5) 
params_dist_rf = {"n_estimators": st.randint(100, 400),"max_depth": st.randint(6, 12),"min_samples_leaf": st.randint(1, 11)}
random_search_rf = RandomizedSearchCV(model4, param_distributions=params_dist_rf,n_iter=10,cv=kf,n_jobs=4)
rf_g=random_search_rf.fit(train[col],np.log1p(train['visitors'].values))
rf_best=rf_g.best_estimator_
rf_pred_best=rf_best.predict(test[col])
rf_pred_best=np.expm1(rf_pred_best)
print(rf_best)
print('Time elapsed for the Random Forests with CV: ',time.time()-start)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=11,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=10, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=157, n_jobs=4,
           oob_score=False, random_state=2, verbose=0, warm_start=False)
Time elapsed for the Random Forests with CV:  940.4802186489105


In [67]:
print('RMSE RandomForestRegressor: ', RMSLE(np.log1p(train['visitors'].values), rf_best.predict(train[col])))

RMSE RandomForestRegressor:  0.4901221720897668


#### Model5: KFold+RS for LGB

In [68]:
start=time.time()
kf=model_selection.KFold(n_splits=5)
params = {  
    "n_estimators": st.randint(100, 500),
    "learning_rate": st.uniform(0.05, 0.4),
    "num_leaves": st.randint(20, 60),
    "min_child_weight": st.randint(1, 10),
    "min_child_samples": st.randint(1, 100),
}   
lgbm = lgb.LGBMRegressor()
lgbmrscv = model_selection.RandomizedSearchCV(lgbm, params, n_iter=10,cv=kf,n_jobs=-1)  
lgbmrscv.fit(train[col],np.log1p(train['visitors'].values))
lgb_best=lgbmrscv.best_estimator_
print(lgbmrscv.best_params_)
print('Time elapsed for the LGB with CV: ',time.time()-start)

{'learning_rate': 0.09069205943969272, 'min_child_samples': 57, 'min_child_weight': 9, 'n_estimators': 216, 'num_leaves': 23}
Time elapsed for the LGB with CV:  56.27379536628723


In [69]:
lgb_pred_best=lgb_best.predict(test[col])
lgb_pred_best=np.expm1(lgb_pred_best)

In [70]:
lgb_pred_best

array([ 1.78388192, 17.04145619, 22.96955575, ...,  3.61593392,
        4.10192504,  4.09603635])

In [71]:
print('RMSE LGBMRegressor: ', RMSLE(np.log1p(train['visitors'].values), lgbmrscv.predict(train[col])))

RMSE LGBMRegressor:  0.48768595899145545


#### Model averaging with KFoldcv and Random GridSearch

In [72]:
model_avg=(0.2*gb_pred_best+0.2*knn_pred_best+0.3*xgb_pred_best+0.1*rf_pred_best+0.2*lgb_pred_best)

In [73]:
model_avg

array([ 1.64369932, 10.6781351 , 14.10006104, ...,  2.20553293,
        2.48201789,  2.09655501])

In [74]:
test['visitors']=model_avg
test['visitors'] =(test['visitors']).clip(lower=0.)

In [75]:
test[['id','visitors']].to_csv('submission_avg.csv',index=False)

**5 Models KFolD + Random Grid Averaged**: Public Leaderboard vs Private Leaderboard 
![Public LB](submissions/model_avg_5models.PNG)

** Observations: ** **K Fold CV with Random Grid Search made things worse**, as expected because of the time series data. many kernls used this so just for a reference i made this part

In [76]:
sub1=test[['id','visitors']].copy()

In [78]:
stack_with_kfold_cv=train_orig[['air_store_id','visit_date']]

In [81]:
stack_with_kfold_cv['gb_preds']=gb_best.predict(train[col])
stack_with_kfold_cv['knn_preds']=knn_best.predict(train[col])
stack_with_kfold_cv['xgb_preds']=xgb_best.predict(train[col])
stack_with_kfold_cv['rf_preds']=rf_best.predict(train[col])
stack_with_kfold_cv['lgb_preds']=lgb_best.predict(train[col])

In [82]:
stack_with_kfold_cv.to_csv('stacking_input/stack_kfold_cv_pub_mine.csv',index=False)

## Results Block for KFold CV + Randomized Grid Search

### Strategy-1: TimeSeries CV  with RandomSearch

** Strategy-1:** Using the out of Box Time_series_split from sklearn for cross-validation

** Observations: **I think about two approaches now:

* method-1:= Using the Timeseriescv generator and passing it directly to the RandomizedGridSearch 

* method-2:= Timeseriessplit on the dates and then with the help of the dates collecting all the restaurant informations and forming the train and valid sets (this seems to have more sense to me)

** RMSLE has to be measured for both the cases **

*Using method -1 *

In [188]:
tscv = model_selection.TimeSeriesSplit(n_splits=4)
tscv_cv = tscv.split(train)

In [189]:
train=train.sort_values(ascending=[True],by=['air_store_id'])
train.head()

Unnamed: 0,air_store_id,visit_date,visitors,dow,year,month,doy,dim,woy,is_month_end,day_of_week,holiday_flg,min_visitors,mean_visitors,median_visitors,max_visitors,count_observations,air_area_name,latitude,longitude,air_area_name0,air_area_name1,air_area_name2,air_area_name3,air_area_name4,air_area_name5,air_area_name6,air_area_name7,air_area_name8,air_area_name9,rs1_x,rv1_x,rs2_x,rv2_x,rs1_y,rv1_y,rs2_y,rv2_y,id,total_reserv_sum,total_reserv_mean,total_reserv_dt_diff_mean,date_int,var_max_lat,var_max_long,lon_plus_lat,air_store_id2,air_genre_name_Asian,air_genre_name_Bar or Club,air_genre_name_European,air_genre_name_Japanese,air_genre_name_Other
52539,air_00a91d42b08b08d9,2016-07-08,42,4,2016,7,190,8,27,False,0,0,17.0,36.5,35.5,57.0,40.0,44.0,35.694003,139.753595,7.0,6.0,3.0,6.0,48.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,0.0,9.0,0.0,9.0,air_00a91d42b08b08d9_2016-07-08,-999.0,-999.0,-999.0,20160708,8.326629,4.519803,175.447598,0,0,0,1,0,0
65560,air_00a91d42b08b08d9,2016-07-27,24,2,2016,7,209,27,30,False,6,0,15.0,28.125,28.0,52.0,40.0,44.0,35.694003,139.753595,7.0,6.0,3.0,6.0,48.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,18.0,2.0,18.0,2.0,air_00a91d42b08b08d9_2016-07-27,-999.0,-999.0,-999.0,20160727,8.326629,4.519803,175.447598,0,0,0,1,0,0
130082,air_00a91d42b08b08d9,2016-10-29,7,5,2016,10,303,29,43,False,2,0,3.0,14.973684,11.0,99.0,38.0,44.0,35.694003,139.753595,7.0,6.0,3.0,6.0,48.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_00a91d42b08b08d9_2016-10-29,-999.0,-999.0,-999.0,20161029,8.326629,4.519803,175.447598,0,0,0,1,0,0
200630,air_00a91d42b08b08d9,2017-02-10,26,4,2017,2,41,10,6,False,0,0,17.0,36.5,35.5,57.0,40.0,44.0,35.694003,139.753595,7.0,6.0,3.0,6.0,48.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_00a91d42b08b08d9_2017-02-10,-999.0,-999.0,-999.0,20170210,8.326629,4.519803,175.447598,0,0,0,1,0,0
84966,air_00a91d42b08b08d9,2016-08-25,34,3,2016,8,238,25,34,False,4,0,15.0,29.868421,30.0,47.0,38.0,44.0,35.694003,139.753595,7.0,6.0,3.0,6.0,48.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,9.0,14.0,9.0,14.0,air_00a91d42b08b08d9_2016-08-25,-999.0,-999.0,-999.0,20160825,8.326629,4.519803,175.447598,0,0,0,1,0,0


In [190]:
train.reset_index(inplace=True)
train.drop('index',axis=1,inplace=True)
train.head()

Unnamed: 0,air_store_id,visit_date,visitors,dow,year,month,doy,dim,woy,is_month_end,day_of_week,holiday_flg,min_visitors,mean_visitors,median_visitors,max_visitors,count_observations,air_area_name,latitude,longitude,air_area_name0,air_area_name1,air_area_name2,air_area_name3,air_area_name4,air_area_name5,air_area_name6,air_area_name7,air_area_name8,air_area_name9,rs1_x,rv1_x,rs2_x,rv2_x,rs1_y,rv1_y,rs2_y,rv2_y,id,total_reserv_sum,total_reserv_mean,total_reserv_dt_diff_mean,date_int,var_max_lat,var_max_long,lon_plus_lat,air_store_id2,air_genre_name_Asian,air_genre_name_Bar or Club,air_genre_name_European,air_genre_name_Japanese,air_genre_name_Other
0,air_00a91d42b08b08d9,2016-07-08,42,4,2016,7,190,8,27,False,0,0,17.0,36.5,35.5,57.0,40.0,44.0,35.694003,139.753595,7.0,6.0,3.0,6.0,48.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,0.0,9.0,0.0,9.0,air_00a91d42b08b08d9_2016-07-08,-999.0,-999.0,-999.0,20160708,8.326629,4.519803,175.447598,0,0,0,1,0,0
1,air_00a91d42b08b08d9,2016-07-27,24,2,2016,7,209,27,30,False,6,0,15.0,28.125,28.0,52.0,40.0,44.0,35.694003,139.753595,7.0,6.0,3.0,6.0,48.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,18.0,2.0,18.0,2.0,air_00a91d42b08b08d9_2016-07-27,-999.0,-999.0,-999.0,20160727,8.326629,4.519803,175.447598,0,0,0,1,0,0
2,air_00a91d42b08b08d9,2016-10-29,7,5,2016,10,303,29,43,False,2,0,3.0,14.973684,11.0,99.0,38.0,44.0,35.694003,139.753595,7.0,6.0,3.0,6.0,48.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_00a91d42b08b08d9_2016-10-29,-999.0,-999.0,-999.0,20161029,8.326629,4.519803,175.447598,0,0,0,1,0,0
3,air_00a91d42b08b08d9,2017-02-10,26,4,2017,2,41,10,6,False,0,0,17.0,36.5,35.5,57.0,40.0,44.0,35.694003,139.753595,7.0,6.0,3.0,6.0,48.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,air_00a91d42b08b08d9_2017-02-10,-999.0,-999.0,-999.0,20170210,8.326629,4.519803,175.447598,0,0,0,1,0,0
4,air_00a91d42b08b08d9,2016-08-25,34,3,2016,8,238,25,34,False,4,0,15.0,29.868421,30.0,47.0,38.0,44.0,35.694003,139.753595,7.0,6.0,3.0,6.0,48.0,0.0,0.0,0.0,0.0,0.0,-999.0,-999.0,-999.0,-999.0,9.0,14.0,9.0,14.0,air_00a91d42b08b08d9_2016-08-25,-999.0,-999.0,-999.0,20160825,8.326629,4.519803,175.447598,0,0,0,1,0,0


In [191]:
lgbm = lgb.LGBMRegressor()
lgbmrscv = model_selection.RandomizedSearchCV(lgbm, params, n_iter=10,cv=tscv,n_jobs=-1)  
lgbmrscv.fit(train[col],np.log1p(train['visitors'].values))
print(lgbmrscv.best_params_)

{'learning_rate': 0.057303305011305206, 'min_child_samples': 18, 'min_child_weight': 2, 'n_estimators': 409, 'num_leaves': 27}


In [192]:
print('RMSE LGBMRegressor: ', RMSLE(np.log1p(train['visitors'].values), lgbmrscv.predict(train[col])))

RMSE LGBMRegressor:  0.48244613355390464


In [193]:
df_train = train.sort_values(by='date_int').copy()
df_test=test.sort_values(by='date_int').copy()

In [194]:
tscv = model_selection.TimeSeriesSplit(n_splits=4)
tscv_cv = tscv.split(df_train)
y=np.log1p(df_train['visitors'].values)
y_test_pred = 0
lgbmtscv = lgb_best
for i, (train_index, test_index) in enumerate(tscv.split(df_train)):
    # Create data for this fold
    y_train, y_valid = y[train_index].copy(), y[test_index]
    X_train, X_valid = df_train.iloc[train_index, :].copy(), df_train.iloc[test_index, :].copy()
    print("\nFold ", i)

    fit_model = lgbmtscv.fit(X_train[col], y_train)
    pred = lgbmtscv.predict(X_valid[col])
    print('RMSE LightGBM, fold ', i, ': ', RMSLE(y_valid, pred))
    print('Prediction length on validation set, LightGBM, fold ', i, ': ', len(pred))
    # Accumulate test set predictions

    pred = lgbmtscv.predict(df_test[col])
    print('Prediction length on test set, LightGBM, fold ', i, ': ', len(pred))
    y_test_pred += pred


Fold  0
RMSE LightGBM, fold  0 :  0.5235935977170681
Prediction length on validation set, LightGBM, fold  0 :  50421
Prediction length on test set, LightGBM, fold  0 :  32019

Fold  1
RMSE LightGBM, fold  1 :  0.5086122795383349
Prediction length on validation set, LightGBM, fold  1 :  50421
Prediction length on test set, LightGBM, fold  1 :  32019

Fold  2
RMSE LightGBM, fold  2 :  0.5350779999324135
Prediction length on validation set, LightGBM, fold  2 :  50421
Prediction length on test set, LightGBM, fold  2 :  32019

Fold  3
RMSE LightGBM, fold  3 :  0.49509989605412535
Prediction length on validation set, LightGBM, fold  3 :  50421
Prediction length on test set, LightGBM, fold  3 :  32019


In [195]:
y_test_pred/=4
df_test_out = df_test.copy()
df_test_out['visitors'] = y_test_pred
df_test_out['visitors'] = np.expm1(df_test_out['visitors']).clip(lower=1.)
df_test_out = df_test_out.sort_index()
sub1['visitors'] = df_test_out['visitors'].values
lgbmtscv_out = sub1[['id','visitors']].copy()
lgbmtscv_out.to_csv('lgbmtscv_out_tscv_m1.csv',index=False)

**Single LGB Model with 4-fold Timeseries CV**: Public Leaderboard vs Private Leaderboard 
![Public LB](submissions/light_gbm_tscv_s1_m1.PNG)

** Observations:** 4 Fold Time series CV with Random grid Search did well when compared to single XGB model with no CV but the overall values are not great. May be the Time series split should be changed a bit 

*Method-2: Using the dateset*

In [196]:
dates_list=train["visit_date"].unique()

In [197]:
dates_list=list(dates_list)
dateset=pd.DataFrame({'date':dates_list,'type':'train'})
dateset=dateset.sort_values(ascending=True,by="date")
dateset.head()

Unnamed: 0,date,type
476,2016-01-01,train
477,2016-01-02,train
396,2016-01-03,train
416,2016-01-04,train
438,2016-01-05,train


In [198]:
dateset.reset_index(inplace=True)

In [199]:
dateset.drop('index',inplace=True,axis=1)

In [200]:
train['visit_date']=pd.to_datetime(train['visit_date'])

In [201]:
from sklearn.metrics import mean_squared_error,make_scorer

In [202]:
scored=make_scorer(mean_squared_error)

In [203]:
# params from above
lbm = lgb.LGBMRegressor()
df_test=df_test.sort_values(ascending=[True],by=["visit_date"])
train=train.sort_values(ascending=[True,True],by=['visit_date','air_store_id'])
train.reset_index(inplace=True)
train.drop('index',axis=1,inplace=True)
y_tscv=0
it=0
cv_set=[]
for train_index, test_index in tscv.split(dateset.values):
    hold_out=dateset.index.values[test_index[-1]+1:]
    it=it+1
    print("Cv set indices captured for Fold -- ",it)
    date_max_train=dateset.date[train_index[-1]]

    mask = ((train['visit_date']>'2016-01-01') & (train['visit_date']<date_max_train))
    X_train= train.loc[mask]

    X_training= X_train.drop(['visitors','air_store_id','visit_date'], axis=1)
    y_training = np.log1p(X_train['visitors'].values)
    date_max_valid=dateset.date[test_index[-1]]

    mask2 = ((train['visit_date']>=date_max_train) & (train['visit_date']<date_max_valid))
    X_valid= train.loc[mask2]
    X_validate= X_valid.drop(['visitors','air_store_id','visit_date'], axis=1)
    y_validate = np.log1p(X_valid['visitors'].values)
    
    cv_set.append(X_training.index.tolist())
    cv_set.append(X_validate.index.tolist())
    # not required to capture the holdout dataset

Cv set indices captured for Fold --  1
Cv set indices captured for Fold --  2
Cv set indices captured for Fold --  3
Cv set indices captured for Fold --  4


In [204]:
id_set=[0,2,4,6]

cvd=[]
cvd = [(cv_set[i],cv_set[i+1])
      for i in id_set]

In [207]:
params_new = {  
    "n_estimators": st.randint(400, 800),
    "learning_rate": st.uniform(0.05, 0.15),
    "num_leaves": st.randint(20, 40),
    "min_child_weight": st.randint(1, 10),
    "min_child_samples": st.randint(1, 100),
}   

In [213]:
# Now implementing RandomSearchCV for every iter
lgbmrstcv = model_selection.RandomizedSearchCV(lbm, params_new, n_iter=10,n_jobs=-1,cv=cvd,scoring=scored)  
f_train=train[col]
f_target=np.log1p(train[['visitors']].values)
lgbmrstcv.fit(f_train.values,f_target)
print(lgbmrstcv.best_params_)
pred = lgbmrstcv.predict(df_test[col].values)
#print('Prediction length on test set, LightGBM, fold ', it, ': ', len(pred))
#y_tscv += pred

{'learning_rate': 0.19436047582152, 'min_child_samples': 21, 'min_child_weight': 2, 'n_estimators': 694, 'num_leaves': 29}


In [214]:
#y_tscv/=5
df_test_out = df_test.copy()
df_test_out['visitors'] = pred
df_test_out['visitors'] = np.expm1(df_test_out['visitors']).clip(lower=0.)
df_test_out = df_test_out.sort_index()
sub1['visitors'] = df_test_out['visitors'].values
lgbmtscv_out = sub1[['id','visitors']].copy()
lgbmtscv_out.to_csv('lgbmtscv_out_tscv_m2.csv',index=False)

In [215]:
lgbm_train=train[['air_store_id','visit_date']]
lgbm_train['visitors'] = lgbmrstcv.predict(train[col].values)

In [217]:
lgbm_train.to_csv('stacking_input/lgbm_tscv_train.csv',index=False)

**Single LGB Model with Time Series + random Grid Search CV**: Public Leaderboard vs Private Leaderboard 
![Public LB](submissions/light_gbm_tscv_s1_m2.PNG)

** Observations:** The use of Time series split using the date set helped in improving the score. I think may be more custom splitting the time series based on the golden week and new year would give better results

### Strategy-2: TimeSeries Custom CV with RandomSearch

* Fold -1: Training Data: till 2016-04-22 (1 week before golden week) , validation data: from 2016-04-22 to 2016-05-12, Hold out: remaining training data
* Fold -2: Training Data: till 31st Aug 2016, validation data: 31st Aug to 30th Sept 2016, hold out: remaining training data
* Fold -3: Training Data: till 18th Dec 2016, validation data: 18th Dec to 15th Jan 2017, hold out: remaining training data
* Fold -4: Training Data: till 23rd Mar 2017, validation data: 23rd Mar to 23rd April 2017, hold out: nothing 

**prediction on the complete test set**

In [218]:
test.head()

Unnamed: 0,id,visitors,visit_date,air_store_id,dow,year,month,doy,dim,woy,is_month_end,day_of_week,holiday_flg,min_visitors,mean_visitors,median_visitors,max_visitors,count_observations,air_area_name,latitude,longitude,air_area_name0,air_area_name1,air_area_name2,air_area_name3,air_area_name4,air_area_name5,air_area_name6,air_area_name7,air_area_name8,air_area_name9,rs1_x,rv1_x,rs2_x,rv2_x,rs1_y,rv1_y,rs2_y,rv2_y,total_reserv_sum,total_reserv_mean,total_reserv_dt_diff_mean,date_int,var_max_lat,var_max_long,lon_plus_lat,air_store_id2,air_genre_name_Asian,air_genre_name_Bar or Club,air_genre_name_European,air_genre_name_Japanese,air_genre_name_Other
0,air_00a91d42b08b08d9_2017-04-23,0,2017-04-23,air_00a91d42b08b08d9,6,2017,4,113,23,16,False,3,0,2.0,2.0,2.0,2.0,1.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170423,8.326629,4.519803,175.447598,0,0,0,1,0,0
1,air_00a91d42b08b08d9_2017-04-24,0,2017-04-24,air_00a91d42b08b08d9,0,2017,4,114,24,17,False,1,0,1.0,22.457143,19.0,47.0,35.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170424,8.326629,4.519803,175.447598,0,0,0,1,0,0
2,air_00a91d42b08b08d9_2017-04-25,0,2017-04-25,air_00a91d42b08b08d9,1,2017,4,115,25,17,False,5,0,1.0,24.35,24.5,43.0,40.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170425,8.326629,4.519803,175.447598,0,0,0,1,0,0
3,air_00a91d42b08b08d9_2017-04-26,0,2017-04-26,air_00a91d42b08b08d9,2,2017,4,116,26,17,False,6,0,15.0,28.125,28.0,52.0,40.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170426,8.326629,4.519803,175.447598,0,0,0,1,0,0
4,air_00a91d42b08b08d9_2017-04-27,0,2017-04-27,air_00a91d42b08b08d9,3,2017,4,117,27,17,False,4,0,15.0,29.868421,30.0,47.0,38.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170427,8.326629,4.519803,175.447598,0,0,0,1,0,0


In [236]:
params = {  
    "n_estimators": st.randint(100, 500),
    "learning_rate": st.uniform(0.05, 0.15),
    "num_leaves": st.randint(20, 60),
    "min_child_weight": st.randint(1, 10),
    "min_child_samples": st.randint(1, 100),
}   

In [241]:
# params from above
import lightgbm as lgb
# params from above
lbm = lgb.LGBMRegressor()
df_test=df_test.sort_values(ascending=[True,True],by=['visit_date','air_store_id'])
train=train.sort_values(ascending=[True,True],by=['visit_date','air_store_id'])
train.reset_index(inplace=True)
train.drop('index',axis=1,inplace=True)
#y_tscv_s2=pd.DataFrame()
it=0
cv_set=[]

cut_dates=['2016-04-22','2016-05-12','2016-08-31','2016-09-30','2016-12-18','2017-01-15','2017-03-23','2017-04-23']
for dts in range(int(len(cut_dates)/2)):
    mask = ((train['visit_date']>='2016-01-01') & (train['visit_date']<cut_dates[it]))
    X_train= train.loc[mask]

    X_training= X_train.drop(['visitors','air_store_id','visit_date'], axis=1)
    y_training = (X_train['visitors'])

    mask2 = ((train['visit_date']>=cut_dates[it]) & (train['visit_date']<cut_dates[it+1]))
    X_valid= train.loc[mask2]


    X_validate= X_valid.drop(['visitors','air_store_id','visit_date'], axis=1)
    y_validate = (X_valid['visitors'])

    cv_set.append(X_training.index.tolist())
    cv_set.append(X_validate.index.tolist())
    it=it+2

In [245]:
id_set=[0,2,4,6]

cvd=[]
cvd = [(cv_set[i],cv_set[i+1])
      for i in id_set]

In [249]:
#cvd

In [250]:
lgbmrstcv = model_selection.RandomizedSearchCV(lbm, param_distributions=params,n_iter=10,n_jobs=1,scoring=scored,cv=cvd,verbose=True)
f_train=train[col]
f_target=np.log1p(train[['visitors']])
lgbmrstcv.fit(f_train.values,f_target.values)
print(lgbmrstcv.best_params_)

Fitting 4 folds for each of 10 candidates, totalling 40 fits


[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:  6.7min finished


{'learning_rate': 0.17944079948127378, 'min_child_samples': 44, 'min_child_weight': 2, 'n_estimators': 319, 'num_leaves': 52}


In [251]:
#train.head()

In [253]:
y_tscv_s2=df_test[['id','visitors']]
y_tscv_s2['visitors']=lgbmrstcv.predict(df_test[col].values)

preds_s2=np.expm1(y_tscv_s2['visitors'].values)

df_test['visitors']=preds_s2

df_test.sort_index(inplace=True)

In [254]:
df_test[['id','visitors']].copy().to_csv('lgbmtscv_out_tscv_s2.csv',index=False)

In [260]:
lgbm_train=train[['air_store_id','visit_date']]
lgbm_train['visitors'] = lgbmrstcv.predict(train[col].values)
lgbm_train.to_csv('stacking_input/lgbm_tscv_train_s2.csv',index=False)

**Single LGB Model with custom Time series split and RandomGrid CV**: Public Leaderboard vs Private Leaderboard 
![Public LB](submissions/light_gbm_tscv_s2.PNG)

** Observations:** The custom time series split did improve and both scores came down. So now i think some more feature engineering could help when used with this split strategy in the next notebook

# Using H2O AutoML

In [258]:
test.head()

Unnamed: 0,id,visitors,visit_date,air_store_id,dow,year,month,doy,dim,woy,is_month_end,day_of_week,holiday_flg,min_visitors,mean_visitors,median_visitors,max_visitors,count_observations,air_area_name,latitude,longitude,air_area_name0,air_area_name1,air_area_name2,air_area_name3,air_area_name4,air_area_name5,air_area_name6,air_area_name7,air_area_name8,air_area_name9,rs1_x,rv1_x,rs2_x,rv2_x,rs1_y,rv1_y,rs2_y,rv2_y,total_reserv_sum,total_reserv_mean,total_reserv_dt_diff_mean,date_int,var_max_lat,var_max_long,lon_plus_lat,air_store_id2,air_genre_name_Asian,air_genre_name_Bar or Club,air_genre_name_European,air_genre_name_Japanese,air_genre_name_Other
0,air_00a91d42b08b08d9_2017-04-23,0,2017-04-23,air_00a91d42b08b08d9,6,2017,4,113,23,16,False,3,0,2.0,2.0,2.0,2.0,1.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170423,8.326629,4.519803,175.447598,0,0,0,1,0,0
1,air_00a91d42b08b08d9_2017-04-24,0,2017-04-24,air_00a91d42b08b08d9,0,2017,4,114,24,17,False,1,0,1.0,22.457143,19.0,47.0,35.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170424,8.326629,4.519803,175.447598,0,0,0,1,0,0
2,air_00a91d42b08b08d9_2017-04-25,0,2017-04-25,air_00a91d42b08b08d9,1,2017,4,115,25,17,False,5,0,1.0,24.35,24.5,43.0,40.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170425,8.326629,4.519803,175.447598,0,0,0,1,0,0
3,air_00a91d42b08b08d9_2017-04-26,0,2017-04-26,air_00a91d42b08b08d9,2,2017,4,116,26,17,False,6,0,15.0,28.125,28.0,52.0,40.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170426,8.326629,4.519803,175.447598,0,0,0,1,0,0
4,air_00a91d42b08b08d9_2017-04-27,0,2017-04-27,air_00a91d42b08b08d9,3,2017,4,117,27,17,False,4,0,15.0,29.868421,30.0,47.0,38.0,44,35.694003,139.753595,7,6,3,6,48,0,0,0,0,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,20170427,8.326629,4.519803,175.447598,0,0,0,1,0,0


In [259]:
import h2o
h2o.init()
from h2o.automl import H2OAutoML
df_train=train
df_test=test
df_train['visitors']=np.log1p(df_train['visitors'])
htrain = h2o.H2OFrame(df_train)
htest = h2o.H2OFrame(df_test)

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_141"; Java(TM) SE Runtime Environment (build 1.8.0_141-b15); Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)
  Starting server from /home/namanda/anaconda3.6/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /home/namanda/tmp/tmp_f8a_pcr
  JVM stdout: /home/namanda/tmp/tmp_f8a_pcr/h2o_namanda_started_from_python.out
  JVM stderr: /home/namanda/tmp/tmp_f8a_pcr/h2o_namanda_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,03 secs
H2O cluster version:,3.16.0.4
H2O cluster version age:,1 month and 2 days
H2O cluster name:,H2O_from_python_namanda_mv53ym
H2O cluster total nodes:,1
H2O cluster free memory:,26.67 Gb
H2O cluster total cores:,32
H2O cluster allowed cores:,32
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:54321


Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [261]:
htrain=htrain.drop(['air_store_id','visit_date','id'])
htest=htest.drop(['air_store_id','visit_date','id'])

In [262]:
test.visitors=0

In [263]:
htest['visitors']=0

In [264]:
htest.head()

visitors,dow,year,month,doy,dim,woy,is_month_end,day_of_week,holiday_flg,min_visitors,mean_visitors,median_visitors,max_visitors,count_observations,air_area_name,latitude,longitude,air_area_name0,air_area_name1,air_area_name2,air_area_name3,air_area_name4,air_area_name5,air_area_name6,air_area_name7,air_area_name8,air_area_name9,rs1_x,rv1_x,rs2_x,rv2_x,rs1_y,rv1_y,rs2_y,rv2_y,total_reserv_sum,total_reserv_mean,total_reserv_dt_diff_mean,date_int,var_max_lat,var_max_long,lon_plus_lat,air_store_id2,air_genre_name_Asian,air_genre_name_Bar or Club,air_genre_name_European,air_genre_name_Japanese,air_genre_name_Other
0,6,2017,4,113,23,16,False,3,0,2,2.0,2.0,2,1,44,35.694,139.754,7,6,3,6,48,0,0,0,0,0,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,20170400.0,8.32663,4.5198,175.448,0,0,0,1,0,0
0,0,2017,4,114,24,17,False,1,0,1,22.4571,19.0,47,35,44,35.694,139.754,7,6,3,6,48,0,0,0,0,0,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,20170400.0,8.32663,4.5198,175.448,0,0,0,1,0,0
0,1,2017,4,115,25,17,False,5,0,1,24.35,24.5,43,40,44,35.694,139.754,7,6,3,6,48,0,0,0,0,0,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,20170400.0,8.32663,4.5198,175.448,0,0,0,1,0,0
0,2,2017,4,116,26,17,False,6,0,15,28.125,28.0,52,40,44,35.694,139.754,7,6,3,6,48,0,0,0,0,0,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,20170400.0,8.32663,4.5198,175.448,0,0,0,1,0,0
0,3,2017,4,117,27,17,False,4,0,15,29.8684,30.0,47,38,44,35.694,139.754,7,6,3,6,48,0,0,0,0,0,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,20170400.0,8.32663,4.5198,175.448,0,0,0,1,0,0
0,4,2017,4,118,28,17,False,0,0,17,36.5,35.5,57,40,44,35.694,139.754,7,6,3,6,48,0,0,0,0,0,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,20170400.0,8.32663,4.5198,175.448,0,0,0,1,0,0
0,5,2017,4,119,29,17,False,2,1,3,14.9737,11.0,99,38,44,35.694,139.754,7,6,3,6,48,0,0,0,0,0,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,20170400.0,8.32663,4.5198,175.448,0,0,0,1,0,0
0,6,2017,4,120,30,17,True,3,0,2,2.0,2.0,2,1,44,35.694,139.754,7,6,3,6,48,0,0,0,0,0,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,20170400.0,8.32663,4.5198,175.448,0,0,0,1,0,0
0,0,2017,5,121,1,18,False,1,0,1,22.4571,19.0,47,35,44,35.694,139.754,7,6,3,6,48,0,0,0,0,0,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,20170500.0,8.32663,4.5198,175.448,0,0,0,1,0,0
0,1,2017,5,122,2,18,False,5,0,1,24.35,24.5,43,40,44,35.694,139.754,7,6,3,6,48,0,0,0,0,0,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,20170500.0,8.32663,4.5198,175.448,0,0,0,1,0,0




In [265]:
x =htrain.columns
y ='visitors'
x.remove(y)

def RMSLE(y_, pred):
    return metrics.mean_squared_error(y_, pred)**0.5

print('Starting h2o autoML model!')  

aml = H2OAutoML(max_runtime_secs = 3600,seed=0)
aml.train(x=x, y =y, training_frame=htrain, leaderboard_frame = htest)

print('Generate predictions...')
htrain.drop(['visitors'])
htest.drop(['visitors'])

preds = aml.leader.predict(htrain)
preds = preds.as_data_frame()
print('RMSLE H2O automl leader: ', RMSLE(df_train['visitors'].values, preds))

preds = aml.leader.predict(htest)
preds = preds.as_data_frame()

df_test['visitors'] = preds
df_test['visitors'] = np.expm1(df_test['visitors']).clip(lower=0.)

Starting h2o autoML model!
AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Generate predictions...
gbm prediction progress: |████████████████████████████████████████████████| 100%
RMSLE H2O automl leader:  0.4875156015616508
gbm prediction progress: |████████████████████████████████████████████████| 100%


In [266]:
aml.leaderboard

model_id,mean_residual_deviance,rmse,mae,rmsle
GBM_grid_0_AutoML_20180218_142713_model_12,7.71868,2.77825,2.6716,1.29675
GBM_grid_0_AutoML_20180218_142713_model_16,7.72193,2.77884,2.67659,1.29801
GBM_grid_0_AutoML_20180218_142713_model_27,7.7349,2.78117,2.68215,1.29953
GBM_grid_0_AutoML_20180218_142713_model_25,7.74062,2.7822,2.68552,1.30058
GBM_grid_0_AutoML_20180218_142713_model_10,7.75977,2.78564,2.70154,1.30524
GBM_grid_0_AutoML_20180218_142713_model_3,7.76081,2.78582,2.68836,1.30137
GBM_grid_0_AutoML_20180218_142713_model_24,7.76473,2.78653,2.70872,1.3074
GBM_grid_0_AutoML_20180218_142713_model_26,7.76518,2.78661,2.74388,1.31814
StackedEnsemble_AllModels_0_AutoML_20180218_142713,7.76623,2.7868,2.69041,1.30185
GBM_grid_0_AutoML_20180218_142713_model_35,7.76638,2.78682,2.78247,1.33017




In [267]:
aml.leader

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_grid_0_AutoML_20180218_142713_model_12


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.2331872830443194
RMSE: 0.482894691464215
MAE: 0.35804586117738657
RMSLE: 0.15028048421100834
Mean Residual Deviance: 0.2331872830443194

ModelMetricsRegression: gbm
** Reported on validation data. **

MSE: 0.2555888973449663
RMSE: 0.5055580059152128
MAE: 0.37382854306331365
RMSLE: 0.15718308692105717
Mean Residual Deviance: 0.2555888973449663

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 0.2605610696397608
RMSE: 0.5104518289121519
MAE: 0.3770617801689382
RMSLE: 0.157832029557106
Mean Residual Deviance: 0.2605610696397608
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,0.3770618,0.0011118,0.3780981,0.3762416,0.3744774,0.3774941,0.3789977
mean_residual_deviance,0.2605611,0.0016315,0.260425,0.2613394,0.2565641,0.2607569,0.26372
mse,0.2605611,0.0016315,0.260425,0.2613394,0.2565641,0.2607569,0.26372
r2,0.600893,0.0022056,0.5968726,0.6007088,0.6061941,0.5989762,0.6017132
residual_deviance,0.2605611,0.0016315,0.260425,0.2613394,0.2565641,0.2607569,0.26372
rmse,0.5104469,0.0016000,0.5103185,0.5112136,0.5065216,0.5106437,0.5135367
rmsle,0.1578285,0.0007421,0.1577742,0.1580161,0.1563944,0.1573464,0.1596116


Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
,2018-02-18 14:54:13,11 min 51.444 sec,0.0,0.8080147,0.6520295,0.6528877,0.8065630,0.6509310,0.6505439
,2018-02-18 14:54:13,11 min 51.549 sec,5.0,0.5166784,0.3819990,0.2669566,0.5181354,0.3832785,0.2684642
,2018-02-18 14:54:14,11 min 51.658 sec,10.0,0.5053873,0.3742740,0.2554163,0.5105719,0.3780267,0.2606837
,2018-02-18 14:54:14,11 min 51.766 sec,15.0,0.5008362,0.3709402,0.2508369,0.5089720,0.3769887,0.2590525
,2018-02-18 14:54:14,11 min 51.881 sec,20.0,0.4962641,0.3675901,0.2462781,0.5070175,0.3754647,0.2570668
,2018-02-18 14:54:14,11 min 52.011 sec,25.0,0.4918862,0.3646518,0.2419520,0.5069887,0.3752952,0.2570376
,2018-02-18 14:54:14,11 min 52.141 sec,30.0,0.4868324,0.3611151,0.2370058,0.5061417,0.3746670,0.2561794
,2018-02-18 14:54:14,11 min 52.198 sec,35.0,0.4854487,0.3599098,0.2356604,0.5055516,0.3740753,0.2555824
,2018-02-18 14:54:14,11 min 52.279 sec,40.0,0.4839460,0.3587025,0.2342037,0.5056551,0.3737574,0.2556870


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
median_visitors,65928.0859375,1.0,0.7475248
max_visitors,5371.8168945,0.0814799,0.0609083
mean_visitors,3977.7641602,0.0603349,0.0451018
rv2_y,1902.2526855,0.0288534,0.0215687
rv1_x,1095.4378662,0.0166156,0.0124206
---,---,---,---
total_reserv_sum,18.5574551,0.0002815,0.0002104
total_reserv_mean,12.2695417,0.0001861,0.0001391
air_area_name8,10.4317493,0.0001582,0.0001183



See the whole table with table.as_data_frame()




**Observations: **The stacked ensemble in the leaderboard is not performing very well. Only the single GBM models are doing better

In [270]:
df_test[['id','visitors']].to_csv('h2o_automl_pubmine.csv',index=False)

In [285]:
preds_train= aml.leader.predict(htrain)
h2o_train=df_train[['air_store_id','visit_date']]

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [286]:
preds_train=preds_train.as_data_frame()
h2o_train['visitors']=preds_train

In [288]:
h2o_train.to_csv('stacking_input/h2o_train_kfld.csv',index=False)

**H2O Auto ML Model**: Public Leaderboard vs Private Leaderboard 
![Public LB](submissions/h2o_automl.PNG)

** Observations:** H2o automl uses KFOLd cv and random grid search, so as i expected the results were not very good, but they are better than my 5 model Kfold average

# Hklee Public kernel

** The code below is used for averaging the results from the above **

In [289]:
# air_visit_data: Visitos data as main training set
air_visit_data=pd.read_csv('air_visit_data.csv')

#air store info and hpg store informations
air_store_info=pd.read_csv('air_store_info.csv')
hpg_store_info=pd.read_csv('hpg_store_info.csv')

#reservation info
air_reserve=pd.read_csv('air_reserve.csv')
hpg_reserve=pd.read_csv('hpg_reserve.csv')

#store id relation between hpg and air

store_id_relation=pd.read_csv('store_id_relation.csv')
date_info=pd.read_csv('date_info.csv')
sample_submission=pd.read_csv('sample_submission.csv')

In [290]:
wkend_holidays = date_info.apply(
    (lambda x:(x.day_of_week=='Sunday' or x.day_of_week=='Saturday') and x.holiday_flg==1), axis=1)
date_info.loc[wkend_holidays, 'holiday_flg'] = 0
# here weighted means are used but without rolling
date_info['weight'] = ((date_info.index + 1) / len(date_info)) ** 5  

visit_data = air_visit_data.merge(date_info, left_on='visit_date', right_on='calendar_date', how='left')
visit_data.drop('calendar_date', axis=1, inplace=True)
visit_data['visitors'] = visit_data.visitors.map(pd.np.log1p)

wmean = lambda x:( (x.weight * x.visitors).sum() / x.weight.sum() )
visitors = visit_data.groupby(['air_store_id', 'day_of_week', 'holiday_flg']).apply(wmean).reset_index()
visitors.rename(columns={0:'visitors'}, inplace=True)

sample_submission['air_store_id'] = sample_submission.id.map(lambda x: '_'.join(x.split('_')[:-1]))
sample_submission['calendar_date'] = sample_submission.id.map(lambda x: x.split('_')[2])
sample_submission.drop('visitors', axis=1, inplace=True)
sample_submission = sample_submission.merge(date_info, on='calendar_date', how='left')
sample_submission = sample_submission.merge(visitors, on=[
    'air_store_id', 'day_of_week', 'holiday_flg'], how='left')

missings = sample_submission.visitors.isnull()
sample_submission.loc[missings, 'visitors'] = sample_submission[missings].merge(
    visitors[visitors.holiday_flg==0], on=('air_store_id', 'day_of_week'), 
    how='left')['visitors_y'].values

missings = sample_submission.visitors.isnull()
sample_submission.loc[missings, 'visitors'] = sample_submission[missings].merge(
    visitors[['air_store_id', 'visitors']].groupby('air_store_id').mean().reset_index(), 
    on='air_store_id', how='left')['visitors_y'].values

sample_submission['visitors'] = sample_submission.visitors.map(pd.np.expm1)
sub2 = sample_submission[['id', 'visitors']].copy()
sub_merge = pd.merge(sub1, sub2, on='id', how='inner')

sub_merge['visitors'] = 0.7*sub_merge['visitors_x'] + 0.3*sub_merge['visitors_y']* 1.1
sub_merge[['id', 'visitors']].to_csv('submission_hk_stack.csv', index=False)

In [296]:
visitors.shape

(9103, 4)

**5 Model with KFold + Random CV avg with HK lee Public kernel**: Public Leaderboard vs Private Leaderboard 
![Public LB](submissions/model_avg5_hklee.PNG)

** Observations:** When i averaged my Time series CV strategy-1 and method-2 output with the public HkLee kernel then my score improved a lot than my custom Time series split. Hence this public kernel result could be valuable

# Final Conclusions

* This problem should be treated as regression problem with Time series effects in it

* the cross validation technique to be used would Time series cross valdiation with custom splitting to address the goldenweek and new year effect

* H2O Auto ML created a refrence stacked ensemble and it did perfom worse than single LGBM model with Time series CV and Random grid search

* I think that there should be a lot feature engineering done inorder to get better scores for the stacked model

# Preliminary Leaderboard

| Rank|Model| Cross Validation Strategy| Private Score|Public Score|
| -------------| ------------- |:-------------:| -----:| ------------- |
| 1 | Hklee + 5 model avg*| No CV with fixed params | 0.529|0.489|
| 2 | Light GBM| Time series CV + Random grid with custom Time split** | 0.534|0.495|
| 3 | 5 model avg*| No CV with fixed params | 0.542|0.496|
| 4 | Light GBM| Time series CV with timeseries split on the dates | 0.546|0.500|
| 5 | Light GBM|  Time series CV with timeseries split on the data directly | 0.557|0.528|
| 6 | **H2OAutoML***| Kfold CV with Random grid search | 0.566|0.524|
| 7 | XGBmodel|  No CV with fixed params | 0.588|0.542|
| 8 | 5 model avg*|  Kfold CV with Random grid | 0.737|0.707|
| 9 | **Benchamrk ARIMA**|  No CV with fixed params | 0.787|0.552|

**Note:** 5 model avg comprises -- GBM, KNN, XGBoost, Random Forests, Light GBM