### Making Predictions, Preparing a Kaggle Submission File

Submissions should follow the kaggle given format, which is two columns (comma-separated).
The first column is the air_store_id, which is concatenated with the visit date. The second is the predicted number of visitors to the specific restaurant.

The prediction dates are: 2017-04-23 through 2017-05-31.

Each air_store_id should have info. for each date.

This first file is a very naive prediction. It is mostly being used for a first pass and for testing. 
Here, the predicted number of visitors for each restaurant is the historical mean number of visitors to that restaurant on that weekday. It uses all visitor data available and does not have any feature engineering. Stay tuned for future parts for improvements on this!

In [1]:
#setup
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#import missingno

aReserveDF = pd.read_csv('air_reserve.csv', parse_dates = ['visit_datetime', 'reserve_datetime']) 
aVisitDF = pd.read_csv('air_visit_data.csv', parse_dates = ['visit_date']) 
aStoreDF = pd.read_csv('air_store_info.csv')

hReserveDF = pd.read_csv('hpg_reserve.csv', parse_dates = ['visit_datetime', 'reserve_datetime']) 
hStoreDF = pd.read_csv('hpg_store_info.csv') 

dateInfoDF = pd.read_csv('date_info.csv', parse_dates = ['calendar_date'])

sampleSubmissionDF = pd.read_csv('sample_submission.csv') 

storeIdRelationDF = pd.read_csv('store_id_relation.csv') 
hReserveDF['visit_year'] = hReserveDF['visit_datetime'].dt.year
hReserveDF['visit_month'] = hReserveDF['visit_datetime'].dt.month
hReserveDF['visit_day'] = hReserveDF['visit_datetime'].dt.day
hReserveDF['reserve_year'] = hReserveDF['reserve_datetime'].dt.year#
hReserveDF['reserve_month'] = hReserveDF['reserve_datetime'].dt.month
hReserveDF['reserve_day'] = hReserveDF['reserve_datetime'].dt.day
#hReserveDF.drop(['visit_datetime','reserve_datetime'], axis=1, inplace=True)

hReserveDF = hReserveDF.groupby(['hpg_store_id', 'visit_year', 'visit_month','visit_day','reserve_year','reserve_month','reserve_day', 'reserve_datetime', 'visit_datetime'], as_index=False).sum()
aReserveDF['visit_year'] = aReserveDF['visit_datetime'].dt.year
aReserveDF['visit_month'] = aReserveDF['visit_datetime'].dt.month
aReserveDF['visit_day'] = aReserveDF['visit_datetime'].dt.day
aReserveDF['reserve_year'] = aReserveDF['reserve_datetime'].dt.year
aReserveDF['reserve_month'] = aReserveDF['reserve_datetime'].dt.month
aReserveDF['reserve_day'] = aReserveDF['reserve_datetime'].dt.day

#aReserveDF.drop(['visit_datetime','reserve_datetime'], axis=1, inplace=True)
dateInfoDF['calendar_year'] = dateInfoDF['calendar_date'].dt.year
dateInfoDF['calendar_month'] = dateInfoDF['calendar_date'].dt.month
dateInfoDF['calendar_day'] = dateInfoDF['calendar_date'].dt.day
#dateInfoDF.drop(['calendar_date'], axis=1, inplace=True)
aVisitDF['visit_year'] = aVisitDF['visit_date'].dt.year
aVisitDF['visit_month'] = aVisitDF['visit_date'].dt.month
aVisitDF['visit_day'] = aVisitDF['visit_date'].dt.day
aVisitDF.drop(['visit_date'], axis=1, inplace=True)

hReserveDF = pd.merge(hReserveDF, storeIdRelationDF, on='hpg_store_id', how='inner')
hReserveDF.drop(['hpg_store_id'], axis=1, inplace=True)
aReserveDF = pd.concat([aReserveDF, hReserveDF])
aReserveDF = aReserveDF.groupby(['air_store_id', 'visit_year', 'visit_month','visit_day', 'visit_datetime', 'reserve_datetime'],\
                         as_index=False).sum().drop(['reserve_day','reserve_month','reserve_year'], axis=1)
aReserveDF = pd.merge(aReserveDF, dateInfoDF, left_on=['visit_year','visit_month','visit_day'], right_on=['calendar_year','calendar_month','calendar_day'], how='left')
aReserveDF.drop(['calendar_year','calendar_month','calendar_day'], axis=1, inplace=True)
aReserveDF = pd.merge(aReserveDF, aStoreDF, on='air_store_id', how='left')
trainDF = pd.merge(aReserveDF, aVisitDF, on=['air_store_id','visit_year','visit_month','visit_day'], how='left')
trainDF.fillna(0,inplace=True)

trainDF.sort_values(by=['visit_year','visit_month', 'visit_day', 'air_store_id'],ascending=[True,True,True,True],inplace=True)
grouped=trainDF.groupby(['visit_year','visit_month', 'visit_day','air_store_id','visitors', 'day_of_week', 'holiday_flg', 'air_genre_name'], as_index=False)['reserve_visitors'].sum()

grouped['day_of_week'] = grouped['day_of_week'].astype('category')
grouped['day_of_week_codes'] = grouped['day_of_week'].cat.codes
grouped = grouped.loc[grouped['visit_year']==2017]
grouped2 = grouped.groupby(['air_store_id', 'day_of_week_codes'])['visitors'].mean()



In [2]:
grouped2 = grouped2.to_frame().reset_index()
grouped2

Unnamed: 0,air_store_id,day_of_week_codes,visitors
0,air_00a91d42b08b08d9,0,39.333333
1,air_00a91d42b08b08d9,1,18.000000
2,air_00a91d42b08b08d9,2,13.500000
3,air_00a91d42b08b08d9,4,31.166667
4,air_00a91d42b08b08d9,5,29.428571
5,air_00a91d42b08b08d9,6,35.400000
6,air_0164b9927d20bcc3,0,10.153846
7,air_0164b9927d20bcc3,1,7.600000
8,air_0164b9927d20bcc3,2,12.000000
9,air_0164b9927d20bcc3,4,11.250000


In [3]:
grouped3 = grouped.groupby(['air_store_id', 'day_of_week_codes'])['reserve_visitors'].mean()

In [4]:
grouped3 = grouped3.to_frame().reset_index()

In [5]:
grouped3.head()
grouped2= grouped2.merge(grouped3,on =['air_store_id','day_of_week_codes'], how='left')

### Create submission file

The submission file should be in the format as seen in the sampleSubmissionDF.head() output below. The id is a concatenation of the air_store_id and the calendar date. 

Our current trainDF does not have such concatenation, so we have to undo the concatenation, and later redo it after making a prediction for visitors.

The sample submission file has '0' entered for visitors. This column has to be dropped. Then, a new column will be added with our predictions of number of visitors.

In [6]:
grouped2.head()
from sklearn import *
col = [c for c in grouped2 if c not in ['air_store_id', 'visit_year','visit_month','visit_day', 'visitors','air_genre_name','holiday_flg','day_of_week']]
print (col)
model1 = neighbors.KNeighborsRegressor(n_jobs=-1,n_neighbors=2)
model1.fit(grouped2[col],np.log1p(grouped2['visitors'].values))

['day_of_week_codes', 'reserve_visitors']




KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=-1, n_neighbors=2, p=2,
          weights='uniform')

In [7]:
grouped2['mypredictions']= model1.predict(grouped2[col])
grouped2['mypredictions']=np.expm1(grouped2['mypredictions']).clip(lower=0.)

In [8]:
sampleSubmissionDF['air_store_id'] = sampleSubmissionDF.id.map(lambda x:'_'.join(x.split('_')[:-1]))

In [9]:
sampleSubmissionDF['calendar_date'] = sampleSubmissionDF.id.map(lambda x:x.split('_')[2])

In [10]:
sampleSubmissionDF.head()

Unnamed: 0,id,visitors,air_store_id,calendar_date
0,air_00a91d42b08b08d9_2017-04-23,0,air_00a91d42b08b08d9,2017-04-23
1,air_00a91d42b08b08d9_2017-04-24,0,air_00a91d42b08b08d9,2017-04-24
2,air_00a91d42b08b08d9_2017-04-25,0,air_00a91d42b08b08d9,2017-04-25
3,air_00a91d42b08b08d9_2017-04-26,0,air_00a91d42b08b08d9,2017-04-26
4,air_00a91d42b08b08d9_2017-04-27,0,air_00a91d42b08b08d9,2017-04-27


In [11]:
sampleSubmissionDF['calendar_date'] = pd.DatetimeIndex(sampleSubmissionDF['calendar_date'])

In [12]:
dateInfoDF['day_of_week'] = dateInfoDF['day_of_week'].astype('category')
dateInfoDF.dtypes
dateInfoDF['day_of_week_codes'] = dateInfoDF['day_of_week'].cat.codes
dateInfoDF.head()

Unnamed: 0,calendar_date,day_of_week,holiday_flg,calendar_year,calendar_month,calendar_day,day_of_week_codes
0,2016-01-01,Friday,1,2016,1,1,0
1,2016-01-02,Saturday,1,2016,1,2,2
2,2016-01-03,Sunday,1,2016,1,3,3
3,2016-01-04,Monday,0,2016,1,4,1
4,2016-01-05,Tuesday,0,2016,1,5,5


In [13]:
sampleSubmissionDF = sampleSubmissionDF.merge(dateInfoDF,on ='calendar_date', how='left')

In [14]:
sampleSubmissionDF.drop(['calendar_month', 'calendar_day','calendar_date','holiday_flg','calendar_year','day_of_week'], axis=1, inplace=True)
sampleSubmissionDF.head()

Unnamed: 0,id,visitors,air_store_id,day_of_week_codes
0,air_00a91d42b08b08d9_2017-04-23,0,air_00a91d42b08b08d9,3
1,air_00a91d42b08b08d9_2017-04-24,0,air_00a91d42b08b08d9,1
2,air_00a91d42b08b08d9_2017-04-25,0,air_00a91d42b08b08d9,5
3,air_00a91d42b08b08d9_2017-04-26,0,air_00a91d42b08b08d9,6
4,air_00a91d42b08b08d9_2017-04-27,0,air_00a91d42b08b08d9,4


In [15]:
sampleSubmissionDF = sampleSubmissionDF.merge(grouped2,on=['air_store_id','day_of_week_codes'],how='left')

In [16]:
sampleSubmissionDF.drop(['day_of_week_codes'],axis=1,inplace=True)
sampleSubmissionDF.drop(['visitors_x'],axis=1,inplace=True)

In [17]:
sampleSubmissionDF.drop(['visitors_y'],axis=1,inplace=True)
sampleSubmissionDF.drop(['air_store_id'],axis=1,inplace=True)

sampleSubmissionDF.drop(['reserve_visitors'],axis=1,inplace=True)

In [18]:
sampleSubmissionDF['mypredictions'].fillna(0,inplace=True)

In [19]:
sampleSubmissionDF.to_csv('prediction1.csv',float_format='%.4f',index=None)

In [20]:
sampleSubmissionDF.head()

Unnamed: 0,id,mypredictions
0,air_00a91d42b08b08d9_2017-04-23,0.0
1,air_00a91d42b08b08d9_2017-04-24,24.971138
2,air_00a91d42b08b08d9_2017-04-25,12.5119
3,air_00a91d42b08b08d9_2017-04-26,26.249954
4,air_00a91d42b08b08d9_2017-04-27,40.483933
