<h1>OVERVIEW</h1>
<p>Recruit Holdings owns Hot Pepper Gourmet (a restaurant review service), AirREGI (a restaurant point of sales service) and Restaurant Board (reservation log management software).</p>
<p>Recruit Holdings wants to expand its service offering by providing a prediction of the number of visitors for individual restaurant. Restaurants can benefit from the accuracy of ordering supplies and proper staffing decisions. For Recruit Holdings it means higher customer retention rate and improving competitive advantage.</p>
<p>We are pleased to propose Data analysis solution to support Recruit Holdings in achieving its goals.</p>
<h1>THE SOLUTION</h1>
<p>Data analysis solution comprises of the following steps:</p>
<h2>Cleaning and preparing data</h2>
<ul>
<li>Choosing required data set.</li>
<li>Removing outliers (number of visitors that are unusually high and could be due to one-time event or technical issue)</li>
<li>Analyzing and filling in missed data with values which are the most possible for a particular restaurant. We have filled in missing data based on the following assumption: number of visitors to each restaurant for each day is equals to the average number of visitors to restaurants in the same area and with the same genre. This number should be adjusted to the size of the restaurant, which we define as the average number of visitors to that restaurant.</li>
</ul>
<h2>Data analysis includes</h2>
<ul>
<li><i>Simple and straightforward method</i>, which could be implemented by the on-site expertise. This simple method is based on the following algorithm: A number of visitors on prediction day will be equal to the average number of visitors on the same day of the week during the last year. If prediction day is a holiday, the number is equal to the average number of visitors on the holidays during last year.</li>
<li><i>Advanced method</i>, which requires utilization of 3rd party library and possible future 3rd party technical support. For an advanced method, we considered [Prophet](https://research.fb.com/prophet-forecasting-at-scale/method) as the more suitable approach to the required task. This is set of the open-source pieces of software developed by [Facebook's Core Data Science team.](https://research.fb.com/category/data-science)</li>
<li>Comparing accuracy estimate for both methods</li>
</ul>
<h1>CONCLUSIONS</h1>
<ul>
<li>For the purpose of making a prediction, there is no need to collect reservation data. </li>
<li>The Simple method shows a decent accuracy of data prediction, while easy to implement and support in the future product. Simple method does not take into account seasonal changes.</li>
<li>The Advanced method improves the accuracy of forecasting by about 6%. However, requires the utilisation of 3rd party library and proper onsite expertise to update model and support it in the future. The Advanced method uses additional information on the number of visitors few days after and/or before holidays. The appropriate expertise of the person who knows visitors behavior during Japanese holidays has to be used to improve accuracy. One year of historical data is not enough to statistically identify visitors behavior before and/or after different kind of holidays.</li>
<li>The accuracy of both methods will be improved when data for few years will be available.</li>
</ul>
<h1>RECOMENDATION</h1>
<p>If Recruit Holdings has already the proper expertise to implement and support the advanced method, then we recommend to use it. If not, Recruit Holdings has to take into account the cost of implementation and future support of the advanced method and make cost/benefit decision on what to use.</p>
<h1>RATIONALE</h1>
<h2>Cleaning and preparing data </h2>
<h3>Import required libraries</h3>
<p> Anywhere further in the text press <b>Code</b> button if want to see actual code and/or <b>Output</b> button if you want to see the output. You might want to try it now </p>

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import zscore
from fbprophet import Prophet
print('Required libraries have been imported')

<h3>Capturing data</h3>
<p>For the purpose of making a prediction, there is no need to collect and analyse reservations data. Although reservations seem to be relevant data, there are so many unknown factors that could make a significant contribution. For example: the number of visitors who made a reservation through another reservation network; or just called directly restaurant; or might come without a reservation. Moreover, these factors will be changing over time, and cannot be reliably predicted. Hence these data should not be used in the future product. In other words- we could estimate the number of visitors who came through Recruit Holdings reservation systems, but we could not make the reliable prediction of the total number of visitors based only on the HPG or AirREGI reservations.</p>
<p>Lets read only data files that we need and convert them into local variables</p> 


In [None]:
path ='../input/'
dfs = {
    'air_visit_data': pd.read_csv(path+'air_visit_data.csv'),
    'air_store_info': pd.read_csv(path+'air_store_info.csv'),
    'sample_submission': pd.read_csv(path+'sample_submission.csv'),
    'date_info': pd.read_csv(path+'date_info.csv')
}
print('files read:{}'.format(list(dfs.keys())))
for key, name in dfs.items(): locals()[key] = name
print('Data captured')

<h3>Removing outliers </h3>
<p>To identify outliers, let’s utilize widely used <b>Z-score</b> measurement. The idea is to calculate for how many standard deviations this particular value outstands from the mean. It is commonly accepted that values with Z score above 3 are outliers. Let’s remove them from data.</p>


In [None]:
outliers = (air_visit_data.groupby( ['air_store_id'])['visitors'].transform(zscore) > 3)
air_visit_data[outliers]=np.nan
air_visit_data.dropna(inplace=True)
print(str(outliers.sum())+' outliers have been removed')

<h3>Filling in missed data</h3>
<p>Quick view on the AirREGI visits data exposed that there are many numbers of visitors to each restaurant for each day are NOT presented. There are plenty of missing data for each restaurant and each day. Missing data will significantly impact prediction performance. This is specifically important for some rare events like Golden week holidays.</p>
<p><i>Lets recover these data.</i><p/> 
<p>We will use following assumption for recovery: number of visitors to each restaurant for each day is equals to the average number of visitors to restaurants in the same area and with the same genre. This number should be adjusted to the size of the restaurant, which we define as the average number of visitors to that restaurant.</p>
<p>Firstly, lets split restaurants into several <i>clusters</i>. Each cluster has unique genre_name and area_name. Then define function, which fills in NaNs (missing value) where it is possible. If there were no visitors to any similar restaurants in the same area on certain dates, It is not possible to fill in.</p>


In [None]:
#Fill in Nans where possible with average in cluster on that day adjusted by the size of the particular restaurants 
def fill_nans_in_cluster(genre_name,area_name):
    #get list of the same type of restaurants in the neighborhood
    neighbors_bool = air_store_info.apply(lambda x:(x.air_genre_name==genre_name and x.air_area_name==area_name), axis=1)
    neighbors_ids=pd.DataFrame((air_store_info[neighbors_bool]))
    neighbors_restaurants= air_visit_data.merge(neighbors_ids,on='air_store_id',how='inner')[['air_store_id','visit_date','visitors']]
 
    #pivot neighbors_restaurants to easy fill in possible missing dates.
    neighbors_restaurants=neighbors_restaurants.pivot_table(index='visit_date',columns='air_store_id', values='visitors',aggfunc=sum)
    
    #Fill in missing dates(if any) with Nans
    idx = pd.date_range('2016-01-01', '2017-04-22')
    neighbors_restaurants.index = pd.DatetimeIndex(neighbors_restaurants.index)
    neighbors_restaurants = neighbors_restaurants.reindex(idx, fill_value=np.nan)

    # Get visitors rate, normalized to the avarage number of visitors per day 
    neighbors_restaurants_average= neighbors_restaurants.mean(axis=0).tolist()
    normalized_neighbors_restaurants = neighbors_restaurants.div(neighbors_restaurants_average,axis=1)

    # Fill in Nans with avarge number of visiotrs in nighbour restaurants 
    #axis argument to fillna is Not Implemented, so have to use transpond
    normalized_neighbors_restaurants_with_filled_nans=normalized_neighbors_restaurants.T.fillna(normalized_neighbors_restaurants.mean(axis=1))
    
    #replace normalized values with real vistors by multipliyng back on average per restaurant
    neighbors_restaurants_with_filled_nans = normalized_neighbors_restaurants_with_filled_nans.mul(neighbors_restaurants_average,axis=0).reset_index()

    #return visit data in the original format 
    df_columns = neighbors_restaurants_with_filled_nans.columns[1:]
    return  pd.melt(neighbors_restaurants_with_filled_nans,id_vars=['air_store_id'], value_vars=df_columns)

<p>Secondly, let’s process all clusters one by one.</p>

In [None]:
clusters_names= air_store_info.apply(lambda x:(x.air_genre_name + '_' + x.air_area_name), axis=1).unique().tolist()
full_data = pd.DataFrame(columns=air_visit_data.columns)

for cluster in clusters_names:
    cluster_data = fill_nans_in_cluster (cluster.split('_')[0],cluster.split('_')[1])
    cluster_data.rename(columns={'variable':'visit_date','value':'visitors'},inplace=True )
    full_data=full_data.append(cluster_data,ignore_index=True)
print('Missing data filling complete')

<p>We have identified that only 821 restaurants are required for submission.So let's lets focus only on restaurants that are required.</p>

In [None]:
target_restaurants= pd.DataFrame({'air_store_id':sample_submission['id'].str[:-len('_2017-04-23')].unique()})
full_data=full_data.merge(target_restaurants,on='air_store_id',how='inner')

<h3>Preparing data for analysis</h3>
<p>For future data analysis it would be more convenient to merge all required information in one DataFrame. 
So add appropriate genre and location to each restaurant. Also let's add days of the week and holidays flags.</p>


In [None]:
visit_data = full_data.merge(air_store_info[['air_store_id','air_genre_name','air_area_name']],
                             left_on='air_store_id',right_on='air_store_id',how='left')
date_info['calendar_date']= pd.to_datetime(date_info['calendar_date'])

visit_data = visit_data.merge(date_info, left_on='visit_date', right_on='calendar_date', how='left')
visit_data.drop('calendar_date', axis=1, inplace=True)

print('Data ready for analysis')

<h2>Data Analysis</h2>

<h3> Simple method</h3>

<p>The simple method is based on the following algorithm:</p>
<li>A number of visitors on prediction day will be equal to the average number of visitors on the same day of the week for the last year.</li>
<li>If prediction day is a holiday, number is equal to the average number of visitors on the holidays during last year.</li>

<p>The main advantage of the simple method is - it could be implemented by on-site expertise and using pretty much any programming language. The disadvantage is - it does not take into account seasonal changes.</p>
<p>Based on the available visit data file, let’s calculate number of visitors per restaurants per day of the week.</p>

In [None]:
#Make a copy of visit daеa so we could use it later 
simple_visit_data = visit_data.copy()

#If there is a holyday mark day of the weel as a Holiday
simple_visit_data.loc[simple_visit_data['holiday_flg']==1,'day_of_week'] = 'Holiday'

# Calculate average number of the visitors per day of the week. Holiday is treated as day of the week
visitors_per_day_of_the_week = simple_visit_data.groupby(['air_store_id', 'day_of_week']).mean().reset_index()
visitors_per_day_of_the_week.drop('holiday_flg', axis=1, inplace=True)

<h3>Prepaire submission file</h3>

In [None]:
#Make a copy of visit daеa so we could use it later 
simple_submission = sample_submission.copy()

#extraxt required restaurant ids and required dates
simple_submission['air_store_id'] = simple_submission['id'].str[:-len('_2017-04-23')] 
simple_submission['calendar_date'] = simple_submission['id'].str[-len('2017-04-23'):] 
simple_submission.drop(['visitors','id'], axis=1, inplace=True)
simple_submission['calendar_date']= pd.to_datetime(simple_submission['calendar_date'])

# Using visitors_per_day_of_the_week fill in required position in the submission file
simple_submission = simple_submission.merge(date_info, on='calendar_date', how='left')
simple_submission.loc[simple_submission['holiday_flg']==1,'day_of_week'] = 'Holiday'
simple_submission = simple_submission.merge(visitors_per_day_of_the_week, on=['air_store_id', 'day_of_week'], how='left')

print('simple submission file is ready')


<h3>Write simple submission to file</h3>

In [None]:
simple_submission['id']= simple_submission.apply(lambda row: str(row.air_store_id)+'_' + str(row.calendar_date)[:len('2017-04-23')], axis=1)
simple_submission[['id', 'visitors']].to_csv('simple_submission.csv', index=None)
print("Submission for simple method is done")

<h3>Scoring of Simple method</h3>
<p>If we submit file for scoring, we get</p>
<p><b>Score is 0.548</b></p>
<p> This is not the best score in the completion, but this is clear straight forward approach without using any kind of weighted coefficients, which in many cases are difficult to explain and rationalize. Adjusting these kind of coefficients usually leads to better performance on certain dates, which are known prior to the experiment. However, in many cases this approach do not get the same results on other set of dates. This is not practically feasible.</p>
<p><b>Conclusion:</b> This simple approach still gets reasonably good results, while very easy to implement and support in the future.</p>
<h2>Advanced method</h2>
There are several advanced methods were considered and compared in [Be my guest - Recruit Restaurant EDA.](http://s://www.kaggle.com/headsortails/be-my-guest-recruit-restaurant-eda) Many thanks to the author of this kernel for the awesome work. Based on this work, we could see that many advanced methods considered there are very close in terms of the accuracy. However, we think that [Prophet](https://research.fb.com/prophet-forecasting-at-scale/method) as the more suitable approach to the required task. This is set of the open-source pieces of software developed by [Facebook's Core Data Science team.](https://research.fb.com/category/data-science)
<p>Prophet utilizes an additive regression model which decomposes a time series into</p>
<ol>
<li>the linear/logistic trend,</li>
<li>a yearly seasonal component,</li>
<li>a weekly seasonal component, and</li>
<li>an optional list of important days (such as holidays, special events, &hellip;).</li>
</ol>
<p>This tool is backed by Facebook brand and presumably employs work of the world&rsquo;s best data scientist.</p>
<p>It fits our purposes because it was originally designed to forecast data based on the time series and correctly handle data around holidays. If we assume that visitor&rsquo;s behavior is different during holidays, as well as a few days before and after Holidays, then Prophet will handle this irregularity scientifically correct.</p>
<h3>Holidays classification</h3>
<p>Let us assume that visitor’s behavior is different for different holidays. Unfortunately, we have only 1 year of historical data and hence, we have only 1 day of representation of each holiday. This is not enough to predict data reliably in the future. To increase statistical representation of each holiday, we need to classify holidays and split them into groups. Each group would consist of holidays where restaurant’s visitor's behavior are similar. This way, we would collect more data for each group of holidays.</p>
<p>We have manually converted date_info file into Prophet’s format and have assigned meaningful names for each holiday. We had to do so manually because, we have to track different kind of holidays separately from each other. Unfortunately information about holidays names are not present in the date_info file.</p>
Please see below Prophet’s holiday’s definition file. According to Prophet Requirements, we have to identify how many days visitor's behavior is different before and after each holiday. More details on lower_window and upper_window parameters are presented [here.](https://facebook.github.io/prophet/docs/seasonality_and_holiday_effects.html)
<p>We would need someone more familiar with Japanese traditions to adjust these parameters.</p>

In [None]:
#Holidays are presented in the format required by Prophet 
new_year_day = pd.DataFrame({
  'holiday': 'new_year_day',
  'ds': pd.to_datetime(['2016-01-01', '2017-01-01']),
  'lower_window': -2, #how many days before holiday are significant 
  'upper_window': 1,  #how many days after holiday are significant 
})
bank_holiday = pd.DataFrame({
  'holiday': 'bank_holiday',
  'ds': pd.to_datetime(['2016-01-02','2016-01-03', '2016-12-31', '2017-01-02','2017-01-03']),
  'lower_window': 0,
  'upper_window': 0,
})
coming_of_age_day = pd.DataFrame({
  'holiday': 'coming_of_age_day',
  'ds': pd.to_datetime(['2016-01-09','2017-01-11']),
  'lower_window': 0,
  'upper_window': 0,
})
national_foundation_day = pd.DataFrame({
  'holiday': 'national_foundation_day',
  'ds': pd.to_datetime(['2016-02-11','2017-02-11']),
  'lower_window': 0,
  'upper_window': 0,
})
valentines_day = pd.DataFrame({
  'holiday': 'valentines_day',
  'ds': pd.to_datetime(['2016-02-14','2017-02-14']),
  'lower_window':-1,
  'upper_window': 1,
})
dolls_girls_festival = pd.DataFrame({
  'holiday': 'dolls_girls_festival',
  'ds': pd.to_datetime(['2016-03-03','2017-03-03']),
  'lower_window':-1,
  'upper_window': 1,
})
equinox = pd.DataFrame({
  'holiday': 'equinox',
  'ds': pd.to_datetime(['2016-03-20','2016-03-21','2016-09-22','2016-06-20','2017-03-20']),
  'lower_window': 0,
  'upper_window': 0,
})
golden_week = pd.DataFrame({
  'holiday': 'golden_week',
  'ds': pd.to_datetime(['2016-04-29','2016-05-03','2016-05-04','2016-05-05','2017-04-29','2017-05-03','2017-05-04','2017-05-05']),
  'lower_window':-2,
  'upper_window': 1,
})
star_festival = pd.DataFrame({
  'holiday': 'star_festival',
  'ds': pd.to_datetime(['2016-07-07']),
  'lower_window': 0,
  'upper_window': 0,
})
sea_day = pd.DataFrame({
  'holiday': 'sea_day',
  'ds': pd.to_datetime(['2016-07-18']),
  'lower_window': 0,
  'upper_window': 0,
})
mountain_day = pd.DataFrame({
  'holiday': 'mountain_day',
  'ds': pd.to_datetime(['2016-08-11']),
  'lower_window': 0,
  'upper_window': 0,
})
respect_for_the_aged = pd.DataFrame({
  'holiday': 'respect_for_the_aged',
  'ds': pd.to_datetime(['2016-09-19']),
  'lower_window': 0,
  'upper_window': 0,
})
sports_day = pd.DataFrame({
  'holiday': 'sports_day',
  'ds': pd.to_datetime(['2016-10-10']),
  'lower_window': 0,
  'upper_window': 0,
})
culture_day = pd.DataFrame({
  'holiday': 'culture_day',
  'ds': pd.to_datetime(['2016-11-03']),
  'lower_window': 0,
  'upper_window': 0,
})
day_7_5_3 = pd.DataFrame({
  'holiday': 'day_7_5_3',
  'ds': pd.to_datetime(['2016-11-15']),
  'lower_window': 0,
  'upper_window': 0,
})
labor_thanksgiving_day = pd.DataFrame({
  'holiday': 'labor_thanksgiving_day',
  'ds': pd.to_datetime(['2016-11-23']),
  'lower_window':-1,
  'upper_window': 1,
})
christmas = pd.DataFrame({
  'holiday': 'christmas',
  'ds': pd.to_datetime(['2016-12-21','2016-12-23','2016-12-25']),
  'lower_window':-2,
  'upper_window': 1,
})

holidays = pd.concat((new_year_day, bank_holiday,coming_of_age_day,national_foundation_day,\
                      valentines_day,dolls_girls_festival,equinox,golden_week,\
                     star_festival,sea_day,mountain_day,respect_for_the_aged,\
                     sports_day,culture_day,day_7_5_3,labor_thanksgiving_day,\
                     christmas))


<p>Prophet start prediction period from the date of last available data. So, before we begin, lets make sure that we have data for the '2017-04-22' (last date of training set). If not, then fill it in with the average  number for that restaurant on ‘Saturday’.</p>

In [None]:
# get restaraunts with missed data on '2017-04-22'
missings= visit_data['visitors'].isnull() & (visit_data['visit_date'] == '2017-04-22')

#get average number of visitors on saturdays for these restaurants
visitors_per_day_of_the_week = visit_data.groupby(['air_store_id', 'day_of_week']).mean().reset_index()
visitors_on_Saturdays = visitors_per_day_of_the_week[visitors_per_day_of_the_week['day_of_week']== 'Saturday']

#apply to visit data
visit_data.loc[missings, 'visitors'] = visit_data[missings].merge(
    visitors_on_Saturdays[['air_store_id', 'visitors']], on='air_store_id', how='left')['visitors_y'].values
print('Data ready for advanced method')


<h3> Prophet prediction for one restaurant</h3>
<p>Lets define function for Prophet prediction for one restaurant</p>

In [None]:
def prophet_prediction (air_id):
    #get restaurant data
    restaurant_visit_data = visit_data[visit_data['air_store_id'] == air_id]
    
    #fill it into Prophet model and fit the model
    df=pd.DataFrame()
    df['ds']=restaurant_visit_data['visit_date']
    df['y'] = np.log(restaurant_visit_data['visitors'])
    model = Prophet(changepoint_prior_scale=0.5, yearly_seasonality=False)
    model.fit(df)
    
    #run prediction for the next 39 days
    future_data = model.make_future_dataframe(periods=39)
    forecast_data = model.predict(future_data)
    return forecast_data.iloc[-39:,:][['ds', 'yhat']]


<p>Let's calculate prediction for each restaurant and prepare submittion file.</p>

In [None]:
#prepare submission file
submission= pd.DataFrame(columns=('id','visitors'))
for row in target_restaurants['air_store_id']:
    submission_to_append= pd.DataFrame(columns=('id','visitors'))
    temp_submission=prophet_prediction(row)
    submission_to_append['id']= temp_submission['ds'].map(lambda x: str(row)+'_'+str(x)[:len('2017-04-23')])
    submission_to_append['visitors']= np.exp(temp_submission['yhat'])
    submission = submission.append(submission_to_append,ignore_index=True)


<p>Save submission to the file</p>

In [None]:
prophet_submission= pd.DataFrame(submission).reset_index()
submission[['id','visitors']].to_csv('prophet_submission.csv', index=None)
print("Submission for advanced method is done")

<h3>Scoring of Advanced method</h3>
<p>If we submit file for scoring, we get</p>
<p><b>Score is 0.512</b></p>

<p>The Advanced method improves the accuracy of forecasting by about 6%. However, requires the utilisation of 3rd party library and proper onsite expertise to update model and support it in the future. The Advanced method uses additional information on the number of visitors few days after and/or before holidays. The appropriate expertise of the person who knows visitors behavior during Japanese holidays has to be used to improve accuracy. One year of historical data is not enough to statistically identify visitors behavior before and/or after different kind of holidays.</p>The accuracy of both methods will be improved when data for few years will be available.</p>
<h1>RECOMENDATION</h1>
<p>If Recruit Holdings has already the proper expertise to implement and support the advanced method, then we recommend to use it. If not, Recruit Holdings has to take into account the cost of implementation and future support of the advanced method and make cost/benefit decision on what to use.</p>