# Recruit Restaurant Visitor Forecasting Project Report

### OVERVIEW
This Capstone Project aims at coming up with a model to forecast the number of visitors to a restaturant on future date.
Accurate forecasting is possible using the restaurant's past reservation, visiting data, geographical location,
kind of day(whether it is a weekend/holiday), Genre of food being offered and supplementary weather data in Japan.
The model is provided with restaurant id and
Dates from the "golden week"(holiday week in Japan that covers the last week of April and May of 2017) for prediction of number of visitors. 

### The Client
The Client here is Recruit Holdings who wants to automate the process of number of visitors a restaurant can expect in future date.
The process of predicting the number of visitors to a restaurant isn't a simple one. It is dependent on so many factors like local restaurants competition,
weather of the place where the restataurant is located.

Recruit Holdings has unique access to the datasets that could make this forecast possible. Recruit Holdings owns  Hot Pepper Gourmet (a restaurant review service), AirREGI (a restaurant point of sales service), and Restaurant Board (reservation log management software).
From these sources the past reservation data, visitor count, geographical information of the restaurants are made available.

This would serve the following purposes for the client.
   1. How many customers are expected to visit the restaurant in the golden week season in Japan.
   2. Effective Cost Management in purchasing ingredients, Scheduling staff by knowing by the number of visiotrs.
   3. Minimize the loss involved with Inventory.
   4. What's the most favorite genre of food that people are eating.Knowing the taste of customers much better.

Knowing these things by the forecast model in advance the restauant can focus more in giving better quality of food and service to the customers visited.
Additionally, factors that lead to the success of a  restaurant can be understood much better and could be guiding principles for new
restaurants.

### Dataset
As part of the kaggle competition, Recruit Holdinigs has provided access to the data(in csv format) which comes from two different sources it owns AirREGI(a restaurant point of sales service), Restaurant Board (reservation log management software) and Hot Pepper Gourmet( (a restaurant review service).

1.) In the competition, date restrictions were given for train and test set.</br>
2.) The training data covers the dates from 2016 until April 2017. </br>
3.) The test set covers the last week of April and  May of 2017 and covers a chosen subset  </br>
     of the air restaurants and intentionally spans a holiday week in Japan called the "Golden Week."

The data comes in different .csv files </br>
    <b> air_reserve.csv </b> Reservation data of restaturants in Air system </br>
    <b> hpg_reserve.csv </b>  Reservation data of restaturants in HPG system  </br>
    <b> air_store_info.csv </b> This file contains information about select air restaurants about their location, type of food serving  </br>
    <b> hpg_store_info.csv </b> This file contains information about select hpg restaurants about their location, type of food serving  </br>
    <b> store_id_relation.csv </b> This file allows you to join select restaurants that have both the air and hpg system. </br>
    <b> air_visit_data.csv </b>  historical visit data for the air restaurants. </br>
    <b> date_info.csv </b> basic information about the calendar dates in the dataset like holiday information </br>



#### Data Wrangling
The first and foremost step is to load the given datasets and clean them using pandas dataframes. As the data is in 
    different files, need to perform merge/ Join operations to present meaningful data for further analysis in prediction


<img src="./img/data_wrangler.png">


1. Performed Inner join between hpg_reserve data and store_id_relation table based on <b>'hpg_store_id'</b> to get intersection of the data </br>
2. Changed calendar dates in data from string to datetime format.
3. Mapped hpg stores with their equivalent air store ids as the final test set given is having air store ids for forecasting
4. Performed various joins on hpg and air system tables to get complete data in a single table for further analysis.
5. Added holiday column using date_info table based on calendar date
6. Included min_visitors, max_visitors, median_visitors, reserve_visitors column
7. Added reserve_time_diff column which represents the time difference between reservation date and visited date.
8. After cleaning data, train and test data are stored separately as Train_data.csv, Test_data.csv files

### Exploratory Data visualization Analysis
In this section, various insights were presented based on data visualization techniques
#### Trend in visitor count with and without Reservations

<img src='./img/trend_line_visitors.png'>
From the plot above, it can be inferred that very high number customers are visiting restaurant without reservations in combined
air & hpg systems

#### Most preferred food type
<img src = './img/food_preference.png'>


1.) Most of the visitors in all restaturants prefer to order <b> Izakaya </b> food, the visitor count is more than 1400000 </br>
2.) Next to Izakaya, most are eating in cafe for snacks/sweets. </br>
3.) The third most eaten food in Japanese restaturants is <b> Italian/French </b> cuisine </br>


#### Holiday effect on visitors
<img src = './img/visitor_count_in_week_with_holiday_effect.png'>

1.) When there is no holiday for a restaurant, more number of visitors are expected in weekends(Saturday, Sunday) </br>
2.) Suprisingly in holidays, friday has more number of visitors </br>
3.) In both holiday and no-holiday days Monday has lower number of visitors

### Inferential Statistics
#### Forming Hypothesis
#### Null Hypothesis : $H_{0}$ : Average number of visitors ordering Izakaya food is equal to average number of visitors ordering Other food types $\mu_{1}$ = $\mu_{2}$ <br> Alternante Hypothesis : $H_{A}$ :  Average number of visitors ordering Izakaya food is greater than the average number of visitors ordering Other food types$\mu_{1}$ !=  $\mu_{2}$ <br> The threshold value of $\alpha$ is assumed to be 0.05. Assuming Null Hypothesis is true.

#### Distribution of visitors for Izkaya genre
<img src = './img/izakaya_distribution.png', width='800'>

#### Distribution of visitors for other genre
<img src = './img/other_distribution.png', width = '800'>

<b> Used student's t-statistic(Pooled variances) as the variance of population is unknown </br>
<b> The p-value calculated from test is (1.0145574677365108e-284), it is evident that mean number of customers eating Izakaya food is statistically significant.

#### Conclusion

In this report, the process of Data Wrangling, Exploratory Data Analysis, Data Visualization,</br>
Hypothesis testing performed on the dataset of Restaturant visitor data </br>

### Machine Learning

### Base Line model 

Considered this as a Regression problem, a base line model is constructed using Linear Regression model, </br>
As the test metric for the competition problem is Root mean sqaured Error (RMSE), the score obtained for Test set  in kaggle execution is <b> 0.568</b>

### Ensemble modelling

Considered approach of Ensemble modelling for this forecasting problem.
Used different models like RandomForest Regressor and also used XGBoost Regressor

Random forest regressor, this model is trained first by tuning hyper parameters max_depth, max_features, min_samples_leaf, n_estimators </br>
Used KFold, GridSearch techniques to find out the optimal parameters and they are  found to be as shown below. </br>
<b>
{'max_depth': 15, </br>
 'max_features': 10, </br>
 'min_samples_leaf': 1, </br>
 'n_estimators': 1000} </b>

### Feature importances

the feature importances are found from the random forest regressor model

<img src = './img/feature_imps.png',width='800'>

From the plot of feature importances, average number of visitors(mean_visitors) to a restaurant is having high importance as a feature

#### [The github repository for the code can be found here](https://github.com/nishalpattan/SPRINGBOARD/tree/master/Capstone_Project)