<div style="background-color:#2563eb; padding:12px; border-radius:8px;">
    <h2 style="color:white; margin:0; text-align:center; letter-spacing:1px;">
        OVERVIEW
    </h2>
</div>
This project uses the Recruit Restaurant Visitor Forecasting dataset, which contains historical restaurant visit information from Japan. The dataset combines restaurant details, daily visitor counts, reservation records, and calendar data. It is designed to predict the number of customers visiting each restaurant on a given date using time-series, location, and reservation-based features.

# Curriculum

1.  Setup Environment
2.  Load Dataset
3.  Understand the Data
4.  Data Prerocessing for EDA
5.  EDA
6.  Feature Engineering
7.  EDA part 2
8.  Statistical Significance Test
9.  Feature Encoding
10. Data Preprocessing for Model Building
11. Evaluation Metrics - Regression
12. Time Series Forecasting - ARIMA
13. Time Series Forecasting - Prophet
14. XGBoost Regression
15. XGBoost
16. CatBoost
17. LightGBM
18. Improve the Model - Hyperparameter Tuning
19. Ensemble (stacking)
20. Final RMSLE score and the recommendations

<div style="background-color:#2563eb; padding:12px; border-radius:8px;">
    <h2 style="color:white; margin:0; text-align:center; letter-spacing:1px;">
        Dataset
    </h2>
</div>

In [1]:
import pandas as pd
import numpy as np

In [2]:
air_reserve_df = pd.read_csv(r"Data/air_reserve.csv")
air_store_df = pd.read_csv(r"Data/air_store_info.csv")
air_visit_data_df = pd.read_csv(r"Data/air_visit_data.csv")
hpg_reserve_df = pd.read_csv(r"Data/hpg_reserve.csv")
date_info_df = pd.read_csv(r"Data/date_info.csv")
hpg_store_info_df = pd.read_csv(r"Data/hpg_store_info.csv")
sample_submission_df = pd.read_csv(r"Data/sample_submission.csv")
store_id_relation_df = pd.read_csv(r"Data/store_id_relation.csv")

<h2>Airstore Reserve info</h2>

In [3]:
air_reserve_df.head()

Unnamed: 0,air_store_id,visit_datetime,reserve_datetime,reserve_visitors
0,air_877f79706adbfb06,2016-01-01 19:00:00,2016-01-01 16:00:00,1
1,air_db4b38ebe7a7ceff,2016-01-01 19:00:00,2016-01-01 19:00:00,3
2,air_db4b38ebe7a7ceff,2016-01-01 19:00:00,2016-01-01 19:00:00,6
3,air_877f79706adbfb06,2016-01-01 20:00:00,2016-01-01 16:00:00,2
4,air_db80363d35f10926,2016-01-01 20:00:00,2016-01-01 01:00:00,5


Contains reservation-level records

Used to create daily aggregated features

Enhances forecasting accuracy significantly

Critical feature source for LightGBM models

<h2>Airstore info</h2>

In [4]:
air_store_df.head()

Unnamed: 0,air_store_id,air_genre_name,air_area_name,latitude,longitude
0,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197853
1,air_7cc17a324ae5c7dc,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197853
2,air_fee8dcf4d619598e,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197853
3,air_a17f0778617c76e2,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197853
4,air_83db5aff8f50478e,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599


Contains static restaurant metadata

One row per restaurant

Used for feature enrichment

Plays a key role in improving visitor forecasting accuracy

<h2>Air visit data info</h2>

In [5]:
air_visit_data_df.head()

Unnamed: 0,air_store_id,visit_date,visitors
0,air_ba937bf13d40fb24,2016-01-13,25
1,air_ba937bf13d40fb24,2016-01-14,32
2,air_ba937bf13d40fb24,2016-01-15,29
3,air_ba937bf13d40fb24,2016-01-16,22
4,air_ba937bf13d40fb24,2016-01-18,6


It is the core training dataset

Used to learn:

Daily demand patterns

Weekly seasonality

Weekend vs weekday trends

Restaurant popularity over time

<h2>hpg Reserve Info</h2>

In [6]:
hpg_reserve_df.head()

Unnamed: 0,hpg_store_id,visit_datetime,reserve_datetime,reserve_visitors
0,hpg_c63f6f42e088e50f,2016-01-01 11:00:00,2016-01-01 09:00:00,1
1,hpg_dac72789163a3f47,2016-01-01 13:00:00,2016-01-01 06:00:00,3
2,hpg_c8e24dcf51ca1eb5,2016-01-01 16:00:00,2016-01-01 14:00:00,2
3,hpg_24bb207e5fd49d4a,2016-01-01 17:00:00,2016-01-01 11:00:00,5
4,hpg_25291c542ebb3bc2,2016-01-01 17:00:00,2016-01-01 03:00:00,13


Reservation data acts as a leading indicator of demand and helps the model identify:

High-traffic days

Weekend surges

Festival and holiday spikes

Advance booking behavior

<h2>date Info</h2>

In [7]:
date_info_df.head()

Unnamed: 0,calendar_date,day_of_week,holiday_flg
0,2016-01-01,Friday,1
1,2016-01-02,Saturday,1
2,2016-01-03,Sunday,1
3,2016-01-04,Monday,0
4,2016-01-05,Tuesday,0


This dataset tells the model what day it is and whether it is a holiday, which strongly affects how many customers visit restaurants.

<h2>hpg tore info</h2>

In [8]:
hpg_store_info_df.head()

Unnamed: 0,hpg_store_id,hpg_genre_name,hpg_area_name,latitude,longitude
0,hpg_6622b62385aec8bf,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
1,hpg_e9e068dd49c5fa00,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
2,hpg_2976f7acb4b3a3bc,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
3,hpg_e51a522e098f024c,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
4,hpg_e3d0e1519894f275,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221


pg_store_info.csv contains restaurant metadata from the Hot Pepper Gourmet platform, including cuisine type, location, and geographic coordinates.

<h2>sample submission </h2>

In [9]:
sample_submission_df.head()

Unnamed: 0,id,visitors
0,air_00a91d42b08b08d9_2017-04-23,0
1,air_00a91d42b08b08d9_2017-04-24,0
2,air_00a91d42b08b08d9_2017-04-25,0
3,air_00a91d42b08b08d9_2017-04-26,0
4,air_00a91d42b08b08d9_2017-04-27,0


<h2>store id relation </h2>

In [10]:
store_id_relation_df.head()

Unnamed: 0,air_store_id,hpg_store_id
0,air_63b13c56b7201bd9,hpg_4bc649e72e2a239a
1,air_a24bf50c3e90d583,hpg_c34b496d0305a809
2,air_c7f78b4f3cba33ff,hpg_cd8ae0d9bbd58ff9
3,air_947eb2cae4f3e8f2,hpg_de24ea49dc25d6b8
4,air_965b2e0cf4119003,hpg_653238a84804d8e7


<div style="background-color:#2563eb; padding:12px; border-radius:8px;">
    <h2 style="color:white; margin:0; text-align:center; letter-spacing:1px;">
        UNDERSTANDING THE DATA
    </h2>
</div>

In [41]:
data_description = pd.DataFrame(columns=["Number_of_uniqueVlues","Size","Null_count","Null_percentage","most_frequent_value","most_frequent_value_perc"])
def about_data(df):
    for col in df.columns:
      data_description.loc[col,"Number_of_uniqueVlues"] = df[col].nunique()
      data_description.loc[col,"Size"] = len(df[col])
      data_description.loc[col,"Null_count"] = df[col].isnull().sum()
      data_description.loc[col,"Null_percentage"] = np.where(data_description.loc[col,"Null_count"]>=0,data_description.loc[col,"Null_count"]/len(df),0)
      data_description.loc[col,"most_frequent_value"] = df[col].value_counts().index[0]
      data_description.loc[col,"most_frequent_value_perc"] = df[col].value_counts(normalize=True).tolist()[1]
    return data_description

In [42]:
about_data(air_reserve_df)

Unnamed: 0,Number_of_uniqueVlues,Size,Null_count,Null_percentage,most_frequent_value,most_frequent_value_perc
air_store_id,314,92378,0,0.0,air_8093d0b565e9dbdf,0.0206
visit_datetime,4975,92378,0,0.0,2016-12-24 19:00:00,0.00249
reserve_datetime,7513,92378,0,0.0,2016-11-24 18:00:00,0.001115
reserve_visitors,71,92378,0,0.0,2,0.164531
