### Description

This kaggle competition aims to predict how many visitors will visit a particular restaurant.

The data comes from two separate restaurant reservation sites: Hot Pepper Gorumet (hpg) and AirREGI (air).

The 'air_reserve.csv' file contains reservation information only. It shows the restaurant ID number, the date and time a reservation was made, the date and time for which a reservation was made, and the number of visitors that the reservation is for. 

The file 'air_visit_date.csv' provides information on the number of visitors to a specific restaurant on a specific date.

The 'air_store_info.csv' file provides general information on the restaurants indexed in the above two files. It describes the type of food that the restaurant provides as well as the restaurant's latitude and longitude coordinates. Finally, it gives the name of the general geographic area in which the restaurant is located.

The file 'hpg_reserve.csv' provides the same type of information as the 'air_reserve.csv' file, but using the HPG reservations system.

The file 'hpg_store_info.csv' provide the same type of information as the 'air_store_info.csv,' but using the HPG reservation system.

There is no 'hpg_visit_date.csv' file that is similar to the 'air_visit_date.csv' file. Instead, another file called 'store-id_relation.csv' cross-references the id numbers used in the two different reservation systems.

The file 'date_info.csv' provides information on the weekday corresponding to specific calendar dates, and contains a flag as to whether or not a specific calender date was a holiday.

Finally, the file 'sample_submission.csv' shows the format that submissions must be made in to submit to kaggle. A submission should consist of two columns. The first column is the id of the store (using the 'air' reservation system ID number) and the second column lists the predicted number of visitors to the restaurant.



In [1]:
#setup
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
%matplotlib inline

#for visualization of missing data
import missingno

aReserveDF = pd.read_csv('air_reserve.csv') 
aVisitDF = pd.read_csv('air_visit_data.csv') 
aStoreDF = pd.read_csv('air_store_info.csv')

hReserveDF = pd.read_csv('hpg_reserve.csv') 
hStoreDF = pd.read_csv('hpg_store_info.csv') 

dateInfoDF = pd.read_csv('date_info.csv')
sampleSubmissionDF = pd.read_csv('sample_submission.csv') 
storeIdRelationDF = pd.read_csv('store_id_relation.csv') 

Let's briefly examine the first few rows of each file.

In [2]:
aReserveDF.head()

Unnamed: 0,air_store_id,visit_datetime,reserve_datetime,reserve_visitors
0,air_877f79706adbfb06,2016-01-01 19:00:00,2016-01-01 16:00:00,1
1,air_db4b38ebe7a7ceff,2016-01-01 19:00:00,2016-01-01 19:00:00,3
2,air_db4b38ebe7a7ceff,2016-01-01 19:00:00,2016-01-01 19:00:00,6
3,air_877f79706adbfb06,2016-01-01 20:00:00,2016-01-01 16:00:00,2
4,air_db80363d35f10926,2016-01-01 20:00:00,2016-01-01 01:00:00,5


In [3]:
aVisitDF.head()

Unnamed: 0,air_store_id,visit_date,visitors
0,air_ba937bf13d40fb24,2016-01-13,25
1,air_ba937bf13d40fb24,2016-01-14,32
2,air_ba937bf13d40fb24,2016-01-15,29
3,air_ba937bf13d40fb24,2016-01-16,22
4,air_ba937bf13d40fb24,2016-01-18,6


In [4]:
aStoreDF.head()

Unnamed: 0,air_store_id,air_genre_name,air_area_name,latitude,longitude
0,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
1,air_7cc17a324ae5c7dc,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
2,air_fee8dcf4d619598e,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
3,air_a17f0778617c76e2,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
4,air_83db5aff8f50478e,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599


In [5]:
hReserveDF.head()

Unnamed: 0,hpg_store_id,visit_datetime,reserve_datetime,reserve_visitors
0,hpg_c63f6f42e088e50f,2016-01-01 11:00:00,2016-01-01 09:00:00,1
1,hpg_dac72789163a3f47,2016-01-01 13:00:00,2016-01-01 06:00:00,3
2,hpg_c8e24dcf51ca1eb5,2016-01-01 16:00:00,2016-01-01 14:00:00,2
3,hpg_24bb207e5fd49d4a,2016-01-01 17:00:00,2016-01-01 11:00:00,5
4,hpg_25291c542ebb3bc2,2016-01-01 17:00:00,2016-01-01 03:00:00,13


In [6]:
hStoreDF.head()

Unnamed: 0,hpg_store_id,hpg_genre_name,hpg_area_name,latitude,longitude
0,hpg_6622b62385aec8bf,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
1,hpg_e9e068dd49c5fa00,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
2,hpg_2976f7acb4b3a3bc,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
3,hpg_e51a522e098f024c,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
4,hpg_e3d0e1519894f275,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221


In [7]:
storeIdRelationDF.head()

Unnamed: 0,air_store_id,hpg_store_id
0,air_63b13c56b7201bd9,hpg_4bc649e72e2a239a
1,air_a24bf50c3e90d583,hpg_c34b496d0305a809
2,air_c7f78b4f3cba33ff,hpg_cd8ae0d9bbd58ff9
3,air_947eb2cae4f3e8f2,hpg_de24ea49dc25d6b8
4,air_965b2e0cf4119003,hpg_653238a84804d8e7


In [8]:
dateInfoDF.head()

Unnamed: 0,calendar_date,day_of_week,holiday_flg
0,2016-01-01,Friday,1
1,2016-01-02,Saturday,1
2,2016-01-03,Sunday,1
3,2016-01-04,Monday,0
4,2016-01-05,Tuesday,0


In [9]:
sampleSubmissionDF.head()

Unnamed: 0,id,visitors
0,air_00a91d42b08b08d9_2017-04-23,0
1,air_00a91d42b08b08d9_2017-04-24,0
2,air_00a91d42b08b08d9_2017-04-25,0
3,air_00a91d42b08b08d9_2017-04-26,0
4,air_00a91d42b08b08d9_2017-04-27,0


### Checking for Missing Data

Next, we need to check whether any rows have missing data.

In [10]:
aReserveDF.isnull().sum()

air_store_id        0
visit_datetime      0
reserve_datetime    0
reserve_visitors    0
dtype: int64

In [11]:
aVisitDF.isnull().sum()

air_store_id    0
visit_date      0
visitors        0
dtype: int64

In [12]:
aStoreDF.isnull().sum()

air_store_id      0
air_genre_name    0
air_area_name     0
latitude          0
longitude         0
dtype: int64

In [13]:
hReserveDF.isnull().sum()

hpg_store_id        0
visit_datetime      0
reserve_datetime    0
reserve_visitors    0
dtype: int64

In [14]:
hStoreDF.isnull().sum()

hpg_store_id      0
hpg_genre_name    0
hpg_area_name     0
latitude          0
longitude         0
dtype: int64

In [15]:
dateInfoDF.isnull().sum()

calendar_date    0
day_of_week      0
holiday_flg      0
dtype: int64

In [16]:
storeIdRelationDF.isnull().sum()

air_store_id    0
hpg_store_id    0
dtype: int64

### Checking the size of our dataframes

Let's also check how many rows we have in each dataframe.

In [17]:
aReserveDF.shape

(92378, 4)

In [18]:
aVisitDF.shape

(252108, 3)

In [19]:
aStoreDF.shape

(829, 5)

In [20]:
hReserveDF.shape

(2000320, 4)

In [21]:
hStoreDF.shape

(4690, 5)

In [22]:
dateInfoDF.shape

(517, 3)

In [23]:
storeIdRelationDF.shape

(150, 2)

### Summary of number of rows in files

Number of reservations made in the Air Reservation System: 92,378

Number of reservations made in the HPG System: 2,000,320


Number of days for which we have a date and number of visits: 252,108

Number of calendar dates for which we have a weekday and a flag (holiday or not): 517

Number of stores for which we have restaurant descriptions in the Air Reservation System: 829

Number of stores for which we have actual visit data (i.e., a reservation was fulfilled): 829

Number of stores for which we have restaurant descriptions in the HPG System: 4,690

Number of restaurants in both reservation systems: 150



### Points to consider regarding the number of entries in each data file

We see that many more reservations were made using the HPG System than the Air Reservation system.

Also, we will have to calculate the number of unique days for which we have information on actual visits. We've been provided with the weekday information for 517 separate weekdays, but if we want to use the weekday information as a feature in our predictions, we will have to check whether we have weekday information for all the days for which reservations were made.

Also, some other cases to think about are:

How will we interpret the case where individuals may be making multiple reservations at different times for the same restaurant, but only want to actually go at one time on one day? For example, somebody may be unsure of the exact time they want to go, so may make multiple reservations for different times on the same day, but later cancel all the reservations except one.

How will we interpret the case where individuals may be making multiple reservations at different restaurants for the same date/time, but again only want to go on one specific date at one specific time? They may not be making 'extra' reservations, but intend to cancel a lot of them later on.



Let's start by checking the number of unique entries.

In [24]:
aReserveDF.describe(include = ['O'])

Unnamed: 0,air_store_id,visit_datetime,reserve_datetime
count,92378,92378,92378
unique,314,4975,7513
top,air_8093d0b565e9dbdf,2016-12-24 19:00:00,2016-11-24 18:00:00
freq,2263,255,106


In [25]:
aVisitDF.describe(include = ['O'])

Unnamed: 0,air_store_id,visit_date
count,252108,252108
unique,829,478
top,air_5c817ef28f236bdf,2017-03-17
freq,477,799


In [26]:
aStoreDF.describe(include = ['O'])

Unnamed: 0,air_store_id,air_genre_name,air_area_name
count,829,829,829
unique,829,14,103
top,air_f267dd70a6a6b5d3,Izakaya,Fukuoka-ken Fukuoka-shi Daimyō
freq,1,197,64


Check whether the number of unique aReserveDF entries is less than the number of unique aStoreDF entries, i.e., that the aStoreDF includes a listing for every entry in aReserveDF.

In [27]:
set(aReserveDF.air_store_id) < set(aStoreDF.air_store_id)

True

This makes sense since we have 829 store descriptions in the aStoreDF, but only 314 unique air_store_id entries.

In [28]:
hReserveDF.describe(include = ['O'])

Unnamed: 0,hpg_store_id,visit_datetime,reserve_datetime
count,2000320,2000320,2000320
unique,13325,9847,11450
top,hpg_2afd5b187409eeb4,2016-12-16 19:00:00,2016-12-12 21:00:00
freq,1155,10528,907


In [29]:
hStoreDF.describe(include = ['O'])

Unnamed: 0,hpg_store_id,hpg_genre_name,hpg_area_name
count,4690,4690,4690
unique,4690,34,119
top,hpg_665a2f84da330d4c,Japanese style,Tōkyō-to Shinjuku-ku None
freq,1,1750,257


Check whether the number of unique hReserveDF entries is less than the number of unique hStoreDF entries, i.e., that the hStoreDF includes a listing for every entry in hReserveDF.

In [30]:
set(hStoreDF.hpg_store_id) < set(hStoreDF.hpg_store_id)

False

The outcome from the above command is "False," which means that some hpg stores are missing descriptions.

In [31]:
dateInfoDF.describe(include = ['O'])

Unnamed: 0,calendar_date,day_of_week
count,517,517
unique,517,7
top,2017-03-29,Friday
freq,1,74


In [32]:
storeIdRelationDF.describe(include = ['O'])

Unnamed: 0,air_store_id,hpg_store_id
count,150,150
unique,150,150
top,air_622375b4815cf5cb,hpg_3c41f028563beac3
freq,1,1


### Summary

We find that there are 314 unique Air Reservation Store IDs, and 4,690 unique HPG Store IDs. 

150 of the HPG Store IDs have a corresponding Air Reservation Store ID. So, for those stores, in the next jupyter notebook, let's replace the HPG Store ID with the corresponding Air Reservation Store ID so that we don't do any double-counting.



### Transforming the dateInfo.

In [33]:
dateInfoDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 3 columns):
calendar_date    517 non-null object
day_of_week      517 non-null object
holiday_flg      517 non-null int64
dtypes: int64(1), object(2)
memory usage: 12.2+ KB


In [34]:
dateInfoDF['calendar_date'] = pd.DatetimeIndex(dateInfoDF['calendar_date'])

In [35]:
dateInfoDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 3 columns):
calendar_date    517 non-null datetime64[ns]
day_of_week      517 non-null object
holiday_flg      517 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 12.2+ KB


In [36]:
hReserveDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000320 entries, 0 to 2000319
Data columns (total 4 columns):
hpg_store_id        object
visit_datetime      object
reserve_datetime    object
reserve_visitors    int64
dtypes: int64(1), object(3)
memory usage: 61.0+ MB
