**The following analysis is replicated using what 'Heads or Tails' has done at 'Be my guest - Recruit Restaurant EDA'.
The key difference is that the code here is written in Python. The description for the analysis more or less remains as is.
Feedback/Upvotes is appreciated.**

***1. Introduction:***

This is an initial Exploratory Data Analysis for the Recruit Restaurant Visitor Forecasting competition using Python's Matplotlib and Seaborn

The aim of this challenge is to predict the future numbers of restaurant visitors. This makes it a Time Series Forecasting problem. The data was collected from Japanese restaurants. As we will see, the data set is small and easily accessible without requiring much memory or computing power. Therefore, this competition is particularly suited for beginners.

The data comes in the shape of 8 relational files which are derived from two separate Japanese websites that collect user information: “Hot Pepper Gourmet (hpg): similar to Yelp” (search and reserve) and “AirREGI / Restaurant Board (air): similar to Square” (reservation control and cash register). The training data is based on the time range of Jan 2016 - most of Apr 2017, while the test set includes the last week of Apr plus May 2017. The test data “intentionally spans a holiday week in Japan called the ‘Golden Week.’ The data description further notes that:”There are days in the test set where the restaurant were closed and had no visitors. These are ignored in scoring. The training set omits days where the restaurants were closed."

Those are the individual files:

air_visit_data.csv: historical visit data for the air restaurants. This is essentially the main training data set.

air_reserve.csv / hpg_reserve.csv: reservations made through the air / hpg systems.

air_store_info.csv / hpg_store_info.csv: details about the air / hpg restaurants including genre and location.

store_id_relation.csv: connects the air and hpg ids

date_info.csv: essentially flags the Japanese holidays.

sample_submission.csv: serves as the test set. The id is formed by combining the air id with the visit date.

**2. Reading the data**
First, check the contents of the input folder.
We have loaded the required libraries and read the data using read_csv method

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')

pd.options.display.max_columns = 50
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
#Read the input data

AIR_VISIT_DATA = pd.read_csv('../input/air_visit_data.csv')
AIR_STORE_INFO = pd.read_csv('../input/air_store_info.csv')
HPG_STORE_INFO = pd.read_csv('../input/hpg_store_info.csv')
AIR_RESERVE = pd.read_csv('../input/air_reserve.csv')
HPG_RESERVE = pd.read_csv('../input/hpg_reserve.csv')
STORE_ID_RELATION = pd.read_csv('../input/store_id_relation.csv')
SAMPLE_SUBMISSION = pd.read_csv('../input/sample_submission.csv')
DATE_INFO = pd.read_csv('../input/date_info.csv').rename(columns={'calendar_date': 'visit_date'})

**3 Overview: File structure and content**
As a first step let’s have an overview of the data sets using the summary and glimpse tools.

In [None]:
AIR_VISIT_DATA.describe()
AIR_VISIT_DATA.head()

In [None]:
AIR_RESERVE.describe()
AIR_RESERVE.head()

In [None]:
HPG_RESERVE.describe()
HPG_RESERVE.head()

In [None]:
AIR_STORE_INFO.describe()
AIR_STORE_INFO.head()

In [None]:
HPG_STORE_INFO.describe()
HPG_STORE_INFO.head()

In [None]:
DATE_INFO.describe()
DATE_INFO.head()

In [None]:
STORE_ID_RELATION.describe()
STORE_ID_RELATION.head()

In [None]:
SAMPLE_SUBMISSION.describe()
SAMPLE_SUBMISSION.head()

*Let's check if there are any missing values in the data*

The easiest way to do this is using heatmaps. 
If there is a null value in the dataset-We can see a horizontal yellow mark for thee respective feature.

spoiler alert: We don't have any missing values :P

In [None]:
data1 = sns.heatmap(AIR_VISIT_DATA.isnull(),yticklabels=False,cbar=False,cmap='viridis')
data1.set_title('AIR_VISIT_DATA')

In [None]:
data2 = sns.heatmap(AIR_STORE_INFO.isnull(),yticklabels=False,cbar=False,cmap='viridis')
data2.set_title('AIR_STORE_INFO')

In [None]:
data3 = sns.heatmap(HPG_STORE_INFO.isnull(),yticklabels=False,cbar=False,cmap='viridis')
data3.set_title('HPG_STORE_INFO')

In [None]:
data4 = sns.heatmap(AIR_RESERVE.isnull(),yticklabels=False,cbar=False,cmap='viridis')
data4.set_title('AIR_RESERVE')

In [None]:
data5 = sns.heatmap(HPG_RESERVE.isnull(),yticklabels=False,cbar=False,cmap='viridis')
data5.set_title('HPG_RESERVE')

In [None]:
data6 = sns.heatmap(STORE_ID_RELATION.isnull(),yticklabels=False,cbar=False,cmap='viridis')
data6.set_title('STORE_ID_RELATION')

In [None]:
data7 = sns.heatmap(SAMPLE_SUBMISSION.isnull(),yticklabels=False,cbar=False,cmap='viridis')
data7.set_title('SAMPLE_SUBMISSION')

In [None]:
data8 = sns.heatmap(DATE_INFO.isnull(),yticklabels=False,cbar=False,cmap='viridis')
data8.set_title('DATE_INFO')

There are no missing values, so we don't have to worry about imputation

**4 Individual feature visualisations**
Here we have a first look at the distributions of the feature in our individual data files before combining them for a more detailed analysis. This inital visualisation will be the foundation on which we build our analysis.

**4.1 Air Visits**
We start with the number of visits to the air restaurants. Here we plot the total number of visitors per day over the full training time range together with the median visitors per day of the week and month of the year:

In [None]:
a = AIR_VISIT_DATA.groupby(AIR_VISIT_DATA['visit_date'])['visitors'].sum()
plt.figure(figsize=(15,7))
plt.plot(a.index, a)

plt.ylabel("Number of Visitors",fontsize= 20)
plt.legend()


AIR_VISIT_DATA['visit_date'] = pd.to_datetime(AIR_VISIT_DATA['visit_date'])
AIR_VISIT_DATA['day_of_week'] = AIR_VISIT_DATA['visit_date'].dt.dayofweek
b = AIR_VISIT_DATA.groupby(['day_of_week'])['visitors'].median()


AIR_VISIT_DATA['month'] = AIR_VISIT_DATA['visit_date'].dt.month
c = AIR_VISIT_DATA.groupby(['month'])['visitors'].median()

fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True,figsize=(14,4))
sns.barplot(x=b.index, y=b, ax=ax1)
sns.barplot(x=c.index, y=c, ax=ax2)


Friday and the weekend appear to be the most popular days; which is to be expected. Monday and Tuesday have the lowest numbers of average visitors.

Also during the year there is a certain amount of variation. Dec appears to be the most popular month for restaurant visits. The period of Mar - May is consistently busy.

In [None]:
plt.figure(figsize=(15,7))
sns.distplot(np.log(AIR_VISIT_DATA['visitors']))
plt.legend()

In [None]:
df = AIR_VISIT_DATA[((AIR_VISIT_DATA['visit_date'] > '2016-04-15') & (AIR_VISIT_DATA['visit_date'] < '2016-06-15'))]
df1 = df.groupby(df['visit_date'])['visitors'].sum()
plt.figure(figsize=(12,4))
plt.plot(df1.index,df1)
plt.xlabel("Date")
plt.ylabel("Visitors")
plt.legend()

**4.2 Air Reservations**
Let’s see how our reservations data compares to the actual visitor numbers. We start with the air restaurants and visualise their visitor volume through reservations for each day, alongside the hours of these visits.

In [None]:
AIR_RESERVE['visit_datetime'] = pd.to_datetime(AIR_RESERVE['visit_datetime'])
AIR_RESERVE['reserve_datetime'] = pd.to_datetime(AIR_RESERVE['reserve_datetime'])
AIR_RESERVE['visit_hour'] = AIR_RESERVE['visit_datetime'].dt.hour
AIR_RESERVE['visit_date'] = AIR_RESERVE['visit_datetime'].dt.date

air_reserve_date = AIR_RESERVE.groupby(['visit_date'])['reserve_visitors'].sum()
plt.figure(figsize=(12,4))
plt.plot(air_reserve_date.index,air_reserve_date,lw = 2)
plt.xlabel("Visit_Date")
plt.ylabel("Visitors")
plt.legend()

air_reserve_hour = AIR_RESERVE.groupby(['visit_hour'])['reserve_visitors'].sum()
plt.figure(figsize=(12,4))
plt.bar(air_reserve_hour.index,air_reserve_hour)
plt.xlabel("Visit_hour")
plt.ylabel("Visitors")
plt.legend()

In [None]:
AIR_RESERVE['delta'] = AIR_RESERVE['visit_datetime']-AIR_RESERVE['reserve_datetime']
AIR_RESERVE['delta1'] = AIR_RESERVE['delta'].apply(lambda x: (x.seconds/3600))
d = AIR_RESERVE.groupby(AIR_RESERVE['delta1'])['reserve_visitors'].sum().reset_index()

We find:

There were much fewer reservations made in 2016 through the air system; even none at all for a long stretch of time. The volume only increased during the end of that year. In 2017 the visitor numbers stayed strong. The artifical decline we see after the first quarter is most likely related to these reservations being at the end of the training time frame, which means that long-term reservations would not be part of this data set.

Reservations are made typically for the dinner hours in the evening.


**4.3 HPG Reservations**
In the same style as above, here are the hpg reservations:

In [None]:
HPG_RESERVE['visit_datetime'] = pd.to_datetime(HPG_RESERVE['visit_datetime'])
HPG_RESERVE['reserve_datetime'] = pd.to_datetime(HPG_RESERVE['reserve_datetime'])
HPG_RESERVE['visit_hour'] = HPG_RESERVE['visit_datetime'].dt.hour
HPG_RESERVE['visit_date'] = HPG_RESERVE['visit_datetime'].dt.date

hpg_reserve_date = HPG_RESERVE.groupby(['visit_date'])['reserve_visitors'].sum()
plt.figure(figsize=(12,4))
plt.plot(hpg_reserve_date.index,hpg_reserve_date,lw = 2)
plt.xlabel("Visit_Date")
plt.ylabel("Visitors")
plt.legend()

hpg_reserve_hour = HPG_RESERVE.groupby(['visit_hour'])['reserve_visitors'].sum()
plt.figure(figsize=(12,4))
plt.bar(hpg_reserve_hour.index,hpg_reserve_hour)
plt.xlabel("Visit_hour")
plt.ylabel("Visitors")
plt.legend()

We find:

Here the visits after reservation follow a more orderly pattern, with a clear spike in Dec 2016. As above for the air data, we also see reservation visits dropping off as we get closer to the end of the time frame.

Again, most reservations are for dinner, and we see another nice 24-hour pattern for making these reservations. It’s worth noting that here the last few hours before the visit don’t see more volume than the 24 or 48 hours before. This is in stark constrast to the air data.

In [None]:
import folium
from folium import plugins

In [None]:
m = folium.Map([AIR_STORE_INFO['latitude'].min(), AIR_STORE_INFO['longitude'].max()], zoom_start=4)
m

**4.4 Air Store**
we plot the numbers of different types of cuisine (or air_genre_names) alongside the areas with the most air restaurants:

In [None]:
pd.options.display.max_rows = 4000
pd.options.display.max_seq_items = 2000

air_store_genre = AIR_STORE_INFO.groupby(AIR_STORE_INFO['air_genre_name'])['air_store_id'].count().reset_index()
air_store_genre = air_store_genre.sort_values(['air_store_id'],ascending=False)

plt.figure(figsize=(8,5))
sns.barplot(x='air_store_id', y='air_genre_name', data = air_store_genre)
plt.xlabel("Number of Restaurants")
plt.ylabel("Type of Cuisine")
plt.legend()


air_area = AIR_STORE_INFO.groupby(AIR_STORE_INFO['air_area_name'])['air_store_id'].count().reset_index()
air_area = air_area.sort_values(['air_store_id'],ascending=False)
air_area = air_area.head(15)


plt.figure(figsize=(8,5))
sns.barplot(x = 'air_store_id',y = air_area['air_area_name'],data = air_area)
plt.xlabel("Number of Restaurants")
plt.ylabel("Area")
plt.legend()

We find:

There are lots of Izakaya gastropubs in our data, followed by Cafe’s. We don’t have many Karaoke places in the air data set and also only a few that describe themselves as generically “International” or “Asian”. I have to admit, I’m kind of intrigued by “creative cuisine”.

Fukuoka has the largest number of air restaurants per area, followed by many Tokyo areas.

**4.5 HPG Store**
Here is the breakdown of genre and area for the hpg restaurants:

In [None]:
HPG_STORE_INFO.head()

In [None]:
hpg_store_genre = HPG_STORE_INFO.groupby(HPG_STORE_INFO['hpg_genre_name'])['hpg_store_id'].count().reset_index()
hpg_store_genre = hpg_store_genre.sort_values(['hpg_store_id'],ascending=False)

fig, (ax1, ax2) = plt.subplots(ncols=2,figsize=(12,10))
plt1 = sns.barplot(x= 'hpg_store_id', y='hpg_genre_name',data = hpg_store_genre,ax = ax1)
plt1.set(xlabel="Number of HPG Restaurants",ylabel='HPG Genre Name')

hpg_store_area = HPG_STORE_INFO.groupby(HPG_STORE_INFO['hpg_area_name'])['hpg_store_id'].count().reset_index()
hpg_store_area = hpg_store_area.sort_values(['hpg_store_id'],ascending=False)
hpg_store_area1 = hpg_store_area.head(15)

plt2 = sns.barplot(x= 'hpg_store_id', y='hpg_area_name',data = hpg_store_area1, ax = ax2)
plt2.set(xlabel="Number of HPG Restaurants",ylabel="Area Name")
plt.tight_layout()


The hpg description contains a larger variety of genres than in the air data. Here, “Japanese style” appears to contain many more places that are categorised more specifically in the air data. The same applies to “International cuisine”.

In the top 15 area we find again Tokyo and Osaka to be prominently present.

4.6. Holidays

Let’s have a quick look at the holidays. We’ll plot how many there are in total and also how they are distributed during our prediction time range in 2017 and the corresponding time in 2016:

In [None]:
DATE_INFO['visit_date'] = pd.to_datetime(DATE_INFO['visit_date'])
holidays16 = DATE_INFO[((DATE_INFO['visit_date'] >'2016-04-15') & (DATE_INFO['visit_date'] < '2016-06-01'))]
holidays17 = DATE_INFO[((DATE_INFO['visit_date'] >'2017-04-15') & (DATE_INFO['visit_date'] < '2017-06-01'))]

In [None]:


sns.countplot(x="holiday_flg",data = DATE_INFO)

fig, (ax1, ax2) = plt.subplots(ncols=2,figsize=(10,4))
plt2 = sns.stripplot(x='visit_date',y='holiday_flg',data=holidays16, ax=ax1)
plt2.set_xticks([])



plt3 = sns.stripplot(x='visit_date',y='holiday_flg',data=holidays17, ax=ax2)
plt3.set(xticks=[])

plt.tight_layout()

We find:

The same days were holidays in late Apr / May in 2016 as in 2017.

There are about 7% holidays in our data:

In [None]:
air_visit = AIR_VISIT_DATA.groupby(AIR_VISIT_DATA['visit_date'])['visitors'].sum().reset_index()
air_visit['visit_date'] = pd.to_datetime(air_visit['visit_date'])
air_visit['year'] = air_visit['visit_date'].dt.year
air_visit['month'] = air_visit['visit_date'].dt.month
air_visit['day'] = air_visit['visit_date'].dt.day

In [None]:
air_visit[['visit_date','year']].set_index('visit_date').plot()

In [None]:
SAMPLE_SUBMISSION['date'] = SAMPLE_SUBMISSION['id'][0].split('_')[2]
SAMPLE_SUBMISSION['date'] = pd.to_datetime(SAMPLE_SUBMISSION['date'])
SAMPLE_SUBMISSION['year'] = SAMPLE_SUBMISSION['date'].dt.year

In [None]:
test = SAMPLE_SUBMISSION.groupby(SAMPLE_SUBMISSION['date'])['year'].max().reset_index()

**5 Feature relations**
After looking at every data set individually, let’s get to the real fun and start combining them. This will tell us something about the relations between the various features and how these relationsy might affect the visitor numbers. Any signal we find will need to be interpreted in the context of the individual feature distributions; which is why it was one of our first steps to study those.

**5.1 Visitors per genre**
Our first plot of the multi-feature space deals with the average number of air restaurant visitors broken down by type of cuisine; i.e. the air_genre_name. 

In [None]:
ab = pd.merge(AIR_VISIT_DATA,AIR_STORE_INFO,on='air_store_id')
ab1 = ab.groupby(['visit_date','air_genre_name'])['visitors'].mean().reset_index()
ab13 = ab1.pivot_table(values='visitors',index='visit_date',columns='air_genre_name')
ab13.plot(subplots=True,figsize=(12,60))

I will add few more plots in the coming days.