# [Seoul Bike Rental Prediction](https://www.kaggle.com/c/seoul-bike-rental-ai-pro-iti)

 **Abstract:**

Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.

**Data Description**

Hourly rental data is provided along with weather data. For this competition, the training set is comprised of the first 20 days of each month, while the test set is the 21th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

**Data fields**

The dataset contains count of public bikes rented at each hour in Seoul Bike haring System with the corresponding Weather data and Holidays information.

* ID - an ID for this instance
* Date - year-month-day
* Hour - Hour of he day
* Temperature - Temperature in Celsius
* Humidity - %
* Windspeed - m/s
* Visibility - 10m
* Dew point temperature - Celsius
* Solar radiation - MJ/m2
* Rainfall - mm
* Snowfall - cm
* Seasons - Winter, Spring, Summer, Autumn
* Holiday - Holiday/No holiday
* Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)
* y - Rented Bike count (Target), Count of bikes rented at each hour

**File descriptions**
* train.csv - the training set.
* test.csv - the test set.
* sample_submission.csv - a sample submission file in the correct format

# **Exploratory Data Analysis:**

This is my very first exploratory data analysis I do on my own. If you found it beneficial, please leave a comment or an upvote. Don't hesitate to leave me a comment if  my findings make no sense or need any improvements.

**Import Libraries:**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set(style="ticks", color_codes=True)
sns.set_theme(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')

**1. Reading and Exploring the data:**

**1. 1. Read dataset:**

In [2]:
filepath = '../input/seoul-bike-rental-ai-pro-iti/train.csv'
df = pd.read_csv(filepath,index_col="Date", parse_dates=True)

In [3]:
filepath = '../input/seoul-bike-rental-ai-pro-iti/test.csv'
dftest = pd.read_csv(filepath)

**1. 2. Explore dataset:**

In [4]:
df.head()

**1. 3. Display general statistics:**

In [5]:
df.shape

In [6]:
df.columns

In [7]:
df.info()

In [8]:
df.describe()

**Count all unique values in each column**

In [9]:
df.nunique()

**2. Clean the data**

**2. 1. Check and Remove Duplicates:**

In [10]:
print(len(df) - len(df.drop_duplicates()))

**2. 2. Check null values:**

In [11]:
df.isnull().sum()

**2. 3. Remove non-imporant columns:**

In [12]:
df.head()

In [13]:
df.drop(['ID', 'Dew point temperature(�C)'],axis=1,inplace=True)

**2. 4. Change the datatypes, columns names to its correct form:**

**Sort the date order, add day name:**

In [14]:
df.index.sort_values('Date')
df['Month'] = df.index.month
df['Year'] = df.index.year
df['Day'] = df.index.day
df['DayName'] = df.index.day_name()

**Ensure index date is of type datetime**

In [15]:
assert df.index.inferred_type == 'datetime64', "must have a datetime index"

In [16]:
df.tail()

**Change column names to better readable names**

In [17]:
df.rename(columns = {"y": "Bikes count (Target)"}, inplace = True)
df.rename(columns = {"Temperature(�C)": "Temperature"}, inplace = True)
df.rename(columns = {"Wind speed (m/s)": "Wind speed"}, inplace = True)
df.rename(columns = {"Visibility (10m)": "Visibility"}, inplace = True)
df.rename(columns = {"Solar Radiation (MJ/m2)": "Solar Radiation"}, inplace = True)
df.rename(columns = {"Rainfall(mm)": "Rainfall"}, inplace = True)
df.rename(columns = {"Snowfall (cm)": "Snowfall"}, inplace = True)
df.rename(columns = {"Humidity(%)": "Humidity"}, inplace = True)

In [18]:
df.columns

In [19]:
df.info()

In [20]:
df.head()

In [21]:
categorical_data = df.drop(['Month','Year','Day','Bikes count (Target)','Hour', 'Temperature', 'Humidity', 'Wind speed', 'Visibility', 'Solar Radiation','Rainfall','Snowfall'], axis=1)
numeric_data = df.drop(['Seasons','Holiday','Functioning Day','DayName'], axis=1)

In [22]:
categorical_data.head()

In [23]:
numeric_data.head()

**3. Data Analysis ans Visualization**

**3. 1. Data Distribution:**

In [24]:
plt.figure(figsize=(14,7))
sns.lineplot(data=df['Bikes count (Target)'])
plt.show()

> **Observation**
> 
> * The number of rental bikes dramatically increases in **2018**.

In [25]:
numeric_data.hist(figsize=(16, 20), color='navy',xlabelsize=8, ylabelsize=8); 

> **Observation**
> 
> From the graphs above we conclude that:
> * Temperature and Humidty have normal distributions.
> * Hour, Day and Month have discrete distributions. (Categorical)
> * Snowfall and Rainfall almost all of their values are zeros as well as in test data, so they should be dropped.

In [26]:
plt.figure(figsize=(20,20))
for i, column in enumerate(numeric_data,1):
    plt.subplot(4,3,i)
    plt.title(column)
    plt.boxplot(numeric_data[column],vert= False,patch_artist='True')


> **Observation**
> * Boxplots for Wind Speed and Solar Radiation features shows outliers so they should be Normalized.

**3. 2. Correlation Analysis**

In [27]:
correlation = numeric_data.corr()

In [28]:
correlation

In [29]:
matrix = np.triu(correlation)
plt.figure(figsize=(15,10))
sns.heatmap(correlation,xticklabels = correlation.columns,yticklabels = correlation.columns,mask=matrix,annot=True);

> **Observation**
> * Snowfall and Rainfall have very **weak** correlation with Bikes count(Target).
> * Temperature and Hour have **strong** correlation with Bikes count(Target).

* Heatmap is not enough to get relations between features.
* Pairplot(scatter plot) will clarify the relation between the features and Bikes count (Target) more.

In [30]:
sns.pairplot(numeric_data.drop(['Month','Year','Day'],axis=1));

**Regression plots to clarify more the correlation between features and Target**

In [31]:
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(25,5))
sns.regplot(x='Rainfall',y='Bikes count (Target)', data=df, ax=ax1,scatter_kws={"color": "darkred"},line_kws={"color": "#FFD880"})
sns.regplot(x='Snowfall', y='Bikes count (Target)', data=df, ax=ax2,scatter_kws={"color": "darkred"},line_kws={"color": "#FFD880"})
sns.regplot(x='Humidity', y='Bikes count (Target)', data=df, ax=ax3,scatter_kws={"color": "darkred"},line_kws={"color": "#FFD880"})
plt.show()

> **Observation**
> 
> These regression plots shows:
> * Bikes count and Rainfall has **negative weak** correlation.
> * Bikes count and Snowfall has **negative weak** correlation.
> * Bikes count and Humidty has **negative weak** correlation.
> * **These 3 columns should be dropped due to their weak correlation.**

In [32]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,4))
sns.regplot(x='Wind speed', y='Bikes count (Target)', data=df, ax=ax1,scatter_kws={"color": "darkred"},line_kws={"color": "#FFD880"})
sns.regplot(x='Temperature', y='Bikes count (Target)', data=df, ax=ax2,scatter_kws={"color": "darkred"},line_kws={"color": "#FFD880"});


> **Observation**
> 
> These regression plots shows:
> * Bikes count and Wind speed has **positive weak** correlation.
> * Bikes count and Temperature has **positive strong** correlation.

**3. 3. Data Analysis for Categorical Data**

> **Strip plots for categorical data:**
> 
> Strip plots are essentially a type of scatter plot. They are used to show the spread of observations within each category.

In [33]:
fig, (ax1,ax2, ax3) = plt.subplots(1,3, figsize=(18,5))
sns.stripplot(x=df['Seasons'],y=df['Bikes count (Target)'],palette='nipy_spectral',linewidth=0.5, alpha=0.6,ax=ax1)
sns.stripplot(x=df['Functioning Day'],y=df['Bikes count (Target)'],palette='nipy_spectral',linewidth=0.5, alpha=0.6,ax=ax2)
sns.stripplot(x=df['Holiday'],y=df['Bikes count (Target)'],linewidth=0.5, alpha=0.6,ax=ax3);

> **Observation:**
> 
> * Bikes count and Seasons: Summer is the highest rental bikes season.
> * Bikes count and Functioning days: No rental bikes in Non Functioning days.
> * Bikes count and No holiday: Holiday days require zero rental bikes.

**Scatterplots:**

In [34]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,5))
sns.scatterplot(x='Temperature', y='Bikes count (Target)',hue='Seasons', data=df,ax=ax1)
sns.scatterplot(x='Temperature', y='Bikes count (Target)',hue='Holiday', data=df, ax=ax2)
plt.show()

**Barplots:**

In [35]:
plt.figure(figsize=(14,7))
sns.barplot(x='Hour', y='Bikes count (Target)', hue= 'Holiday',palette='Reds',data=df);

> **Observation:**
> 
> * Hour 18 has the highest rental  bikes in **No holiday** days.

In [36]:
plt.figure(figsize=(14,7))
sns.barplot(x='Hour', y='Bikes count (Target)', hue= 'Seasons',data=df);

> **Observation:**
> 
> * Hour 18 has the highest rental  bikes in **summer**.

In [37]:
plt.figure(figsize=(14,7))
sns.barplot(x=df['Month'],y=df['Bikes count (Target)'],palette='nipy_spectral')
plt.show()

> **Observation**
> * The barplot shows that **JUNE & JULY** has the Highest number of rental bikes.

In [38]:
plt.figure(figsize=(14,7))
sns.barplot(x=df['Month'],y=df['Bikes count (Target)'], hue = df['Holiday'], palette='Blues')
plt.show()

In [39]:
plt.figure(figsize=(14,7))
sns.barplot(x=df['Year'],y=df['Bikes count (Target)'])
plt.show()