- In this study, we are going to make Exploratory Data Analysis (EDA) with the London Bike Share dataset.
- Study aims to be beginner friendly and give as much as possible explanation for each step on the way.
- Study's dataset has 17414 instances along with their count of bike share, temperature and other features.
- Data includes 2015-2017 bike share infos in London..

- 'Ride into a wise, healthy world that’s eco-friendly, efficient, and fun.' from the https://www.pbsc.com/about-us website


- Let's import the required libraries

In [20]:
import pandas as pd
import numpy as np

import plotly.io as pio
pio.renderers.default = 'iframe'

import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [12]:
df = pd.read_csv('london_merged.csv')
df.head()

Unnamed: 0,timestamp,cnt,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season
0,2015-01-04 00:00:00,182,3.0,2.0,93.0,6.0,3.0,0.0,1.0,3.0
1,2015-01-04 01:00:00,138,3.0,2.5,93.0,5.0,1.0,0.0,1.0,3.0
2,2015-01-04 02:00:00,134,2.5,2.5,96.5,0.0,1.0,0.0,1.0,3.0
3,2015-01-04 03:00:00,72,2.0,2.0,100.0,0.0,1.0,0.0,1.0,3.0
4,2015-01-04 04:00:00,47,2.0,0.0,93.0,6.5,1.0,0.0,1.0,3.0


Metadata:
- "timestamp" - timestamp field for grouping the data
- "cnt" - the count of a new bike shares
- "t1" - real temperature in C
- "t2" - temperature in C "feels like"
- "hum" - humidity in percentage
- "windspeed" - wind speed in km/h
- "weathercode" - category of the weather
- "isholiday" - boolean field - 1 holiday / 0 non holiday
- "isweekend" - boolean field - 1 if the day is weekend
- "season" - category field meteorological seasons: 0-spring ; 1-summer; 2-fall; 3-winter.

- "weathe_code" category description:
   - 1 = Clear ; mostly clear but have some values with haze/fog/patches of fog/ fog in vicinity 2 = scattered clouds / few clouds 3 = Broken clouds 4 = Cloudy 7 = Rain/ light Rain shower/ Light rain 10 = rain with thunderstorm 26 = snowfall 94 = Freezing Fog

In [3]:
df.shape

(17414, 10)

- We have 17414 instances with 10 different variables to work on.

In [4]:
df.isnull().sum()

timestamp       0
cnt             0
t1              0
t2              0
hum             0
wind_speed      0
weather_code    0
is_holiday      0
is_weekend      0
season          0
dtype: int64

- Yes, very clean data for the 17414 instances.
- In the real world very hard to find this kind of clean data. Enjoy !!

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17414 entries, 0 to 17413
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   timestamp     17414 non-null  object 
 1   cnt           17414 non-null  int64  
 2   t1            17414 non-null  float64
 3   t2            17414 non-null  float64
 4   hum           17414 non-null  float64
 5   wind_speed    17414 non-null  float64
 6   weather_code  17414 non-null  float64
 7   is_holiday    17414 non-null  float64
 8   is_weekend    17414 non-null  float64
 9   season        17414 non-null  float64
dtypes: float64(8), int64(1), object(1)
memory usage: 1.3+ MB


- It looks like we have 9 numeric variable. But is that so???
- Also we have 1 non-numeric variable. 
- Non-numeric variable is coded as Object, but it looks like time object. It needs further adjustment. Noted.
- Also boolean variables are coded as 0 and 1, noted.
- Categorical variables **season** and **weathercode** are also coded as numerical.  Noted.
- "t1" - real temperature in C and "t2" - temperature in C "feels like" seems quite same thing, needs to look their correlation. Noted.

In [8]:
df.drop(['season', 'weather_code', 'is_holiday','is_weekend'], axis=1).describe()

Unnamed: 0,cnt,t1,t2,hum,wind_speed
count,17414.0,17414.0,17414.0,17414.0,17414.0
mean,1143.101642,12.468091,11.520836,72.324954,15.913063
std,1085.108068,5.571818,6.615145,14.313186,7.89457
min,0.0,-1.5,-6.0,20.5,0.0
25%,257.0,8.0,6.0,63.0,10.0
50%,844.0,12.5,12.5,74.5,15.0
75%,1671.75,16.0,16.0,83.0,20.5
max,7860.0,34.0,34.0,100.0,56.5


Before going further, let's summarize what we have got from the dataset.

- Our dataset has 17414 time records of the bike rent. 
-  "t1" - real temperature in C and "t2" - temperature in C "feels like" seems quite same thing, needs to look their correlation. We need to be careful about the multicollinearity.

- We have date object, needs to be adjusted.

- Numerically coded (season and weather_code) variables can be used as a group to see the differences among them.

- 'cnt' : count of bike share, will be our target variable to work on it.

- Numerical columns most probably have outliers. (Mean- Median difference, difference between 75% and maximum value, difference between %25 and minimum value), we have to check them.

- Let's make the necessary adjustments before moving to the analysis part.

#### **Temperature**

- Lets' checek correlation between real temperature and felt temperature.
- if correlation is high, we can detect the multicollinearity and use one of the highly correlated variable  to improve our model success.
- Even though, we will make detailed EDA in this study, still it is best practice to follow.

In [39]:
df['t1'].corr(df['t2'])

0.9883442218765803

- Correlation is extremely high, so we will use only  "t1" - real temperature in C, in our analysis.

#### **timestamp**

- Let's make 'timestamp' as datetime object and use its values to make new columns out of it.

In [13]:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df= df.set_index('timestamp')

In [14]:
df['year_month']= df.index.strftime('%Y-%m')
df['year'] = df.index.year
df['month']= df.index.month
df['day_of_month']= df.index.day
df['day_of_week']=df.index.dayofweek
df['hour']=df.index.hour

df.head()

Unnamed: 0_level_0,cnt,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season,year_month,year,month,day_of_month,day_of_week,hour
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2015-01-04 00:00:00,182,3.0,2.0,93.0,6.0,3.0,0.0,1.0,3.0,2015-01,2015,1,4,6,0
2015-01-04 01:00:00,138,3.0,2.5,93.0,5.0,1.0,0.0,1.0,3.0,2015-01,2015,1,4,6,1
2015-01-04 02:00:00,134,2.5,2.5,96.5,0.0,1.0,0.0,1.0,3.0,2015-01,2015,1,4,6,2
2015-01-04 03:00:00,72,2.0,2.0,100.0,0.0,1.0,0.0,1.0,3.0,2015-01,2015,1,4,6,3
2015-01-04 04:00:00,47,2.0,0.0,93.0,6.5,1.0,0.0,1.0,3.0,2015-01,2015,1,4,6,4


- Seems much better

#### Look at the **season** and **weather_code** 

In [15]:
df['season'].value_counts()

0.0    4394
1.0    4387
3.0    4330
2.0    4303
Name: season, dtype: int64

- That's good, it can be used as group to see the differences at the count of bike share

In [17]:
df['weather_code'].value_counts()

1.0     6150
2.0     4034
3.0     3551
7.0     2141
4.0     1464
26.0      60
10.0      14
Name: weather_code, dtype: int64

- It seems OK, can be used in a groupby.

### Analysis Part

### **Correlation**

In [118]:
df[['cnt','t1','hum','wind_speed']].corr()

Unnamed: 0,cnt,t1,hum,wind_speed
cnt,1.0,0.388798,-0.462901,0.116295
t1,0.388798,1.0,-0.447781,0.145471
hum,-0.462901,-0.447781,1.0,-0.287789
wind_speed,0.116295,0.145471,-0.287789,1.0


In [119]:
index_vals = df['season'].astype('category').cat.codes

fig = go.Figure(data=go.Splom(
                dimensions=[dict(label='Number of Bike Share',
                                 values=df['cnt']),
                            dict(label='Temperature',
                                 values=df['t1']),
                            dict(label='Humidity',
                                 values=df['hum']),
                           dict(label='Wind Speed',
                                 values=df['wind_speed'])],
                showupperhalf=False, 
                text=df['season'],
                marker=dict(color=index_vals,
                            showscale=False,
                            line_color='white', line_width=0.5)
                ))


fig.update_layout(
    title='Bike Share in london',
    width=1000,
    height=1000,
)

fig.show()

- Based on the correlation matrix:
    - There is weak positive relationship (.388) between temperature and the number of bike share
    - Also there is weak negative relationship (.46) between humidity and the number of the bike share.

#### **Season**

In [18]:
df['season'].value_counts(normalize=True)

0.0    0.252326
1.0    0.251924
3.0    0.248651
2.0    0.247100
Name: season, dtype: float64

- Dataset contains almost same number of instances from the four seasons.

In [25]:
fig = px.bar(x= df['season'].value_counts().index, y=df['season'].value_counts().values, 
             title='Seasons', labels={'y':'Count', 'x':'Seasons'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

#### **weather_code**

In [26]:
df['weather_code'].value_counts(normalize=True)

1.0     0.353164
2.0     0.231653
3.0     0.203916
7.0     0.122947
4.0     0.084070
26.0    0.003446
10.0    0.000804
Name: weather_code, dtype: float64

- %35 of the times, weather code is Clear(1.0)
- %23 of the times, weather code is 'scattered clouds, few clouds'
- %20 of the times, weather code is 'broken clouds'
- %12 of the time 'rain, light rain'

- By the way, remember that we are looking at the London's data. So rain and cloud is quite a Londonish.

In [55]:
fig = px.pie(df, values=df['weather_code'].value_counts().values, 
             names= ['Clear', 'scattered clouds', 'Broken clouds', 'Cloudy' 'Rain', 'rain with thunderstorm', 'snowfall', 'Freezing Fog'])
fig.show()



#### **Count of a New Bike Shares**

In [31]:
df['cnt'].describe()

count    17414.000000
mean      1143.101642
std       1085.108068
min          0.000000
25%        257.000000
50%        844.000000
75%       1671.750000
max       7860.000000
Name: cnt, dtype: float64

- We have huge difference between mean and median values (mean = 1143, median=844)
- It has highly skewed distribution with the outliers on the maximum side.
- We can expect highly right skewed distribution with possible outliers in the maximum side.
- Let' see it.

In [32]:
fig = px.histogram(df, x= 'cnt', title='Count of a New Bike Shares', marginal="box", hover_data = df[['season']])
fig.show()

- As expected, highly right skewed distribution with the outliers on the maximum side.

- All of the extreme outliers (starting from 5560 count) are in the season 1, which means in the summer.

- Any surprise !!! 

#### **real temperature in C**

In [40]:
df['t1'].describe()

count    17414.000000
mean        12.468091
std          5.571818
min         -1.500000
25%          8.000000
50%         12.500000
75%         16.000000
max         34.000000
Name: t1, dtype: float64

- Both mean and median scores are very close to each other. Median is slightly higher than mean score. 
- So we can very slightly left skewed distribution
- But the distribution will be very close to normal distribution with several outliers.
- Let's see it.

In [42]:
fig = px.histogram(df, x= 't1', title='Temperatures', marginal="box", hover_data = df[['season']])
fig.show()

- Yeah, as we expected, quite normal distribution with several outliers, 
- As seen better in the box plot, very slightly left skewed distribution.

#### **Wind Speed**

In [43]:
df['wind_speed'].describe()

count    17414.000000
mean        15.913063
std          7.894570
min          0.000000
25%         10.000000
50%         15.000000
75%         20.500000
max         56.500000
Name: wind_speed, dtype: float64

- We can expect slighlt rightly skewed distribution (mean 15.9, median=15)
- Which will be very close to normal distribution
- We can expect outliers on the maximum side.

In [44]:
fig = px.histogram(df, x= 'wind_speed', title='Wind Speed', marginal="box", hover_data = df[['season']])
fig.show()

- As we expected, several outliers on the right side.
- Slightly rightly skewed distribution

#### **Humidity**

In [45]:
df['hum'].describe()

count    17414.000000
mean        72.324954
std         14.313186
min         20.500000
25%         63.000000
50%         74.500000
75%         83.000000
max        100.000000
Name: hum, dtype: float64

- Both mean and median scores are close to each other.
- Since median score is little bit higher than mean score, we can expect slightly left skewed distribution.
- Possible outliers on the minimum side.

In [46]:
fig = px.histogram(df, x= 'hum', title='Humidity', marginal="box", hover_data = df[['season']])
fig.show()

- As we expected, left skewed distribution with outliers on the left side.

#### **Holiday or No?**

In [47]:
df['is_holiday'].value_counts()

0.0    17030
1.0      384
Name: is_holiday, dtype: int64

In [56]:
fig = px.pie(df, values=df['is_holiday'].value_counts().values, 
             names= ['Normal Day','Holiday'] )
fig.show()

#### **Wekend or No**

In [60]:
df['is_holiday'].value_counts()

0.0    17030
1.0      384
Name: is_holiday, dtype: int64

In [58]:
fig = px.pie(df, values=df['is_holiday'].value_counts().values, 
             names= ['Weekday','Weekend'] )
fig.show()

- Ok let's go deeper.

### **Bike Share by Year**

In [115]:
fig = px.scatter(df, x="year", y="cnt")
fig.show()

- From 2015 to 2017 we can observe significant decrease on the bike share counts.

### **Bike Share by Year and Months**

In [107]:
fig = px.scatter(df, x="year_month", y="cnt")
fig.show()

- As easily seen in the scatter plot, during the summer time, significant increase on the bike share.
- On the other hand, during the winter time it decreases.

#### **Bike Share by Seasons**

In [77]:
df['season1']= df['season'].replace({0:'Spring',1:'summer',2:'Fall',3:'Winter'})
fig = px.bar(df, x='season1', y= 'cnt',  hover_data = df[['year_month']], color='season1', 
             labels={'season1':'Seasons','cnt':'Number of Bike Share'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- As we have seen in the year_month, same is true based on the seasons.
- Bike share increases on the summer time and reached lowest point on the winter time.

#### **Bike Share During the Holiday**

In [98]:
holiday = df.groupby('is_holiday')['cnt'].mean().reset_index().rename(columns={'is_holiday': 'Holiday', 'cnt':'Number of Bike Shared'}, )
holiday['Holiday']= holiday['Holiday'].replace({0: 'Normal Day', 1:'Holiday'})

fig = px.bar(holiday, x='Holiday', y= 'Number of Bike Shared', color='Holiday', )
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Normal days have more bike share than holidays.

#### **Bike Share During the Weekend**

In [100]:
weekend = df.groupby('is_weekend')['cnt'].mean().reset_index().rename(columns={'is_weekend': 'Weekend', 'cnt':'Number of Bike Shared'}, )
weekend['Weekend']= weekend['Weekend'].replace({0: 'Weekday', 1:'Weekend'})

fig = px.bar(weekend, x='Weekend', y= 'Number of Bike Shared', color='Weekend', )
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Weekdays have more bike share than weekends.

#### **Bike Shares by Hour**

In [110]:
fig = px.scatter(df, x="hour", y="cnt", color='is_holiday')
fig.show()

- On the mornings between 8-10, and on the afternoons between 17-18 are the peak hours for bike sharing.
- We can make different speculations based on this result, such as before going to work or school and after school or work would be the peak hours for sharing bike.
- But still we need more variables to justify our assumptions.

In [111]:
fig = px.scatter(df, x="hour", y="cnt", color='is_weekend')
fig.show()

- During the wekend we have another result to look for it.
- Weekend time between 10-16 are the peak time to share a bike.
- Yeah, also during the midnight, somebody needs a ride !!!

In [114]:
fig = px.scatter(df, x="day_of_week", y="cnt", color='is_weekend', hover_data = df[['hour']])
fig.show()

- Except Thurdays, almost same distribution during the weekdays.
- Thursdays have the peaks at the morning 8.a.m and afternoons between 16-18.

- Thanks for the dataset contibutor for this data. I really enjoyed working on it.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 