
# <span style = "color:blue">Bike Sharing Dataset</span>
==========================================


Hadi Fanaee-T

Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto
INESC Porto, Campus da FEUP
Rua Dr. Roberto Frias, 378
4200 - 465 Porto, Portugal




=========================================
# <span style = "color:blue">Data Set</span>
=========================================

Renting bicycles is largely correlated to environment and seasons for instance day of week, hour of day, weather conditions, etc. The dataset is related to two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data. The dataset has been aggregated the data on two hourly and daily basis but for this analysis we have taken only dataset on hourly basis. Weather information are extracted from http://www.freemeteo.com. 


=========================================
# <span style = "color:blue">License </span>
=========================================

Use of this dataset in publications must be cited to the following publication:

[1] Fanaee-T, Hadi, and Gama, Joao, "Event labeling combining ensemble detectors and background knowledge", Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.

@article{
	year={2013},
	issn={2192-6352},
	journal={Progress in Artificial Intelligence},
	doi={10.1007/s13748-013-0040-3},
	title={Event labeling combining ensemble detectors and background knowledge},
	url={http://dx.doi.org/10.1007/s13748-013-0040-3},
	publisher={Springer Berlin Heidelberg},
	keywords={Event labeling; Event detection; Ensemble learning; Background knowledge},
	author={Fanaee-T, Hadi and Gama, Joao},
	pages={1-15}
}



=========================================
# <span style = "color:blue">Dataset characteristics</span>
=========================================	
We have choosen hour.csv for the analysis. The variables used in the dataset are given below

**instant** record index

**dteday** : date

**season** : season (1:springer, 2:summer, 3:fall, 4:winter)

**yr** : year (0: 2011, 1:2012)

**mnth** : month ( 1 to 12)

**hr** : hour (0 to 23)

**holiday** : weather day is holiday or not (extracted from 

http://dchr.dc.gov/page/holiday-schedule)

**weekday** : day of the week (1 = Monday)

**workingday** : if day is neither weekend nor holiday is 1, otherwise is 0.

**weathersit** : 
		- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
		- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
		- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
		- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
        
**temp** : Normalized temperature in Celsius. The values are divided to 41 (max)

**atemp**: Normalized feeling temperature in Celsius. The values are divided to 50 (max)

**hum**: Normalized humidity. The values are divided to 100 (max)

**windspeed**: Normalized wind speed. The values are divided to 67 (max)

**casual**: count of casual users

**registered**: count of registered users

**cnt**: count of total rental bikes including both casual and registered


<span style="background-color: #DAF7A6">The aim of this project is to analyze the Hour.csv dataset. The analysis will be made based on the use of registered and casual bicycles in an hour, day, week, month, year,holidays and weather. We will also see correlation of different variables on the use of bicycles.</span>




# <span style="background-color: #CBBDB6">Importing Libraries</span>

In [1]:
import pandas as pd
import numpy as np
#%matplotlib inline
#import matplotlib
#import matplotlib.pyplot as plt
#import seaborn as sns

from plotly.offline import iplot, init_notebook_mode
import cufflinks as cf
init_notebook_mode(connected=True)
cf.go_offline(connected=True)
cf.set_config_file(theme="pearl")


# <span style="background-color: #CBBDB6">Loading dataset

In [2]:
bike_share = pd.read_csv("hour.csv",sep=',',index_col='instant')
bike_share

Unnamed: 0_level_0,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
instant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0000,3,13,16
2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.80,0.0000,8,32,40
3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.80,0.0000,5,27,32
4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0000,3,10,13
5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0000,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17375,2012-12-31,1,1,12,19,0,1,1,2,0.26,0.2576,0.60,0.1642,11,108,119
17376,2012-12-31,1,1,12,20,0,1,1,2,0.26,0.2576,0.60,0.1642,8,81,89
17377,2012-12-31,1,1,12,21,0,1,1,1,0.26,0.2576,0.60,0.1642,7,83,90
17378,2012-12-31,1,1,12,22,0,1,1,1,0.26,0.2727,0.56,0.1343,13,48,61


## <span style="background-color: #CBBDB6">Dropping some columns

In [3]:
bike_share= bike_share.drop(columns=['yr','mnth'])

In [4]:
bike_share.head()

Unnamed: 0_level_0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
instant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,2011-01-01,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
2,2011-01-01,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
3,2011-01-01,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
4,2011-01-01,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
5,2011-01-01,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


<span style="background-color: #DAF7A6"> From the above table, we can see that the data has been collected every hour, 24 times a day starting from January 1 00:00, 2011 which is Saturday, the 6th day of the week. </span>

# <span style="background-color: #CBBDB6">Data Structure

In [5]:
# Lets see the shape of the dataset
bike_share.shape

(17379, 14)

In [6]:
# Lets check the columns
bike_share.columns

Index(['dteday', 'season', 'hr', 'holiday', 'weekday', 'workingday',
       'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual',
       'registered', 'cnt'],
      dtype='object')

In [7]:
# Lets check information of the dataset
bike_share.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17379 entries, 1 to 17379
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   dteday      17379 non-null  object 
 1   season      17379 non-null  int64  
 2   hr          17379 non-null  int64  
 3   holiday     17379 non-null  int64  
 4   weekday     17379 non-null  int64  
 5   workingday  17379 non-null  int64  
 6   weathersit  17379 non-null  int64  
 7   temp        17379 non-null  float64
 8   atemp       17379 non-null  float64
 9   hum         17379 non-null  float64
 10  windspeed   17379 non-null  float64
 11  casual      17379 non-null  int64  
 12  registered  17379 non-null  int64  
 13  cnt         17379 non-null  int64  
dtypes: float64(4), int64(9), object(1)
memory usage: 2.0+ MB



<span style="background-color: #DAF7A6">From the bike_share.info, we can see that there are 17379 rows and 16 different columns. There are 3 different data types used i.e. object, int64 and float64. We can also notice that there are no 'Null' values in the whole dataset. </span>

In [8]:
# Converting date as datetime format
bike_share['dteday'] = bike_share['dteday'].astype('datetime64')


In [9]:
# Let us check the description of the dataset
bike_share.describe()

Unnamed: 0,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,2.50164,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
std,1.106918,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599
min,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,2.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,3.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,3.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,4.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0


# <span style="background-color: #CBBDB6">Data Analysis

In [10]:
bikes= bike_share.set_index(keys='dteday')

<span style="background-color: #DAF7A6">The use of bicycles is highly related with weather, temperature and so on. The variables of the dataset is much correlated.</span>

In [11]:
corr_ = bikes.corr()
#corr_

In [12]:
import plotly.figure_factory as ff
fig=ff.create_annotated_heatmap(z=corr_.values,
                           x=list(corr_.columns),
                           y=list(corr_.index),
                           annotation_text=corr_.round(2).values,
                           showscale=True,
                               colorscale='Earth',)
fig.layout.margin=dict(l=200,t=200)
fig.layout.height=800
fig.layout.width=1000
iplot(fig)

<span style="background-color: #DAF7A6"> From the correlation table above we can see that humidity and use of bicycles are negatively correlated. Also weather and use of bicycles are also negatively correlated. Tempearute and hour has positive correaltion with the use of bicycles.  </span>

<span style="background-color: #DAF7A6">Down below we will see the number of bikes used on hourly basis. </span>

In [13]:
# Lets check the unique values of hours
bike_share['hr'].unique()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])

# <span style="background-color: #CBBDB6">Grouping by hour

In [14]:
hourly=bike_share.groupby('hr').sum()
#hourly.head()

In [15]:
# Grouping the data according to bike uses in 24 hours
hourly[['casual','registered']].iplot(kind='scatter',
                            bins= 24,title='Use of Registered bikes Vs Casual bikes by hour',
                                     
                            xTitle='Hours of day',
                            yTitle='Number of users')

<span style="background-color: #DAF7A6">From the above figure, we can make conclusions that the peak users for registered users is at 17 o' clock and for casual users the peak hour is 14 o'clock. The number of registered users is much higher comparing to casual users. From this we can make assumptions that, 17 o' clock is office hour and people tend to use bikes more at this hour. In case of casual users, we can make assumptions that usually tourists use bikes during the day time, hence there peak hour for casual users is at 14 o' clock</span>

# <span style="background-color: #CBBDB6">Resampling the dataset by day


In [16]:
bikes= bike_share.set_index(keys='dteday')


In [17]:
daily = bikes.resample('D').sum()
#daily.head()

In [18]:
daily[['registered','casual']].iplot(kind='bar',
                            title='Use of Registered bikes Vs Casual bikes by day',         
                            xTitle='Day',
                            yTitle='Number of users')

## Better to use candle chart for the above figure

<span style="background-color: #DAF7A6">From the above scatter plot, we can make conclusions that, there was a significant rise in uses of bicycles in year 2012 compared to 2011. For the casual users, bikes were mostly used on May 19 (3,410 bikes),2012 whereas for registered users, most bikes (6,946 bikes) were used on September 26 2012.</span> 

# <span style="background-color: #CBBDB6">Grouping by weekday

In [19]:
byWeekday = bike_share.groupby('weekday').sum()
#byWeekday

In [20]:
byWeekday[['registered','casual']].iplot(kind='bar',bins=1,
                            title='Use of Registered bikes Vs Casual bikes by Weekdays',         
                            xTitle='Weekdays (1=Monday)',
                            yTitle='Number of users')

<span style="background-color: #DAF7A6"> The conclusions that we can make from the above diagram is that there are more registered bike users on Thursdays whereas the casual users are more on Saturdays. </span>

# <span style="background-color: #CBBDB6">Resampling by month

In [21]:
monthly = bikes.resample('M').sum()
monthly.head(2)
 

Unnamed: 0_level_0,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
dteday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2011-01-31,688,8168,24,2018,429,1014,135.82,140.1854,394.95,135.8697,3073,35116,38189
2011-02-28,649,7649,24,1957,436,944,184.3,185.4385,363.25,148.9139,6242,41973,48215


In [22]:
monthly[['registered','casual']].iplot(kind='bar',
                            title='Use of Registered bikes Vs Casual bikes by month',         
                            xTitle='Months',
                            yTitle='Number of users')

<span style="background-color: #DAF7A6">The above figure concludes that May 2012 has the most casual bike users whereas for the registered users, it was the month of September 2012.</span>

# <span style="background-color: #CBBDB6">Grouping by season

In [23]:
bySeason = bike_share.groupby('season').sum()
#bySeason


In [24]:
#sns.kdeplot(bySeason['casual'],shade=True)
#sns.kdeplot(bySeason['registered'],shade=True)
#plt.show()

In [25]:
bySeason[['registered','casual']].iplot(kind='bar',
                            title='Use of Registered bikes Vs Casual bikes by seasons',         
                            xTitle='Seasons (1:spring, 2:summer, 3:fall, 4:winter)',
                            yTitle='Number of users')

<span style="background-color: #DAF7A6"> The number of bicycle users are more in the fall for both registered users and casual users. The users are comparatively less in spring. </span>

# <span style="background-color: #CBBDB6">Resampling by year

In [26]:
yearly = bikes.resample('Y').sum()
#yearly

In [27]:
yearly[['registered','casual']].iplot(kind='bar',
                            title='Use of Registered bikes Vs Casual bikes by year',         
                            xTitle='Year',
                            yTitle='Number of users')

<span style="background-color: #DAF7A6"> There are 1,25,513 more casual bike users in 2012 compared to 2011. The number of registered bike users have been increased by 680,960. There has been quite an increase in bike users from 2011 to 2012</span>

# <span style="background-color: #CBBDB6">Grouping by holiday

In [28]:
byHoliday = bike_share.groupby('holiday').sum()
#byHoliday

In [29]:
byHoliday[['registered','casual']].iplot(kind='bar',
                            title='Use of Registered bikes Vs Casual bikes by holiday',         
                            xTitle='Holidays',
                            yTitle='Number of users')

<span style="background-color: #DAF7A6">During holidays the use of bicycles has decreased dramatically. When thee are no holidays, there are quite higher number of registered users compared to casual users.</span>

# <span style="background-color: #CBBDB6">Grouping by weather

In [30]:
byWeather = bike_share.groupby('weathersit').mean()
#byWeather

1- Clear, Few clouds, Partly cloudy, Partly cloudy

2- Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

3- Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

4- Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

In [31]:
byWeather[['registered','casual']].iplot(kind='bar',
                            title='Use of Registered bikes Vs Casual bikes by Weather',         
                            xTitle='Weather (1,2,3,4)',
                            yTitle='Number of users')

<span style="background-color: #DAF7A6"> From the figure above, we can see that the use of bicycles is heavily dependent on weather. As the weather deteriorites, use of bikes significantly drops down </span>

# <span style="background-color: #CBBDB6">Final conclusion

<span style="background-color: #DAF7A6"> From this data analysis, we can conclude that there has been a significant increase in registered users of bikes from 2011 to 2012.In case of casual users the trend also shows an increase but significantly less compared to registered users. The use of bikes is higher in weekdays compared to weekends.    </span>