<font color='black'> <h1> <center> Exploratory Data Analysis </center> </h1> </font>

### Bike Sharing Dataset

Dataset and more information can be found at following URL -
https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset 

Please take some moments to go through the description of the dataset and its features which will be helpful for doing further data analysis.

The dataset has the following fields:
	
	- instant: record index
	- dteday : date
	- season : season (1:winter, 2:spring, 3:summer, 4:fall)
	- yr : year (0: 2011, 1:2012)
	- mnth : month ( 1 to 12)
	- hr : hour (0 to 23)
	- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
	- weekday : day of the week
	- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
	+ weathersit : 
		- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
		- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
		- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
		- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
	- temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
	- atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
	- hum: Normalized humidity. The values are divided to 100 (max)
	- windspeed: Normalized wind speed. The values are divided to 67 (max)
	- casual: count of casual users
	- registered: count of registered users
	- cnt: count of total rental bikes including both casual and registered

In [1]:
# Importing the Python libraries

import pandas as pd
import numpy as np

#### Step1 - Reading Data From CSV

As the data is stored in csv format, use read_csv function of pandas to read the data in pandas structure named DataFrame.

In [2]:
bikes = pd.read_csv("../Data/bike_shairing_hourly.csv")

<b> Data Viewing<b>

Lets explore the first and last few rows of dataframe.

In [3]:
bikes.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [4]:
bikes.tail()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
17374,17375,2012-12-31,1,1,12,19,0,1,1,2,0.26,0.2576,0.6,0.1642,11,108,119
17375,17376,2012-12-31,1,1,12,20,0,1,1,2,0.26,0.2576,0.6,0.1642,8,81,89
17376,17377,2012-12-31,1,1,12,21,0,1,1,1,0.26,0.2576,0.6,0.1642,7,83,90
17377,17378,2012-12-31,1,1,12,22,0,1,1,1,0.26,0.2727,0.56,0.1343,13,48,61
17378,17379,2012-12-31,1,1,12,23,0,1,1,1,0.26,0.2727,0.65,0.1343,12,37,49


Lets find out number of rows and columns of data set.

In [5]:
bikes.shape

(17379, 17)

#### Step 2 - Formatting, cleaning and filtering Data Frames

Lets check how many data values are present in each column.

In [6]:
bikes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB


As all the features contains same number of data values, there are no missing values in dataset.

<b> Properties of data <b>

Lets check some properties of dataset :

In [7]:
bikes.columns

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'casual', 'registered', 'cnt'],
      dtype='object')

17 features are present in dataset, most of which looks integers & float in nature.

In [8]:
bikes.dtypes  

instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
hr              int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object

<b> Unique feature values <b>

Lets explore the unique values present in each feature. These unique values can give us some hints while doing grouping of the data.

In [9]:
bikes.dteday.unique() 

array(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
       '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
       '2011-01-09', '2011-01-10', '2011-01-11', '2011-01-12',
       '2011-01-13', '2011-01-14', '2011-01-15', '2011-01-16',
       '2011-01-17', '2011-01-18', '2011-01-19', '2011-01-20',
       '2011-01-21', '2011-01-22', '2011-01-23', '2011-01-24',
       '2011-01-25', '2011-01-26', '2011-01-27', '2011-01-28',
       '2011-01-29', '2011-01-30', '2011-01-31', '2011-02-01',
       '2011-02-02', '2011-02-03', '2011-02-04', '2011-02-05',
       '2011-02-06', '2011-02-07', '2011-02-08', '2011-02-09',
       '2011-02-10', '2011-02-11', '2011-02-12', '2011-02-13',
       '2011-02-14', '2011-02-15', '2011-02-16', '2011-02-17',
       '2011-02-18', '2011-02-19', '2011-02-20', '2011-02-21',
       '2011-02-22', '2011-02-23', '2011-02-24', '2011-02-25',
       '2011-02-26', '2011-02-27', '2011-02-28', '2011-03-01',
       '2011-03-02', '2011-03-03', '2011-03-04', '2011-

Most of the date data is from 2011 and 2012, starting from Jan 2011 to Dec 2012

In [10]:
bikes.season

0        1
1        1
2        1
3        1
4        1
        ..
17374    1
17375    1
17376    1
17377    1
17378    1
Name: season, Length: 17379, dtype: int64

In [11]:
bikes.season.unique()

array([1, 2, 3, 4], dtype=int64)

There are 4 unique season in the current dataset

In [12]:
bikes.yr

0        0
1        0
2        0
3        0
4        0
        ..
17374    1
17375    1
17376    1
17377    1
17378    1
Name: yr, Length: 17379, dtype: int64

In [13]:
bikes.yr.unique()

array([0, 1], dtype=int64)

Again the yr information can be considered as 0 for year 2011 and 1 for year 2012

In [14]:
bikes.mnth

0         1
1         1
2         1
3         1
4         1
         ..
17374    12
17375    12
17376    12
17377    12
17378    12
Name: mnth, Length: 17379, dtype: int64

In [15]:
bikes.mnth.unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12], dtype=int64)

We have bikes shared across all the 12 months for the year 2011 & 2012

In [16]:
bikes.hr

0         0
1         1
2         2
3         3
4         4
         ..
17374    19
17375    20
17376    21
17377    22
17378    23
Name: hr, Length: 17379, dtype: int64

In [17]:
bikes.hr.unique()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23], dtype=int64)

Data analysis for Bikes shared 24 hours round the clock 

In [18]:
bikes.holiday

0        0
1        0
2        0
3        0
4        0
        ..
17374    0
17375    0
17376    0
17377    0
17378    0
Name: holiday, Length: 17379, dtype: int64

In [19]:
bikes.holiday.unique()

array([0, 1], dtype=int64)

Here 0 represents no holiday and 1 for holiday

In [20]:
bikes.weekday

0        6
1        6
2        6
3        6
4        6
        ..
17374    1
17375    1
17376    1
17377    1
17378    1
Name: weekday, Length: 17379, dtype: int64

In [21]:
bikes.weekday.unique()

array([6, 0, 1, 2, 3, 4, 5], dtype=int64)

Bikes shared round the week from sunday to monday

In [22]:
bikes.workingday

0        0
1        0
2        0
3        0
4        0
        ..
17374    1
17375    1
17376    1
17377    1
17378    1
Name: workingday, Length: 17379, dtype: int64

In [23]:
bikes.workingday.unique()

array([0, 1], dtype=int64)

Here 0 for not workig days and 1 for working days

In [24]:
bikes.weathersit

0        1
1        1
2        1
3        1
4        1
        ..
17374    2
17375    2
17376    1
17377    1
17378    1
Name: weathersit, Length: 17379, dtype: int64

In [25]:
bikes.weathersit.unique()

array([1, 2, 3, 4], dtype=int64)

weathersit  has four values - 1 , 2, 3, 4

The other columns have data which may not considered for uniqueness like temp, windsped, atemp etc

In [26]:
bikes.isnull().sum()  # checkig for missing values, null values etc

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [27]:
bikes.isna().sum()  # Checking for NA values

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [28]:
bikes.dteday.isnull().sum()  # checking for null values

0

In [29]:
bikes.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
17374    False
17375    False
17376    False
17377    False
17378    False
Length: 17379, dtype: bool

In [30]:
bikes.duplicated().sum() # checking for duplicate records in the dataframe

0

<b> Data transformation <b>

The column dte can be converted to datetime  for better analysis

In [31]:
bikes.dteday

0        2011-01-01
1        2011-01-01
2        2011-01-01
3        2011-01-01
4        2011-01-01
            ...    
17374    2012-12-31
17375    2012-12-31
17376    2012-12-31
17377    2012-12-31
17378    2012-12-31
Name: dteday, Length: 17379, dtype: object

In [32]:
bikes.dteday.head()

0    2011-01-01
1    2011-01-01
2    2011-01-01
3    2011-01-01
4    2011-01-01
Name: dteday, dtype: object

In [33]:
dte = pd.to_datetime(bikes.dteday)

In [34]:
dte

0       2011-01-01
1       2011-01-01
2       2011-01-01
3       2011-01-01
4       2011-01-01
           ...    
17374   2012-12-31
17375   2012-12-31
17376   2012-12-31
17377   2012-12-31
17378   2012-12-31
Name: dteday, Length: 17379, dtype: datetime64[ns]

In [35]:
type(dte)

pandas.core.series.Series

<b> Feature Extraction<b>

Now extract year, month and day from "dte" series.