# Project Exploratory Data Analysis

Gowtham K

March, 08, 2016

=========================================================================================
# Bike Sharing
===================================
    # Background
===================================

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return 
back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return 
back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of 
over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, 
environmental and health issues. 

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by
these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration
of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into
a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important
events in the city could be detected via monitoring these data.

===================================
    # Dataset
===================================


Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions,
precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to  
the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is 
publicly available in http://capitalbikeshare.com/system-data. We aggregated the data on two hourly and daily basis and then 
extracted and added the corresponding weather and seasonal information. Weather information are extracted from http://www.freemeteo.com. 

=================================
    # Dataset characteristics
=================================

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

    1) instant: record index  
    2) dteday : date
    3) season : season (1:springer, 2:summer, 3:fall, 4:winter)
    4) yr : year (0: 2011, 1:2012)
    5) mnth : month ( 1 to 12)
    6) hr : hour (0 to 23)
    7) holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
    8) weekday : day of the week
    9) workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
    10) weathersit : 
         1: Clear, Few clouds, Partly cloudy, Partly cloudy
		 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
		 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
		 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
    11) temp : Normalized temperature in Celsius. The values are divided to 41 (max)
    12) atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
    13) hum: Normalized humidity. The values are divided to 100 (max)
    14) windspeed: Normalized wind speed. The values are divided to 67 (max)
    15) casual: count of casual users
    16) registered: count of registered users
    17) cnt: count of total rental bikes including both casual and registered

# Dataset and its Summary
    # Required task: 
        Predication of bike rental count hourly or daily based on the environmental and seasonal settings.
    

================================
    # Summary for Daily bikes
================================

In [6]:
run DailyBike.py

           atemp       casual          cnt      dteday   holiday       hum  \
1st Qu  0.337842   315.500000  3152.000000         NaN  0.000000  0.520000   
2nd Qu  0.486733   713.000000  4548.000000         NaN  0.000000  0.626667   
Max     0.840896  3410.000000  8714.000000  2012-12-31  1.000000  0.972500   
Mean    0.474354   848.176471  4504.348837         NaN  0.028728  0.627894   
Median  0.486733   713.000000  4548.000000         NaN  0.000000  0.626667   
Min     0.079070     2.000000    22.000000  2011-01-01  0.000000  0.000000   

        instant       mnth   registered   season      temp  weathersit  \
1st Qu    183.5   4.000000  2497.000000  2.00000  0.337083    1.000000   
2nd Qu    366.0   7.000000  3662.000000  3.00000  0.498333    1.000000   
Max       731.0  12.000000  6946.000000  4.00000  0.861667    3.000000   
Mean      366.0   6.519836  3656.172367  2.49658  0.495385    1.395349   
Median    366.0   7.000000  3662.000000  3.00000  0.498333    1.000000   
Min      

From the above summary we can observe that the maximum count of users who are renting bikes on daily basis are 8714. Our tasks is to how weather effects the daily bike rentals. Lets go further and do the analysis.

==========================================
       # Summary for Hourly bikes
==========================================

In [7]:
run HourlyBike.py

           atemp      casual         cnt      dteday  holiday         hr  \
1st Qu  0.333300    4.000000   40.000000         NaN  0.00000   6.000000   
2nd Qu  0.484800   17.000000  142.000000         NaN  0.00000  12.000000   
Max     1.000000  367.000000  977.000000  2012-12-31  1.00000  23.000000   
Mean    0.475775   35.676218  189.463088         NaN  0.02877  11.546752   
Median  0.484800   17.000000  142.000000         NaN  0.00000  12.000000   
Min     0.000000    0.000000    1.000000  2011-01-01  0.00000   0.000000   

             hum  instant       mnth  registered   season      temp  \
1st Qu  0.480000   4345.5   4.000000   34.000000  2.00000  0.340000   
2nd Qu  0.630000   8690.0   7.000000  115.000000  3.00000  0.500000   
Max     1.000000  17379.0  12.000000  886.000000  4.00000  1.000000   
Mean    0.627229   8690.0   6.537775  153.786869  2.50164  0.496987   
Median  0.630000   8690.0   7.000000  115.000000  3.00000  0.500000   
Min     0.000000      1.0   1.000000    0

Here the maximum count of users who rent the bikes on hourly basis are 977. Here also weather effects the hourly bike rentals and lets see the analysis down the line. 

First lets see the details of the dataset and its datatypes. These will be useful to perform different analysis of data. 

In [8]:
run DailyBike.py

instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object


In [10]:
run HourlyBike.py

instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
hr              int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object


    # The only difference between Daily rentals and hourly rentals are 'hr' variable. hr variable contains 0-23 values which represents hours in a day
    

# Histogram Study

#### Lets Discuss about the weather related histograms and see how they are effecting the hourly bikes. Go through below divisions for the wether related histograms. The variable definitions are already defined in the top of this document.

======================================
    Histogram for Hourly User Count
======================================

<img src="./histograms/count.png" style="width:512px;height:512px;float:left">

===========================================
    Histogram for Normalized Temperature
===========================================

<img src="./histograms/temp.png" style="width:512px;height:512px;float:left">

=================================
    Histogram for weather
=================================

<img src="./histograms/weathersit.png" style="width:512px;height:512px;float:left">

========================================
    Histogram for Normalized Humidity
========================================

<img src="./histograms/humidity.png" style="width:512px;height:512px;float:left">

=================================
    Histogram for windspeed
=================================

<img src="./histograms/windspeed.png" style="width:512px;height:512px;float:left">

=================================
    Histogram for season
=================================

<img src="./histograms/season.png" style="width:512px;height:512px;float:left">

=================================
    Histogram for Hours
=================================

<img src="./histograms/hour.png" style="width:512px;height:512px;float:left">

# Bivariate relathionships between total count and other variables

================================================
   #### Bivariate graph between season and count
    This plot the graph between season and total rental bikes i.e:997. Seasons have profound effect on the amount of bike rentals. We can observe that 2(summer),3(fall) seasons have more users who take bike rentals
================================================

   <img src="./bivariate/seasonvscount.png" style="width:512px;height:512px;float:left">

================================================
   #### Bivariate graph between Year and Total count
		This plots the graph between year and total rental bikes i.e:997. From this observation, it looks like the amount of bike rentals increased in the year 2012 compared to 2011 year.
================================================

   <img src="./bivariate/yrvscoun.png" style="width:512px;height:512px;float:left">

=========================================================
   #### Bivariate graph between Hours and Total count
		This plots the graph between hours and total rental bikes i.e:997. You can observer that more users have rented the hourly bike rental for more than 17 hours and 18 hours. You can observe that 8- hours rental also have decent cluster. So we can say that many users are willing to take bikes for more than 8 hours.
=========================================================

   <img src="./bivariate/hoursvsCnt.png" style="width:512px;height:512px;float:left">

=========================================================
   #### Bivariate graph between weekdays and Total count
		# This plot the graph between weekday and total rental bikes i.e:997. From this observation, it seems that the amount of bike rentals increased on wednesdays and thursdays.
=========================================================

   <img src="./bivariate/weekdayvsCount.png" style="width:512px;height:512px;float:left">

=========================================================
   #### Bivariate graph between Normalized Temperature and Total count
		# This plot the graph between temperature and total rental bikes i.e:997. Weather has a profound effect on the bike rentals. It seems that bike rentals increased on normal temperatures
=========================================================

   <img src="./bivariate/tempvscount.png" style="width:512px;height:512px;float:left">

=========================================================
   #### Bivariate graph between Normalized Humidity and Total count
		# This plot the graph between normalized humidity and total rental bikes i.e:997. As we discussed above Weather has a profound effect on the bike rentals. It seems that bike rentals increased on normal humidities
=========================================================

   <img src="./bivariate/humvscount.png" style="width:512px;height:512px;float:left">

=========================================================
   #### Bivariate graph between Months and Total count
		This plot the graph between month and total rental bikes i.e:997. We can observe that the amount of bike rentals increased in 6th, 7th, 8th, 9th, 10th months and decreased in the winter seasons.
=========================================================

   <img src="./bivariate/monthvscount.png" style="width:512px;height:512px;float:left">

=========================================================
   #### Bivariate graph between Holidays and Total count
    From this observation, Interestingly You can see that the amount of bike rentals increased on days which is not a holiday(01)
    =========================================================

   <img src="./bivariate/cntonholidays.png" style="width:512px;height:512px;float:left">

# Multivariate relationship between Total count, Season, Hours.

#### I have created a 3-Dimentional relationtship between Season (on x-axis), Total count (on y- axis) and Hours (on z-axis)

   <img src="./multivariate/seacnthr.png" style="width:512px;height:512px;float:left">
    

## Data Distribution

    1st Qu(25%) : 40.000000
    2nd Qu(50%) : 142.000000
    Max    : 977.000000
    Mean   : 189.463088
    Median : 142.000000
    Min    : 1.000000 

This observation shows that the maximum count of hourly bike rentals are 977 and the average is more than the 300 if you observe in cnt histogram. There are different factors which are effecting the hours. Weather has a profound effect on the hours and also the season. Many users are willing to take bikes in the month of fall and summer and the hours decreased in the winter season. Using the normalized temperature values the hours are clustered together near the center that says that the better temperature the better hourly bike rentals. 
You can also observe that interestingly hours increased on a non holiday. Mostly weekdays have more hours than the weekends. Based on this, the sales increases exponentially if we make bikes available in the weekdays. 

If you observe the hours histogram. Low bike rentals are in the early mornings between 2AM - 4AM which are non-business hours and the graaph increased in the day time. 

## Final Plots and Summary

#### Plot One
#### Description One

Most of the bike rentals are on a working day. Apart from weather the day is a holiday or a working day the rental bikes increased in the business hours. Interestingly holidays did not effect much on the bike rentals. 

#### Plot Two
#### Description Two

If you observe the year histogram, 2012 year have more hourly bike rentals than the year 2011. As an assumption the sales are increasing for every year. This assumption again depends on the next following years hourly bike rental data.

#### Plot Three
#### Description Three

By observing the seasons bivariant graph by comparing with total hourly bike rentals, in spring the bike rentals are very less compared to rest of the seasons. this might be possible because of snow effect in the Washington DC in the months of early december to end of march. According to <a href="https://www.fema.gov/news-release/2011/11/08/president-declares-disaster-district-columbia">District of Columbia (DC) Earthquake (DR-4044). </a>There is an earthquake in the month of august(23-28). But it doesnot effect much on the Houlry bike rentals. 

## Reflection

This Bike Sharing project consists of different attributes and each attribute has data values which can be compared with other attributes and get the desired analysis. I have used hourly rental data and compared with the other attributes such as weather, month, temperature, workingdays etc. All these attributes are already defined initially in this document. The main tasks of this project is to understand how these attributes effects the hourly bike rentals in the Washington DC area. This project analysed the data and generated different types of plots univariate(histograms), Bivariate(2D), Multivariate relationships between the attributes or variables. There are two types of users who have taken hourly bike rentals they are casual users and registered users. Registered users have more hourly bike rentals than the casual users. But this project made analysis on both the users by taking count of both the users. The main factors which are effecting on the bike rentals are weather and natural disasters. Hurricane Sandy have effected the business of bike rentals in the month of november, Snow factors also effected the bike rentals in the months of dec-march.

## Reference

[1] Fanaee-T, Hadi, and Gama, Joao, "Event labeling combining ensemble detectors and background knowledge", Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.
