# 01 - Kaggle - bike share system -  Problem formulation

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to **combine historical usage patterns with weather data** in order to **forecast bike rental demand** in the Capital Bikeshare program in Washington, D.C.

<img src="data/bikes.png">


## Description of the problem

You are provided hourly rental data spanning two years. For this competition, the *training set* is comprised of the first 19 days of each month, while the *test set* is the 20th to the end of the month. You must predict the **total count** of bikes rented during each hour covered by the test set, using only information available prior to the rental period.



## Data set

The **training set** includes:

<center> datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count </center>


- **datetime**: `YYYY-MM-DD HH:00:00` --> hourly date + timestamp by hour  
    - `YYYY` = 2011 or 2012
    - `MM` = 1 - 12  
    - `DD` = 1 - 19
    - `HH` = 0 - 23

- **season**: Kaggle's [website](https://www.kaggle.com/c/bike-sharing-demand/data) says "`1 = spring, 2 = summer, 3 = fall, 4 = winter`", but the season indecies in the dataset correspond to 
    - 1 = Winter (January-March)
    - 2 = Spring (April-June)
    - 3 = Summer (July-September)
    - 4 = Fall (October-December)

- **holiday**: whether the day is considered a holiday 
    - 0 = non-holiday
    - 1 = holiday

- **workingday**: whether the day is neither a weekend nor holiday
     - 0 = day is weekend or holiday
     - 1 = otherwise 
- **weather**: encoded to make explicit various extreme weather events
    - 1 = Clear, Few clouds, Partly cloudy, Partly cloudy 
    - 2 = Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3 = Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds)
    - 4 = Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

- **temp**: temperature in Celsius. 

- **atemp**: "feels like" temperature in Celsius.

- **humidity**: relative humidity

- **windspeed**: wind speed

- **casual**: number of non-registered user rentals initiated

- **registered**: number of registered user rentals initiated

- **count**: number of total rentals (casual + registered)


The ** test set ** includes the same features except `casual`, `registered` and `count`. Also, `DD` in **`datetime`** is from 20 to the end of the month.

# Evaluation

Submissions are evaluated one the Root Mean Squared Logarithmic Error (RMSLE). The RMSLE is calculated as

$$
\epsilon 
= 
\sqrt{{1\over n} \sum_{i=1}^n \left[ \log(p_i+1) - \log(a_i+1)\right]^2}
= 
\sqrt{{1\over n} \sum_{i=1}^n \left[ \log\left(\frac{p_i+1}{a_i+1}\right)\right]^2}
$$

*Where:*

- $n$ is the number of instances in the test set
- $p_i$ is your predicted count
- $a_i$ is the actual count
- $\log(x)$ is the natural logarithm


RMSLE penalizes an under-predicted estimate greater than an over-predicted estimate.  

In [31]:
import numpy as np

def f(p, a):
    return np.log1p(p) - np.log1p(a)

plist = [10, 1,  1, 91, 100 , 1000, 10000]
alist = [100, 10 , 1, 1, 10 , 100, 1000]

for p , a in zip(plist,alist):
   print 'error = {0:.3f} for (p, a) = ({1:.0f} , {2:.0f}) where p-a = {3:.0f}'.format(f(p, a),p, a, p-a) 


error = -2.217 for (p, a) = (10 , 100) where p-a = -90
error = -1.705 for (p, a) = (1 , 10) where p-a = -9
error = 0.000 for (p, a) = (1 , 1) where p-a = 0
error = 3.829 for (p, a) = (91 , 1) where p-a = 90
error = 2.217 for (p, a) = (100 , 10) where p-a = 90
error = 2.294 for (p, a) = (1000 , 100) where p-a = 900
error = 2.302 for (p, a) = (10000 , 1000) where p-a = 9000


We see that $\log\left(\frac{p_i+1}{a_i+1}\right) = \log\left(1+\frac{p_i-a_i}{a_i+1}\right)$ is sensitive to the difference between the predicted and the actual values for small actual values. With increase in the actual value, large differences are penalized *almost* closely. That is large differences $p_i - a_i$ for large $a_i$ are not penalized as much as for small values of $a_i$ where large differences are significant. For example, $(p,a)$ for $(91,1)$ and $(100,10)$ have the same difference, but their log error is different and the smaller $a$ leads to higher log error. When we take the log of a quantity and then calculate the mean squared error, essentially we are calculating the error in (difference of) the order of magnitude rather than the difference of the original values. Also, calculating the mean squared error of log is less sensitive to outliers than the MSE of the original values.