# Regression task - Bike sharing 1

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return has become automatic. Through these systems, a user is able to easily rent a bike from a particular position and return it at another place.

The dataset contains the hourly count of rental bikes between years 2011 and 2012 in the Capital Bikeshare system (Wasington DC) with the corresponding weather and seasonal information.

The goal of this task is to train a regressor to predict total counts of bike rentals based on the provided features for a given hour. 

## Data source
[http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset](http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset)

## Feature description
* **dteday** - date time stamot
* **season** - season (1: spring, 2: summer, 3: fall, 4: winter)
* **yr** - year (0: 2011, 1: 2012)
* **mnth** - month (1 to 12)
* **hr** - hour (0 to 23)
* **holiday** - 1 if the day is a holiday, else 0 (extracted from [holiday schedules](https://dchr.dc.gov/page/holiday-schedules))
* **weekday** - day of the week (0 to 6)
* **workingday** - is 1 if day is neither weekend nor holiday, else 0.
* **weathersit** 
    * 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    * 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    * 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    * 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
* **temp** - Normalized temperature in degrees of Celsius.
* **atemp** - Normalized feeling temperature in degrees Celsius.
* **hum** - Normalized relative humidity.
* **windspeed** - Normalized wind speed.
* **casual** - Count of casual users.
* **registered** - Count of registered users.
* **cnt** -  Count of total rental bikes including both casual and registered. This is the target value. 

In [1]:
import pandas as pd
data = pd.read_csv('../data/bikes.csv', sep=',')
data.head()

Unnamed: 0,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


## Simple regressor

Implement a simple regressor based on all reasonable features from the input data set. Notice that some of the features from the input data cannot be used.

### Data preparation

Prepare train and test data sets.

In [2]:
from sklearn.model_selection import train_test_split

X_all = data[['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit','temp', 'atemp', 'hum', 'windspeed']]
y_all = data['cnt']


X_train, X_test, y_train, y_test = train_test_split(
    X_all, 
    y_all,
    random_state=1,
    test_size=0.2)

print('Train size: {}'.format(len(X_train)))
print('Test size: {}'.format(len(X_test)))

Train size: 13903
Test size: 3476


### Training a regressor

Train a regressor using the following models:
* [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
* [Support Vector Machines for regression](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) (experiment with different kernels)
* [Gradient Boosted Trees](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) (Experiment with different depths and number of trees)

In [3]:
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

regr = LinearRegression()

#regr = Pipeline([('std', StandardScaler()),
#                 ('svr', svm.SVR(kernel='linear'))])

#regr = GradientBoostingRegressor(n_estimators=100, max_depth=4, loss='lad')

regr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Evaluate the models

Measure mean squared error and mean absolute error evaluation metrics on both train and test data sets. Compute the mean and standard deviation of the target values. Decide which model performs best on the given problem.

In [4]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

y_pred = regr.predict(X_test)
print ("Test mean: {}, std: {}".format(np.mean(y_test), np.std(y_test)))
print("Test Mean squared error: {:.2f}".format(mean_squared_error(y_test, y_pred)))
print("Test Mean absolute error: {:.2f}".format(mean_absolute_error(y_test, y_pred)))

Test mean: 191.16168009205984, std: 182.65244995043594
Test Mean squared error: 22910.06
Test Mean absolute error: 114.17


In [5]:
y_pred = regr.predict(X_train)
print("Train Mean squared error: %.2f"
      % mean_squared_error(y_train, y_pred))
print("Train Mean absolute error: %.2f"
      % mean_absolute_error(y_train, y_pred))

Train Mean squared error: 22541.49
Train Mean absolute error: 113.26


### Feature importance

Print coefficients of the linear regression model and decide which features are most important.


In [6]:
print('Coefficients: \n', regr.coef_)

Coefficients: 
 [  17.80667933   76.64323597    1.03487081  -24.57869294    1.64737296
    3.86627466    6.25513072   -3.42563606  365.41018721 -285.52572173
   63.18158246]
