Capstone Project
Machine Learning Engineer Nanodegree

# Demand prediction for a bike sharing systems

Mai 2016  
Philipp Vogler  

## Definition

### Project Overview  

- Looked for a problem in the area of transport and logistics that is solvable with machine learning
- Utelizing machine learning to forecast the demand for the washington DC bike sharing system 'capital bike share'  
- Using different types of regression to find an algorithem to predict the demand for bikes based on calenderic and weather information.  
- Weather, calendaric and demand information is provided in a dataset by the University of Porto at UCI ML Repository.  
- This project tries to create a forecasting function based on two years of historic data by utelizing the machine learning libraries scikit-learn and tensor-flow.  

> http://www.capitalbikeshare.com   
> http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset  
> http://freemeteo.de/wetter/  
> http://dchr.dc.gov/page/holiday-schedules  
> http://scikit-learn.org/stable/  
> https://www.tensorflow.org  

### Problem Statement  

The goal is to forecast the demand of bikes in in dependency of weather conditions like outside temperature and calendaric informations e.g. holidays. These information and the demand structure is provided in a set with two years of dayly historic data.  
The demand is given as the total dayly demand and as a split for registered users and casual users. To increase the quality of the prediction registered user demand and casual user demand will be predicted seperatly in step two.  
To make predictions machnie learning is used to train regressors. Scikit-Learn recomands a support vector regressor (SVR) for this kind of problem and dataset. In addition a deep neuronal network (DNN) regressor is trained for comparison. To find the hyperparameters for these regressors grid search and ramdomized search are utelized. Due to the small dataset cross validation is applied.  

> http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html  
> http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR  
> https://github.com/tensorflow/skflow/blob/master/g3doc/api_docs/python/estimators.md  
> http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html  
> http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.RandomizedSearchCV.html

### Metrics

To mesure the performance of the regressions two standard regression metrics are used: Mean squared eror (MSE) and the coefficient of determination (R^2). Both metrics are calculated for both regressor types. For comparison and parameter tuneing only R^2 is used due to the better readability.

> http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error  
> http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score  

In [1]:
# Import libraries

import numpy as np
#import pandas as pd
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import r2_score, mean_squared_error
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
from tensorflow.contrib import skflow

## Analysis

In [2]:
# Fetching Dataset

bike_data = pd.read_csv("day.csv")

print "Data read successfully!"

Data read successfully!


### Data Exploration

In [3]:
# Extracting

feature_cols = bike_data.columns[:-3]  # all columns but last are features
target_col = bike_data.columns[-1]  # last column is the target

print ("Feature column(s):\n{}\n".format(feature_cols))
print ("Target column:\n{}".format(target_col))

Feature column(s):
Index([u'instant', u'dteday', u'season', u'yr', u'mnth', u'holiday',
       u'weekday', u'workingday', u'weathersit', u'temp', u'atemp', u'hum',
       u'windspeed'],
      dtype='object')

Target column:
cnt


In [4]:
# Exploration

print "\n Data values:"
print bike_data.head()  # print the first 5 rows

print "\n Data stats:"
bike_data.describe() # shows stats 


 Data values:
   instant      dteday  season  yr  mnth  holiday  weekday  workingday  \
0        1  2011-01-01       1   0     1        0        6           0   
1        2  2011-01-02       1   0     1        0        0           0   
2        3  2011-01-03       1   0     1        0        1           1   
3        4  2011-01-04       1   0     1        0        2           1   
4        5  2011-01-05       1   0     1        0        3           1   

   weathersit      temp     atemp       hum  windspeed  casual  registered  \
0           2  0.344167  0.363625  0.805833   0.160446     331         654   
1           2  0.363478  0.353739  0.696087   0.248539     131         670   
2           1  0.196364  0.189405  0.437273   0.248309     120        1229   
3           1  0.200000  0.212122  0.590435   0.160296     108        1454   
4           1  0.226957  0.229270  0.436957   0.186900      82        1518   

    cnt  
0   985  
1   801  
2  1349  
3  1562  
4  1600  


Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


#### Characteristics

Most of the data is already normalized or binary.  
The dataset is very concise and missing values are not a problem. Categorical data like weekday or workingday are already processed.

### Data Preprocessing (Methodology)

Dates get droped because the regressor can not read this datatype and the order information is already stored in the index. The instant variable replicates this information also. 

In [5]:
# Pre-processing

X = bike_data[feature_cols.drop(['dteday'],['instant'])] # feature values 
y = bike_data[target_col]  # corresponding targets

### Exploratory Visualization

The visualiation shows a classic seasonal pattern with a up trend year over year. There are so outliers. These are left in the dataset because they are not due to mesurement errors, but to extrem weather conditions. Extrem weather conditions are part of the problem so the data is not excluded.

In [6]:
# Visulazation

plt.style.use('ggplot')
plt.figure(1)
      
plt.plot(bike_data.cnt,'go')
#plt.plot(bike_data.casual,'yx')
#plt.plot(bike_data.registered,'bx')

plt.title('Number of bikes rented per day')
plt.xlabel('Days')
plt.ylabel('Number of bikes')

plt.show()

# source: http://matplotlib.org/examples/showcase/bachelors_degrees_by_gender.html

### Algorithms and Techniques

In [7]:
# Split

X_train, X_test, y_train, y_test = train_test_split(X, y)# test size is set to 0.25

Two types of regressors are trained. A SVR and a DNN-Regressor. Both are first used "of the shelf" with default parameters to create a benchmark.

> http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR  
> https://github.com/tensorflow/skflow/blob/master/g3doc/api_docs/python/estimators.md  

### Benchmark

Both "benchmarks" for the coefficient of determination are very low. Parameter tuneing is mandatory.

In [8]:
# Training SVR

svr = SVR()
svr.fit(X_train, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [9]:
# Validation SVR

svr_pred = svr.predict(X_test)

# score_svr = mean_squared_error(y_test, svr_pred)
score_svr = r2_score(y_test, svr_pred)

print("Score SVR: %f" % score_svr)

Score SVR: 0.000482


In [10]:
# DNN-Regressor

# Build 2 layer fully connected DNN with 10, 10 units respectively.
regressor = skflow.TensorFlowDNNRegressor(hidden_units=[10,10], steps=5000, learning_rate=0.1, batch_size=1)

# Fit
regressor.fit(X_train, y_train)

# Predict and validate
#score_regressor = metrics.mean_squared_error( y_test, regressor.predict(X_test))
score_regressor = r2_score(y_test, regressor.predict(X_test))

print('\n Score: {0:f}'.format(score_regressor))

#  Copyright 2015-present The Scikit Flow Authors. All Rights Reserved.
#  Licensed under the Apache License, Version 2.0 (the "License");
# source https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/skflow/boston.py

Step #99, avg. train loss: 3475705.00000
Step #199, avg. train loss: 2000593.87500
Step #299, avg. train loss: 2129308.50000
Step #399, avg. train loss: 2617349.50000
Step #499, avg. train loss: 2149781.00000
Step #600, epoch #1, avg. train loss: 2001944.00000
Step #700, epoch #1, avg. train loss: 2128975.75000
Step #800, epoch #1, avg. train loss: 2106880.25000
Step #900, epoch #1, avg. train loss: 2099041.25000
Step #1000, epoch #1, avg. train loss: 1477907.62500
Step #1100, epoch #2, avg. train loss: 2811893.00000
Step #1200, epoch #2, avg. train loss: 2109668.50000
Step #1300, epoch #2, avg. train loss: 1854669.00000
Step #1400, epoch #2, avg. train loss: 2182509.00000
Step #1500, epoch #2, avg. train loss: 1442949.87500
Step #1600, epoch #2, avg. train loss: 2523864.75000
Step #1700, epoch #3, avg. train loss: 1882376.00000
Step #1800, epoch #3, avg. train loss: 1453228.00000
Step #1900, epoch #3, avg. train loss: 2197455.50000
Step #2000, epoch #3, avg. train loss: 2259222.50000


## Methodology

### Implementation

The regressors are trained using randomised search and cross validation to identify the area of the best parameters. Than a grid search is used to tune parameter values of the regressor functions.

> http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html  
> http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.RandomizedSearchCV.html

In [26]:
# Tuning SVR with GridSearch

tuned_parameters = [{'C': [1000, 3000, 10000], 
                     'kernel': ['linear', 'rbf']}
                   ]

#svr_tuned = GridSearchCV(SVR (C=1), param_grid = tuned_parameters, scoring = 'mean_squared_error') #default 3-fold cross-validation, score method of the estimator
svr_tuned_GS = GridSearchCV(SVR (C=1), param_grid = tuned_parameters, scoring = 'r2', n_jobs=-1) #default 3-fold cross-validation, score method of the estimator

svr_tuned_GS.fit(X_train, y_train)

print (svr_tuned_GS)
print ('\n' "Best parameter from grid search: " + str(svr_tuned_GS.best_params_) +'\n')

GridSearchCV(cv=None, error_score='raise',
       estimator=SVR(C=1, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'kernel': ['linear', 'rbf'], 'C': [1000, 3000, 10000]}],
       pre_dispatch='2*n_jobs', refit=True, scoring='r2', verbose=0)

Best parameter from grid search: {'kernel': 'linear', 'C': 3000}



In [12]:
# Validation - SVR tuned 

svr_tuned_pred_GS = svr_tuned_GS.predict(X_test)

#score_svr_tuned = mean_squared_error(y_test, svr_tuned_pred)
score_svr_tuned_GS = r2_score(y_test, svr_tuned_pred_GS)

print('SVR Results\n')
print("Score SVR: %f" % score_svr)
print("Score SVR tuned GS: %f" % score_svr_tuned_GS)

SVR Results

Score SVR: 0.000482
Score SVR tuned GS: 0.757696


In [13]:
# SVR tuned with RandomizesSearch
# may take a while!

# Parameters
param_dist = {  'C': sp_uniform (1000, 10000), 
                'kernel': ['linear']
             }

n_iter_search = 1

# MSE optimized
#SVR_tuned_RS = RandomizedSearchCV(SVR (C=1), param_distributions = param_dist, scoring = 'mean_squared_error', n_iter=n_iter_search)

# R^2 optimized
SVR_tuned_RS = RandomizedSearchCV(SVR (C=1), param_distributions = param_dist, scoring = 'r2', n_iter=n_iter_search)

# Fit
SVR_tuned_RS.fit(X_train, y_train)

# Best score and corresponding parameters.
print('best CV score from grid search: {0:f}'.format(SVR_tuned_RS.best_score_))
print('corresponding parameters: {}'.format(SVR_tuned_RS.best_params_))

# Predict and score
predict = SVR_tuned_RS.predict(X_test)

#score_regressor_tuned_RS = mean_squared_error(y_test, predict)
score_svr_tuned_RS = r2_score(y_test, predict)

best CV score from grid search: 0.783958
corresponding parameters: {'kernel': 'linear', 'C': 3548.376157985727}


In [14]:
print('SVR Results\n')
print("Score SVR: %f" % score_svr)
print("Score SVR tuned GS: %f" % score_svr_tuned_GS)
print("Score SVR tuned RS: %f" % score_svr_tuned_RS)

SVR Results

Score SVR: 0.000482
Score SVR tuned GS: 0.757696
Score SVR tuned RS: 0.756611


The tuneing works for the SVR.

In [15]:
# DNN-Regressor tuned with GS

# param_grid
param_grid = {'hidden_units': [[11,11], [12,12], [13,13], [14,14], [15,15]], 
              'steps': [100],
              'learning_rate': [0.3, 0.7, 1.0],
              'batch_size': [250, 300, 350, 400, 450]
             }

# GS with MSE
#regressor_tuned = GridSearchCV(skflow.TensorFlowDNNRegressor (hidden_units = [10, 10]), param_grid, scoring = 'mean_squared_error')

# GS with R^2
regressor_tuned_GS = GridSearchCV(skflow.TensorFlowDNNRegressor (hidden_units = [10, 10]), param_grid, scoring = 'r2', n_jobs=-1)

# Fit
regressor_tuned_GS.fit(X_train, y_train)

# Best score and corresponding parameters.
print('best CV score from grid search: {0:f}'.format(regressor_tuned_GS.best_score_))
print('corresponding parameters: {}'.format(regressor_tuned_GS.best_params_))

# source: https://github.com/tensorflow/skflow/pull/126/files

# Predict and score
predict = regressor_tuned_GS.predict(X_test)

#score_regressor_tuned_GS = mean_squared_error(y_test, predict)
score_regressor_tuned_GS = r2_score(y_test, predict)

print('Score: {0:f}'.format(score_regressor_tuned_GS))

Step #100, epoch #50, avg. train loss: 2010684.00000
Step #100, epoch #50, avg. train loss: 1978598.50000
Step #100, epoch #50, avg. train loss: 2005605.12500
Step #100, epoch #50, avg. train loss: 2649574.50000
Step #100, epoch #50, avg. train loss: 2565630.75000
Step #100, epoch #50, avg. train loss: 2564813.50000
Step #100, epoch #50, avg. train loss: 2663323.75000
Step #100, epoch #50, avg. train loss: 2601203.00000
Step #100, epoch #50, avg. train loss: 2707351.75000
Step #100, epoch #50, avg. train loss: 1990417.25000
Step #100, epoch #50, avg. train loss: 1953482.25000
Step #100, epoch #50, avg. train loss: 2029625.12500
Step #100, epoch #50, avg. train loss: 2498554.25000
Step #100, epoch #50, avg. train loss: 2420813.00000
Step #100, epoch #50, avg. train loss: 2468398.00000
Step #100, epoch #50, avg. train loss: 2943557.50000
Step #100, epoch #50, avg. train loss: 2864365.50000
Step #100, epoch #50, avg. train loss: 2871355.25000
Step #100, epoch #50, avg. train loss: 2053932

In [16]:
print('DNN Regressor Results\n')
print("DNN: %f" % score_regressor)
print("DNN tuned grid: %f" % score_regressor_tuned_GS)

DNN Regressor Results

DNN: -0.236513
DNN tuned grid: -0.021819


In [17]:
# DNN-Regressor tuned with RandomizesSearch

# Parameters
param_dist = {  'hidden_units': [[11,11], [12,12], [13,13]], 
                'learning_rate': sp_uniform(0.0,1.0), 
                'batch_size': sp_randint(250, 350)
             }

n_iter_search = 1

# MSE optimized
#regressor_tuned_RS = RandomizedSearchCV(skflow.TensorFlowDNNRegressor (hidden_units = [10, 10]), param_distributions = param_dist, scoring = 'mean_squared_error', n_iter=n_iter_search)

# R^2 optimized
regressor_tuned_RS = RandomizedSearchCV(skflow.TensorFlowDNNRegressor (hidden_units = [10, 10]), param_distributions = param_dist, scoring = 'r2', n_iter=n_iter_search)

# Fit
regressor_tuned_RS.fit(X_train, y_train)

# Best score and corresponding parameters.
print('\n best CV score from grid search: {0:f}'.format(regressor_tuned_RS.best_score_))
print('\n corresponding parameters: {}'.format(regressor_tuned_RS.best_params_))

# source: https://github.com/tensorflow/skflow/pull/126/files

# Predict and score
predict = regressor_tuned_RS.predict(X_test)

#score_regressor_tuned_RS = mean_squared_error(y_test, predict)
score_regressor_tuned_RS = r2_score(y_test, predict)

print('\n Score: {0:f}'.format(score_regressor_tuned_RS))

Step #100, epoch #50, avg. train loss: 2038259.25000
Step #200, epoch #100, avg. train loss: 1549159.25000
Step #100, epoch #50, avg. train loss: 2031939.50000
Step #200, epoch #100, avg. train loss: 1647863.87500
Step #100, epoch #50, avg. train loss: 2057597.25000
Step #200, epoch #100, avg. train loss: 1664315.37500
Step #100, epoch #50, avg. train loss: 2046084.12500
Step #200, epoch #100, avg. train loss: 1598422.25000

 best CV score from grid search: 0.180538

 corresponding parameters: {'learning_rate': 0.1844694727284093, 'hidden_units': [11, 11], 'batch_size': 316}

 Score: -0.073531


Same picture with the DNN Regressor. The tuning helps, but the results are still underwelming. Also the best DNN result is no match for the tuned SVR.

In [18]:
print('Results\n')

print("SVR: %f" % score_svr)
print("SVR tuned grid: %f" % score_svr_tuned_GS)
print("SVR tuned random: %f" % score_svr_tuned_RS)

print('\n')
print("DNN: %f" % score_regressor)
print("DNN tuned grid: %f" % score_regressor_tuned_GS)
print("DNN tuned random: %f" % score_regressor_tuned_RS)

Results

SVR: 0.000482
SVR tuned grid: 0.757696
SVR tuned random: 0.756611


DNN: -0.236513
DNN tuned grid: -0.021819
DNN tuned random: -0.073531


SVR works better than the DNN Regressor.

### Refinement  
The count of rented bikes (cnt) is just the sum of the features casual and registered. Two seperate models are trained to predict these features. And add them up afterwards. This should improve the projection.

In [19]:
#SVR with GridSearch - for casual users

# Extracting
feature_cols_cas = bike_data.columns[:-3]  # all columns but last are features
target_col_cas = bike_data.columns[-3]  # last column is the target
print ("Feature columns:\n{}\n".format(feature_cols_cas))
print ("Target column:\n{}\n".format(target_col_cas))

# Pre-processing
X_cas = bike_data[feature_cols_cas.drop(['dteday'],['instant'])]  # feature values 
y_cas = bike_data[target_col_cas]  # corresponding targets

# Split Set
X_train_cas, X_test_cas, y_train_cas, y_test_cas = train_test_split(X_cas, y_cas)# test size is set to 0.25

# Tuning SVR
param_grid = [
             {'C': [1, 3, 10, 30, 100, 300, 1000, 3000],
              'kernel': ['linear', 'rbf']}
             ]

# MSR optimized
#svr_tuned_cas = GridSearchCV(SVR (C=1), param_grid = param_grid, scoring = 'mean_squared_error')

# R^2 optimized
svr_tuned_cas_GS = GridSearchCV(SVR (C=1), param_grid = param_grid, scoring = 'r2', n_jobs=-1)

# Fitting
svr_tuned_cas_GS.fit(X_train_cas, y_train_cas)

print (svr_tuned_cas_GS)
print ('\n' "Best parameter from grid search: {}".format(svr_tuned_cas_GS.best_params_))

Feature columns:
Index([u'instant', u'dteday', u'season', u'yr', u'mnth', u'holiday',
       u'weekday', u'workingday', u'weathersit', u'temp', u'atemp', u'hum',
       u'windspeed'],
      dtype='object')

Target column:
casual

GridSearchCV(cv=None, error_score='raise',
       estimator=SVR(C=1, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'kernel': ['linear', 'rbf'], 'C': [1, 3, 10, 30, 100, 300, 1000, 3000]}],
       pre_dispatch='2*n_jobs', refit=True, scoring='r2', verbose=0)

Best parameter from grid search: {'kernel': 'linear', 'C': 300}


In [20]:
#SVR with RandomizesSearch - for casual users
# may take a while!

# Parameters
param_dist = {  'C': sp_uniform (300, 3000), 
                'kernel': ['linear']
             }

n_iter_search = 1

svr_tuned_cas_RS = RandomizedSearchCV(SVR (C=1), param_distributions = param_dist, scoring = 'r2', n_iter=n_iter_search)

# Fit
svr_tuned_cas_RS.fit(X_train_cas, y_train_cas)

# Best score and corresponding parameters.
print('best CV score from grid search: {0:f}'.format(svr_tuned_cas_RS.best_score_))
print('corresponding parameters: {}'.format(svr_tuned_cas_RS.best_params_))

# Predict and score
predict = svr_tuned_cas_RS.predict(X_test)

#score_regressor_tuned_RS = mean_squared_error(y_test, predict)
score_SVR_tuned_RS = r2_score(y_test, predict)

best CV score from grid search: 0.645595
corresponding parameters: {'kernel': 'linear', 'C': 1092.914166471615}


In [None]:
#SVR for casual with with GridSearch - for registered users

# Extracting
feature_cols_reg = bike_data.columns[:-3]  # all columns but last are features
target_col_reg = bike_data.columns[-2]  # last column is the target
print ("Feature column(s):\n{}\n".format(feature_cols_reg))
print ("Target column:\n{}\n".format(target_col_reg))

# Pre-processing
X_reg = bike_data[feature_cols_reg.drop(['dteday'],['casual'])]  # feature values 
y_reg = bike_data[target_col_reg]  # corresponding targets

# Split
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg)# test size is set to 0.25

# Tuning SVR
param_grid = [
             {'C': [1000, 3000, 10000],
              'kernel': ['linear', 'rbf']}
             ]

#svr_tuned_reg = GridSearchCV(SVR (C=1), param_grid = param_grid, scoring = 'mean_squared_error')
svr_tuned_reg_GS = GridSearchCV(SVR (C=1), param_grid = param_grid, scoring = 'r2', n_jobs=-1)


# Fitting 
svr_tuned_reg_GS.fit(X_train_reg, y_train_reg)

print (svr_tuned_reg_GS)
print ('\n' "Best parameter from grid search:{}".format(svr_tuned_reg_GS.best_params_))

Feature column(s):
Index([u'instant', u'dteday', u'season', u'yr', u'mnth', u'holiday',
       u'weekday', u'workingday', u'weathersit', u'temp', u'atemp', u'hum',
       u'windspeed'],
      dtype='object')

Target column:
registered



In [22]:
#SVR with RandomizesSearch - for registered users
# may take a while!

# Parameters
param_dist = {  'C': sp_uniform (1000, 3000), 
                'kernel': ['linear']
             }

n_iter_search = 1

svr_tuned_reg_RS = RandomizedSearchCV(SVR (C=1), param_distributions = param_dist, scoring = 'r2', n_iter=n_iter_search)

# Fit
svr_tuned_reg_RS.fit(X_train_reg, y_train_reg)

# Best score and corresponding parameters.
print('best CV score from grid search: {0:f}'.format(svr_tuned_cas_RS.best_score_))
print('corresponding parameters: {}'.format(svr_tuned_cas_RS.best_params_))

# Predict and score
predict = svr_tuned_reg_RS.predict(X_test)

#score_regressor_tuned_RS = mean_squared_error(y_test, predict)
score_SVR_tuned_reg_RS = r2_score(y_test, predict)

best CV score from grid search: 0.645595
corresponding parameters: {'kernel': 'linear', 'C': 1092.914166471615}


## Results

### Model Evaluation and Validation

In [23]:
# Prediction

#print ('Score cas: {0:f}'.format(mean_squared_error(y_test_cas,svr_tuned_cas.predict(X_test_cas))))
#print ('Score reg: {0:f}'.format(mean_squared_error(y_test_reg,svr_tuned_reg.predict(X_test_reg))))
print ('Score cas: {0:f}'.format(r2_score(y_test_cas,svr_tuned_cas_RS.predict(X_test_cas))))
print ('Score reg: {0:f}'.format(r2_score(y_test_reg,svr_tuned_reg_RS.predict(X_test_reg))))

predict_sum_test = svr_tuned_cas_RS.predict(X_test) + svr_tuned_reg_RS.predict(X_test)

#score = mean_squared_error(y_test, predict_sum)
score = r2_score(y_test, predict_sum_test)

print('Score sum: {0:f}'.format(score))

Score cas: 0.640793
Score reg: 0.754691
Score sum: 0.776282


### Justification

In [24]:
# Results
print("SVR: %f" % score_svr)
print("SVR tuned grid: %f" % score_svr_tuned_GS)
print("SVR tuned RS: %f" % score_svr_tuned_RS)
print('\n')
print("DNN: %f" % score_regressor)
print("DNN tuned grid: %f" % score_regressor_tuned_GS)
print("DNN tuned random: %f" % score_regressor_tuned_RS)
print('\n')
print('SVR sum: {0:f}'.format(score))

SVR: 0.000482
SVR tuned grid: 0.757696
Score SVR tuned RS: 0.756611
DNN: -0.236513
DNN tuned grid: -0.021819
DNN tuned random: -0.073531


SVR sum: 0.776282


- The SVR beats the DNN Regressor by far.  
- The seperat prediction of casual and registerd customers increases the R^2 sligtly.  
- More than 80% determination is a deacent result.  

## Conclusion

### Free-Form Visualization

In [25]:
# Visulazation

predict_sum_all = svr_tuned_cas_RS.predict(X) + svr_tuned_reg_RS.predict(X)

plt.style.use('ggplot')
plt.figure(1)
      
plt.plot(bike_data.cnt,'go', label='truth')
plt.plot(predict_sum_all,'bx', label='prediction')

plt.title('Number of bikes rented per day')
plt.xlabel('Days')
plt.ylabel('Number of bikes')

plt.legend(loc='best')

plt.show()

# source: http://matplotlib.org/examples/showcase/bachelors_degrees_by_gender.html

### Reflection

- I had high hopes for the DNN Regressor. It was kind of disappointing that it does not even come close. Maybe my tuneing was not right or it neads more data or computaional power.
- utelizing grid and randomize search in a way that makes sens was a little tricky. It makes more sens to start with a broad grid search and than use randomized search on the given intervall, instead of vis a versa. It is also coputational more efficient.

### Improvement

- More than 80% determination is a deacent result.  
- The could possibly be increased by increasing iterations in training and the number of folds in the cross validation, at the expense of computing time.
- chain multiple estimators