# Regression task - Bike sharing 2

In [0]:
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/mlcollege/introduction-to-ml/master/data/bikes.csv', sep=',')
data.head()

## Add some features from the past

Since we have time stamp of every measurement, we can see the data as a time series and use data from the past. Add one or more feature columns computed from the data of the previous hour.

You can use the following pandas methods:
* [df.sort_values('column_name')](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) - sorts the rows of a data frame by the column with name 'column_name'.
* [df.shift(periods)](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html) - Shifts index of a data frame by desired number of periods.

In [0]:
data.sort_values(['dteday', 'hr'])
cnt = data['cnt']
data['hist'] = cnt.shift(1)
data = data[1:]
data.head()

### Data preparation

Prepare train and test data sets.

In [0]:
from sklearn.model_selection import train_test_split

X_all = data[['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit','temp', 'atemp', 'hum', 'windspeed', 'hist']]
y_all = data['cnt']


X_train, X_test, y_train, y_test = train_test_split(
    X_all, 
    y_all,
    random_state=1,
    test_size=0.2)

print('Train size: {}'.format(len(X_train)))
print('Test size: {}'.format(len(X_test)))

## Transform categorical attributes using one-hot encoding

It doesn't make sense to treat categorical attributes (eg. week day or weathersit) as numerical values. Use [One-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) instead.

In [0]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

column_trans = ColumnTransformer(
    [('ohe', OneHotEncoder(categories='auto'),['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit']),
     ('std', StandardScaler(), ['temp', 'atemp', 'hum', 'windspeed', 'hist'])
    ], remainder='passthrough')

### Training a regressor

Train a regressor using the following models:
* [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
* [Support Vector Machines for regression](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) (experiment with different kernels)
* [Gradient Boosted Trees](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) (Experiment with different depths and number of trees)

In [0]:
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

regr = Pipeline([('transformer', column_trans),
                 ('linear', LinearRegression())
                 #('gbr', GradientBoostingRegressor(n_estimators=100, max_depth=4))
                ])

regr.fit(X_train, y_train)

### Evaluate the models

Measure mean squared error and mean absolute error evaluation metrics on both train and test data sets. Compute the mean and standard deviation of the target values. Decide which model performs best on the given problem.

In [0]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

y_pred = regr.predict(X_test)
print ("Test mean: {}, std: {}".format(np.mean(y_test), np.std(y_test)))
print("Test Root mean squared error: {:.2f}".format(np.sqrt(mean_squared_error(y_test, y_pred))))
print("Test Mean absolute error: {:.2f}".format(mean_absolute_error(y_test, y_pred)))

In [0]:
y_pred = regr.predict(X_train)
print("Train Root mean squared error: %.2f"
      % np.sqrt(mean_squared_error(y_train, y_pred)))
print("Train Mean absolute error: %.2f"
      % mean_absolute_error(y_train, y_pred))