### How many bikes were checked out in a specific timeframe?

#### predicting a demand (continuous number) - regression problem (for regression problems we need a different metric):

**Root Mean Squared Logarithmic Error**

note: this week it is better to overshoot with the model! (e.g. they will put 2 bikes too much at one station)

In [13]:
import pandas as pd

### Import data

In [14]:
df = pd.read_csv('data/train.csv')
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


### Convert datetime col to datetime format and add columns for day, month, dayofweek, year, dayofyear 

In [15]:
df['datetime'] = pd.to_datetime(df['datetime']) # convert column

df['month'] = df['datetime'].dt.month
df['dayofweek'] = df['datetime'].dt.dayofweek
df['hour'] = df['datetime'].dt.hour
df['dayofyear'] = df['datetime'].dt.dayofyear

df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,month,dayofweek,hour,dayofyear
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,1,5,0,1
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,1,5,1,1


In [16]:
df.shape

(10886, 16)

In [17]:
        
#df.drop(df[condition]. index, axis=0, inplace=True)

cond = df['weather'] == 4
df.drop(df[cond].index, axis=0, inplace=True)

### Train-test split



In [18]:
X = df.drop('count', axis=1)

type(X) # feature matrix



pandas.core.frame.DataFrame

In [19]:
X.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,month,dayofweek,hour,dayofyear
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,1,5,0,1
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,1,5,1,1
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,1,5,2,1
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,1,5,3,1
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1,5,4,1


In [20]:
X.shape

(10885, 15)

In [21]:
y = pd.to_numeric(df['count'])
type(y) # series 

pandas.core.series.Series

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # stratify=y

# reset indexes
X_train.reset_index(inplace=True)
X_test.reset_index(inplace=True)

y_train.reset_index(inplace=True, drop=True)
y_test.reset_index(inplace=True, drop=True)

# check shapes
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)


(8708, 16) (8708,) (2177, 16) (2177,)


### Feature Engineering Function

notes:
#### if no clear measurable distance between the categories (e.g. seasons) = then one hot encoding
#### one hot encoding for categories where order is not important, ordinal encoding where you think order matters (as linear regression WILL take up ordinal encoding)

In [28]:
import datetime as dt
def feature_engineer(df): # take any dataframe, no matter if test or train
        # select relevant features
        df_sub = df[['hour', 'atemp', 'temp', 'humidity', 'month', 'workingday', 'weather']] # , 'windspeed', 'weather', 'month',
        
        df_sub2 = df_sub[['humidity', 'atemp', 'workingday', 'temp']]
        
        # one hot-encoding of season
        season_binary_df = pd.get_dummies(df_sub['month'], prefix='month') 
        season_binary_df = season_binary_df.drop('month_1', axis=1)
        
        # one hot encoding of weather cat
        weat_binary_df = pd.get_dummies(df_sub['weather'], prefix='weather_cat')
        weat_binary_df = weat_binary_df.drop('weather_cat_1', axis=1)
        
        # one hot encoding of hour
        hour_binary_df = pd.get_dummies(df_sub['hour'], prefix='hour')
        hour_binary_df = hour_binary_df.drop('hour_0', axis=1)
        
        # join with the sub_df
        df_fe = pd.DataFrame(df_sub2.join([season_binary_df, weat_binary_df, hour_binary_df], how='left')) #  
        
        # interaction term humidity and temperature
        df_fe['temp_hum_interact'] = df_fe['temp'] * df_fe['humidity']
        # drop temp col
        #df_fe = df_fe.drop('temp', axis=1)
        
        # interaction term working day and roushhours (6-9, 16-19)
        df_fe['workday_hour_7_interact'] = df_fe['workingday'] * hour_binary_df['hour_7'] 
        df_fe['workday_hour_8_interact'] = df_fe['workingday'] * hour_binary_df['hour_8'] 
        df_fe['workday_hour_9_interact'] = df_fe['workingday'] * hour_binary_df['hour_9']
        df_fe['workday_hour_17_interact'] = df_fe['workingday'] * hour_binary_df['hour_17']  
        df_fe['workday_hour_18_interact'] = df_fe['workingday'] * hour_binary_df['hour_18']
        df_fe['workday_hour_19_interact'] = df_fe['workingday'] * hour_binary_df['hour_19']
        
        # interaction term non-working day and hour of day 
        # create non-working day column
        df_sub['non_workingday'] = df_sub['workingday'].replace({0:1, 1:0})
        df_fe['non_workingday'] = df_sub['non_workingday'] * hour_binary_df['hour_11'] 
        df_fe['non_workingday'] = df_sub['non_workingday'] * hour_binary_df['hour_12'] 
        df_fe['non_workingday'] = df_sub['non_workingday'] * hour_binary_df['hour_13']
        df_fe['non_workingday'] = df_sub['non_workingday'] * hour_binary_df['hour_14']

        
        
        # reset index
        df_fe.reset_index()

        
        return df_fe

In [29]:
X_train_fe = feature_engineer(X_train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [30]:
X_train_fe.head()

Unnamed: 0,humidity,atemp,workingday,temp,month_2,month_3,month_4,month_5,month_6,month_7,...,hour_22,hour_23,temp_hum_interact,workday_hour_7_interact,workday_hour_8_interact,workday_hour_9_interact,workday_hour_17_interact,workday_hour_18_interact,workday_hour_19_interact,non_workingday
0,75,15.91,0,11.48,0,0,0,0,0,0,...,0,0,861.0,0,0,0,0,0,0,0
1,61,32.575,0,28.7,0,0,0,0,0,0,...,0,0,1750.7,0,0,0,0,0,0,0
2,59,38.635,1,32.8,0,0,0,0,0,0,...,0,0,1935.2,0,0,0,0,1,0,0
3,65,13.635,1,12.3,0,0,1,0,0,0,...,0,0,799.5,1,0,0,0,0,0,0
4,62,34.09,0,29.52,0,0,0,0,1,0,...,0,0,1830.24,0,0,0,0,0,0,0


In [31]:
X_train_fe.shape

(8708, 48)

### Polynomial Features 
**has to be done BEFORE SCALING**
**mostly done on continous data (like temperature, humidity)

In [32]:
from sklearn.preprocessing import PolynomialFeatures
pt = PolynomialFeatures()

p_features = pt.fit_transform(X_train_fe[['atemp']])
#p_features[:,2]
p_features.shape

(8708, 3)

In [33]:
polynomial_temp_df = pd.DataFrame.from_records(p_features)


polynomial_temp_df.columns = ['t', 't2', 'temp_pol2']
polynomial_temp_df.head()

Unnamed: 0,t,t2,temp_pol2
0,1.0,15.91,253.1281
1,1.0,32.575,1061.130625
2,1.0,38.635,1492.663225
3,1.0,13.635,185.913225
4,1.0,34.09,1162.1281


In [34]:
polynomial_temp_df.shape

(8708, 3)

In [35]:
X_train_fe = X_train_fe.join(polynomial_temp_df['temp_pol2'], how='left')
X_train_fe.head()

Unnamed: 0,humidity,atemp,workingday,temp,month_2,month_3,month_4,month_5,month_6,month_7,...,hour_23,temp_hum_interact,workday_hour_7_interact,workday_hour_8_interact,workday_hour_9_interact,workday_hour_17_interact,workday_hour_18_interact,workday_hour_19_interact,non_workingday,temp_pol2
0,75,15.91,0,11.48,0,0,0,0,0,0,...,0,861.0,0,0,0,0,0,0,0,253.1281
1,61,32.575,0,28.7,0,0,0,0,0,0,...,0,1750.7,0,0,0,0,0,0,0,1061.130625
2,59,38.635,1,32.8,0,0,0,0,0,0,...,0,1935.2,0,0,0,0,1,0,0,1492.663225
3,65,13.635,1,12.3,0,0,1,0,0,0,...,0,799.5,1,0,0,0,0,0,0,185.913225
4,62,34.09,0,29.52,0,0,0,0,1,0,...,0,1830.24,0,0,0,0,0,0,0,1162.1281


In [36]:
X_train_fe.shape

(8708, 49)

In [37]:
X_train_fe = X_train_fe.reset_index()

In [38]:
##X_train_fe = X_train_fe.fillna(0)
## X_train_fe.info()

### Scaling

In [39]:
# # scaling with min max
# from sklearn.preprocessing import MinMaxScaler
# scaler = MinMaxScaler()
# scaler.fit(X_train_fe) # memorizes the min and max for each column, no y 
# X_train_fe = scaler.transform(X_train_fe) # does the actual scaling; still no y
# X_train_fe  # numpy array

In [40]:
# # scaling with standard scaler
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# X_train_fe = scaler.fit_transform(X_train_fe)
# X_train_fe

In [41]:
X_train_fe.shape

(8708, 50)

### Train model: Linear regression with scikit-learn

In [42]:
from sklearn.linear_model import LinearRegression

### Fit the model

In [43]:
# Create the model 
m = LinearRegression(normalize=True)


In [44]:
m.fit(X_train_fe, y_train)


### Reduce the complexity through regularization
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet # combination of Ridge and Lasso

m_ridge = Ridge(alpha=5.0)
m_ridge.fit(X_train_fe, y_train)

m_lasso = Lasso(alpha=5.0)
m_lasso.fit(X_train_fe, y_train)

m_ridge_lasso = ElasticNet(alpha=0.5)
m_ridge_lasso.fit(X_train_fe, y_train)

ElasticNet(alpha=0.5, copy_X=True, fit_intercept=True, l1_ratio=0.5,
           max_iter=1000, normalize=False, positive=False, precompute=False,
           random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

In [45]:
m.coef_

array([-2.89625258e-04,  1.27133023e+00,  2.73906161e+00, -5.26711103e+01,
        1.01995845e+01,  4.56209694e+00,  2.01216958e+01,  3.45036315e+01,
        6.62194732e+01,  4.52337705e+01,  1.78884024e+01,  3.27973330e+01,
        7.70122161e+01,  8.76125704e+01,  7.23997670e+01,  6.69442800e+01,
       -7.93451353e+00, -6.56881015e+01, -1.49520617e+01, -2.65421059e+01,
       -3.84229248e+01, -3.77526481e+01, -2.05616762e+01,  3.66829089e+01,
       -3.08360095e+01,  2.78874741e+01,  8.33430008e+01,  1.04122122e+02,
        1.28694067e+02,  1.61946893e+02,  1.59167146e+02,  9.23541306e+01,
        1.49924867e+02,  2.10926387e+02,  2.05847347e+02,  1.66728464e+02,
        1.33868401e+02,  1.54325069e+02,  1.09085917e+02,  6.92811471e+01,
        3.41513847e+01, -1.18542398e-01,  2.96484287e+02,  4.16896024e+02,
        1.14663963e+02,  2.37527607e+02,  2.52534095e+02,  1.50114653e+02,
        1.36504564e+02, -7.69181023e-03])

In [46]:
# Coefficients
w_0 = m.intercept_
w_1 = m.coef_[0]

In [47]:
# Interpretation of w_0
w_0, w_1

(-127.60906115406877, -0.00028962525780861236)

Evaluate/Optimize the model

    What kind of evaluation metrics can we use?
        MSE
        RMSLE (root mean squared logarithmic error)
        R-squared (coefficient of determination)
    You should do cross-validation (on your own)



In [48]:
# Look at the training score
# Return the coefficient of determination R^2 of the prediction.b
m.score(X_train_fe, y_train) # 
# R-squared 0.6357555992294879
# incl weather cat one hot encoded 0.6389685222446879
# incl interaction working day and hour_18  0.6498956851149124
# incl interaction working day and hour_17  0.6615055908463631
# incl interaction working day and hour_19  0.6649612525052413
# incl interaction working day and hour_7   0.6850735381594927
# incl interaction working day and hour_8   0.7295004828181124
# incl interaction working day and hour_9   0.7333398004517597
# incl interaction non working day and hour_11   0.7366122513219515
# incl interaction non working day and hour_12   0.7376443860278101
# incl interaction non working day and hour_13   0.7378722116507113
# incl interaction non working day and hour_14   0.7387041310756237
# incl temperature polynomial, degrees of freedom = 2    0.7457480384640465

0.7447261196143407

In [49]:
# with regularization
print('m_ridge\n',
m_ridge.score(X_train_fe, y_train),
      '\nm_lasso\n',
m_lasso.score(X_train_fe, y_train),
      '\nm_ridge_lasso\n',
m_ridge_lasso.score(X_train_fe, y_train))



m_ridge
 0.7436770858009656 
m_lasso
 0.41564517603523254 
m_ridge_lasso
 0.389740197391364


### Optimize/ Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
cross_val_result_m_negMSR = cross_val_score(m, X_train_fe, y_train, cv=10, scoring='neg_mean_squared_error')
cross_val_result_m_negMSR

In [None]:
cross_val_result_m_r2 = cross_val_score(m, X_train_fe, y_train, cv=10, scoring='r2')
cross_val_result_m_r2

In [None]:
print('cross_val_result_m_negMSR mean',
cross_val_result_m_negMSR.mean(), 
      '\ncross_val_result_m_r2 mean',
      cross_val_result_m_r2.mean())   # no overfitting !

### Predictions

In [73]:
# Make predictions for the training data
y_pred_train = m.predict(X_train_fe)
y_pred_train

array([ 96.79815553, 355.15971854, 546.27311084, ..., 385.54207295,
       187.32296237, 403.7473712 ])

### Calculate Test Score

In [50]:
X_test_fe = feature_engineer(X_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [51]:
pt = PolynomialFeatures(degree=2)
p_features = pt.fit_transform(X_test_fe[['atemp']])
polynomial_temp_df = pd.DataFrame.from_records(p_features)


polynomial_temp_df.columns = ['t', 't2', 'temp_pol2']
X_test_fe = X_test_fe.join(polynomial_temp_df['temp_pol2'], how='left')
#X_test_fe = X_test_fe.fillna(0)
X_test_fe = X_test_fe.reset_index()

In [52]:
# # scaling with standard scaler
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# X_test_fe_sc = scaler.fit_transform(X_test_fe)

In [53]:
X_test_fe.shape

(2177, 50)

In [54]:
y_test.shape

(2177,)

In [55]:
m.score(X_test_fe, y_test)

0.7452193465126459

In [72]:
m.predict(X_test_fe)


array([-743.6701556 , -756.17280795, -767.76285208, ..., -565.59674219,
       -599.53983217, -646.04515887])

In [79]:
Xtest_kaggle = pd.read_csv('data/test.csv')
Xtest_kaggle.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed
0,2011-01-20 00:00:00,1,0,1,1,10.66,11.365,56,26.0027
1,2011-01-20 01:00:00,1,0,1,1,10.66,13.635,56,0.0
2,2011-01-20 02:00:00,1,0,1,1,10.66,13.635,56,0.0
3,2011-01-20 03:00:00,1,0,1,1,10.66,12.88,56,11.0014
4,2011-01-20 04:00:00,1,0,1,1,10.66,12.88,56,11.0014


In [82]:
Xtest_kaggle['datetime'] = pd.to_datetime(Xtest_kaggle['datetime']) # convert column

Xtest_kaggle['month'] = Xtest_kaggle['datetime'].dt.month
Xtest_kaggle['dayofweek'] = Xtest_kaggle['datetime'].dt.dayofweek
Xtest_kaggle['hour'] = Xtest_kaggle['datetime'].dt.hour
Xtest_kaggle['dayofyear'] = Xtest_kaggle['datetime'].dt.dayofyear

Kaggle_Xtest_fe = feature_engineer(Xtest_kaggle)

pt = PolynomialFeatures(degree=2)
p_features = pt.fit_transform(Kaggle_Xtest_fe[['atemp']])
polynomial_temp_df = pd.DataFrame.from_records(p_features)


polynomial_temp_df.columns = ['t', 't2', 'temp_pol2']
X_test_fe = Kaggle_Xtest_fe.join(polynomial_temp_df['temp_pol2'], how='left')
#X_test_fe = X_test_fe.fillna(0)
Kaggle_Xtest_fe = Kaggle_Xtest_fe.reset_index()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [86]:
y_pred_lr = m.predict(Kaggle_Xtest_fe)
y_pred_lr   #### WTF?

array([177020.03119196, 176999.70646631, 176987.82535777, ...,
       189805.70194065, 177125.46456892, 205547.33880248])

In [87]:
# y_pred_lr = m.predict(Xtest_kaggle)
# y_pred_lr[y_pred_lr < 0] = 0
# submission = pd.DataFrame({"datetime": Xtest_kaggle["datetime"], "count": y_pred_lr})
# submission.shape
# submission.to_csv("kaggle_bike_predictions_AF_2nd", index=False)

### Kaggle submission

Kaggle evaluates the results of all submissions based on the Root Mean Squared Log Error (RMSLE).

The purpose of this metric is to treat the error in relation to the bike count. If the amount of bikes is 100, an error of 10 bikes does not matter that much, but if the predicted value is only 10, the same error is a lot. The logarithm fixes that.

To optimize your model against the RMSLE, you should take the logarithm of the target colum (y). Because 0 is a valid target value, use the log of y+1
instead: