### Citybike statistical analysis with multiple linear regession (test)

##### Import libraries

In [28]:
import pandas as pd
import sys, os
import matplotlib.pyplot as plt
import numpy as np


##### Read and combine data

In [29]:
avail_data = "hourly-avg-2017-06-all-stations.csv"

rain_data = "fmi-raindata-Helsinki-Kaisaniemi-2017-06.csv"

temp_data = "fmi-weatherdata-Helsinki-Kaisaniemi-2017.csv"

A = pd.read_csv(avail_data, sep=',')
B = pd.read_csv(rain_data, sep=',')
C = pd.read_csv(temp_data, sep=",")

C.rename(columns={'Vuosi': 'Vuosi', 'Kk': 'Kk', 'Pv': 'Pv', 'Klo': 'Klo', 'Aikavyöhyke': 'Aikavyöhyke', 'Sateen intensiteetti (mm/h)': 'Sateen_intensiteetti_mm_h', 'Ilman lämpötila (degC)': 'Ilman_lampotila'}, inplace=True)  

# Let's take all the Temperature values from the column 'Ilman_lampotila' so that we begin with the june (Kk==6) values and continue until we reach the length of 720, this is just for model-testing purposes 
C_ilma = C['Ilman_lampotila'].iloc[1465:2185] 
#C.drop(columns='Ilman_lampotila')
A['Sade'] = B['Sade']
A['Ilman_lampotila'] = C_ilma
#A['Ilman_lampotila'].isnull().values.any()

##### Double-check that the DataFrame 'A' has 4 columns with their specific names, e.g. 'timehour', 'sumofhourlyavg', 'Sade' and 'Ilman_lampotila'

In [30]:
print(A.shape)
print(A.columns)
print(A.dtypes) # Double-check: all of our variables for the model are the type float64, so no categorical variables around, that's good.

(720, 4)
Index(['timehour', 'sumofhourlyavg', 'Sade', 'Ilman_lampotila'], dtype='object')
timehour            object
sumofhourlyavg     float64
Sade               float64
Ilman_lampotila    float64
dtype: object


In [31]:
A['Ilman_lampotila'] = np.float64(18) # this is done for testing purposes, the fact that nulls didn't exist in C but when assigned to A's column, they suddenly exist, should be something to figure out

In [32]:
A.dtypes

timehour            object
sumofhourlyavg     float64
Sade               float64
Ilman_lampotila    float64
dtype: object

In [33]:
A['Ilman_lampotila'].isnull().values.any() # this is done for testing purposes, the fact that nulls didn't exist in C but when assigned to A's column, they suddenly exist, should be something to figure out

False

##### Assigning 'sumofhourlyavg' as the dependent variable y
##### Assigning 'Sade' and 'Ilman_lampotila' as independent variables X (X is 2xn matrix here but it will still need the weekday-info and hour-columns to be complete)

In [34]:
X = A.loc[:, ['Sade','Ilman_lampotila']] # matrix of independent variables 'Sade' and 'Ilman_lampotila'
y = A[A.columns[1]] # vector of 'sumofhourlyavg'

##### Splitting the dataset 'A' into the Training set and Test set

In [35]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state = 0) # 144 (0.2) observations to test-set and 576 (0.8) observations to the train-set

# Feature scaling


##### Fitting Multiple Linear Regression to the Training set

In [36]:
from sklearn.linear_model import LinearRegression
# Creating a regressor object of LinearRegression class
regressor = LinearRegression()

# Applying fit-method to the training set
regressor.fit(X_train, y_train)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

##### Testing the performance of multiple linear regression model (predicting test set results)

In [37]:
y_pred = regressor.predict(X_test)

##### Checking how well the linear regression model fits to the observations

In [38]:
R_2 = regressor.score(X_train, y_train)
print(R_2)  # 0.015354948259910128  

0.015354948259910128


##### R-squared is the statistical measure that tells how close the data are to the fitted regression line. The value range is from 0 to 1 where 1 would mean perfect fit and 0 represents a model that does not explain any of the variation in the response variable. So in this case, especially when needed to apply hardcoding for replacing quickly the temperature variable's nulls, the model ended up only 0,015 so only 1,5%. The rule of thumb: the larger the R**2, the better the regression model fits to the observations

Exporting the prediction-vector y_pred to a csv-file

In [39]:
y_pred_df = pd.DataFrame(y_pred)
y_pred_df.to_csv('test_predictions.csv')