Build a regression model.

In [None]:
import numpy as np
import pandas as pd
from sklearn import linear_model, datasets
import statsmodels.api as sm

In [4]:
# Load CSV files
Bikes_Toronto = pd.read_csv('../data/Bikes_Toronto.csv', encoding="unicode_escape")
Bikes_Toronto

Unnamed: 0,id,name,lat_lon,free_bikes,poi_name,dist2station,rating,reviews
0,7303,Queen St E / Woodward Ave,"43.665269,-79.319796",5,Jaclyn's,172.375134,4.5,15
1,7303,Queen St E / Woodward Ave,"43.665269,-79.319796",5,Casa Di Giorgios,408.715724,4.0,74
2,7303,Queen St E / Woodward Ave,"43.665269,-79.319796",5,Lake Inez,794.661955,4.5,94
3,7303,Queen St E / Woodward Ave,"43.665269,-79.319796",5,Mattachioni,815.148564,4.5,5
4,7303,Queen St E / Woodward Ave,"43.665269,-79.319796",5,Chino Locos Original,467.122546,4.0,190
...,...,...,...,...,...,...,...,...
30450,7681,25 Booth Ave,"43.6544839,-79.34105699999999",7,Dave's Hot Chicken,1026.824835,4.0,21
30451,7681,25 Booth Ave,"43.6544839,-79.34105699999999",7,My Roti Place,720.231561,4.0,13
30452,7681,25 Booth Ave,"43.6544839,-79.34105699999999",7,Riverside Burgers,914.845418,4.5,14
30453,7681,25 Booth Ave,"43.6544839,-79.34105699999999",7,Hanley's Nashville Hot Chicken,762.212241,3.5,5


#### Building the model using the Backward Selection method

1st step of the method: *All independent variables are included: 'distance_to_station', 'rating', and 'reviews'*

In [9]:
# y is the dependent variable and x the independent variables
y = Bikes_Toronto['free_bikes']
x = Bikes_Toronto[['dist2station', 'rating', 'reviews']]

#Regression model step 1
x = sm.add_constant(x)
model_stp1 = sm.OLS(y, x).fit()
print(model_stp1.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     69.31
Date:                Mon, 27 Mar 2023   Prob (F-statistic):           1.15e-44
Time:                        12:46:28   Log-Likelihood:                -98090.
No. Observations:               30455   AIC:                         1.962e+05
Df Residuals:                   30451   BIC:                         1.962e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            8.0749      0.158     51.238   

The R-square value of 0.007 is quite low and indicates that the model might not to be a good fit for the data. Regarding the p-values, 'dist2station' (0.265) and 'rating' (0.186) may not be statistically significant relevant. Therefore, in the next step 'dist2station' variable will be dropped.

2nd step of the method: *'distance_to_station' variable is dropped because it is not statistical significant relevant. 'rating' and 'reviews' independent variables are included*

In [10]:
# y is the dependent variable and x the independent variables
y = Bikes_Toronto['free_bikes']
x = Bikes_Toronto[['rating', 'reviews']]

#Regression model step 2
x = sm.add_constant(x)
model_stp2 = sm.OLS(y, x).fit()
print(model_stp2.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     103.3
Date:                Mon, 27 Mar 2023   Prob (F-statistic):           1.86e-45
Time:                        12:49:49   Log-Likelihood:                -98091.
No. Observations:               30455   AIC:                         1.962e+05
Df Residuals:                   30452   BIC:                         1.962e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.1458      0.144     56.476      0.0

Again, in this model step (model_stp2), the R-square value (0.007) continues being quite low. Additionally, the independent variable 'rating' with its p-value of 0.191 is not statistically significant relevant and it's the next candicate to be dropped in step 3.

3rd step of the method: *'rating' variable is dropped because it is not statistical significant relevant. 'reviews' is the only independent variables in the model at this step.*

In [11]:
# y is the dependent variable and x the independent variable
y = Bikes_Toronto['free_bikes']
x = Bikes_Toronto[['reviews']]

#Regression model step 3
x = sm.add_constant(x)
model_stp3 = sm.OLS(y, x).fit()
print(model_stp3.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     205.0
Date:                Mon, 27 Mar 2023   Prob (F-statistic):           2.42e-46
Time:                        12:56:27   Log-Likelihood:                -98091.
No. Observations:               30455   AIC:                         1.962e+05
Df Residuals:                   30453   BIC:                         1.962e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.9640      0.038    209.305      0.0

In this last step, the model continues exhibiting a very low R-squared value of only 0.007. Despite that the p-value of 'reviews' variable is zero, it can be concluded that this model is NOT a good fit for the data (or in other words, it doesn't expalain the data).