Build a regression model.

In [3]:
# import numpy
import numpy as np
import pandas as pd
import statsmodels.api as sm
import sqlite3

In [4]:
con = sqlite3.connect('../data/ebikes.db') 
query = """
    SELECT *
    FROM points_of_interst"""
df = pd.read_sql_query(query, con)
con.close()
df

Unnamed: 0,station_id,name,distance,rating,free_bikes
0,7a19c49f486d7c0c02b3685d7b240448,La Taqueria Pinche Taco Shop,164.000000,8.4,8
1,7a19c49f486d7c0c02b3685d7b240448,Whole Foods,182.000000,8.6,8
2,7a19c49f486d7c0c02b3685d7b240448,Hokkaido Ramen Santouka,186.000000,7.8,8
3,7a19c49f486d7c0c02b3685d7b240448,Cactus Club Cafe Broadway + Ash,232.000000,7.5,8
4,7a19c49f486d7c0c02b3685d7b240448,Solly's Bagelry,283.000000,7.9,8
...,...,...,...,...,...
852,ee620d77724c8993b0d366e7ec655b64,Nicli Antica Pizzeria,72.869669,5.0,4
853,ee620d77724c8993b0d366e7ec655b64,Togo Sushi,61.784294,4.6,4
854,ee620d77724c8993b0d366e7ec655b64,Freshii,70.443227,5.6,4
855,ee620d77724c8993b0d366e7ec655b64,Bubble Waffle Cafe,1038.485055,0.0,4


Simple linear regression: Distance vs # of free bikes available

In [5]:
y = df['free_bikes']

In [14]:
x = df['distance']
x = sm.add_constant(x) # adding a constant

In [15]:
lin_reg = sm.OLS(y,x)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.027
Model:                            OLS   Adj. R-squared:                  0.026
Method:                 Least Squares   F-statistic:                     23.43
Date:                Thu, 16 May 2024   Prob (F-statistic):           1.54e-06
Time:                        02:50:06   Log-Likelihood:                -2791.7
No. Observations:                 857   AIC:                             5587.
Df Residuals:                     855   BIC:                             5597.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.1181      0.396     17.982      0.0

In [16]:
#The number of free ebikes is the same for each station and each POI will have the same free ebike so the average is more representatitive
df_group = df.groupby('station_id')[['free_bikes', 'distance']].mean().sort_values(by='distance')
df_group

Unnamed: 0_level_0,free_bikes,distance
station_id,Unnamed: 1_level_1,Unnamed: 2_level_1
b9baf685b7053de899bf3467f61c2781,16.0,9.832073
0c42f45e4a14957ad4a6d521d0ba8bc3,24.0,26.698440
7d231a181d21056e4ba80c4d9939fe92,6.0,27.523596
e9b37f2d9b7b2e2e3ade73f13acb69b2,3.0,28.430047
fbb1d30d7f30b049873f5be5688563d4,0.0,31.074777
...,...,...
ace20c241ee38643c9060f290b215b9d,8.0,754.000000
e1ff428dadc7c32141b9d89a7b56728a,10.0,755.855286
cadc004f0903ef45e898032143c0832f,6.0,788.250000
afacb2133f2462aba7748cc21b71c788,6.0,813.000000


In [17]:
y = df_group['free_bikes']
x = df_group['distance']
x = sm.add_constant(x) # adding a constant
lin_reg = sm.OLS(y,x)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     1.078
Date:                Thu, 16 May 2024   Prob (F-statistic):              0.301
Time:                        02:50:14   Log-Likelihood:                -529.97
No. Observations:                 162   AIC:                             1064.
Df Residuals:                     160   BIC:                             1070.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.5748      1.086      6.972      0.0

Provide model output and an interpretation of the results. 

H0: The points of of interest distance has no significant effect on the number of ebikes available

H1: the points of interest have a significant effect on the number of ebikes available

R squared: is zero so the independent variable (POI distance) doesn't explain the variability in the dependent variable (free ebikes)

P value: pvalue is 0.3, which is much greater that 0.05. A pvalue greater than 0.05 indicates that the obeserved data is consistent with the null hypothesis, which is the POI distance has no significant effect on the number of ebikes available

Simple linear regression: Rating vs # of free bikes available

In [18]:
y = df['free_bikes']
x = df['rating']
x = sm.add_constant(x) # adding a constant
lin_reg = sm.OLS(y,x)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.2349
Date:                Thu, 16 May 2024   Prob (F-statistic):              0.628
Time:                        02:50:19   Log-Likelihood:                -2803.2
No. Observations:                 857   AIC:                             5610.
Df Residuals:                     855   BIC:                             5620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.2669      0.974      8.489      0.0

H0: There is no significant relationship between POI rating and ebike station distance

H1: There is a significant relationship between POI rating and ebike station distance

R squared: is low at 0.133 so the independent variable (POI rating) doesn't explain the variability in the dependent variable (ebike station distance)

P value: pvalue is 0.628, which is greater than 0.05. A pvalue greater than 0.05 indicates that the obeserved data is consistent with the null hypothesis, which is the POI rating has no significant effect on the number of ebikes available

Simple linear regression: distance vs rating

In [19]:
y = df['distance']
x = df['rating']
x = sm.add_constant(x) # adding a constant
lin_reg = sm.OLS(y,x)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:               distance   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                  0.006
Method:                 Least Squares   F-statistic:                     6.017
Date:                Thu, 16 May 2024   Prob (F-statistic):             0.0144
Time:                        04:58:16   Log-Likelihood:                -5887.9
No. Observations:                 857   AIC:                         1.178e+04
Df Residuals:                     855   BIC:                         1.179e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        276.3689     35.620      7.759      0.0

Multiple linear regression

In [6]:
x = df[['distance','rating']]
x = sm.add_constant(x) # adding a constant
lin_reg = sm.OLS(y,x)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.027
Model:                            OLS   Adj. R-squared:                  0.024
Method:                 Least Squares   F-statistic:                     11.71
Date:                Thu, 16 May 2024   Prob (F-statistic):           9.65e-06
Time:                        05:56:16   Log-Likelihood:                -2791.7
No. Observations:                 857   AIC:                             5589.
Df Residuals:                     854   BIC:                             5604.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.0389      0.995      7.077      0.0

# Stretch

How can you turn the regression model into a classification model?