Build a regression model.

In [None]:
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('data/cleaned_data.csv')

X = df[['Distance (m)', 'Rating']].values
y = df['free_bikes'].values

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.147
Model:                            OLS   Adj. R-squared:                 -0.024
Method:                 Least Squares   F-statistic:                    0.8615
Date:                Wed, 20 Nov 2024   Prob (F-statistic):              0.452
Time:                        11:18:14   Log-Likelihood:                -37.667
No. Observations:                  13   AIC:                             81.33
Df Residuals:                      10   BIC:                             83.03
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -13.0284     13.862     -0.940      0.3

  res = hypotest_fun_out(*samples, **kwds)


Provide model output and an interpretation of the results. 

Low R-squaerd value (14.7%) suggests that the model is not explaining much of the variability in the number of free bikes. Neither distance nor rating are statistically significant predictors of the number of free bikes, as indicated by their high p-values. Prob (F-statistic) = 0.452 suggests that the model as a whole is not statistically significant (p-value > 0.05). More data is needed to help improve the model's performance and significance.

# Stretch

How can you turn the regression model into a classification model?

In [6]:
bins = [0, 5, 15, float('inf')]
labels = ['Low', 'Medium', 'High']
df['free_bikes_category'] = pd.cut(df['free_bikes'], bins=bins, labels=labels, right=False)

df['free_bikes_category'] = pd.Categorical(df['free_bikes_category'])

X = df[['Distance (m)', 'Rating']].values  
y = df['free_bikes_category'].cat.codes  

X = sm.add_constant(X)
model = sm.MNLogit(y, X).fit()

print(model.summary())

         Current function value: 0.231396
         Iterations: 35
                          MNLogit Regression Results                          
Dep. Variable:                      y   No. Observations:                   13
Model:                        MNLogit   Df Residuals:                        7
Method:                           MLE   Df Model:                            4
Date:                Wed, 20 Nov 2024   Pseudo R-squ.:                  0.6632
Time:                        11:25:04   Log-Likelihood:                -3.0081
converged:                      False   LL-Null:                       -8.9322
Covariance Type:            nonrobust   LLR p-value:                   0.01852
       y=1       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -9.6707     15.453     -0.626      0.531     -39.958      20.616
x1             0.2266      0.207      1.097      0.273      -0.17

