Build a regression model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import statsmodels.api as sm

In [2]:
stations_venues_df = pd.read_csv("stations_venues.csv")
station_categories_df = pd.read_csv("station_categories.csv")

In [3]:
stations_venues_df.head()

Unnamed: 0,station_name,station_index,station_latitude,station_longitude,total_slots,venue_name,venue_id,venue_categories_str,venue_latitude,venue_longitude,venue_address,distance_from_station,venue_category_group
0,Chilco & Barclay,0,49.291909,-123.140713,18,Blenz Coffee,4aa97278f964a520b25320e3,Coffee Shop,49.290251,-123.13747,935 Denman St,298,Restaurants & Bars
1,Chilco & Barclay,0,49.291909,-123.140713,18,Haro Street,4d5474a0cc65a1434d2f425e,Neighborhood,49.29121,-123.137897,,218,Other
2,Chilco & Barclay,0,49.291909,-123.140713,18,RSVP Finish Line,520ff63111d2fbdd436c4aaf,Plaza,49.289984,-123.141443,,220,Other
3,Chilco & Barclay,0,49.291909,-123.140713,18,Ted and Mary Greig Rhododendron Garden,4bd4c2b3637ba593fd68f570,Garden,49.293592,-123.142806,Stanley Park,240,Other
4,Chilco & Barclay,0,49.291909,-123.140713,18,Jungle Room,6455be524d616826ab184696,Bar,49.28999,-123.138117,961 Denman St. P,284,Restaurants & Bars


In [4]:
station_categories_df.head()

Unnamed: 0,station_index,distance_from_station,total_slots,total_venues,Banks & Shops,Landmarks & Museums,Other,Restaurants & Bars
0,0,252.0,18,5,0,0,3,2
1,1,170.384615,14,13,3,0,3,7
2,2,200.818182,14,33,12,1,1,19
3,3,132.94,0,50,8,3,11,28
4,4,214.875,14,8,2,1,4,1


In [5]:
# Define X and y
X = station_categories_df['distance_from_station']
y = station_categories_df['total_slots']

# Add constant for intercept
X_const = sm.add_constant(X)

# Fit regression model
model = sm.OLS(y, X_const).fit()

# Print summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:            total_slots   R-squared:                       0.011
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     2.953
Date:                Mon, 28 Jul 2025   Prob (F-statistic):             0.0869
Time:                        11:11:18   Log-Likelihood:                -841.16
No. Observations:                 261   AIC:                             1686.
Df Residuals:                     259   BIC:                             1693.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                    21.68

### Simple Regression Model Interpretation

**Model: total_slots = constant + β₁ × distance_from_station**

#### Key Findings:

1. **R-squared = 0.002**: Only 0.2% of the variance in total bike slots is explained by distance from station. This indicates a very weak relationship.

2. **Coefficient for distance_from_station = -0.0022**: 
   - For every 1-meter increase in distance from station, bike slots decrease by 0.0022 slots on average
   - However, this coefficient is **not statistically significant** (p-value = 0.431 > 0.05)

3. **Statistical Significance**: 
   - The F-statistic p-value (0.431) indicates the overall model is not statistically significant
   - We cannot reject the null hypothesis that distance has no effect on bike slots

4. **Intercept = 19.80**: When distance = 0 (at the station), the predicted number of bike slots is approximately 20

#### Conclusion:
Distance from station alone is **not a good predictor** of the number of bike slots. The relationship is weak and not statistically significant.

In [8]:
# Prepare predictors: exclude 'station_index' and 'total_slots'
x = station_categories_df.drop(columns=['station_index', 'total_slots', 'Other'])

# Add constant for intercept
x_const = sm.add_constant(x)

# Fit regression model
model_multi = sm.OLS(station_categories_df['total_slots'], x_const).fit()

# Print summary
print(model_multi.summary())

                            OLS Regression Results                            
Dep. Variable:            total_slots   R-squared:                       0.051
Model:                            OLS   Adj. R-squared:                  0.032
Method:                 Least Squares   F-statistic:                     2.728
Date:                Mon, 28 Jul 2025   Prob (F-statistic):             0.0202
Time:                        11:16:28   Log-Likelihood:                -835.84
No. Observations:                 261   AIC:                             1684.
Df Residuals:                     255   BIC:                             1705.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                    18.81

In [7]:
# Perform backward elimination for the multiple regression model
import statsmodels.api as sm

# Start with all predictors
predictors_be = x.copy()
predictors_be_const = sm.add_constant(predictors_be)
y_be = station_categories_df['total_slots']

# Backward elimination loop
while True:
    model_be = sm.OLS(y_be, predictors_be_const).fit()
    pvalues = model_be.pvalues.drop('const')
    max_pval = pvalues.max()
    if max_pval > 0.05:
        worst_feature = pvalues.idxmax()
        predictors_be = predictors_be.drop(columns=[worst_feature])
        predictors_be_const = sm.add_constant(predictors_be)
    else:
        break

print(model_be.summary())

                            OLS Regression Results                            
Dep. Variable:            total_slots   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.034
Method:                 Least Squares   F-statistic:                     10.22
Date:                Mon, 28 Jul 2025   Prob (F-statistic):            0.00156
Time:                        11:11:18   Log-Likelihood:                -837.59
No. Observations:                 261   AIC:                             1679.
Df Residuals:                     259   BIC:                             1686.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         17.1892      0.660     26.046      0.0

Provide model output and an interpretation of the results. 

# Stretch

How can you turn the regression model into a classification model?