In [1]:
##Import our stuff
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm


In [2]:
df=pd.read_csv('df_final.csv')

Build a regression model.

In [8]:
df.head()

Unnamed: 0,station_id,num_bikes,b_pop,b_rating,num_bars,c_pop,c_rating,num_cafes
0,7a19c49f486d7c0c02b3685d7b240448,21,0.578193,7.35,5.0,0.810487,7.56,25.0
1,32603a87cfca71d0f7dfa3513bad69d5,9,0.790734,6.914286,50.0,0.793092,7.081818,24.0
2,6d42fa40360f9a6b2bf641c7b8bb2862,13,0.790377,6.705882,39.0,0.89179,7.244828,50.0
3,66f873d641d448bd1572ab086665a458,2,0.578193,7.35,4.0,0.807795,7.7125,18.0
4,485d4d24c803cfde829ab89699fed833,9,0.744556,7.366667,7.0,0.788617,6.836364,26.0


In [3]:
## First we set our dependent and independant variables
y = df['num_bikes']
X = df.drop(columns = ['station_id', 'num_bikes', 'num_bars','b_pop','b_rating','c_rating','c_pop'])
## Add constant to our X so our model will have an intercept
X = sm.add_constant(X) 

In [4]:
## Instantiate and fit the model
model = sm.OLS(y, X) 
results = model.fit() 
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              num_bikes   R-squared:                       0.053
Model:                            OLS   Adj. R-squared:                  0.048
Method:                 Least Squares   F-statistic:                     10.90
Date:                Sun, 26 Mar 2023   Prob (F-statistic):            0.00114
Time:                        10:17:21   Log-Likelihood:                -610.43
No. Observations:                 198   AIC:                             1225.
Df Residuals:                     196   BIC:                             1231.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.0796      0.638     11.099      0.0

####
Note: I arrived at this model by using the backward selection processes, removing independant variables based on p-value until I arrived at a model with the highest adjusted r-squared value. This process was done in a seperate notebook so as not to clutter up this one. The second best model according to adj r-squared values would be a model with with two independant variables, num_cafes and num_bars. 


Provide model output and an interpretation of the results. 

In [5]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              num_bikes   R-squared:                       0.053
Model:                            OLS   Adj. R-squared:                  0.048
Method:                 Least Squares   F-statistic:                     10.90
Date:                Sun, 26 Mar 2023   Prob (F-statistic):            0.00114
Time:                        10:19:52   Log-Likelihood:                -610.43
No. Observations:                 198   AIC:                             1225.
Df Residuals:                     196   BIC:                             1231.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.0796      0.638     11.099      0.0

Right of the bat let me just say that even though this is the best model we can get based on the data I have chosen to work with, the adjusted R-squared value of 0.048 is still pitifully low. We can think of the R-squared value as a measure of how nuch the dependant variable is explained by our independant variables(s). Realistically speaking, this probably means this model will be pretty terrible for actually making predictions of our dependant variable. Perhaps the only real insight we can get is that none of our independant variables (number, avg_rating, avg_popularity, of bars and cafes in a 500m radius from a bike station) is a good predictor for how many bikes will be available at a particular bike station.

That being said, for the sake of arguement, if this model had a higher R-squared value (for example, a value 0f 0.7), we could say that the number of bikes available at a specific coordinate can be predicted by the number of cafes in a 500m radius from said coordinate. We have a positive coefficient for num_cafes (albeit a very small value again) and that means the higher the number of cafes in a given area, the more likely it is that that area would contain a higher number of bikes. 

# Stretch

How can you turn the regression model into a classification model?

For this project, we have a regression model because our dependant variable (number of bikes at a bike station) is a continuous variable. To turn our model into a classification model, our dependant variable would have to be discrete. So instead of using a count of the number of bikes for our dependant variable, we could categorise number of bikes into categories. For example we could categorise the number of bikes into "Low", "Medium" and "High" based on percentile. From there we can build a classification model such as a multinomial regression model.