Build a regression model.

In [5]:
import numpy as np
import pandas as pd
from sklearn import linear_model

import statsmodels.api as sm

In [10]:
# Load in data from previous activities (using .csvs instead of database due to less code)
yelp_pois_df = pd.read_csv("../data/Yelp_POIs.csv")
stations_df = pd.read_csv("../data/citybike_stations.csv")
merged_df = pd.merge(yelp_pois_df, stations_df, on='station_id')

merged_df.head(5)

Unnamed: 0,station_id,distance_from_station,POI_name,rating,review_count,price,address,category_names,station_name,free_bikes,empty_slots,latitude,longitude
0,af1d0f25cbc75377878349fde4d86133,206.25878,BrewDog DogHouse,4.5,48,££,99 Hutcheson Street,"['Pubs', 'Beer, Wine & Spirits', 'Barbeque']",Merchant Square - ELECTRIC,7,2.0,55.858167,-4.245483
1,af1d0f25cbc75377878349fde4d86133,85.938424,Café Gandolfi,4.5,67,££,64 Albion Street,"['Seafood', 'British']",Merchant Square - ELECTRIC,7,2.0,55.858167,-4.245483
2,af1d0f25cbc75377878349fde4d86133,48.999836,The Wilson Street Pantry,4.5,88,£,6 Wilson Street,"['Cafes', 'Coffee & Tea', 'Breakfast & Brunch']",Merchant Square - ELECTRIC,7,2.0,55.858167,-4.245483
3,af1d0f25cbc75377878349fde4d86133,54.506141,Blackfriars,4.0,59,££,36 Bell Street,['Pubs'],Merchant Square - ELECTRIC,7,2.0,55.858167,-4.245483
4,af1d0f25cbc75377878349fde4d86133,176.960268,Babbity Bowster,4.0,33,££,16-18 Blackfriars Street,['Pubs'],Merchant Square - ELECTRIC,7,2.0,55.858167,-4.245483


In [23]:
x = merged_df[['distance_from_station']]
y = merged_df['rating']

In [24]:
# Add a constant term to the independent variables
X = sm.add_constant(x)

# Fit the linear regression model
model = sm.OLS(y, X).fit()
print_model = model.summary()

Provide model output and an interpretation of the results. 

In [25]:
# Print the model summary
print(print_model)

                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.028
Model:                            OLS   Adj. R-squared:                  0.027
Method:                 Least Squares   F-statistic:                     19.16
Date:                Sun, 03 Sep 2023   Prob (F-statistic):           1.40e-05
Time:                        20:25:10   Log-Likelihood:                -831.83
No. Observations:                 659   AIC:                             1668.
Df Residuals:                     657   BIC:                             1677.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     4.04

In the above model I attempted to relate `distance_from_station` with `rating` to see if there was any tangible connection between the two. According to the regression model the coefficient for `distance_from_station` was -0.0004 which indicates there is a slight negative relationship between the distance and rating (ie. as the distance from the station increases the ratings tend to decrease).

However, because the R-squared value is so low (0.028) this explaination only applys to 2.8% of the data. Meaning the distance from the station is probably not the primary contributing factor to the rating of a venue

# Stretch

How can you turn the regression model into a classification model?