In [1]:
# imports
import numpy as np
import pandas as pd
import statsmodels.api as sm

Build a regression model.

In [2]:
CombinedPointData = pd.read_csv('../data/CombinedPointData.csv')
CombinedPointData

Unnamed: 0,Data Point,Foursquare Name,Foursquare Popularity,Foursquare Price,Matched Name,Yelp Name,Yelp Price,Search,Category,Rounded Latitude,Rounded Longitude,Rating,Free Bikes
0,Point of Interest,10 Hammas,0.000000,Not Listed,10 Hammas,10 Hammas,Not Listed,15007,Dentist,60.211,25.080,8.000000,0.0
1,Point of Interest,,,,,100 Dogs,Affordable,cocktailbars,Cocktail Bars,60.167,24.931,8.600000,0.0
2,Point of Interest,,,,,3 Amigos,Not Listed,mexican,Mexican,60.219,24.813,6.000000,0.0
3,Point of Interest,3Amigos,0.970977,Most Affordable,3 Amigos,3 Amigos,Most Affordable,13305,Burrito Restaurant,60.293,25.037,4.950000,0.0
4,Point of Interest,60° Bar & Brewery,0.995157,Not Listed,60° Bar & Brewery,,,13006,Beer Bar,60.316,24.972,5.100000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6286,Bike Spot,,,,,,,,,60.319,24.850,7.139481,1.0
6287,Bike Spot,,,,,,,,,60.276,24.846,7.139481,5.0
6288,Bike Spot,,,,,,,,,60.318,24.831,7.139481,3.0
6289,Bike Spot,,,,,,,,,60.251,24.987,7.139481,27.0


In [3]:
# X is features Y is target.
X = CombinedPointData[['Rounded Latitude','Rounded Longitude','Rating']]
y = CombinedPointData['Free Bikes']

In [4]:
X

Unnamed: 0,Rounded Latitude,Rounded Longitude,Rating
0,60.211,25.080,8.000000
1,60.167,24.931,8.600000
2,60.219,24.813,6.000000
3,60.293,25.037,4.950000
4,60.316,24.972,5.100000
...,...,...,...
6286,60.319,24.850,7.139481
6287,60.276,24.846,7.139481
6288,60.318,24.831,7.139481
6289,60.251,24.987,7.139481


In [5]:
y

0        0.0
1        0.0
2        0.0
3        0.0
4        0.0
        ... 
6286     1.0
6287     5.0
6288     3.0
6289    27.0
6290     5.0
Name: Free Bikes, Length: 6291, dtype: float64

In [6]:
# checking for correlation. the closwer a value is to 1, the more correlated; no codependency issues here.
CorrelationMatrix = CombinedPointData.corr(numeric_only=True)
CorrelationMatrix

Unnamed: 0,Foursquare Popularity,Rounded Latitude,Rounded Longitude,Rating,Free Bikes
Foursquare Popularity,1.0,-0.136776,-0.016235,-0.04424544,
Rounded Latitude,-0.136776,1.0,0.290084,-0.1332617,-0.00167217
Rounded Longitude,-0.016235,0.290084,1.0,-0.004775129,-0.003081073
Rating,-0.044245,-0.133262,-0.004775,1.0,5.723884e-16
Free Bikes,,-0.001672,-0.003081,5.723884e-16,1.0


In [7]:
X = sm.add_constant(X) # adding a constant
LinReg = sm.OLS(y,X)

In [8]:
Model = LinReg.fit()

Provide model output and an interpretation of the results. 

In [9]:
ModelOutput = Model.summary()
print(ModelOutput)

                            OLS Regression Results                            
Dep. Variable:             Free Bikes   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                   0.02132
Date:                Wed, 07 Aug 2024   Prob (F-statistic):              0.996
Time:                        19:54:22   Log-Likelihood:                -17257.
No. Observations:                6291   AIC:                         3.452e+04
Df Residuals:                    6287   BIC:                         3.455e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 7.7523     63.49

The adjusted R-squared shows us that the model explains 0% of the data.

The p-values are all above 0.05 threshold. This means that the relationship between each feature (Rounded Latitude, Rounded Longitude, and Rating) and the number of free bikes is most likely from natural variation as opposed to the number of free bikes being related to the feature.

The coefficient shows that all of the features have a weak negative impact on bikes (when one goes up the other goes down, vice versa). But as explained by the R-squared and p-values, the features likely don't actually have any effect on the target.


In [17]:
# adjusting the model by removing the feature with the highest p-value
xAdjusted = X.drop(columns=['Rating'])
xAdjusted = sm.add_constant(xAdjusted) # adding a constant
LinRegAdj = sm.OLS(y,xAdjusted)
ModelAdjusted = LinRegAdj.fit()
ModelAdjustedOutput = ModelAdjusted.summary()
print(ModelAdjustedOutput)

                            OLS Regression Results                            
Dep. Variable:             Free Bikes   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                   0.03193
Date:                Wed, 07 Aug 2024   Prob (F-statistic):              0.969
Time:                        19:59:31   Log-Likelihood:                -17257.
No. Observations:                6291   AIC:                         3.452e+04
Df Residuals:                    6288   BIC:                         3.454e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 7.6607     62.84

In [19]:
# p values still above 0.05 threshold so adjusting again
xReadjusted = xAdjusted.drop(columns=['Rounded Latitude'])
xReadjusted = sm.add_constant(xReadjusted) # adding a constant
LinRegReadj = sm.OLS(y,xReadjusted)
ModelReadjusted = LinRegReadj.fit()
ModelReadjustedOutput = ModelReadjusted.summary()
print(ModelReadjustedOutput)

                            OLS Regression Results                            
Dep. Variable:             Free Bikes   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                   0.05970
Date:                Wed, 07 Aug 2024   Prob (F-statistic):              0.807
Time:                        20:00:23   Log-Likelihood:                -17257.
No. Observations:                6291   AIC:                         3.452e+04
Df Residuals:                    6289   BIC:                         3.453e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 3.6808     11.90

Adjusting the model doesn't have the model explain any more of the data.

# Stretch

How can you turn the regression model into a classification model?

I would get the 'Category' and 'Price' columns ready by making the values in them into separate columns and having 1 mean that data point has that value and 0 mean that row does not have that value. Using price as an example, there are 5 possible outcomes ['Not Listed', 'Most Affordable', 'Affordable', 'Expensive', 'Most Expensive']. 

I would make 'Not Listed', 'Most Affordable', 'Affordable', 'Expensive', and 'Most Expensive' columns. A data point that has a Price value of 'Not Listed' would have 1 in 'Not Listed' column and 0 for the other 4 columns made from the Price values.

This could be applied in the same way to categories. I would further clean up and group the categories, then take the final categories and make columns. If I had a data point that was a restaurant, for example, it would be a 1 in the 'Restaurant' column and 0 in all other Category related columns.