In [3]:
# imports
import numpy as np
import pandas as pd
import statsmodels.api as sm

Build a regression model.

In [4]:
CombinedPointData = pd.read_csv('../data/CombinedPointData.csv')
CombinedPointData

Unnamed: 0,Foursquare Name,Foursquare Popularity,Foursquare Price,Matched Name,Yelp Name,Yelp Price,Search,Category,Rounded Latitude,Rounded Longitude,Rating,Free Bikes
0,Gula Villan,0.305123,Expensive,Gula Villan,Gula Villan,Affordable,13032,"Cafe, Coffee, And Tea House",60.141,24.757,8.20,0.0
1,Iso-Vasikkasaaren kahvila,0.000000,Not Listed,,,,13034,Café,60.141,24.757,0.00,0.0
2,Villa Pentry,0.982601,Affordable,Villa Pentry,Villa Pentry,Not Listed,13336,Scandinavian Restaurant,60.145,24.736,8.85,0.0
3,Grande Buffet,0.949260,Not Listed,,,,13030,Buffet,60.145,24.914,0.00,0.0
4,Toten seinä,0.885058,Affordable,,,,13073,Australian Restaurant,60.146,24.752,0.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
6087,HSL V8002 Oskarintie,0.907233,Not Listed,,,,19043,Bus Stop,60.335,25.077,0.00,0.0
6088,HSL V8001 Oskarintie,0.195755,Not Listed,,,,19043,Bus Stop,60.335,25.077,0.00,0.0
6089,HSL V7213 Rekolanmäen koulu,0.096578,Not Listed,,,,19043,Bus Stop,60.336,25.057,0.00,0.0
6090,Rekolan koirapuisto,0.934974,Not Listed,Uutelan koirapuisto,,,16033,Dog Park,60.337,25.076,0.00,0.0


In [5]:
# X are features Y is target.
X = CombinedPointData[['Rounded Latitude','Rounded Longitude','Rating']]
y = CombinedPointData['Free Bikes']

In [6]:
X

Unnamed: 0,Rounded Latitude,Rounded Longitude,Rating
0,60.141,24.757,8.20
1,60.141,24.757,0.00
2,60.145,24.736,8.85
3,60.145,24.914,0.00
4,60.146,24.752,0.00
...,...,...,...
6087,60.335,25.077,0.00
6088,60.335,25.077,0.00
6089,60.336,25.057,0.00
6090,60.337,25.076,0.00


In [7]:
y

0       0.0
1       0.0
2       0.0
3       0.0
4       0.0
       ... 
6087    0.0
6088    0.0
6089    0.0
6090    0.0
6091    0.0
Name: Free Bikes, Length: 6092, dtype: float64

In [8]:
# checking for correlation. the closwer a value is to 1, the more correlated; no codependency issues here.
CorrelationMatrix = CombinedPointData.corr(numeric_only=True)
CorrelationMatrix

Unnamed: 0,Foursquare Popularity,Rounded Latitude,Rounded Longitude,Rating,Free Bikes
Foursquare Popularity,1.0,-0.136776,-0.016235,0.357924,0.033232
Rounded Latitude,-0.136776,1.0,0.287681,-0.285171,-0.008736
Rounded Longitude,-0.016235,0.287681,1.0,0.027814,0.004089
Rating,0.357924,-0.285171,0.027814,1.0,-0.006235
Free Bikes,0.033232,-0.008736,0.004089,-0.006235,1.0


In [9]:
X = sm.add_constant(X) # adding a constant
LinReg = sm.OLS(y,X)

In [10]:
Model = LinReg.fit()

Provide model output and an interpretation of the results. 

In [11]:
ModelOutput = Model.summary()
print(ModelOutput)

                            OLS Regression Results                            
Dep. Variable:             Free Bikes   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.4544
Date:                Sat, 10 Aug 2024   Prob (F-statistic):              0.714
Time:                        14:10:51   Log-Likelihood:                -17386.
No. Observations:                6092   AIC:                         3.478e+04
Df Residuals:                    6088   BIC:                         3.481e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                70.5209     74.65

The adjusted R-squared shows us that the model explains 0% of the data.

The p-values are all above 0.05 threshold. This means that the relationship between each feature (Rounded Latitude, Rounded Longitude, and Rating) and the number of free bikes is most likely from natural variation as opposed to the number of free bikes being related to the feature.

As explained by the R-squared and p-values, the features likely don't actually have any effect on the target. So although the model's coefficient values show that Rounded Latitude and Rating have a weak negative impact on bikes (when one goes up the other goes down, vice versa) and Rounded Longitude has a weak positive impact on bikes, this doesn't indicate an actual change as calculated by the model.

In [13]:
# adjusting the model by removing the feature with the highest p-value
xAdjusted = X.drop(columns=['Rounded Longitude'])
xAdjusted = sm.add_constant(xAdjusted) # adding a constant
LinRegAdj = sm.OLS(y,xAdjusted)
ModelAdjusted = LinRegAdj.fit()
ModelAdjustedOutput = ModelAdjusted.summary()
print(ModelAdjustedOutput)

                            OLS Regression Results                            
Dep. Variable:             Free Bikes   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.4848
Date:                Sat, 10 Aug 2024   Prob (F-statistic):              0.616
Time:                        14:12:15   Log-Likelihood:                -17386.
No. Observations:                6092   AIC:                         3.478e+04
Df Residuals:                    6089   BIC:                         3.480e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               64.3989     74.013  

In [15]:
# p values still above 0.05 threshold so adjusting again
xReadjusted = xAdjusted.drop(columns=['Rating'])
xReadjusted = sm.add_constant(xReadjusted) # adding a constant
LinRegReadj = sm.OLS(y,xReadjusted)
ModelReadjusted = LinRegReadj.fit()
ModelReadjustedOutput = ModelReadjusted.summary()
print(ModelReadjustedOutput)

                            OLS Regression Results                            
Dep. Variable:             Free Bikes   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.4648
Date:                Sat, 10 Aug 2024   Prob (F-statistic):              0.495
Time:                        14:13:11   Log-Likelihood:                -17386.
No. Observations:                6092   AIC:                         3.478e+04
Df Residuals:                    6090   BIC:                         3.479e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               49.3754     70.925  

Adjusting the model doesn't have the model explain any more of the data.

# Stretch

How can you turn the regression model into a classification model?

I would get the 'Category' and 'Price' columns ready by making the values in them into separate columns and having 1 mean that data point has that value and 0 mean that row does not have that value. Using price as an example, there are 5 possible outcomes ['Not Listed', 'Most Affordable', 'Affordable', 'Expensive', 'Most Expensive']. 

I would make 'Not Listed', 'Most Affordable', 'Affordable', 'Expensive', and 'Most Expensive' columns. A data point that has a Price value of 'Not Listed' would have 1 in 'Not Listed' column and 0 for the other 4 columns made from the Price values.

This could be applied in the same way to categories. I would further clean up and group the categories, then take the final categories and make columns. If I had a data point that was a restaurant, for example, it would be a 1 in the 'Restaurant' column and 0 in all other Category related columns.