Build a regression model.

In [51]:
import pandas as pd
import statsmodels.formula.api as smf
import sqlite3
conn = sqlite3.connect(r'../data/poi_Greenville_SC.db')
df = pd.read_sql_query("select * from poi", conn)
conn.close()

# independent variables
X = df[['price', 'rating']]

# dependent variables
y = df[['station_number_of_bikes']]

In [52]:
X.shape

(118, 2)

In [45]:
# Find outliers
import statsmodels.formula.api as smapi
regression = smapi.ols("data ~ X", data=dict(data=y, x=X)).fit()
regression.summary()

test = regression.outlier_test()
#test

print('Bad data points (bonf(p) < 0.05):')
test[test['bonf(p)'] < 0.05]

Bad data points (bonf(p) < 0.05):


Unnamed: 0,student_resid,unadj_p,bonf(p)


Provide model output and an interpretation of the results. 

In [47]:
# Single variable

import statsmodels.api as sm
X = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(y,X['rating'])
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                                   OLS Regression Results                                   
Dep. Variable:     station_number_of_bikes   R-squared (uncentered):                   0.713
Model:                                 OLS   Adj. R-squared (uncentered):              0.711
Method:                      Least Squares   F-statistic:                              291.2
Date:                     Sat, 02 Sep 2023   Prob (F-statistic):                    1.56e-33
Time:                             15:22:45   Log-Likelihood:                         -268.37
No. Observations:                      118   AIC:                                      538.7
Df Residuals:                          117   BIC:                                      541.5
Df Model:                                1                                                  
Covariance Type:                 nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025    

In [None]:
'''
############################
Interpretation of the result
############################

An R-squared of 0.713 suggests that approximately 71.3% of the total variation in the dependent variable (number of bikes) 
is explained by the independent variable(s) (rating) included in the model.
This is a relatively high percentage, indicating a strong relationship between the number of bikes and rating.

- R-squared (uncentered): 0.713

This R-squared value indicates that approximately 71.3% of the variation in the "rating" variable is explained by the "station_number_of_bikes." 
In other words, the number of bikes at a station has a reasonably strong explanatory power for predicting ratings.

- Adj. R-squared (uncentered): 0.711

The adjusted R-squared takes into account the number of predictors in the model. 
In this case, it's very close to the R-squared value, indicating that adding additional predictors may not significantly improve the model's explanatory power.

- F-statistic: 291.2

The F-statistic tests the overall significance of the regression model. 
A high F-statistic suggests that the independent variable in the model is significantly related to the dependent variable. 
In this case, the F-statistic is quite high, indicating that the model as a whole is statistically significant.

- Prob (F-statistic): 1.56e-33

The p-value associated with the F-statistic is very close to zero (1.56e-33), which is much smaller than the typical significance level of 0.05. 
This indicates that the model is highly significant, and the independent variable is contributing significantly to the model.

- Coefficient (coef) for "rating": 0.4668

The coefficient represents the change in the "rating" variable for a one-unit change in the number of bikes.
In this case, for each additional bike at a station, the "rating" increases by approximately 0.4668 units.


In summary, this regression analysis suggests that the number of bikes at a station ("station_number_of_bikes") is a statistically significant predictor of the "rating" variable. 
The model explains about 71.3% of the variation in ratings, and for each additional bike at a station, the rating tends to increase by approximately 0.4668 units. 
The model appears to be a good fit for the data, as indicated by the high R-squared and F-statistic.

'''

# Stretch

How can you turn the regression model into a classification model?

In [48]:
'''

For instance, we can categorize ratings as 'High' or 'Low.' We define 'High' as ratings greater than 5 and 'Low' as ratings less than or equal to 5. 
Using these predefined categories, we can employ Logistic Classification to predict whether a station will receive a 'High' or 'Low' rating based on the number of bikes it has available.

In Logistic Classification, we assign 'High' a label of 1 and 'Low' a label of 0. 
The independent variable used for this classification is 'station_number_of_bikes,' and our goal is to determine whether the station's rating falls into the 'High' or 'Low' category."


'''