#### Build a regression model.



Building a regression model to demonstrate a relationship between the number of bikes and average number of reviews of nearby restaurants and average rating of nearby restaurants

In [73]:
import sqlite3 as sqlite
import pandas as pd
from sqlalchemy import create_engine

connection = create_engine('sqlite:///bike_and_restaurants.db')
    
sql = ''' SELECT      b.station_name,
                      number_of_bikes,
                      AVG(review_count) AS avg_review_count,
                      AVG(rating) AS avg_rating
          FROM        bike_stations b
          JOIN        nearby_restaurants nr
          ON          b.station_name = nr.station_name
          JOIN        restaurants r
          ON          nr.restaurant_name = r.restaurant_name
          WHERE       review_count IS NOT NULL
          GROUP BY    b.station_name, number_of_bikes '''

df = pd.read_sql_query(sql, connection)

#### Provide model output and an interpretation of the results. 

In [78]:
import statsmodels.api as sm

y = df['number_of_bikes']
x = df[['avg_review_count', 'avg_rating']]

lin_reg = sm.OLS(y, x)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                                 OLS Regression Results                                
Dep. Variable:        number_of_bikes   R-squared (uncentered):                   0.944
Model:                            OLS   Adj. R-squared (uncentered):              0.937
Method:                 Least Squares   F-statistic:                              127.3
Date:                Sat, 13 Apr 2024   Prob (F-statistic):                    3.90e-10
Time:                        19:56:48   Log-Likelihood:                         -38.668
No. Observations:                  17   AIC:                                      81.34
Df Residuals:                      15   BIC:                                      83.00
Df Model:                           2                                                  
Covariance Type:            nonrobust                                                  
                       coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------



Removed review_count due to the low p-value in order to get the best of fit model

In [72]:
x = df['avg_rating']

lin_reg = sm.OLS(y, x)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                                 OLS Regression Results                                
Dep. Variable:        number_of_bikes   R-squared (uncentered):                   0.944
Model:                            OLS   Adj. R-squared (uncentered):              0.941
Method:                 Least Squares   F-statistic:                              270.7
Date:                Sat, 13 Apr 2024   Prob (F-statistic):                    1.89e-11
Time:                        19:47:35   Log-Likelihood:                         -38.690
No. Observations:                  17   AIC:                                      79.38
Df Residuals:                      16   BIC:                                      80.21
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------



From the regression model it can be inferred that there is a correlation between the number of bikes in a station and the average rating of restaurants nearby that station. Both the low p-value and the high Adj. R-squared indicates that the model is a good fit. Although there is an error at the bottom that suggests the model may not be as reliable due to the small sample size.

# Stretch

#### How can you turn the regression model into a classification model?

It is possible to change the regression model into a classification model by splitting the rating into a 'high' or 'low' rating. By changing the datatype into object, or categorical type and any rating that is higher than 2.5 being 'high' and anything below being 'low'. 
Then using this new rating column and the number of bikes, its possible to create a classification problem where the number of bikes could predict whether it would be a high or low rating.