Build a regression model.

In [12]:
#Importing libraries

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn import linear_model, datasets

In [None]:
#Reading merged_df csv file

df  = pd.read_csv('merged_df.csv')

df.head()

In [None]:
#Dropping columns that won't be needed for training

df = df.drop(['Station Name', 'Business Name', 'Business Address', 'Category'], axis=1)

df.head()

In [15]:
x = df.drop("No. of bikes" , axis=1)
y = df.pop("No. of bikes")

In [None]:
x.head()

In [None]:
y.head()

In [18]:
#Adding a constant

x = sm.add_constant(x)
lin_reg = sm.OLS(y,x)

In [19]:
#Building the model

model = lin_reg.fit()

Provide model output and an interpretation of the results.

In [None]:
print_model = model.summary()

print(print_model)

# Model Summary:
# •	Dependent Variable (Dep. Variable): No. of bikes
# •	R-squared (R²): 0.091
# o	The R-squared value indicates the proportion of the variance in the dependent variable (No. of bikes) that is predictable from the independent variables.
# o	In this case, approximately 9.1% of the variance in the number of bikes is explained by the independent variables.
# •	Adjusted R-squared (Adj. R²): 0.015
# o	Adjusted R-squared considers the number of predictors in the model, providing a more reliable measure when there are multiple predictors. It penalizes the inclusion of irrelevant variables.
# •	F-statistic (F-statistic): 1.200
# o	The F-statistic is a measure of how well the entire model explains the variability in the dependent variable.
# o	A higher F-statistic suggests a better fit. Here, it's relatively low.
# •	Prob (F-statistic): 0.312
# o	This is the p-value associated with the F-statistic. A low p-value indicates that the overall regression model is statistically significant. Here, the value is 0.312, suggesting that the model is not statistically significant.
# Coefficients Table:
# •	Const (Constant): 1.182e+04 (11,820)
# o	The constant term in the regression equation when all independent variables are zero.
# •	Coefficients for Independent Variables (e.g., Latitude, Longitude, etc.):
# o	Each coefficient represents the change in the dependent variable for a one-unit change in the corresponding independent variable, holding other variables constant.
# •	P>|t| (p-value):
# o	The p-value associated with each coefficient. It tests the null hypothesis that the corresponding coefficient is equal to zero (no effect).
# o	A low p-value (typically less than 0.05) indicates that the variable is statistically significant.
# Additional Information:
# •	Omnibus: A test of the normality of residuals. A low p-value suggests that the residuals are not normally distributed.
# •	Durbin-Watson: A test for autocorrelation of residuals. Values close to 2 suggest no autocorrelation.
# •	Jarque-Bera (JB): Another test for normality. A low p-value suggests non-normality.
# •	Skewness (Skew): A measure of the asymmetry of the residuals.
# •	Kurtosis: A measure of the "tailedness" of the residuals.
# •	Cond. No. (Condition Number): A measure of multicollinearity. Values greater than 20 may indicate a problematic amount of collinearity.
# Interpretation:
# •	The R-squared is relatively low (9.1%), indicating that the model explains a small portion of the variability in the number of bikes.
# •	The p-values associated with coefficients should be checked to determine which variables are statistically significant.

# Stretch

How can you turn the regression model into a classification model?

# To turn a regression model into a classification model, one needs to define a threshold or cutoff point, and then predict whether the outcome falls above or below that threshold. This usually involves converting the continuous output of the regression model into discrete categories. Here's a general step-by-step approach:
# 1.	Define Categories:
# o	Decide on the categories you want to predict. For example, you might want to predict whether the number of bikes is "High" or "Low."
# 2.	Choose a Threshold:
# o	Determine a threshold value that separates the categories. This could be based on domain knowledge, business requirements, or a statistical criterion.
# 3.	Create Binary Outcome:
# o	Create a new binary outcome variable based on whether the predicted value from the regression model is above or below the chosen threshold.
# o	Example: If predicted value > Threshold, assign category "High," else assign category "Low."
# 4.	Train a Classification Model:
# o	Use the binary outcome as the dependent variable and the same set of independent variables from the regression model.
# o	Choose a suitable classification algorithm (e.g., logistic regression, decision tree, random forest) and train the model on the binary outcome.
# 5.	Evaluate the Classification Model:
# o	Assess the performance of the classification model using standard classification metrics (accuracy, precision, recall, F1-score, etc.).
# o	Split your data into training and testing sets to evaluate the model on unseen data.

# To be more specific for our data set, we can do the following;
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Assuming df is the DataFrame with the regression results
# Let's say I want to predict if the number of bikes is "High" (1) or "Low" (0)
threshold = 12000  # Choose an appropriate threshold
# Create a binary outcome variable
df['BikeCategory'] = np.where(df['PredictedNumberofBikes'] > threshold, 1, 0)
# Define features (independent variables) and target (dependent variable)
X = df[['Latitude', 'Longitude', 'Distance', 'Review Count', 'Rating']]
y = df['BikeCategory']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a logistic regression model (one can choose another classification algorithm)
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict on the test set
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("Classification Report:\n", classification_rep)