Build a regression model.

Import the necessary libraries

In [1]:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


Loading the data

In [2]:
df = pd.read_csv('../data/city_poi_data.csv')
df.head()

Unnamed: 0,Station Name,empty_slots,free_bikes,ebikes,normal_bikes,name,distance,rating,price,review_count,poi_location
0,Chilco & Barclay,9,9,2,7,Kingyo,296.0,4.4,$$,1204.0,"(49.29, -123.14)"
1,Chilco & Barclay,9,9,2,7,Guu with Garlic,533.0,4.3,$$,1081.0,"(49.29, -123.13)"
2,Chilco & Barclay,9,9,2,7,Saku,682.0,4.4,$$,453.0,"(49.29, -123.13)"
3,Chilco & Barclay,9,9,2,7,Cardero's Restaurant & Marine Pub,954.0,3.9,$$$,715.0,"(49.29, -123.13)"
4,Chilco & Barclay,9,9,2,7,Forage,990.0,4.1,$$,736.0,"(49.29, -123.13)"


Let's build a sample regression model to predict the rating of a POI based on the number of bikes near by.

In [3]:
df_copy = df.copy()
# Count the number of bikes for each POI
poi_bike_counts = df_copy.groupby('poi_location')['free_bikes'].sum().reset_index()
poi_bike_counts.columns = ['poi_location', 'bike_count']

# Merge the counts back to the original dataframe
df_copy = df_copy.merge(poi_bike_counts, on='poi_location', how='left')
df_copy.head()

X = df_copy['bike_count']
y = df_copy['rating']

# Add a constant to the independent variable
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.018
Model:                            OLS   Adj. R-squared:                  0.018
Method:                 Least Squares   F-statistic:                     46.95
Date:                Sun, 20 Oct 2024   Prob (F-statistic):           9.14e-12
Time:                        19:18:15   Log-Likelihood:                -853.60
No. Observations:                2524   AIC:                             1711.
Df Residuals:                    2522   BIC:                             1723.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.0534      0.009    427.661      0.0

# Regression Model Results Interpretation

## Model Statistical Significance
- **p-value** (Prob F-statistic): 9.14e-12 < 0.05
- The model is statistically significant

## Model Explanatory Power
- **R-squared**: 0.018
- **Adjusted R-squared**: 0.018
- The model explains 1.8% of the variance in ratings

## Impact of Bike Count on Ratings
- **Coefficient**: 3.96e-05
- For each additional bike, the rating increases by an average of 0.0000396 points

## Model Equation
Rating = 4.0534 + 0.0000396 * bike_count

## Conclusions
1. There is a **statistically significant relationship** between the number of bikes and POI ratings
2. The practical impact is extremely small
3. The model's explanatory power is very limited (only 1.8%)
4. Other factors not included in this simple model likely have a much stronger influence on POI ratings

Let's add the ratio of ebikes to total bikes to the model.

In [4]:
# Check for division by zero cases
print((df_copy['free_bikes'] == 0).sum())

# Safely calculate the ratio, avoiding division by zero
df_copy['ebike_ratio'] = df_copy['ebikes'] / df_copy['free_bikes'].replace(0, np.nan)

# Check and handle infinite and NaN values
df_copy['ebike_ratio'] = df_copy['ebike_ratio'].replace([np.inf, -np.inf], np.nan)

# Remove rows containing NaN
df_copy = df_copy.dropna(subset=['bike_count', 'ebike_ratio'])

X = df_copy[['bike_count', 'ebike_ratio']]
y = df_copy['rating']

# Add a constant to the independent variable
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())


226
                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.021
Model:                            OLS   Adj. R-squared:                  0.020
Method:                 Least Squares   F-statistic:                     24.21
Date:                Sun, 20 Oct 2024   Prob (F-statistic):           3.94e-11
Time:                        19:18:15   Log-Likelihood:                -809.11
No. Observations:                2298   AIC:                             1624.
Df Residuals:                    2295   BIC:                             1641.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           4.0393      0.012    342.375  

Ebikes ratio is not statistically significant. Let's try to add empty_slots to the model.


In [5]:
X = df_copy[['bike_count', 'empty_slots']]
y = df_copy['rating']

# Add a constant to the independent variable
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())



                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.020
Model:                            OLS   Adj. R-squared:                  0.020
Method:                 Least Squares   F-statistic:                     24.00
Date:                Sun, 20 Oct 2024   Prob (F-statistic):           4.85e-11
Time:                        19:18:15   Log-Likelihood:                -809.32
No. Observations:                2298   AIC:                             1625.
Df Residuals:                    2295   BIC:                             1642.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           4.0401      0.016    249.435      

Empty slots is not statistically significant either. Let's try the average distance to the nearest bike.

In [6]:
# Safely calculate the average distance to the nearest bike
df_mean_distance = df_copy.groupby('poi_location')['distance'].mean().reset_index()
df_mean_distance.columns = ['poi_location', 'mean_distance']

# Merge the counts back to the original dataframe
df_copy = df_copy.merge(df_mean_distance, on='poi_location', how='left')

# Add a constant to the independent variable
X = df_copy[['mean_distance', 'bike_count']]
y = df_copy['rating']

# Add a constant to the independent variable
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.022
Model:                            OLS   Adj. R-squared:                  0.021
Method:                 Least Squares   F-statistic:                     25.79
Date:                Sun, 20 Oct 2024   Prob (F-statistic):           8.36e-12
Time:                        19:18:15   Log-Likelihood:                -807.56
No. Observations:                2298   AIC:                             1621.
Df Residuals:                    2295   BIC:                             1638.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             3.8506      0.102     37.639

Again the average distance is not statistically significant. but the P-value is 0.05, which is close to the threshold and the R-squared is 0.02, which is slightly better than the model with bike count only.

Finally, I will add the review count and the price to the model.


In [7]:
# Create a mapping dictionary to convert price symbols to numeric values
price_map = {
    '$': 1,
    '$$': 2,
    '$$$': 3,
    '$$$$': 4
}

# Convert the price column using the mapping
df_copy['price_category'] = df_copy['price'].map(price_map)

# Create dummy variables
price_dummies = pd.get_dummies(df_copy['price_category'], prefix='price')
price_dummies = price_dummies.astype(int)
df_copy = pd.concat([df_copy, price_dummies], axis=1)

# Select features and target
X = df_copy[['mean_distance', 'bike_count', 'price_1.0', 'price_2.0', 'price_3.0', 'price_4.0', 'review_count']]
y = df_copy['rating']

# Convert to numeric and handle missing values
X = X.apply(pd.to_numeric, errors='coerce')
X = X.dropna()
y = y[X.index]

# Ensure all columns are numeric
numeric_columns = X.select_dtypes(include=[np.number]).columns
X = X[numeric_columns]

# Add a constant to the independent variable
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.117
Model:                            OLS   Adj. R-squared:                  0.114
Method:                 Least Squares   F-statistic:                     43.40
Date:                Sun, 20 Oct 2024   Prob (F-statistic):           7.24e-58
Time:                        19:18:15   Log-Likelihood:                -689.97
No. Observations:                2298   AIC:                             1396.
Df Residuals:                    2290   BIC:                             1442.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             3.8135      0.102     37.436

This is a completely wrong choice! The new model is even worse than the model with bike count only. Moreover, the some of price is not statistically significant. Let's drop it.

In [9]:
# Drop the price_1.0 and price_4.0 columns, because they are not statistically significant
X = X.drop(columns=['price_1.0', 'price_4.0'])

# Fit the model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.117
Model:                            OLS   Adj. R-squared:                  0.115
Method:                 Least Squares   F-statistic:                     60.48
Date:                Sun, 20 Oct 2024   Prob (F-statistic):           2.44e-59
Time:                        19:22:37   Log-Likelihood:                -690.70
No. Observations:                2298   AIC:                             1393.
Df Residuals:                    2292   BIC:                             1428.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             3.8155      0.098     38.857

The model's performance is not improved, The original model is still the best.


# Stretch

How can you turn the regression model into a classification model?

1. Degine a threshold for the rating, e.g. if the rating is greater than 3, it is a positive review, otherwise it is a negative review.
2. Based on the threhold, create a new column for the binary classification, 1 for positive review and 0 for negative review.
3. Build a classification model using the new column.