Build a regression model.

In [105]:
import pandas as pd
import statsmodels.api as sm

# Load the fsq_dfmodel.csv
df = pd.read_csv('../data/fsq_dfmodel.csv')

In [106]:
df.head()

Unnamed: 0,id,name,slots,lat,long,accoms,biketrail,outdoor,park,sightsee,snack,min_distance
0,7a19c49f486d7c0c02b3685d7b240448,10th & Cambie,36,49.262487,-123.114397,0,0,1,0,2,47,96.0
1,32603a87cfca71d0f7dfa3513bad69d5,Yaletown-Roundhouse Station,16,49.274566,-123.121817,50,0,0,23,6,50,29.0
2,6d42fa40360f9a6b2bf641c7b8bb2862,Dunsmuir & Beatty,26,49.279764,-123.110154,50,0,7,22,22,50,59.0
3,66f873d641d448bd1572ab086665a458,12th & Yukon (City Hall),16,49.260599,-123.113504,10,1,1,16,2,0,128.0
4,485d4d24c803cfde829ab89699fed833,8th & Ash,16,49.264215,-123.117772,10,0,1,0,2,36,74.0


In [107]:
# Drop name, lat and long from the df because we do not need those as features in the linear regression
df = df.drop(columns=['id', 'name', 'lat', 'long'])
df.head()

Unnamed: 0,slots,accoms,biketrail,outdoor,park,sightsee,snack,min_distance
0,36,0,0,1,0,2,47,96.0
1,16,50,0,0,23,6,50,29.0
2,26,50,0,7,22,22,50,59.0
3,16,10,1,1,16,2,0,128.0
4,16,10,0,1,0,2,36,74.0


In [108]:
y = df['slots']

#
# Code credit:  Jeremy Eng (from W03D05-Model Evaluation lecture and notebook)
# (I have made only a few tweaks to naming and adding comments in some places)
#

# X=[]
# for column in df.columns[1:]:
#     X.append(sm.add_constant(df[column]))

# Create a model for each indep. variable
# Put into list of X's (with constants)

X = [sm.add_constant(df[column]) for column in df.columns[1:]] # accoms, biketrail, outdoor, park, sightsee, snack, min_distance

In [109]:
# Show what we have set up

for i in range(len(X)):
    print (f"X[{i}]: {X[i]}")

X[0]:      const  accoms
0      1.0       0
1      1.0      50
2      1.0      50
3      1.0      10
4      1.0      10
..     ...     ...
240    1.0       4
241    1.0      13
242    1.0       2
243    1.0       4
244    1.0       5

[245 rows x 2 columns]
X[1]:      const  biketrail
0      1.0          0
1      1.0          0
2      1.0          0
3      1.0          1
4      1.0          0
..     ...        ...
240    1.0          0
241    1.0          0
242    1.0          0
243    1.0          0
244    1.0          0

[245 rows x 2 columns]
X[2]:      const  outdoor
0      1.0        1
1      1.0        0
2      1.0        7
3      1.0        1
4      1.0        1
..     ...      ...
240    1.0        0
241    1.0        6
242    1.0        0
243    1.0        0
244    1.0        3

[245 rows x 2 columns]
X[3]:      const  park
0      1.0     0
1      1.0    23
2      1.0    22
3      1.0    16
4      1.0     0
..     ...   ...
240    1.0     9
241    1.0    29
242    1.0     3
24

In [110]:
# Create a list of Models
Models = [sm.OLS(y,x) for x in X]

# Create list of Results
Results = [model.fit() for model in Models]

# Create a list of Adj_Rsquared values for each model
Adj_Rsquared = [results.rsquared_adj for results in Results]

# Create list of p-values for each model
Pval = [results.pvalues for results in Results]

In [111]:
# Print out the adjusted Rsquared, p-values corresponding for the test in each feature
for i in range(len(Adj_Rsquared)):
     print(f'adj_R2: {Adj_Rsquared[i]:.3f},\tP-values: {*Pval[i],},\t\tcolumn: {df.columns[i+1]}')

adj_R2: 0.003,	P-values: (9.648991077803056e-106, 0.2041096798150786),		column: accoms
adj_R2: 0.010,	P-values: (3.2991677144263846e-135, 0.06599128732164597),		column: biketrail
adj_R2: 0.081,	P-values: (1.3520809416106496e-107, 3.582120811743888e-06),		column: outdoor
adj_R2: -0.003,	P-values: (2.0643093847308137e-67, 0.6213063314285698),		column: park
adj_R2: 0.011,	P-values: (3.15377301830442e-109, 0.051805747907294815),		column: sightsee
adj_R2: 0.009,	P-values: (2.8385943108478697e-72, 0.07040979732475412),		column: snack
adj_R2: 0.012,	P-values: (1.806376798184194e-98, 0.05032746883191435),		column: min_distance


The adj_R2 values all look not great, but the best/strongest one corresponds to "outdoor".  So, I will create a number of models that consist of "outdoor" and iterating through the remaining variables, and adding each one individually to the model that has "outdoor" in it.

For this run, best R2 was 0.081.

In [96]:
remaining_var = df.drop(['slots', 'outdoor'], axis = 1)
remaining_var.head()

Unnamed: 0,accoms,biketrail,park,sightsee,snack,min_distance
0,0,0,0,2,47,96.0
1,50,0,23,6,50,29.0
2,50,0,22,22,50,59.0
3,10,1,16,2,0,128.0
4,10,0,0,2,36,74.0


In [97]:
included_df = df[['outdoor']]
included_df

Unnamed: 0,outdoor
0,1
1,0
2,7
3,1
4,1
...,...
240,0
241,6
242,0
243,0


In [98]:
# Create a new X which is a list of all X's for model building, with constants
X = [sm.add_constant(pd.merge(included_df, remaining_var[column], right_index=True, left_index=True)) for column in remaining_var.columns]

In [99]:
# Show what we have set up

for i in range(len(X)):
    print (f"X[{i}]: {X[i]}")

print(len(X))

X[0]:      const  outdoor  accoms
0      1.0        1       0
1      1.0        0      50
2      1.0        7      50
3      1.0        1      10
4      1.0        1      10
..     ...      ...     ...
240    1.0        0       4
241    1.0        6      13
242    1.0        0       2
243    1.0        0       4
244    1.0        3       5

[245 rows x 3 columns]
X[1]:      const  outdoor  biketrail
0      1.0        1          0
1      1.0        0          0
2      1.0        7          0
3      1.0        1          1
4      1.0        1          0
..     ...      ...        ...
240    1.0        0          0
241    1.0        6          0
242    1.0        0          0
243    1.0        0          0
244    1.0        3          0

[245 rows x 3 columns]
X[2]:      const  outdoor  park
0      1.0        1     0
1      1.0        0    23
2      1.0        7    22
3      1.0        1    16
4      1.0        1     0
..     ...      ...   ...
240    1.0        0     9
241    1.0        

In [100]:
# Re-create all the "stuff": Models, Results, Adj_R2 and Pval for each combination of Outdoor and other variables

# Create a list of Models
Models = [sm.OLS(y,x) for x in X]

# Create list of Results
Results = [model.fit() for model in Models]

# Create a list of Adj_Rsquared values for each model
Adj_Rsquared = [results.rsquared_adj for results in Results]

# Create list of p-values for each model
Pval = [results.pvalues for results in Results]

# Print out the results:
for i in range(len(Adj_Rsquared)):
     print(f'adj_R2: {Adj_Rsquared[i]:.3f},\tP-values: {*Pval[i],},\t\tcolumn: {remaining_var.columns[i]}')

adj_R2: 0.081,	P-values: (8.714626913656873e-100, 5.282092213467692e-06, 0.3296080051089667),		column: accoms
adj_R2: 0.080,	P-values: (2.3238710752667275e-101, 1.5469448743598226e-05, 0.41758485492202646),		column: biketrail
adj_R2: 0.088,	P-values: (6.091218205040602e-67, 9.559813885169966e-07, 0.09024940803327114),		column: park
adj_R2: 0.077,	P-values: (1.0173095268852584e-99, 2.6609647004021233e-05, 0.875515023636293),		column: sightsee
adj_R2: 0.100,	P-values: (2.5528412740222574e-68, 9.162179131087871e-07, 0.01456245835721607),		column: snack
adj_R2: 0.083,	P-values: (5.24193101169686e-77, 1.3004892303643901e-05, 0.23066627334341067),		column: min_distance


The adj_R2 values are getting better than before, even though they're not so strong.  The next best/strongest one corresponds to "snack".  So, I will create a number of models that consist of "outdoor" and "snack" and iterating through the remaining variables, and adding each one individually to the model that has "outdoor" and "snack" in it.

For this run, best R2 was 0.100.

In [69]:
df.columns

Index(['slots', 'accoms', 'biketrail', 'outdoor', 'park', 'sightsee', 'snack',
       'min_distance'],
      dtype='object')

In [70]:
remaining_var = df.drop(['slots', 'outdoor', 'snack'], axis = 1)
remaining_var.head()

Unnamed: 0,accoms,biketrail,park,sightsee,min_distance
0,0,0,0,2,96.0
1,50,0,23,6,29.0
2,50,0,22,22,59.0
3,10,1,16,2,128.0
4,10,0,0,2,74.0


In [71]:
included_df = df[['outdoor', 'snack']]

In [72]:
# Create a new X which is a list of all X's for model building, with constants
X = [sm.add_constant(pd.merge(included_df, remaining_var[column], right_index=True, left_index=True)) for column in remaining_var.columns]

In [73]:
# Show what we have set up

for i in range(len(X)):
    print (f"X[{i}]: {X[i]}")

X[0]:      const  outdoor  snack  accoms
0      1.0        1     47       0
1      1.0        0     50      50
2      1.0        7     50      50
3      1.0        1      0      10
4      1.0        1     36      10
..     ...      ...    ...     ...
240    1.0        0     10       4
241    1.0        6     50      13
242    1.0        0      3       2
243    1.0        0      6       4
244    1.0        3     19       5

[245 rows x 4 columns]
X[1]:      const  outdoor  snack  biketrail
0      1.0        1     47          0
1      1.0        0     50          0
2      1.0        7     50          0
3      1.0        1      0          1
4      1.0        1     36          0
..     ...      ...    ...        ...
240    1.0        0     10          0
241    1.0        6     50          0
242    1.0        0      3          0
243    1.0        0      6          0
244    1.0        3     19          0

[245 rows x 4 columns]
X[2]:      const  outdoor  snack  park
0      1.0        1     4

In [74]:
# Re-create all the "stuff": Models, Results, Adj_R2 and Pval for each combination of Outdoor PLUS Sightsee, and other variables

# Create a list of Models
Models = [sm.OLS(y,x) for x in X]

# Create list of Results
Results = [model.fit() for model in Models]

# Create a list of Adj_Rsquared values for each model
Adj_Rsquared = [results.rsquared_adj for results in Results]

# Create list of p-values for each model
Pval = [results.pvalues for results in Results]

# Print out the results:
for i in range(len(Adj_Rsquared)):
          print(f'adj_R2: {Adj_Rsquared[i]:.3f},\tP-values: {*Pval[i],},\t\tcolumn: {remaining_var.columns[i]}')

adj_R2: 0.096,	P-values: (1.8737219232141567e-67, 1.837433412467784e-05, 0.024956994061091305, 0.8637559010096311),		column: accoms
adj_R2: 0.099,	P-values: (1.4776261091056848e-66, 4.579990921001211e-06, 0.013692854235181507, 0.37477637185920276),		column: biketrail
adj_R2: 0.100,	P-values: (5.947515873576809e-59, 5.329234489337747e-07, 0.03968485322116129, 0.2845100806792245),		column: park
adj_R2: 0.102,	P-values: (2.7507509132929745e-68, 4.589591041219631e-05, 0.00593084180776688, 0.19872898606358624),		column: sightsee
adj_R2: 0.108,	P-values: (1.7630596442838366e-54, 4.222961343995133e-06, 0.00539503099783647, 0.07235017352529804),		column: min_distance


The adj_R2 values are getting better than before, even though they're not so strong.  The next best/strongest one corresponds to "min_distance".  So, I will create a number of models that consist of "outdoor" plus "snack" plus "min_distance", and iterating through the remaining variables, and adding each one individually to the model that has "outdoor" plus "snack" plus "min_distance" in it.

For this run, best R2 was 0.108.

In [76]:
remaining_var = df.drop(['slots', 'outdoor', 'snack', 'min_distance'], axis = 1)
remaining_var.head()

Unnamed: 0,accoms,biketrail,park,sightsee
0,0,0,0,2
1,50,0,23,6
2,50,0,22,22
3,10,1,16,2
4,10,0,0,2


In [77]:
included_df = df[['outdoor', 'snack', 'min_distance']]

In [78]:
# Create a new X which is a list of all X's for model building, with constants
X = [sm.add_constant(pd.merge(included_df, remaining_var[column], right_index=True, left_index=True)) for column in remaining_var.columns]

In [80]:
# Show what we have set up

for i in range(len(X)):
    print (f"X[{i}]: {X[i]}")

X[0]:      const  outdoor  snack  min_distance  accoms
0      1.0        1     47          96.0       0
1      1.0        0     50          29.0      50
2      1.0        7     50          59.0      50
3      1.0        1      0         128.0      10
4      1.0        1     36          74.0      10
..     ...      ...    ...           ...     ...
240    1.0        0     10         189.0       4
241    1.0        6     50          35.0      13
242    1.0        0      3          86.0       2
243    1.0        0      6         246.0       4
244    1.0        3     19          67.0       5

[245 rows x 5 columns]
X[1]:      const  outdoor  snack  min_distance  biketrail
0      1.0        1     47          96.0          0
1      1.0        0     50          29.0          0
2      1.0        7     50          59.0          0
3      1.0        1      0         128.0          1
4      1.0        1     36          74.0          0
..     ...      ...    ...           ...        ...
240    1.0  

In [81]:
# Re-create all the "stuff": Models, Results, Adj_R2 and Pval for each combination of Outdoor PLUS Sightsee, and other variables

# Create a list of Models
Models = [sm.OLS(y,x) for x in X]

# Create list of Results
Results = [model.fit() for model in Models]

# Create a list of Adj_Rsquared values for each model
Adj_Rsquared = [results.rsquared_adj for results in Results]

# Create list of p-values for each model
Pval = [results.pvalues for results in Results]

# Print out the results:
for i in range(len(Adj_Rsquared)):
          print(f'adj_R2: {Adj_Rsquared[i]:.3f},\tP-values: {*Pval[i],},\t\tcolumn: {remaining_var.columns[i]}')

adj_R2: 0.104,	P-values: (3.2099127732331563e-54, 3.6633394158086886e-05, 0.012867273341744203, 0.07423482704219862, 0.9792458831307089),		column: accoms
adj_R2: 0.107,	P-values: (1.1856003899502093e-53, 1.8928051793426897e-05, 0.005004679970170248, 0.0707483303247529, 0.36094313544405154),		column: biketrail
adj_R2: 0.111,	P-values: (2.7225445574046622e-48, 1.9148399157013025e-06, 0.017236970583423952, 0.05106041254780776, 0.18822051474939502),		column: park
adj_R2: 0.110,	P-values: (1.4732438453669472e-54, 0.000115104946110371, 0.002555217032966608, 0.08525483476706415, 0.2382136095610618),		column: sightsee


The adj_R2 values are getting better than before, but are starting to plateau and not make that much improvement (recall that the best R2 from the last run was 0.108).  This will probably be the last run but we will add one more variable.  The next best/strongest one corresponds to "park", at 0.111 (barely above "sightsee" at 0.110).  So, I will create a number of models that consist of "outdoor" plus "snack" plus "min_distance" plus "park", and iterating through the remaining variables, and adding each one individually to the model that has "outdoor" plus "snack" plus "min_distance" plus "park" in it.

For this run, best R2 was 0.111.

In [83]:
remaining_var = df.drop(['slots', 'outdoor', 'snack', 'min_distance', 'park'], axis = 1)
remaining_var.head()

Unnamed: 0,accoms,biketrail,sightsee
0,0,0,2
1,50,0,6
2,50,0,22
3,10,1,2
4,10,0,2


In [84]:
included_df = df[['outdoor', 'snack', 'min_distance', 'park']]

In [85]:
# Create a new X which is a list of all X's for model building, with constants
X = [sm.add_constant(pd.merge(included_df, remaining_var[column], right_index=True, left_index=True)) for column in remaining_var.columns]

In [86]:
# Show what we have set up

for i in range(len(X)):
    print (f"X[{i}]: {X[i]}")

X[0]:      const  outdoor  snack  min_distance  park  accoms
0      1.0        1     47          96.0     0       0
1      1.0        0     50          29.0    23      50
2      1.0        7     50          59.0    22      50
3      1.0        1      0         128.0    16      10
4      1.0        1     36          74.0     0      10
..     ...      ...    ...           ...   ...     ...
240    1.0        0     10         189.0     9       4
241    1.0        6     50          35.0    29      13
242    1.0        0      3          86.0     3       2
243    1.0        0      6         246.0     3       4
244    1.0        3     19          67.0     7       5

[245 rows x 6 columns]
X[1]:      const  outdoor  snack  min_distance  park  biketrail
0      1.0        1     47          96.0     0          0
1      1.0        0     50          29.0    23          0
2      1.0        7     50          59.0    22          0
3      1.0        1      0         128.0    16          1
4      1.0    

In [87]:
# Re-create all the "stuff": Models, Results, Adj_R2 and Pval for each combination of Outdoor PLUS Sightsee, and other variables

# Create a list of Models
Models = [sm.OLS(y,x) for x in X]

# Create list of Results
Results = [model.fit() for model in Models]

# Create a list of Adj_Rsquared values for each model
Adj_Rsquared = [results.rsquared_adj for results in Results]

# Create list of p-values for each model
Pval = [results.pvalues for results in Results]

# Print out the results:
for i in range(len(Adj_Rsquared)):
          print(f'adj_R2: {Adj_Rsquared[i]:.3f},\tP-values: {*Pval[i],},\t\tcolumn: {remaining_var.columns[i]}')


adj_R2: 0.107,	P-values: (8.873579070974956e-48, 2.2277550270875802e-05, 0.023777064099365686, 0.05468097387934629, 0.1839294789934922, 0.8376269333366955),		column: accoms
adj_R2: 0.110,	P-values: (7.416788778088541e-48, 8.917014794616491e-06, 0.015975044218921312, 0.050281848293009544, 0.19630761545924952, 0.37821494870409134),		column: biketrail
adj_R2: 0.113,	P-values: (1.9884946019889366e-48, 5.6194173018598266e-05, 0.007640890797539014, 0.060179522270175555, 0.17205386651060156, 0.21671073856886824),		column: sightsee


The R2 values are not making significant improvement so we will end with the model as is, with these variables:
Y = number of slots of a bike station
X = number of outdoor POIs, number of places to stop for a snack (coffee/tea/juice), minimum distance to the nearest POI of any type, and num of parks -- in a 1 km radius from the station.

In [88]:
# Let's do our linear regression with those 4 variables: outdoor, snack, min_distance, and park.

y = df['slots']
X = df[['outdoor', 'snack', 'min_distance', 'park']]

X = sm.add_constant(X) # adding a constant
model = sm.OLS(y,X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  slots   R-squared:                       0.125
Model:                            OLS   Adj. R-squared:                  0.111
Method:                 Least Squares   F-statistic:                     8.599
Date:                Sat, 21 Oct 2023   Prob (F-statistic):           1.68e-06
Time:                        00:10:01   Log-Likelihood:                -745.84
No. Observations:                 245   AIC:                             1502.
Df Residuals:                     240   BIC:                             1519.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           20.9344      1.128     18.561   

Interestingly, despite the hard work done earlier to try to determine the variables to include in the model to get the best model, the results above are showing that park is definitely not a significant contributor to the Y.  Let's drop it.  We may need to drop min_distance (barely significant p-value) as well.

In [90]:
# Let's do our linear regression with 3 variables: outdoor, snack, min_distance.

y = df['slots']
X = df[['outdoor', 'snack', 'min_distance']]

X = sm.add_constant(X) # adding a constant
model = sm.OLS(y,X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  slots   R-squared:                       0.119
Model:                            OLS   Adj. R-squared:                  0.108
Method:                 Least Squares   F-statistic:                     10.85
Date:                Sat, 21 Oct 2023   Prob (F-statistic):           1.03e-06
Time:                        00:12:35   Log-Likelihood:                -746.73
No. Observations:                 245   AIC:                             1501.
Df Residuals:                     241   BIC:                             1515.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           20.2179      0.990     20.420   

Once 'park' is dropped, min_distance's coefficient (which is small at -0.0076 to begin with) is not significantly different from 0.  We should drop min_distance as well, to tidy up our model and not include variables that have little impact, which may bar others from being able to use the model if they do not have these pieces of data, despite they do not have much impact on the prediction.

In [91]:
# Let's do our linear regression with 2 variables: outdoor, snack.

y = df['slots']
X = df[['outdoor', 'snack']]

X = sm.add_constant(X) # adding a constant
model = sm.OLS(y,X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  slots   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.100
Method:                 Least Squares   F-statistic:                     14.51
Date:                Sat, 21 Oct 2023   Prob (F-statistic):           1.12e-06
Time:                        00:12:59   Log-Likelihood:                -748.37
No. Observations:                 245   AIC:                             1503.
Df Residuals:                     242   BIC:                             1513.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         19.0876      0.770     24.775      0.0

In [112]:
# Print the precise p-values:
print(results.pvalues)

const      2.552841e-68
outdoor    9.162179e-07
snack      1.456246e-02
dtype: float64


Provide model output and an interpretation of the results. 

In [92]:
# See above for model output.

## Interpretation of the results:

- An Adjusted R-Squared value of only 0.100 indicates that this linear regression model does not explain much of the variance in the dependent variable, "number of slots for each bike station" (or number of bikes that are "rentable" at each bike station).
- The low Adjusted R-Squared value indicates there may be other features/factors that are not in the model that contribute to the variance in the "number of slots".
- Only approximately 10% of the variance in the dependent variable is explained by "outdoor" (number of POIs categorized as "outdoor") and "snack" (number of POIs categorized as "snack") variables.
- Stated another way, this model is not a good fit for our data (Adjusted R-Squared is a measure of "goodness of fit").
- That being said, the p-values for "outdoor" (p-value = 0.000000916) and "snack" (p-value=0.0146) are less than 0.01 and 0.05, respectively.
    - This means that there is strong reason to reject the "null hypothesis", which is that the coefficients associated to these variables, in the linear equation relating number of slots to number of outdoor POIs and number of snack POIs, are not 0.
    - **If** the model was doing a good job of explaining the variance in the dependent variable, the coefficients for "outdoor" and "snack" indicate:
        - for every 1 increase in number of "outdoor" POIs, there is a 0.67 increase in number of bike slots
        - for every 1 increase in number of "snack" POIs, there is a 0.05 *decrease* in number of bike slots

- The current model states:  the predictor for number of slots in a given bike station, Y, is influenced by both:
    - the number of POIs that are "outdoor" classification (i.e. Landmark, Beach, Historic Site, Scenic Lookout) within a 1 km radius, and
    - the number of POIs that are "snack classification (i.e. coffee, tea, or juice outfits)

- If the linear regression line was represented as an equation, it would be something like:
    - $num\_slots = b_0+b_1(num\_outdoor\_POI)+b_2(num\_snack\_POI) \ ,\ where \ b_0=19.0876, \ b_1=0.6735, \ b_2=-0.0496$

------------

#### Side Note: Commentary on the Forward Selection Model Evaluation Approach Used:

Interestingly, we might have seen from earlier, that after adding the initial 2 x variables ("outdoor", "snack"), the highest adj.R^2 (0.100), that it wasn't worth adding additional x variables to the regression.  Recall the below was the result of all the other variables added to the model with "outdoor" plus "snack":  even adding "min_distance" to the model (so there were 3 variables:  "outdoor" and "snack" and "miin_distance") didn't increase the adj.R2 that we already got with "outdoor" and "snack" alone (0.100) by very much (0.108).  We could have "quit" at that point and only used "outdoor" and "snack" as the only 2 x variables in the model.  This is the same as the conclusion I just reached above.  So I almost did a combination of forward *and* backwards selection!

##### Earlier results (copied from above) of additional columns added to the model with "outdoor" already in it as the first "x":
- adj_R2: 0.096,	P-values: (1.8737219232141567e-67, 1.837433412467784e-05, 0.024956994061091305, 0.8637559010096311),		column: accoms
- adj_R2: 0.099,	P-values: (1.4776261091056848e-66, 4.579990921001211e-06, 0.013692854235181507, 0.37477637185920276),		column: biketrail
- adj_R2: 0.100,	P-values: (5.947515873576809e-59, 5.329234489337747e-07, 0.03968485322116129, 0.2845100806792245),		    column: park
- adj_R2: 0.102,	P-values: (2.7507509132929745e-68, 4.589591041219631e-05, 0.00593084180776688, 0.19872898606358624),		column: sightsee
- adj_R2: 0.108,	P-values: (1.7630596442838366e-54, 4.222961343995133e-06, 0.00539503099783647, 0.07235017352529804),		column: min_distance

# Stretch

Finally, execute a query to get all the POI information for a singular station, randomly choosing station_id = 'cc25ae4f093b33ba0afd1dbc0dd20324', to demonstrate a JOIN via SQL executed from Python.publish_display_data

How can you turn the regression model into a classification model?

Answer:  You can turn the "thing to predict" into a categorical result.  The easiest thing to turn this into a classification model would be to turn the prediction into something that is "over" or "under" a limit.  For example, the question could be "Will the number of slots at the station be < 15 or > 15?".  Then, the x variables in the model would determine the probability that the number of slots at the station is > 15 (for example Y encoded as '1') or is < 15 (for example Y encoded as '0').

For multiple x's in a classification prediction, we would use multinomial regression (MNLogit) to build and evaluate the model.