### CAPP 30122 Project Food in the Hood - Regression Analysis

In this part of the project, we use regression to conduct statistical inferences between crime rates and other demographic and health related variables. The purpose of this section is to explore the topic from an angle different from heatmap and decision tree. The main question to answer is what's the best model in explaining the crime rate? We got the solution through three regression models. 

A multivate regression usually has the form of $$Y = \beta_0 + \beta_1\alpha + \beta_2\beta + ... + \beta_n\gamma$$
where $\alpha, \beta, \gamma$ represents the main effects. In some cases, interactions between main effects are also presented, however, we chose not to do so in our analysis as interactions add little senses to the perfection of our analysis. 

We use two pairs of criterias to compare the goodness of our models. The first pair is $R^2$ and $R_{adj}^2$, which collectively represented the proportion of $Y$ explained by $X_s$. Compared to original $R^2$, $R_{adj}^2$ penalizes on the number of predictors to find a balance between model simplicity/interpretability and model performance. For both of them, the larger the value, the greater the model.

The second pair is $AIC$ and $BIC$. AIC stands for Akaike Information Criterion (AIC) and has the formula of $$AIC(M) = -2logL(M) + 2p(M)$$

Comparatively, BIC stands for Bayesian Information Criterion and is represented by $$BIC(M) = -2logL(M) + p(M)logn$$
, where M is the model and logL is the loglikelihood. 

For AIC and BIC, the smaller the value, the better the mode.

Now, we shall begin our analysis. We start with loading packages. We will primarily use `scipy` packages because it offers a comprehensive output similar to the `lm` table in R.

In [2]:
from sklearn import datasets, linear_model 
from sklearn.linear_model import LinearRegression 
import statsmodels.api as sm
from scipy import stats
import pandas as pd 

We use the dataset `food_data.csv`, which is merged from two datasets that were scapred from the Chicago Data Portal and Chicago Health Atlas. The dataset is already cleaned, therfore, we do not need to work more on data wrangling.

In [4]:
data_name = "../data/food_data.csv"
data = pd.read_csv(data_name)

In the first model, we put all of our predictors on the right side of the regression formula in order to find potential statistical signifiance between them and our outcome variable - crime rate. 

In [5]:
X = data[["population", "poverty_rate",  "low_food_access", "adult_fruit_and_vegetable_servings_rate", "adult_soda_consumption_rate"]]
y = data["crime_rate"]

X2 = sm.add_constant(X) 
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

                            OLS Regression Results                            
Dep. Variable:             crime_rate   R-squared:                       0.646
Model:                            OLS   Adj. R-squared:                  0.619
Method:                 Least Squares   F-statistic:                     24.04
Date:                Tue, 15 Mar 2022   Prob (F-statistic):           1.09e-13
Time:                        15:22:38   Log-Likelihood:                -541.46
No. Observations:                  72   AIC:                             1095.
Df Residuals:                      66   BIC:                             1109.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                                              coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------

The model explained $61.9\%$ ($R_{adj}^2$) of variance in crime rates. Very surprisingly, only poverty rate is statistically signifiant in inferring crime rates of a neighborhood ($p$ < .05). This might be caused by internal validity issues ("pseudo-zeros") for which we continue out analysis. We only include health-related predictors in the second model.

In [8]:
X3 = sm.add_constant(X.iloc[:, 2:5]) 
est2 = sm.OLS(y, X3)
est_fit = est2.fit()
print(est_fit.summary())

                            OLS Regression Results                            
Dep. Variable:             crime_rate   R-squared:                       0.206
Model:                            OLS   Adj. R-squared:                  0.171
Method:                 Least Squares   F-statistic:                     5.869
Date:                Wed, 16 Mar 2022   Prob (F-statistic):            0.00126
Time:                        10:34:04   Log-Likelihood:                -570.51
No. Observations:                  72   AIC:                             1149.
Df Residuals:                      68   BIC:                             1158.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                                              coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------

Only the adult soda consumption rate is statistically significant, though its p-value is very close to our threshold ($p$ = .05). Soda consumption rates seems to be positively correlated with our crime rate. It is interesting in terms of if soda consumptions serve as an intermediate variable/confounder in the process (i.e. poor people consumers more soda and poor people makes more crimes $\rightarrow$ soda is related to crimes). This model is inferior than our first one as it has a extremely low $R^2$ and $R_{adj}^2$ and very high $AIC$ and $BIC$.

In our final modle, we only use poverty rate and soda consumption rate as our predictors.

In [9]:
X4 = data[["poverty_rate",  "adult_soda_consumption_rate"]]

est3 = sm.OLS(y, X4)
est_fit2 = est3.fit()
print(est_fit2.summary())

                                 OLS Regression Results                                
Dep. Variable:             crime_rate   R-squared (uncentered):                   0.884
Model:                            OLS   Adj. R-squared (uncentered):              0.880
Method:                 Least Squares   F-statistic:                              265.5
Date:                Wed, 16 Mar 2022   Prob (F-statistic):                    2.07e-33
Time:                        10:39:31   Log-Likelihood:                         -542.69
No. Observations:                  72   AIC:                                      1089.
Df Residuals:                      70   BIC:                                      1094.
Df Model:                           2                                                  
Covariance Type:            nonrobust                                                  
                                  coef    std err          t      P>|t|      [0.025      0.975]
------------------------

Very interestingly, now soda consumption is again not significant. However, this model has the highest model performance based on our four criterias. The question of why soda consumption is jumping between significant and insignificant shall be examined to a greater degree in future research. 

Because this model explains the most of $Y$ (crime rates), we regard it as our final model.