# In this notebook, we use different food access factors to predict outcomes related to food deserts in order to identify the most important features for labelling a food desert. #

In [30]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [8]:
food_obesity = pd.read_csv('./data/food_obesity.csv')

In [14]:
food_obesity['TractSNAP_percent'] = food_obesity['TractSNAP'] / food_obesity['Pop2010']

### Predicting obesity with 1-mile food access data ###

In [15]:
cols_1_mile = ['LowIncomeTracts', 'LATracts1', 'HUNVFlag', 'TractSNAP_percent']

In [20]:
X = food_obesity[cols_1_mile]
y = food_obesity['HCSOBP_2016-2018']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3224)

In [22]:
linreg = LinearRegression()

In [23]:
linreg.fit(X_train, y_train)

LinearRegression()

In [24]:
linreg.score(X_train, y_train)

0.32764373549260173

In [25]:
linreg.score(X_test, y_test)

0.288560034822962

In [26]:
preds = linreg.predict(X_test)

In [28]:
rmse_regression = mean_squared_error(y_test, preds)**0.5

In [31]:
base_preds = np.ones(196)*y_test.mean()

In [33]:
rmse_base = mean_squared_error(y_test, base_preds)**0.5

In [34]:
print(f'Baseline RMSE: {rmse_base}')
print(f'Regression RMSE: {rmse_regression}')
print(f'Improvement over baseline: {rmse_base - rmse_regression}')
print(f'Proportional improvement: {(rmse_base - rmse_regression)/rmse_base}')

Baseline RMSE: 11.0220858983688
Regression RMSE: 9.296787843894386
Improvement over baseline: 1.7252980544744148
Proportional improvement: 0.1565309933512447


We see about 15% improvement over baseline.

In [35]:
pd.DataFrame(zip(linreg.feature_names_in_, linreg.coef_), columns=['feature', 'coefficient']).sort_values(by='coefficient', ascending=False)

Unnamed: 0,feature,coefficient
3,TractSNAP_percent,16.266496
0,LowIncomeTracts,9.766421
2,HUNVFlag,4.887343
1,LATracts1,-0.475046


SNAP percentage and the Low Income flag are the most useful predictors.  Food access within 1 mile appears to be a poor predictor.

### Predicting obesity with half-mile food access data ###

In [36]:
cols_half_mile = ['LowIncomeTracts', 'LATracts_half', 'HUNVFlag', 'TractSNAP_percent']

In [37]:
X = food_obesity[cols_half_mile]
y = food_obesity['HCSOBP_2016-2018']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3224)

In [38]:
linreg.fit(X_train, y_train)

LinearRegression()

In [39]:
linreg.score(X_train, y_train)

0.3783168689440378

In [40]:
linreg.score(X_test, y_test)

0.32958420271894395

In [41]:
preds = linreg.predict(X_test)

In [42]:
rmse_regression = mean_squared_error(y_test, preds)**0.5

In [43]:
base_preds = np.ones(196)*y_test.mean()

In [44]:
rmse_base = mean_squared_error(y_test, base_preds)**0.5

In [45]:
print(f'Baseline RMSE: {rmse_base}')
print(f'Regression RMSE: {rmse_regression}')
print(f'Improvement over baseline: {rmse_base - rmse_regression}')
print(f'Proportional improvement: {(rmse_base - rmse_regression)/rmse_base}')

Baseline RMSE: 11.0220858983688
Regression RMSE: 9.024765186122835
Improvement over baseline: 1.9973207122459655
Proportional improvement: 0.181210773592461


We see about 18% improvement over baseline.

In [46]:
pd.DataFrame(zip(linreg.feature_names_in_, linreg.coef_), columns=['feature', 'coefficient']).sort_values(by='coefficient', ascending=False)

Unnamed: 0,feature,coefficient
3,TractSNAP_percent,17.591663
0,LowIncomeTracts,10.07391
1,LATracts_half,6.430744
2,HUNVFlag,-0.246612


SNAP percentage and the Low Income flag are the most useful predictors.  Food access within a half mile appears to be a much better predictor than access within 1 mile.