# In this notebook, we use different food access factors to predict outcomes related to food deserts in order to identify the most important features for labelling a food desert. #

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
food_obesity = pd.read_csv('./data/food_obesity.csv')
food_diabetes = pd.read_csv('./data/food_diabetes.csv')

### Section A: A quick regression to see if our food access data predicts obesity rates ###

To get some idea of whether the features without NULL values can sufficiently identify food deserts, let's do a very quick and basic regression.  We will see whether these features have the potential to predict adult obesity rate, which we expect to be higher in food deserts.

In [3]:
# Drop the Population and Number of Housing Units for this quick fit, as they aren't directly related to food access.
quick_fit_df = food_obesity.drop(columns=['Pop2010', 'OHU2010'])
# Drop the Tract and Community Area columns, as they are only labels now that the tables are merged.
quick_fit_df.drop(columns=['Tract', 'Community Area'], inplace=True)

In [4]:
X = quick_fit_df.drop(columns=['HCSOBP_2016-2018'])
y = quick_fit_df['HCSOBP_2016-2018']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3224)

In [5]:
linreg = LinearRegression()

In [6]:
linreg.fit(X_train, y_train);

In [7]:
linreg.score(X_train, y_train)

0.6005628503983738

In [8]:
linreg.score(X_test, y_test)

0.41480123568922367

In [9]:
preds = linreg.predict(X_test)

In [10]:
rmse_regression = mean_squared_error(y_test, preds)**0.5

In [11]:
y_test.mean()

30.678571428571427

In [12]:
base_preds = np.ones(196)*y_test.mean()

In [13]:
rmse_base = mean_squared_error(y_test, base_preds)**0.5

In [14]:
print(f'Baseline RMSE: {rmse_base}')
print(f'Regression RMSE: {rmse_regression}')
print(f'Improvement over baseline: {rmse_base - rmse_regression}')
print(f'Proportional improvement: {(rmse_base - rmse_regression)/rmse_base}')

Baseline RMSE: 11.0220858983688
Regression RMSE: 8.43170670881344
Improvement over baseline: 2.5903791895553603
Proportional improvement: 0.23501714770148183


This very simple linear fit is already showing some predictive relationship between food accessibility and adult obesity rates.  This is very promising for using these features to identify food deserts.

### Section B: Predicting Obesity and Diabetes Rates with Food Access Data ###

In [15]:
# We expect percentage of population in SNAP program to be more useful than raw number.
food_obesity['TractSNAP_percent'] = food_obesity['TractSNAP'] / food_obesity['Pop2010']
food_diabetes['TractSNAP_percent'] = food_diabetes['TractSNAP'] / food_diabetes['Pop2010']

In [16]:
def linear_fit(df, features, target):
    
    """
    This function performs a simple linear regression on the data in a Dataframe.
    The following metrics are printed: train R2, test R2, baseline RMSE, test RMSE, flat RMSE change, proportional RMSE change
    The coefficient value for each feature is returned in a new DataFrame.
    
    df is the frame containing the data.
    features is a list of the columns to be used as features.
    target is the name of the target column.
    """
    
    # Set up train and test data.
    X = df[features]
    y = df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3224)
    
    # Fit the model.
    linreg = LinearRegression()
    linreg.fit(X_train, y_train)
    
    # Calculate RMSE for train and test.
    preds = linreg.predict(X_test)
    rmse_regression = mean_squared_error(y_test, preds)**0.5
    base_preds = np.ones(len(y_test))*y_test.mean()
    rmse_base = mean_squared_error(y_test, base_preds)**0.5
    
    # Print metrics and return dataframe that shows the coefficient of each feature.
    print(f'Training R2: {linreg.score(X_train, y_train)}')
    print(f'Testing R2: {linreg.score(X_test, y_test)}')
    print(f'Baseline RMSE: {rmse_base}')
    print(f'Regression RMSE: {rmse_regression}')
    print(f'Improvement over baseline: {rmse_base - rmse_regression}')
    print(f'Proportional improvement: {(rmse_base - rmse_regression)/rmse_base}')
    return pd.DataFrame(zip(X_train.columns, linreg.coef_), columns=['feature', 'coefficient']).sort_values(by='coefficient', ascending=False)

### *Predicting obesity with 1-mile food access data* ###

In [17]:
cols_1_mile = ['LowIncomeTracts', 'LATracts1', 'HUNVFlag', 'TractSNAP_percent']

In [18]:
linear_fit(food_obesity, cols_1_mile, 'HCSOBP_2016-2018')

Training R2: 0.32764373549260173
Testing R2: 0.288560034822962
Baseline RMSE: 11.0220858983688
Regression RMSE: 9.296787843894386
Improvement over baseline: 1.7252980544744148
Proportional improvement: 0.1565309933512447


Unnamed: 0,feature,coefficient
3,TractSNAP_percent,16.266496
0,LowIncomeTracts,9.766421
2,HUNVFlag,4.887343
1,LATracts1,-0.475046


We see about 15% improvement over baseline.  SNAP percentage and the Low Income flag are the most useful predictors.  Food access within 1 mile appears to be a poor predictor.

### *Predicting obesity with half-mile food access data* ###

In [19]:
cols_half_mile = ['LowIncomeTracts', 'LATracts_half', 'HUNVFlag', 'TractSNAP_percent']

In [20]:
linear_fit(food_obesity, cols_half_mile, 'HCSOBP_2016-2018')

Training R2: 0.3783168689440378
Testing R2: 0.32958420271894395
Baseline RMSE: 11.0220858983688
Regression RMSE: 9.024765186122835
Improvement over baseline: 1.9973207122459655
Proportional improvement: 0.181210773592461


Unnamed: 0,feature,coefficient
3,TractSNAP_percent,17.591663
0,LowIncomeTracts,10.07391
1,LATracts_half,6.430744
2,HUNVFlag,-0.246612


We see about 18% improvement over baseline. SNAP percentage and the Low Income flag are the most useful predictors.  Food access within a half mile appears to be a much better predictor than access within 1 mile.

### *Predicting diabetes with 1-mile food access data* ###

In [21]:
linear_fit(food_diabetes, cols_1_mile, 'HCSDIAP_2016-2018')

Training R2: 0.16989347897522222
Testing R2: 0.26334518862607115
Baseline RMSE: 3.5541007131375
Regression RMSE: 3.050434788560699
Improvement over baseline: 0.5036659245768011
Proportional improvement: 0.1417140270434749


Unnamed: 0,feature,coefficient
0,LowIncomeTracts,2.957291
1,LATracts1,1.054966
2,HUNVFlag,0.937767
3,TractSNAP_percent,-2.248496


We see about 14% improvement over baseline.  Low Income flag is the most useful predictor.  Food access within 1 mile is also a good predictor.

### *Predicting diabetes with half-mile food access data* ###

In [22]:
linear_fit(food_diabetes, cols_half_mile, 'HCSDIAP_2016-2018')

Training R2: 0.2097004066954048
Testing R2: 0.2745535522896497
Baseline RMSE: 3.5541007131375
Regression RMSE: 3.0271393221151346
Improvement over baseline: 0.5269613910223656
Proportional improvement: 0.14826855892813826


Unnamed: 0,feature,coefficient
0,LowIncomeTracts,3.102989
1,LATracts_half,2.030085
2,HUNVFlag,-0.579829
3,TractSNAP_percent,-2.511129


We see about 15% improvement over baseline. The Low Income flag is again the most useful predictors.  Food access within a half mile appears to be a better predictor than access within 1 mile.

Overall, the food access data is better at predicting obesity than diabetes.  Also, food access within a half mile is consistently better at predicting both obesity and diabetes than food access within 1 mile.  This analysis suggests that the Low Income Flag and Food Access Within a Half Mile are good features to use for food desert identification.