# Modeling

The goal is to use the variables in the dataset to predict the obesity rate in any county

In [46]:
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [47]:
obesity_df = pd.read_csv('cleaned_obesity_data.csv')

## Linear Regression Using All Predictors

In [48]:
# Separate data into only the variables we will be using:

obesity_predictors = obesity_df[['Food Environment Index', '% Physically Inactive', '% Limited Access to Healthy Foods',
       '% Frequent Physical Distress', '% Frequent Mental Distress',
       '% With Access to Exercise Opportunities', 'Life Expectancy',
       'Median Household Income', '% Uninsured Adults', '% Excessive Drinking']]

obesity_rates = obesity_df[['% Adults with Obesity']]

obesity_rates

Unnamed: 0,% Adults with Obesity
0,37.0
1,33.0
2,46.0
3,38.0
4,33.0
...,...
2707,33.0
2708,20.0
2709,35.0
2710,30.0


In [49]:
# Create the Model

clf = LinearRegression()
clf.fit(obesity_predictors, obesity_rates)
print(clf.coef_)

[[ 3.69459285e-01  6.74911820e-01  3.58260684e-02 -4.62559804e-01
   7.17898407e-02 -1.57349841e-02 -2.56986101e-01 -4.83386293e-05
  -2.52574021e-02  5.88771389e-02]]


In [50]:
# Split data into training and testing data, we will use 8/2 testing to training ratio. 
X_train, X_test, y_train, y_test = train_test_split(obesity_predictors, obesity_rates, test_size = 0.2)

In [51]:
clf = LinearRegression()
clf.fit(X_train, y_train)

preds = clf.predict(X_test)

mean_absolute_error(preds, y_test)

2.085757205126215

The above mean abbsolute error prediction tells us that the predictions were off by 2% on average.

In [52]:
# R squared value
clf.score(X_train, y_train)

0.6503455644355645

The above R squared value means that the linear model can explain about 66% of the variation in obesity rates using the predictors

## Linear Regression Model Using PCA Findings

The graphs derived from the principal component analysis showed us that the first principal component seemed to correlate with obesity rates the best. Within the first principal component, the food environment index, life expextancy, frequent physical distress, and median houshold income seem to correlate with obesity rates the best. Therefore, in this linear model, only these predictors will be used in the linear model. 

In [53]:
# Separate data into only the variables we will be using:

obesity_predictors_pca = obesity_df[['Food Environment Index', '% Frequent Physical Distress',  'Life Expectancy', 'Median Household Income']]

obesity_rates = obesity_df[['% Adults with Obesity']]

obesity_predictors_pca

Unnamed: 0,Food Environment Index,% Frequent Physical Distress,Life Expectancy,Median Household Income
0,6.6,11.0,76.6,66444.0
1,7.5,10.0,77.7,65658.0
2,5.8,15.0,72.9,38649.0
3,7.4,13.0,73.6,48454.0
4,7.8,12.0,74.2,56894.0
...,...,...,...,...
2707,7.9,9.0,76.5,74677.0
2708,8.6,7.0,86.7,102709.0
2709,8.4,9.0,77.0,70162.0
2710,8.3,9.0,78.8,62176.0


In [54]:
# Create the Model

clf = LinearRegression()
clf.fit(obesity_predictors_pca, obesity_rates)
print(clf.coef_)

[[ 2.39880233e-01  6.78475609e-01 -4.10442322e-01 -7.54669113e-05]]


In [55]:
# Split data into training and testing data, we will use 8/2 testing to training ratio. 
X_train, X_test, y_train, y_test = train_test_split(obesity_predictors_pca, obesity_rates, test_size = 0.2)

In [56]:
clf = LinearRegression()
clf.fit(X_train, y_train)

preds = clf.predict(X_test)

mean_absolute_error(preds, y_test)

2.313123515735018

The mean absolute errror has not changed much from the previous linear model

In [57]:
# R squared value
clf.score(X_train, y_train)

0.5201066112076

The R squared value reports that this linear model is worse than the previous one considering that only 52% of the variation in obesity rates is explained by the chosen predictors.