# INFO2950 Project - Phase V

## Table of Contents
1. Introduction <br>
a. Background <br>
b. Research Questions 
2. Data Description
3. Data Cleaning
4. Preregistration Statement
5. Data Analysis
6. Evaluation of Significance
7. Interpretation and Conclusions
8. Limitations
9. Sources

## 1. Introduction

### 1a. Background

### 1b. Research Questions

Our overarching question we are looking to answer is if there is a relationship in the United States between grocery stores, fast food restaurants, and the demographics in each county? This lead us to ask the following subquestions:
- Is there a relationship between food availability, both grocery stores and restaurants, and county demographics across the United States? 
- Are there specific kinds of grocery stores and restaurants depending on the demographics in the county? 
- Is there a relationship between grocery stores and fast food restaurants within each county?

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import duckdb
from sklearn.linear_model import LinearRegression, LogisticRegression

## 2. Data Description

## 3. Data Cleaning

The original data was contained in an excel file, with each subjet on a different tab. The first step was to download each tab we were working with as a csv. We started with the grocery stores data.

In [4]:
stores = pd.read_csv('data/Store_Access.csv')
stores.head()

Unnamed: 0,FIPS,State,County,GROC11,GROC16,PCH_GROC_11_16,GROCPTH11,GROCPTH16,PCH_GROCPTH_11_16,SUPERC11,...,PCH_SNAPS_12_17,SNAPSPTH12,SNAPSPTH17,PCH_SNAPSPTH_12_17,WICS11,WICS16,PCH_WICS_11_16,WICSPTH11,WICSPTH16,PCH_WICSPTH_11_16
0,1001,AL,Autauga,5,3,-40.0,0.090581,0.054271,-40.085748,1,...,19.376392,0.674004,0.804747,19.3979,5.0,5.0,0.0,0.090567,0.090511,-0.061543
1,1003,AL,Baldwin,27,29,7.407407,0.144746,0.139753,-3.449328,6,...,36.927711,0.725055,0.890836,22.864524,26.0,28.0,7.692307,0.13938,0.134802,-3.284727
2,1005,AL,Barbour,6,4,-33.333333,0.21937,0.155195,-29.254287,0,...,3.349282,1.28059,1.424614,11.246689,7.0,6.0,-14.285714,0.255942,0.232387,-9.203081
3,1007,AL,Bibb,6,5,-16.666667,0.263794,0.220916,-16.254289,1,...,11.794872,0.719122,0.801423,11.444711,6.0,5.0,-16.666666,0.263771,0.221474,-16.035471
4,1009,AL,Blount,7,5,-28.571429,0.121608,0.086863,-28.571429,1,...,5.701754,0.657144,0.692374,5.361034,8.0,8.0,0.0,0.139,0.139089,0.064332


Next, we determined which columns we would potentially want to use based on our reseach questions, and dropped the rest from our data frame.

In [5]:
stores = stores.drop(['PCH_GROCPTH_11_16',\
                      'PCH_SUPERCPTH_11_16',\
                      'PCH_CONVSPTH_11_16',\
                      'PCH_SPECSPTH_11_16',\
                      'SNAPS12','SNAPS17',\
                      'PCH_SNAPS_12_17',\
                      'SNAPSPTH12',\
                      'SNAPSPTH17',\
                      'PCH_SNAPSPTH_12_17',\
                      'WICS11','WICS16',\
                      'PCH_WICS_11_16',\
                      'WICSPTH11',\
                      'WICSPTH16',\
                      'PCH_WICSPTH_11_16',\
                      'FIPS',\
                      'PCH_SUPERC_11_16',\
                      'PCH_GROC_11_16',\
                      'PCH_CONVS_11_16',\
                      'PCH_SPECS_11_16',  'CONVS11', 'CONVS16','CONVSPTH11','CONVSPTH16', 'SPECS11','SPECS16','SPECSPTH11', 'SPECSPTH16'], axis=1)
stores.head()

Unnamed: 0,State,County,GROC11,GROC16,GROCPTH11,GROCPTH16,SUPERC11,SUPERC16,SUPERCPTH11,SUPERCPTH16
0,AL,Autauga,5,3,0.090581,0.054271,1,1,0.018116,0.01809
1,AL,Baldwin,27,29,0.144746,0.139753,6,7,0.032166,0.033733
2,AL,Barbour,6,4,0.21937,0.155195,0,1,0.0,0.038799
3,AL,Bibb,6,5,0.263794,0.220916,1,1,0.043966,0.044183
4,AL,Blount,7,5,0.121608,0.086863,1,1,0.017373,0.017373


The data from the USDA had a tab in the Excel file that explained what each variable meant, and then used the variable names as the titles for the columns in the rest of the tabs. For our purposes, we wanted to convert the variable names back to their real names so we could see their meaning easily by looking at our data frame. To do so, we referred to the variable tab and manually renamed each column to easily understand its meaning.

In [8]:
stores = stores.rename(columns={'GROC11':'Grocery_2011',
                                'GROC16': 'Grocery_2016', 
                                'GROCPTH11':'Grocery_per_1000_2011', 
                                'GROCPTH16':'Grocery_per_1000_2016', 'SUPERC11': 'Supercenter_2011', 
                                'SUPERC16': 'Supercenter_2016', 
                                'SUPERCPTH11':'Supercenter_per_1000_2011',
                                'SUPERCPTH16':'Supercenter_per_1000_2016', 
                                })
stores['Combined_Grocery_2011'] = stores[['Grocery_2011', 'Supercenter_2011']].sum(axis=1, min_count=1).squeeze()
stores['Combined_Grocery_2016'] = stores[['Grocery_2016', 'Supercenter_2016']].sum(axis=1, min_count=1).squeeze()
stores['Combined_per_1000_2011'] = stores[['Grocery_per_1000_2011','Supercenter_per_1000_2011']].sum(axis=1, min_count=1).squeeze()
stores['Combined_per_1000_2016'] = stores[['Grocery_per_1000_2016','Supercenter_per_1000_2016']].sum(axis=1, min_count=1).squeeze()
stores.head()  

Unnamed: 0,State,County,Grocery_2011,Grocery_2016,Grocery_per_1000_2011,Grocery_per_1000_2016,Supercenter_2011,Supercenter_2016,Supercenter_per_1000_2011,Supercenter_per_1000_2016,Combined_Grocery_2011,Combined_Grocery_2016,Combined_per1000_2011,Combined_per1000_2016,Combined_per_1000_2011,Combined_per_1000_2016
0,AL,Autauga,5,3,0.090581,0.054271,1,1,0.018116,0.01809,6,4,0.108698,0.072361,0.108698,0.072361
1,AL,Baldwin,27,29,0.144746,0.139753,6,7,0.032166,0.033733,33,36,0.176911,0.173486,0.176911,0.173486
2,AL,Barbour,6,4,0.21937,0.155195,0,1,0.0,0.038799,6,5,0.21937,0.193994,0.21937,0.193994
3,AL,Bibb,6,5,0.263794,0.220916,1,1,0.043966,0.044183,7,6,0.30776,0.2651,0.30776,0.2651
4,AL,Blount,7,5,0.121608,0.086863,1,1,0.017373,0.017373,8,6,0.138981,0.104235,0.138981,0.104235


Next, we converted the demographics tab of the Excel sheet into a csv and performed the same procedure on the demographics data frame, first dropping the columns that were unnecessary in relation to our research questions, and the convert the column names back to their titles based on the variable codes.

In [10]:
demographics = pd.read_csv('data/county_demographics.csv')
demographics.head()

Unnamed: 0,FIPS,State,County,PCT_NHWHITE10,PCT_NHBLACK10,PCT_HISP10,PCT_NHASIAN10,PCT_NHNA10,PCT_NHPI10,PCT_65OLDER10,PCT_18YOUNGER10,MEDHHINC15,POVRATE15,PERPOV10,CHILDPOVRATE15,PERCHLDPOV10,METRO13,POPLOSS10
0,1001,AL,Autauga,77.246156,17.582599,2.400542,0.855766,0.397647,0.040314,11.995382,26.777959,56580.0,12.7,0,18.8,0,1,0.0
1,1003,AL,Baldwin,83.504787,9.308425,4.384824,0.735193,0.628755,0.043343,16.771185,22.987408,52387.0,12.9,0,19.6,0,1,0.0
2,1005,AL,Barbour,46.753105,46.69119,5.051535,0.3897,0.218524,0.087409,14.236807,21.906982,31433.0,32.0,1,45.2,1,0,0.0
3,1007,AL,Bibb,75.020729,21.924504,1.771765,0.096007,0.279293,0.030548,12.68165,22.696923,40767.0,22.2,0,29.3,1,1,0.0
4,1009,AL,Blount,88.887338,1.26304,8.0702,0.200621,0.497191,0.031402,14.722096,24.608353,50487.0,14.7,0,22.2,0,1,0.0


In [4]:
rpp = pd.read_csv('data/RPP_data2015.csv')
# https://fredblog.stlouisfed.org/2017/07/regional-price-parities/#:~:text=In%20general%2C%20price%20levels%20are,and%20New%20Jersey%20(113.4).
rpp.head()

Unnamed: 0,State,2015-01-01
0,AL,90.259
1,AK,104.43
2,AZ,97.901
3,AR,89.827
4,CA,109.26


In [11]:
demographics = demographics.drop(['PCT_65OLDER10', 'PCT_18YOUNGER10', 'PERCHLDPOV10',
                                  'CHILDPOVRATE15',
                                 'POPLOSS10', 'FIPS'], axis=1)
demographics.head()

Unnamed: 0,State,County,PCT_NHWHITE10,PCT_NHBLACK10,PCT_HISP10,PCT_NHASIAN10,PCT_NHNA10,PCT_NHPI10,MEDHHINC15,POVRATE15,PERPOV10,METRO13
0,AL,Autauga,77.246156,17.582599,2.400542,0.855766,0.397647,0.040314,56580.0,12.7,0,1
1,AL,Baldwin,83.504787,9.308425,4.384824,0.735193,0.628755,0.043343,52387.0,12.9,0,1
2,AL,Barbour,46.753105,46.69119,5.051535,0.3897,0.218524,0.087409,31433.0,32.0,1,0
3,AL,Bibb,75.020729,21.924504,1.771765,0.096007,0.279293,0.030548,40767.0,22.2,0,1
4,AL,Blount,88.887338,1.26304,8.0702,0.200621,0.497191,0.031402,50487.0,14.7,0,1


In [12]:
demographics = demographics.rename(columns={
    'PCT_NHWHITE10': 'Percent_White_2010',
    'PCT_NHBLACK10': 'Percent_Black_2010',
    'PCT_HISP10': 'Percent_Hispanic_2010',
    'PCT_NHASIAN10': 'Percent_Asian_2010',
    'PCT_NHNA10': 'Percent_Native_American_2010',
    'PCT_NHPI10': 'Percent_Hawaiian_2010',
    'MEDHHINC15': 'Median_household_income_2015',
    'PERPOV10':'Persistent_Poverty',
    'POVRATE15': 'Poverty_Rate_2015',
    'METRO13': 'Metro'
})
demographics.head()

Unnamed: 0,State,County,Percent_White_2010,Percent_Black_2010,Percent_Hispanic_2010,Percent_Asian_2010,Percent_Native_American_2010,Percent_Hawaiian_2010,Median_household_income_2015,Poverty_Rate_2015,Persistent_Poverty,Metro
0,AL,Autauga,77.246156,17.582599,2.400542,0.855766,0.397647,0.040314,56580.0,12.7,0,1
1,AL,Baldwin,83.504787,9.308425,4.384824,0.735193,0.628755,0.043343,52387.0,12.9,0,1
2,AL,Barbour,46.753105,46.69119,5.051535,0.3897,0.218524,0.087409,31433.0,32.0,1,0
3,AL,Bibb,75.020729,21.924504,1.771765,0.096007,0.279293,0.030548,40767.0,22.2,0,1
4,AL,Blount,88.887338,1.26304,8.0702,0.200621,0.497191,0.031402,50487.0,14.7,0,1


Then, we converted the restaurants tab of the Excel sheet into a csv file, created another data frame, dropped the columns we were not interested in, and changed the column names from the variable codes.

In [12]:
restaurants = pd.read_csv('data/Restaurants.csv')
restaurants.head()

Unnamed: 0,FIPS,State,County,FFR11,FFR16,PCH_FFR_11_16,FFRPTH11,FFRPTH16,PCH_FFRPTH_11_16,FSR11,FSR16,PCH_FSR_11_16,FSRPTH11,FSRPTH16,PCH_FSRPTH_11_16,PC_FFRSALES07,PC_FFRSALES12,PC_FSRSALES07,PC_FSRSALES12
0,1001,AL,Autauga,34,44,29.411765,0.615953,0.795977,29.226817,32,31,-3.125,0.579721,0.560802,-3.263448,649.511367,674.80272,484.381507,512.280987
1,1003,AL,Baldwin,121,156,28.92562,0.648675,0.751775,15.893824,216,236,9.259259,1.157966,1.1373,-1.784662,649.511367,674.80272,484.381507,512.280987
2,1005,AL,Barbour,19,23,21.052632,0.694673,0.892372,28.45932,17,14,-17.647059,0.621549,0.543183,-12.608237,649.511367,674.80272,484.381507,512.280987
3,1007,AL,Bibb,6,7,16.666667,0.263794,0.309283,17.243995,5,7,40.0,0.219829,0.309283,40.692794,649.511367,674.80272,484.381507,512.280987
4,1009,AL,Blount,20,23,15.0,0.347451,0.399569,15.0,15,12,-20.0,0.260589,0.208471,-20.0,649.511367,674.80272,484.381507,512.280987


In [13]:
restaurants = restaurants.drop(['PCH_FFRPTH_11_16', 'PCH_FSRPTH_11_16', 'PC_FFRSALES07', 'PC_FSRSALES07',\
                               'FIPS', 'PCH_FFR_11_16', 'PCH_FSR_11_16', ], axis=1)
restaurants.head()

Unnamed: 0,State,County,FFR11,FFR16,FFRPTH11,FFRPTH16,FSR11,FSR16,FSRPTH11,FSRPTH16,PC_FFRSALES12,PC_FSRSALES12
0,AL,Autauga,34,44,0.615953,0.795977,32,31,0.579721,0.560802,674.80272,512.280987
1,AL,Baldwin,121,156,0.648675,0.751775,216,236,1.157966,1.1373,674.80272,512.280987
2,AL,Barbour,19,23,0.694673,0.892372,17,14,0.621549,0.543183,674.80272,512.280987
3,AL,Bibb,6,7,0.263794,0.309283,5,7,0.219829,0.309283,674.80272,512.280987
4,AL,Blount,20,23,0.347451,0.399569,15,12,0.260589,0.208471,674.80272,512.280987


In [14]:
restaurants = restaurants.rename(columns={'FFR11':'Fast_food_2011', 
                                          'FFR16':'Fast_food_2016', 
                                          'FFRPTH11':'Fast_food_per_1000_2011',
                                          'FFRPTH16':'Fast_food_per_1000_2016', 
                                          'FSR11':'Full_service_2011',
                                          'FSR16':'Full_service_2016',
                                          'FSRPTH11':'Full_service_per_1000_2011', 
                                          'FSRPTH16':'Full_service_per_1000_2016',
                                          'PC_FFRSALES12':'Fast_food_expenditures_per_capita_2012', 
                                          'PC_FSRSALES12':'Full_service_expenditures_per_capita_2012'})
restaurants.head()

Unnamed: 0,State,County,Fast_food_2011,Fast_food_2016,Fast_food_per_1000_2011,Fast_food_per_1000_2016,Full_service_2011,Full_service_2016,Full_service_per_1000_2011,Full_service_per_1000_2016,Fast_food_expenditures_per_capita_2012,Full_service_expenditures_per_capita_2012
0,AL,Autauga,34,44,0.615953,0.795977,32,31,0.579721,0.560802,674.80272,512.280987
1,AL,Baldwin,121,156,0.648675,0.751775,216,236,1.157966,1.1373,674.80272,512.280987
2,AL,Barbour,19,23,0.694673,0.892372,17,14,0.621549,0.543183,674.80272,512.280987
3,AL,Bibb,6,7,0.263794,0.309283,5,7,0.219829,0.309283,674.80272,512.280987
4,AL,Blount,20,23,0.347451,0.399569,15,12,0.260589,0.208471,674.80272,512.280987


Now we have all three of the data frames we want to use formatted so that we can use them for our exploration. Next, we want to check for missing values in our dataframes.

In [15]:
print(stores.columns[stores.isnull().any()])
print(demographics.columns[demographics.isnull().any()])
print(restaurants.columns[restaurants.isnull().any()])

Index([], dtype='object')
Index(['Median_household_income_2015', 'Poverty_Rate_2015',
       'Child_poverty_rate_2015'],
      dtype='object')
Index([], dtype='object')


We need to input NaN for missing data

In [16]:
stores.fillna(np.nan, inplace=True)
demographics.fillna(np.nan, inplace=True)
restaurants.fillna(np.nan, inplace=True)
demographics[demographics.isnull().any(axis=1)]

Unnamed: 0,State,County,Percent_White_2010,Percent_Black_2010,Percent_Hispanic_2010,Percent_Asian_2010,Percent_Native_American_2010,Percent_Hawaiian_2010,Median_household_income_2015,Poverty_Rate_2015,Child_poverty_rate_2015,Metro
92,AK,Wade Hampton,2.667918,0.013407,0.093846,0.227913,94.945703,0.0,,,,0
548,HI,Kalawao,26.666667,0.0,1.111111,7.777778,0.0,48.888889,,,,1
2417,SD,Shannon,2.804357,0.029442,2.193434,0.103047,94.096864,0.014721,,,,0
2916,VA,Bedford,75.072324,20.009643,2.153648,0.658952,0.112504,0.016072,,,,1


## 4. Preregistration Statement

## 5. Data Analysis

## 6. Evaluation of Significance

## 7. Interpretation and Conclusions

## 8. Limitations

## 9. Sources