<a href="https://colab.research.google.com/github/nidhicodes4045/datascience442/blob/main/Project_Dataset_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset

> #### Description

> The Food Access Research Atlas is a dataset that was developed by the United States Department of Agriculture's Economic Research Service (ERS) to understand and address the issue of limited access to healthy, affordable food across the US. This county-level dataset combines information from several sources, including the 2019 STARS directory of SNAP-authorized stores, a 2019 supermarket list, the 2010 Decennial Census, and the 2014-18 American Community Survey (ACS). By integrating data on store locations, demographics, and socioeconomic factors, the Atlas provides a comprehensive view of food access, enabling researchers to analyze the complex relationship between these factors and food insecurity within communities. Ryan Whitcomb, Joung Min Choi, and Bo Guan are associated with providing the dataset on CORGIS [Food Access](https://corgis-edu.github.io/corgis/csv/food_access/). It's possible they worked with the data after its initial creation by the ERS.

>The County Demographics dataset is a dataset originally compiled by the United States Census Bureau and made available through their QuickFacts tool.  It provides demographic information at the county level for the United States, designed to offer easy access to key population, business, and geographic statistics. This county-level dataset draws information from several Census Bureau sources, including the Decennial Census, the American Community Survey (ACS), the County Business Patterns (CBP), and the Population Estimates Program. By integrating data from these sources, the dataset provides a snapshot of key demographic indicators, enabling researchers to analyze various population characteristics. Ryan Whitcomb, Joung Min Choi, and Bo Guan are associated with providing the dataset on CORGIS [County Demographics](https://corgis-edu.github.io/corgis/csv/county_demographics/). It's possible they compiled or curated the data after its initial collection by the U.S. Census Bureau.

We have prepared a dataset that consists of a subset of these two CORGIS datasets. We will be referencing the data that we have prepared in our Github repository: ["food_df"]("https://raw.githubusercontent.com/nidhicodes4045/datascience442/refs/heads/main/Food_access_mod.csv") and ["county_df"]("https://raw.githubusercontent.com/nidhicodes4045/datascience442/refs/heads/main/county_demographics_mod.csv").

## Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
!pip install us
os.environ['DC_STATEHOOD'] = '1'
import seaborn as sns
import re

Collecting us
  Downloading us-3.2.0-py3-none-any.whl.metadata (10 kB)
Downloading us-3.2.0-py3-none-any.whl (13 kB)
Installing collected packages: us
Successfully installed us-3.2.0


In [None]:
import us

In [None]:
print(os.getcwd())

/content


##Our Dataset initialization

In [None]:
county_df = pd.read_csv("https://raw.githubusercontent.com/nidhicodes4045/datascience442/refs/heads/main/county_demographics_mod.csv")
food_df = pd.read_csv("https://raw.githubusercontent.com/nidhicodes4045/datascience442/refs/heads/main/Food_access_mod.csv")

##Dataframe info

###County df head and tail

In [None]:
county_df.head()

Unnamed: 0,County,State,Age.Percent 65 and Older,Age.Percent Under 18 Years,Age.Percent Under 5 Years,Education.Bachelor's Degree or Higher,Education.High School or Higher,Employment.Nonemployer Establishments,Ethnicities.American Indian and Alaska Native Alone,Ethnicities.Asian Alone,...,Population.Population per Square Mile,Sales.Accommodation and Food Services Sales,Sales.Retail Sales,Employment.Firms.Total,Employment.Firms.Women-Owned,Employment.Firms.Men-Owned,Employment.Firms.Minority-Owned,Employment.Firms.Nonminority-Owned,Employment.Firms.Veteran-Owned,Employment.Firms.Nonveteran-Owned
0,Abbeville County,SC,22.4,19.8,4.7,15.6,81.7,1416,0.3,0.4,...,51.8,12507.0,91371,1450,543,689,317,1080,187,1211
1,Acadia Parish,LA,15.8,25.8,6.9,13.3,79.0,4533,0.4,0.3,...,94.3,52706.0,602739,4664,1516,2629,705,3734,388,4007
2,Accomack County,VA,24.6,20.7,5.6,19.5,81.5,2387,0.7,0.8,...,73.8,53568.0,348195,2997,802,1716,335,2560,212,2536
3,Ada County,ID,14.9,23.2,5.6,38.5,95.2,41464,0.8,2.7,...,372.8,763099.0,5766679,41789,14661,19409,3099,36701,3803,35132
4,Adair County,IA,23.0,21.8,5.6,18.5,94.2,609,0.3,0.5,...,13.5,,63002,914,304,499,0,861,185,679


In [None]:
county_df.tail()

Unnamed: 0,County,State,Age.Percent 65 and Older,Age.Percent Under 18 Years,Age.Percent Under 5 Years,Education.Bachelor's Degree or Higher,Education.High School or Higher,Employment.Nonemployer Establishments,Ethnicities.American Indian and Alaska Native Alone,Ethnicities.Asian Alone,...,Population.Population per Square Mile,Sales.Accommodation and Food Services Sales,Sales.Retail Sales,Employment.Firms.Total,Employment.Firms.Women-Owned,Employment.Firms.Men-Owned,Employment.Firms.Minority-Owned,Employment.Firms.Nonminority-Owned,Employment.Firms.Veteran-Owned,Employment.Firms.Nonveteran-Owned
3134,Yuma County,AZ,19.3,25.1,7.1,15.0,73.3,9896,2.3,1.5,...,35.5,307540.0,1995974,10846,4298,4529,5749,4476,839,9265
3135,Yuma County,CO,18.7,27.4,7.5,21.8,88.6,1020,1,0.5,...,4.2,8501.0,125565,1492,391,797,45,1350,66,1278
3136,Zapata County,TX,13.2,33.1,8.6,11.6,61.9,1452,0.5,0.2,...,14.0,,75681,1964,818,1003,1680,235,181,1738
3137,Zavala County,TX,14.6,28.4,7.2,10.9,66.9,837,1.1,0.3,...,9.0,8808.0,45596,1232,486,674,1062,159,42,1178
3138,Ziebach County,SD,9.6,27.5,5.5,16.4,84.1,87,Unavailable,0.3,...,1.4,,15757,78,0,42,29,36,0,54


###Food df head and tail

In [None]:
food_df.head()

Unnamed: 0,County,Population,State,Housing Data.Residing in Group Quarters,Housing Data.Total Housing Units,Vehicle Access.1 Mile,Vehicle Access.1/2 Mile,Vehicle Access.10 Miles,Vehicle Access.20 Miles,Low Access Numbers.Children.1 Mile,...,Low Access Numbers.Low Income People.10 Miles,Low Access Numbers.Low Income People.20 Miles,Low Access Numbers.People.1 Mile,Low Access Numbers.People.1/2 Mile,Low Access Numbers.People.10 Miles,Low Access Numbers.People.20 Miles,Low Access Numbers.Seniors.1 Mile,Low Access Numbers.Seniors.1/2 Mile,Low Access Numbers.Seniors.10 Miles,Low Access Numbers.Seniors.20 Miles
0,Autauga County,54571,Alabama,455,20221,834,1045,222.0,0,9973,...,2307,0,37424.0,49497,5119,0,4393,5935.0,707,0
1,Baldwin County,182265,Alabama,2307,73180,1653,2178,32.0,0,30633,...,846,0,132442.0,165616,2308,0,21828,27241.0,390,0
2,Barbour County,27457,Alabama,3193,9820,545,742,201.0,0,3701,...,2440,0,,23762,4643,0,2537,3348.0,629,0
3,Bibb County,22915,Alabama,2224,7953,312,441,0.0,0,4198,...,102,0,17560.0,20989,365,0,2262,2630.0,72,0
4,Blount County,57322,Alabama,489,21578,752,822,0.0,0,12575,...,0,0,50848.0,54933,0,0,7114,7810.0,0,0


In [None]:
food_df.tail()

Unnamed: 0,County,Population,State,Housing Data.Residing in Group Quarters,Housing Data.Total Housing Units,Vehicle Access.1 Mile,Vehicle Access.1/2 Mile,Vehicle Access.10 Miles,Vehicle Access.20 Miles,Low Access Numbers.Children.1 Mile,...,Low Access Numbers.Low Income People.10 Miles,Low Access Numbers.Low Income People.20 Miles,Low Access Numbers.People.1 Mile,Low Access Numbers.People.1/2 Mile,Low Access Numbers.People.10 Miles,Low Access Numbers.People.20 Miles,Low Access Numbers.Seniors.1 Mile,Low Access Numbers.Seniors.1/2 Mile,Low Access Numbers.Seniors.10 Miles,Low Access Numbers.Seniors.20 Miles
3137,Sweetwater County,43806,Wyoming,679,16475,284,372,18.0,13,6280,...,759,614,24036.0,36045,2548,2063,2358,3232.0,255,203
3138,Teton County,21294,Wyoming,271,8973,74,144,0.0,0,2853,...,383,181,15298.0,18354,1677,572,1867,2009.0,218,85
3139,Uinta County,21118,Wyoming,270,7668,175,283,9.0,0,3755,...,290,3,12250.0,17998,726,14,1078,,97,3
3140,Washakie County,8533,Wyoming,140,3492,37,106,7.0,6,746,...,218,176,2869.0,4961,902,730,506,,218,191
3141,Weston County,7208,Wyoming,313,3021,22,43,2.0,0,735,...,247,55,3659.0,5237,840,188,518,824.0,154,44


###county df info

In [None]:
county_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3139 entries, 0 to 3138
Data columns (total 43 columns):
 #   Column                                                        Non-Null Count  Dtype  
---  ------                                                        --------------  -----  
 0   County                                                        3139 non-null   object 
 1   State                                                         3139 non-null   object 
 2   Age.Percent 65 and Older                                      3139 non-null   float64
 3   Age.Percent Under 18 Years                                    3139 non-null   object 
 4   Age.Percent Under 5 Years                                     3139 non-null   object 
 5   Education.Bachelor's Degree or Higher                         3139 non-null   float64
 6   Education.High School or Higher                               3139 non-null   float64
 7   Employment.Nonemployer Establishments                         3139 no

###food df info

In [None]:
food_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3142 entries, 0 to 3141
Data columns (total 25 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   County                                         3142 non-null   object 
 1   Population                                     3142 non-null   int64  
 2   State                                          3142 non-null   object 
 3   Housing Data.Residing in Group Quarters        3142 non-null   int64  
 4   Housing Data.Total Housing Units               3142 non-null   int64  
 5   Vehicle Access.1 Mile                          3142 non-null   int64  
 6   Vehicle Access.1/2 Mile                        3142 non-null   int64  
 7   Vehicle Access.10 Miles                        2807 non-null   float64
 8   Vehicle Access.20 Miles                        3142 non-null   int64  
 9   Low Access Numbers.Children.1 Mile             3142 

###null count

In [None]:
print(food_df.isnull().sum())

County                                             0
Population                                         0
State                                              0
Housing Data.Residing in Group Quarters            0
Housing Data.Total Housing Units                   0
Vehicle Access.1 Mile                              0
Vehicle Access.1/2 Mile                            0
Vehicle Access.10 Miles                          335
Vehicle Access.20 Miles                            0
Low Access Numbers.Children.1 Mile                 0
Low Access Numbers.Children.1/2 Mile               0
Low Access Numbers.Children.10 Miles               0
Low Access Numbers.Children.20 Miles               0
Low Access Numbers.Low Income People.1 Mile        0
Low Access Numbers.Low Income People.1/2 Mile      0
Low Access Numbers.Low Income People.10 Miles      0
Low Access Numbers.Low Income People.20 Miles      0
Low Access Numbers.People.1 Mile                 156
Low Access Numbers.People.1/2 Mile            


#Data Cleaning Step 1: Rectify structural misalignments
1. Check shape of the data – number of columns and rows  
2. Make sure the columns are named for convenience, follow the data dictionary
3. Do rows need to be named? Would default 0 order index work?


In [None]:
country_df_shape = county_df.shape
print(country_df_shape)

(3139, 43)


In [None]:
food_df_shape = food_df.shape
print(food_df_shape)

(3142, 25)


Food df has 3 more rows
country df has more cols

In [None]:
#function to lowercase all text
def lowerandstriptext(text):
  newtext = re.sub(r"[\s\t]+", "", text)
  return newtext.lower().strip()

In [None]:
#function to convert full state names to abbr
def convertstate(text):
  if us.states.lookup(text) is None:
    return "None"
  else:
    state = us.states.lookup(text).abbr
    statenew = re.sub(r"[\s\t]+", "", state)
    return statenew.lower().strip()

In [None]:
#Apply the lowercase and strip functions to the State and County cols county df
county_df["County"] = county_df["County"].apply(lowerandstriptext)
county_df["State"] = county_df["State"].apply(lowerandstriptext)
county_df.head()

Unnamed: 0,County,State,Age.Percent 65 and Older,Age.Percent Under 18 Years,Age.Percent Under 5 Years,Education.Bachelor's Degree or Higher,Education.High School or Higher,Employment.Nonemployer Establishments,Ethnicities.American Indian and Alaska Native Alone,Ethnicities.Asian Alone,...,Population.Population per Square Mile,Sales.Accommodation and Food Services Sales,Sales.Retail Sales,Employment.Firms.Total,Employment.Firms.Women-Owned,Employment.Firms.Men-Owned,Employment.Firms.Minority-Owned,Employment.Firms.Nonminority-Owned,Employment.Firms.Veteran-Owned,Employment.Firms.Nonveteran-Owned
0,abbevillecounty,sc,22.4,19.8,4.7,15.6,81.7,1416,0.3,0.4,...,51.8,12507.0,91371,1450,543,689,317,1080,187,1211
1,acadiaparish,la,15.8,25.8,6.9,13.3,79.0,4533,0.4,0.3,...,94.3,52706.0,602739,4664,1516,2629,705,3734,388,4007
2,accomackcounty,va,24.6,20.7,5.6,19.5,81.5,2387,0.7,0.8,...,73.8,53568.0,348195,2997,802,1716,335,2560,212,2536
3,adacounty,id,14.9,23.2,5.6,38.5,95.2,41464,0.8,2.7,...,372.8,763099.0,5766679,41789,14661,19409,3099,36701,3803,35132
4,adaircounty,ia,23.0,21.8,5.6,18.5,94.2,609,0.3,0.5,...,13.5,,63002,914,304,499,0,861,185,679


In [None]:
#Apply the preprocessing functions to food df
food_df["County"] = food_df["County"].apply(lowerandstriptext)
food_df["State"] = food_df["State"].apply(convertstate)
food_df.head()

Unnamed: 0,County,Population,State,Housing Data.Residing in Group Quarters,Housing Data.Total Housing Units,Vehicle Access.1 Mile,Vehicle Access.1/2 Mile,Vehicle Access.10 Miles,Vehicle Access.20 Miles,Low Access Numbers.Children.1 Mile,...,Low Access Numbers.Low Income People.10 Miles,Low Access Numbers.Low Income People.20 Miles,Low Access Numbers.People.1 Mile,Low Access Numbers.People.1/2 Mile,Low Access Numbers.People.10 Miles,Low Access Numbers.People.20 Miles,Low Access Numbers.Seniors.1 Mile,Low Access Numbers.Seniors.1/2 Mile,Low Access Numbers.Seniors.10 Miles,Low Access Numbers.Seniors.20 Miles
0,autaugacounty,54571,al,455,20221,834,1045,222.0,0,9973,...,2307,0,37424.0,49497,5119,0,4393,5935.0,707,0
1,baldwincounty,182265,al,2307,73180,1653,2178,32.0,0,30633,...,846,0,132442.0,165616,2308,0,21828,27241.0,390,0
2,barbourcounty,27457,al,3193,9820,545,742,201.0,0,3701,...,2440,0,,23762,4643,0,2537,3348.0,629,0
3,bibbcounty,22915,al,2224,7953,312,441,0.0,0,4198,...,102,0,17560.0,20989,365,0,2262,2630.0,72,0
4,blountcounty,57322,al,489,21578,752,822,0.0,0,12575,...,0,0,50848.0,54933,0,0,7114,7810.0,0,0


In [None]:
#Verify if any 'Nones' exist before merging
check_state_food_df = (food_df['State'] == 'None').sum()
print(check_state_food_df)

0


In [None]:
#Insert merged columns at index 2 and 3 for county and food dfs respectively
county_df.insert(2, "mergedkeycol", county_df['County'] + county_df['State'])
food_df.insert(3, "mergedkeycol", food_df['County'] + food_df['State'])

In [None]:
#Check county df
county_df.head()

Unnamed: 0,County,State,mergedkeycol,Age.Percent 65 and Older,Age.Percent Under 18 Years,Age.Percent Under 5 Years,Education.Bachelor's Degree or Higher,Education.High School or Higher,Employment.Nonemployer Establishments,Ethnicities.American Indian and Alaska Native Alone,...,Population.Population per Square Mile,Sales.Accommodation and Food Services Sales,Sales.Retail Sales,Employment.Firms.Total,Employment.Firms.Women-Owned,Employment.Firms.Men-Owned,Employment.Firms.Minority-Owned,Employment.Firms.Nonminority-Owned,Employment.Firms.Veteran-Owned,Employment.Firms.Nonveteran-Owned
0,abbevillecounty,sc,abbevillecountysc,22.4,19.8,4.7,15.6,81.7,1416,0.3,...,51.8,12507.0,91371,1450,543,689,317,1080,187,1211
1,acadiaparish,la,acadiaparishla,15.8,25.8,6.9,13.3,79.0,4533,0.4,...,94.3,52706.0,602739,4664,1516,2629,705,3734,388,4007
2,accomackcounty,va,accomackcountyva,24.6,20.7,5.6,19.5,81.5,2387,0.7,...,73.8,53568.0,348195,2997,802,1716,335,2560,212,2536
3,adacounty,id,adacountyid,14.9,23.2,5.6,38.5,95.2,41464,0.8,...,372.8,763099.0,5766679,41789,14661,19409,3099,36701,3803,35132
4,adaircounty,ia,adaircountyia,23.0,21.8,5.6,18.5,94.2,609,0.3,...,13.5,,63002,914,304,499,0,861,185,679


In [None]:
#Check food df
food_df.head()

Unnamed: 0,County,Population,State,mergedkeycol,Housing Data.Residing in Group Quarters,Housing Data.Total Housing Units,Vehicle Access.1 Mile,Vehicle Access.1/2 Mile,Vehicle Access.10 Miles,Vehicle Access.20 Miles,...,Low Access Numbers.Low Income People.10 Miles,Low Access Numbers.Low Income People.20 Miles,Low Access Numbers.People.1 Mile,Low Access Numbers.People.1/2 Mile,Low Access Numbers.People.10 Miles,Low Access Numbers.People.20 Miles,Low Access Numbers.Seniors.1 Mile,Low Access Numbers.Seniors.1/2 Mile,Low Access Numbers.Seniors.10 Miles,Low Access Numbers.Seniors.20 Miles
0,autaugacounty,54571,al,autaugacountyal,455,20221,834,1045,222.0,0,...,2307,0,37424.0,49497,5119,0,4393,5935.0,707,0
1,baldwincounty,182265,al,baldwincountyal,2307,73180,1653,2178,32.0,0,...,846,0,132442.0,165616,2308,0,21828,27241.0,390,0
2,barbourcounty,27457,al,barbourcountyal,3193,9820,545,742,201.0,0,...,2440,0,,23762,4643,0,2537,3348.0,629,0
3,bibbcounty,22915,al,bibbcountyal,2224,7953,312,441,0.0,0,...,102,0,17560.0,20989,365,0,2262,2630.0,72,0
4,blountcounty,57322,al,blountcountyal,489,21578,752,822,0.0,0,...,0,0,50848.0,54933,0,0,7114,7810.0,0,0


In [None]:
county_df.dtypes

Unnamed: 0,0
County,object
State,object
mergedkeycol,object
Age.Percent 65 and Older,float64
Age.Percent Under 18 Years,object
Age.Percent Under 5 Years,object
Education.Bachelor's Degree or Higher,float64
Education.High School or Higher,float64
Employment.Nonemployer Establishments,object
Ethnicities.American Indian and Alaska Native Alone,object


In [None]:
#Replaces all non-numeric values with missing values for all numeric columns
words = ['Age', 'Miscellaneous', 'Housing', 'Ethnicities', 'Employment', 'Sales']
numeric_cols =  county_df[[col for col in county_df.columns if any(word in col for word in words)]]
county_df[[col for col in county_df.columns if any(word in col for word in words)]] = numeric_cols.map(lambda x: pd.to_numeric(x, errors='coerce'))

In [None]:
#Iterate through each column for county
for col in county_df.columns:
    # Calculate the percentage of missing values
    missing_percentage = county_df[col].isna().mean()

    if missing_percentage < 0.05:
        # Drop rows if missing data is below threshold
        county_df = county_df[county_df[col].notna()]
    else:
        # Replace missing values with median if above threshold
        median_value = county_df[col].median()
        county_df.fillna({col: median_value}, inplace=True)

In [None]:
#check if there are any missing values
county_df.isna().sum()

Unnamed: 0,0
County,0
State,0
mergedkeycol,0
Age.Percent 65 and Older,0
Age.Percent Under 18 Years,0
Age.Percent Under 5 Years,0
Education.Bachelor's Degree or Higher,0
Education.High School or Higher,0
Employment.Nonemployer Establishments,0
Ethnicities.American Indian and Alaska Native Alone,0


In [None]:
county_df.head()

Unnamed: 0,County,State,mergedkeycol,Age.Percent 65 and Older,Age.Percent Under 18 Years,Age.Percent Under 5 Years,Education.Bachelor's Degree or Higher,Education.High School or Higher,Employment.Nonemployer Establishments,Ethnicities.American Indian and Alaska Native Alone,...,Population.Population per Square Mile,Sales.Accommodation and Food Services Sales,Sales.Retail Sales,Employment.Firms.Total,Employment.Firms.Women-Owned,Employment.Firms.Men-Owned,Employment.Firms.Minority-Owned,Employment.Firms.Nonminority-Owned,Employment.Firms.Veteran-Owned,Employment.Firms.Nonveteran-Owned
0,abbevillecounty,sc,abbevillecountysc,22.4,19.8,4.7,15.6,81.7,1416.0,0.3,...,51.8,12507.0,91371.0,1450,543,689,317.0,1080,187,1211
1,acadiaparish,la,acadiaparishla,15.8,25.8,6.9,13.3,79.0,4533.0,0.4,...,94.3,52706.0,602739.0,4664,1516,2629,705.0,3734,388,4007
2,accomackcounty,va,accomackcountyva,24.6,20.7,5.6,19.5,81.5,2387.0,0.7,...,73.8,53568.0,348195.0,2997,802,1716,335.0,2560,212,2536
3,adacounty,id,adacountyid,14.9,23.2,5.6,38.5,95.2,41464.0,0.8,...,372.8,763099.0,5766679.0,41789,14661,19409,3099.0,36701,3803,35132
4,adaircounty,ia,adaircountyia,23.0,21.8,5.6,18.5,94.2,609.0,0.3,...,13.5,44317.0,63002.0,914,304,499,0.0,861,185,679


##Cleaning Food_df

In [None]:
food_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3142 entries, 0 to 3141
Data columns (total 26 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   County                                         3142 non-null   object 
 1   Population                                     3142 non-null   int64  
 2   State                                          3142 non-null   object 
 3   mergedkeycol                                   3142 non-null   object 
 4   Housing Data.Residing in Group Quarters        3142 non-null   int64  
 5   Housing Data.Total Housing Units               3142 non-null   int64  
 6   Vehicle Access.1 Mile                          3142 non-null   int64  
 7   Vehicle Access.1/2 Mile                        3142 non-null   int64  
 8   Vehicle Access.10 Miles                        2807 non-null   float64
 9   Vehicle Access.20 Miles                        3142 