# Obtain CDC Social Vulnerability Data 
Read in CDC SVI

This code no longer works. The CDC has removed SVI Data from their website in compliance with [Exectutive Order 14151(https://www.federalregister.gov/documents/2025/01/29/2025-01953/ending-radical-and-wasteful-government-dei-programs-and-preferencing)

## Description of Program
- program:    URSC645_1av1_CDCSVI
- task:       Obtain CDC SVI data
- Version:    2023-03-23
- 2024-02-09: Switched to reading in data from website and saving in Source Folder
- 2024-02-15: Running in class fixing issues
- 2024-02-22: Add code abstraction to bottom of notebook
- 2025-02-04: In compliance with [Exectutive Order 14151](https://www.federalregister.gov/documents/2025/01/29/2025-01953/ending-radical-and-wasteful-government-dei-programs-and-preferencing) CDC has removed SVI data
- project:    URSC 645
- funding:	  LAUP
- author:     Nathanael Rosenheim

## Step 0: Good Housekeeping
The following section accomplishes theses good housekeeping steps:
1. sets up the python environment
2. loads the necessary packages
3. checks the versions of the packages used in this program
4. checks the current working directory
5. sets a program name variable and directory to name and save output files

The are all good steps to accomplish at the start of any program or notebook. 
These steps help ensure that the code will run as expected and that the results can be replicated in the future.

In [1]:
import pandas as pd     # For obtaining and cleaning tabular data
import geopandas as gpd # For obtaining and cleaning spatial data
import os # For saving output to path

import fiona # For reading ESRI geodatabases
import requests # For downloading data from the web 
from zipfile import ZipFile # For unzipping files
from io import BytesIO # For reading in zipped files

In [2]:
import sys
print("Python Version     ", sys.version)
print("pandas version:    ", pd.__version__)
print("geopandas version: ", gpd.__version__)

Python Version      3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:06:27) [MSC v.1942 64 bit (AMD64)]
pandas version:     2.2.3
geopandas version:  1.0.1


In [4]:
# Get information on current working directory (getcwd)
os.getcwd()

'c:\\Users\\nathanael99\\MyProjects\\GitHub\\URSC645\\WorkNPR'

In [3]:
# Store Program Name for output files to have the same name
#programname = "URSC645_1av2_SVI_2024-02-22"
# Make directory to save output
#if not os.path.exists(programname):
#    os.mkdir(programname)

# Step 1: Obtain Data

To obtain data from the web we will need to find the URL for the files.

From the cdc website we can find the data at the following URL:

https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html

I was able to find the URL by quickly watching where the file is downloaded from when I clicked the "GO" button on the website.

The CDC has data as a CSV and an ESRI GeoDatabase. The following provide examples for reading in the CSV and then the ESRI GeoDatabase.

In [5]:
# CSV location
csv_file_url = "https://svi.cdc.gov/Documents/Data/2020/csv/states/"
csv_filename = "Iowa.csv" # selecting Iowas as an example because it is a small file
csv_filepath = os.path.join(csv_file_url, csv_filename)
# Read in the csv using pandas
cdcsvi_df = pd.read_csv(csv_filepath)

HTTPError: HTTP Error 404: Not Found

In [6]:
#  Geodatabases are more complex than CSV files.
# Thank you Chat-GPT 4 for developing the following code: 
# https://chat.openai.com/g/g-cZDUaJHDk-pyncoda/c/06423213-e458-406a-8cbc-c4c8a7d0bb80 
# URL of the .zip file containing the ESRI File Geodatabas
# ESRI Geodatabase location
geo_file_url = "https://svi.cdc.gov/Documents/Data/2020/db/states/"
geo_filename = "Iowa.zip"
geo_filepath = os.path.join(geo_file_url, geo_filename)
print(geo_filepath)
# Download the zip file
response = requests.get(geo_filepath)
zipfile = ZipFile(BytesIO(response.content))

# Extract the .gdb file to a local directory
local_directory_path = programname
zipfile.extractall(local_directory_path)

https://svi.cdc.gov/Documents/Data/2020/db/states/Iowa.zip


BadZipFile: File is not a zip file

In [7]:

# Get file path for the .gdb file
gdb_path = os.path.join(local_directory_path, "SVI2020_IOWA_tract.gdb")

# Use GeoPandas to read a specific layer from the .gdb
# If you're not sure about the layer names, you can list them using fiona
import fiona
layers = fiona.listlayers(gdb_path)
for layer in layers:
    print(layer)


SVI2020_IOWA_tract


In [8]:
# Read in layer SVI2020_TEXAS_tract from the .gdb using GeoPandas 
cdcsvi_gdf = gpd.read_file(gdb_path, layer='SVI2020_IOWA_tract')

In [9]:
cdcsvi_gdf.head()

Unnamed: 0,ST,STATE,ST_ABBR,STCNTY,COUNTY,FIPS,LOCATION,AREA_SQMI,E_TOTPOP,M_TOTPOP,...,MP_AIAN,EP_NHPI,MP_NHPI,EP_TWOMORE,MP_TWOMORE,EP_OTHERRACE,MP_OTHERRACE,Shape_Length,Shape_Area,geometry
0,19,Iowa,IA,19001,Adair,19001960100,"Census Tract 9601, Adair County, Iowa",273.329109,2696,191,...,0.8,0.0,0.8,0.9,0.8,0.0,0.8,1.269744,0.076358,"MULTIPOLYGON (((-94.70057 41.48250, -94.70053 ..."
1,19,Iowa,IA,19001,Adair,19001960200,"Census Tract 9602, Adair County, Iowa",257.170375,1591,154,...,1.4,0.0,1.4,0.0,1.4,1.4,1.7,1.516442,0.07166,"MULTIPOLYGON (((-94.70059 41.20060, -94.70054 ..."
2,19,Iowa,IA,19001,Adair,19001960300,"Census Tract 9603, Adair County, Iowa",38.770927,2761,178,...,0.8,0.0,0.8,2.0,1.2,0.0,0.8,0.522311,0.010845,"MULTIPOLYGON (((-94.58517 41.30202, -94.57120 ..."
3,19,Iowa,IA,19003,Adams,19003950100,"Census Tract 9501, Adams County, Iowa",339.498225,1616,126,...,1.4,0.0,1.4,0.1,0.3,0.0,1.4,1.561917,0.094708,"MULTIPOLYGON (((-94.92818 40.98840, -94.92773 ..."
4,19,Iowa,IA,19003,Adams,19003950200,"Census Tract 9502, Adams County, Iowa",83.934528,2017,126,...,0.7,0.0,1.1,2.3,1.4,0.0,1.1,0.758603,0.023308,"MULTIPOLYGON (((-94.92860 40.90177, -94.92806 ..."


# Step 2: Clean Data

In [10]:
# Select one county by FIPS Code
county = "19001" # Adams County
cdcsvi_county_gdf = cdcsvi_gdf[cdcsvi_gdf["STCNTY"] == county]
cdcsvi_county_gdf.head()

Unnamed: 0,ST,STATE,ST_ABBR,STCNTY,COUNTY,FIPS,LOCATION,AREA_SQMI,E_TOTPOP,M_TOTPOP,...,MP_AIAN,EP_NHPI,MP_NHPI,EP_TWOMORE,MP_TWOMORE,EP_OTHERRACE,MP_OTHERRACE,Shape_Length,Shape_Area,geometry
0,19,Iowa,IA,19001,Adair,19001960100,"Census Tract 9601, Adair County, Iowa",273.329109,2696,191,...,0.8,0.0,0.8,0.9,0.8,0.0,0.8,1.269744,0.076358,"MULTIPOLYGON (((-94.70057 41.48250, -94.70053 ..."
1,19,Iowa,IA,19001,Adair,19001960200,"Census Tract 9602, Adair County, Iowa",257.170375,1591,154,...,1.4,0.0,1.4,0.0,1.4,1.4,1.7,1.516442,0.07166,"MULTIPOLYGON (((-94.70059 41.20060, -94.70054 ..."
2,19,Iowa,IA,19001,Adair,19001960300,"Census Tract 9603, Adair County, Iowa",38.770927,2761,178,...,0.8,0.0,0.8,2.0,1.2,0.0,0.8,0.522311,0.010845,"MULTIPOLYGON (((-94.58517 41.30202, -94.57120 ..."


# Step 3: Explore Data

In [11]:
cdcsvi_gdf[["FIPS"]].describe()

Unnamed: 0,FIPS
count,896
unique,896
top,19001960100
freq,1


In [12]:
cdcsvi_county_gdf[["FIPS"]].describe()

Unnamed: 0,FIPS
count,3
unique,3
top,19001960100
freq,1


# Output files

In [13]:
# Save Work as CSV
savefile = programname+"/"+programname+".csv"
cdcsvi_county_gdf.to_csv(savefile, index=False)

In [14]:
# Shapefiles can only have 10 character column names
# list all columns with names longer than 10 characters
long_columns = [col for col in cdcsvi_county_gdf.columns if len(col) > 10]
long_columns

['EPL_UNINSUR',
 'E_OTHERRACE',
 'M_OTHERRACE',
 'EP_OTHERRACE',
 'MP_OTHERRACE',
 'Shape_Length']

In [15]:
# list the columns names truncated to 10 characters
short_columns = [col[:10] for col in long_columns]
print("The column names will be truncated to 10 characters:")
short_columns

The column names will be truncated to 10 characters:


['EPL_UNINSU',
 'E_OTHERRAC',
 'M_OTHERRAC',
 'EP_OTHERRA',
 'MP_OTHERRA',
 'Shape_Leng']

In [16]:
# Save Work as Shapefile
savefile = programname+"/"+programname+".shp"
cdcsvi_county_gdf.to_file(savefile)

  cdcsvi_county_gdf.to_file(savefile)


## Code Abstraction
Code abstraction refers to the process of making code more general so that it can be used in more situations.
The code provided above works for the state of Texas. 
The code below shows how to abstract the code to work for any state.
This is accomplished by using a function that uses the state name as an argument.

In [17]:
def download_cdc_data(cdc_url: str = "https://svi.cdc.gov/Documents/Data/", 
                      year: str = "2020", 
                      state: str = "Texas", 
                      file_type: str = "csv", 
                      geolevel = "tract"):
    '''
    Data from the Centers for Disease Control and Prevention (CDC) Social Vulnerability Index (SVI) 
    can be manually obtained from the following URL:
    https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html

    This code attempts to automate the process of downloading the data from the CDC website.
    '''

    # remove spaces from state name
    state = state.replace(" ", "")
    
    # CSV location
    if geolevel == "tract":
        csv_file_url = f"{cdc_url}{year}/{file_type}/states/"
        csv_filename = f"{state}.{file_type}"
        csv_filepath = os.path.join(csv_file_url, csv_filename)
        print(f"Downloading csv file from: {csv_filepath}")
    elif geolevel == "county":
        csv_file_url = f"{cdc_url}{year}/{file_type}/states_counties/"
        csv_filename = f"{state}_county.{file_type}"
        csv_filepath = os.path.join(csv_file_url, csv_filename)
        print(f"Downloading csv file from: {csv_filepath}")
    else:
        print("geolevel must be either 'tract' or 'county'") 
        return None       
    # Read in the csv using pandas
    cdcsvi_df = pd.read_csv(csv_filepath)

    return cdcsvi_df

In [18]:
# look for common elements in the code
cdc_url = "https://svi.cdc.gov/Documents/Data/"
year = "2020" # possible years 2020, 2018, 2016, 2014, 2010, 2000
state = "North Dakota" 
file_type = "csv"
geolevel = "tract" # possible levels are "tract" or "county"
# Test the code
cdcsvi_df = download_cdc_data(cdc_url = cdc_url, 
                              year = year,
                              state = state,
                              file_type = file_type, 
                              geolevel = geolevel)

# Save Work as CSV
state_save = state.replace(" ", "")
savefile = f"{programname}/CDCSVI_{year}_{state_save}_{geolevel}.csv"
cdcsvi_df.to_csv(savefile, index=False)

Downloading csv file from: https://svi.cdc.gov/Documents/Data/2020/csv/states/NorthDakota.csv


In [19]:
cdcsvi_df.head()

Unnamed: 0,ST,STATE,ST_ABBR,STCNTY,COUNTY,FIPS,LOCATION,AREA_SQMI,E_TOTPOP,M_TOTPOP,...,EP_ASIAN,MP_ASIAN,EP_AIAN,MP_AIAN,EP_NHPI,MP_NHPI,EP_TWOMORE,MP_TWOMORE,EP_OTHERRACE,MP_OTHERRACE
0,38,North Dakota,ND,38001,Adams,38001965600,"Census Tract 9656, Adams County, North Dakota",987.546174,2271,0,...,2.3,1.6,0.4,0.4,0.0,1.0,0.5,0.6,0.0,1.0
1,38,North Dakota,ND,38003,Barnes,38003967900,"Census Tract 9679, Barnes County, North Dakota",623.064619,1751,151,...,0.0,1.3,0.5,0.9,0.0,1.3,0.8,1.3,0.0,1.3
2,38,North Dakota,ND,38003,Barnes,38003968000,"Census Tract 9680, Barnes County, North Dakota",859.647942,2395,154,...,0.0,0.9,0.3,0.7,0.2,0.4,5.7,6.1,0.0,0.9
3,38,North Dakota,ND,38003,Barnes,38003968200,"Census Tract 9682, Barnes County, North Dakota",5.848626,2519,248,...,1.9,2.2,0.4,0.9,0.0,0.9,0.6,0.7,0.0,0.9
4,38,North Dakota,ND,38003,Barnes,38003968300,"Census Tract 9683, Barnes County, North Dakota",2.99642,3927,270,...,0.8,0.9,3.0,2.0,0.0,0.6,5.3,3.6,0.0,0.6


### Code abstraction for geodatabases
The process for obtaining a geodatabase is significantly more complication than obtaining a CSV file.

A simpler approach might be just download the shapefiles and merge them with the CSV file. 
This approach however would require a dictionary to find the appropriate shapefile for each state from www2.census.gov.

In [20]:
def get_gdb_layer(local_directory_path: str = "", 
                  state: str = 'Texas', 
                  year: str = '2020', 
                  geolevel: str = "tract"):
    """
    Get the layer name from the .gdb file
    """
    # Dynamically get the directory name of the extracted .gdb
    if local_directory_path == "":
        local_directory_path = os.getcwd()
    
    extracted_folders = [f for f in os.listdir(local_directory_path)
                        if os.path.isdir(os.path.join(local_directory_path, f)) and f.endswith('.gdb')]
    print(extracted_folders)
    # set layer to none 
    layer = None
    for folder in extracted_folders:
        # for consistency folder name needs to be in all caps
        # tracts are lowercase by COUNTY is upper case
        folder = folder.upper()
        print(folder)
        # check if the folder has the .gdb extension
        # and has the state name in it
        # and has the year in it
        condition1 = folder.endswith('.GDB')
        # state will in all caps
        condition2 = state.upper() in folder
        condition3 = year in folder
        condition4 = geolevel.upper() in folder
        print(f"looking for a .gdb folder with {state} and {year} for {geolevel}")
        if condition1 and condition2 and condition3 and condition4:
            print(f"found a .gdb folder with {state} and {year} for {geolevel}")
            gdb_folder_name = folder
            gdb_folder_name = f"SVI{year}_{state.upper()}_{geolevel.upper()}.gdb"
            gdb_path = os.path.join(local_directory_path, gdb_folder_name)

            # Use GeoPandas to read a specific layer from the .gdb
            # If you're not sure about the layer names, you can list them using fiona
            layers = fiona.listlayers(gdb_path)
            for layer in layers:
                print(layer)
            
            # break for loop
            return gdb_path, layer

    if layer is None:        
        print("No .gdb folder found after extraction.")

    return None

def download_CDCSVI_ESRI_geodatabase(cdc_url: str = "https://svi.cdc.gov/Documents/Data/", 
                                     year: str = "2020", 
                                     state: str = "Texas", 
                                     file_type: str = "db", 
                                     local_directory_path: str = "", 
                                     geolevel: str = "tract"):
    '''
    Geodatabases are more complex than CSV files.
    Thank you Chat-GPT 4 for help developing the initial code: 
    https://chat.openai.com/g/g-cZDUaJHDk-pyncoda/c/06423213-e458-406a-8cbc-c4c8a7d0bb80 
    URL of the .zip file containing the ESRI File Geodatabas
    ESRI Geodatabase location
    '''


    if geolevel == "tract":
        geo_file_url = f"{cdc_url}{year}/{file_type}/states/"
        # ESRI Geodatabase file types are .zip
        geo_filename = f"{state}.zip"
        geo_filepath = os.path.join(geo_file_url, geo_filename)
        print(geo_filepath)
    elif geolevel == "county":
        geo_file_url = f"{cdc_url}{year}/{file_type}/states_counties/"
        # ESRI Geodatabase file types are .zip
        geo_filename = f"{state}_county.zip"
        geo_filepath = os.path.join(geo_file_url, geo_filename)
        print(geo_filepath)        
    else:
        print("geolevel must be either 'tract' or 'county'") 
        return None
    
    # Download the zip file
    response = requests.get(geo_filepath)
    zipfile = ZipFile(BytesIO(response.content))

    # Extract the .gdb file to a local directory
    if local_directory_path == "":
        local_directory_path = os.getcwd()
    
    zipfile.extractall(local_directory_path)
    gdb_path, layer = get_gdb_layer(local_directory_path, state, year, geolevel)
    # Read in layer SVI2020_TEXAS_tract from the .gdb using GeoPandas 
    cdcsvi_gdf = gpd.read_file(gdb_path, layer=layer)

    return cdcsvi_gdf



In [21]:
cdcsvi_gdf = download_CDCSVI_ESRI_geodatabase()

https://svi.cdc.gov/Documents/Data/2020/db/states/Texas.zip
['SVI2020_TEXAS_tract.gdb']
SVI2020_TEXAS_TRACT.GDB
looking for a .gdb folder with Texas and 2020 for tract
found a .gdb folder with Texas and 2020 for tract
SVI2020_TEXAS_tract


In [22]:
# look for common elements in the code
cdc_url = "https://svi.cdc.gov/Documents/Data/"
year = "2000" # possible years 2020, 2018, 2016, 2014, 2010, 2000
state = "Iowa" # Note if the State has a space in the name, do not include the space
file_type = "db"
local_directory_path = programname
geolevel = "county" # possible geolevels are "tract" or "county"

# Test the code
cdcsvi_gdf = download_CDCSVI_ESRI_geodatabase(cdc_url, year, state, file_type, local_directory_path, geolevel)

https://svi.cdc.gov/Documents/Data/2000/db/states_counties/Iowa_county.zip
['SVI2020_ARKANSAS_tract.gdb', 'SVI2018_ARKANSAS_tract.gdb', 'SVI2018_TEXAS_tract.gdb', 'SVI2014_ARKANSAS_tract.gdb', 'SVI2020_IOWA_tract.gdb', 'SVI2000_IOWA_tract.gdb', 'SVI2000_IOWA_COUNTY.gdb', 'SVI2020_TEXAS_tract.gdb']
SVI2020_ARKANSAS_TRACT.GDB
looking for a .gdb folder with Iowa and 2000 for county
SVI2018_ARKANSAS_TRACT.GDB
looking for a .gdb folder with Iowa and 2000 for county
SVI2018_TEXAS_TRACT.GDB
looking for a .gdb folder with Iowa and 2000 for county
SVI2014_ARKANSAS_TRACT.GDB
looking for a .gdb folder with Iowa and 2000 for county
SVI2020_IOWA_TRACT.GDB
looking for a .gdb folder with Iowa and 2000 for county
SVI2000_IOWA_TRACT.GDB
looking for a .gdb folder with Iowa and 2000 for county
SVI2000_IOWA_COUNTY.GDB
looking for a .gdb folder with Iowa and 2000 for county
found a .gdb folder with Iowa and 2000 for county
SVI2000_IOWA_COUNTY


In [23]:
cdcsvi_gdf.head()

Unnamed: 0,STATE_FIPS,CNTY_FIPS,STCOFIPS,STATE_NAME,STATE_ABBR,COUNTY,G1V1R,G1V2R,G1V3R,G1V4R,...,G3V1N,G3V2N,G4V1N,G4V2N,G4V3N,G4V4N,G4V5N,Shape_Length,Shape_Area,geometry
0,19,1,19001,Iowa,IA,Adair,0.0765,0.0224,17262.0,0.122,...,111,19,32,204,11,199,198,1.614211,0.15888,"MULTIPOLYGON (((-94.65782 41.50408, -94.65599 ..."
1,19,3,19003,Iowa,IA,Adams,0.0929,0.0323,15550.0,0.1546,...,43,24,37,154,14,97,121,1.43066,0.117986,"MULTIPOLYGON (((-94.81487 41.15832, -94.71956 ..."
2,19,5,19005,Iowa,IA,Allamakee,0.0957,0.0254,16599.0,0.1856,...,747,495,106,1320,137,338,414,1.845347,0.189205,"MULTIPOLYGON (((-91.36933 43.50083, -91.29205 ..."
3,19,7,19007,Iowa,IA,Appanoose,0.1454,0.0341,14644.0,0.1855,...,405,29,225,695,84,442,220,1.540748,0.142593,"MULTIPOLYGON (((-92.63917 40.89890, -92.63918 ..."
4,19,9,19009,Iowa,IA,Audubon,0.0773,0.0261,17489.0,0.1747,...,83,4,9,52,30,153,189,1.50097,0.124247,"MULTIPOLYGON (((-95.09294 41.86338, -94.97641 ..."
