# Visualizing COVID-19 Data at the State and County Levels in Python
## Part I: Downloading and Organizing Data

From casual observation, I surmise that the widespread stay-at-home orders initiated in March 2020 have left data scientists with a bit of extra time. With each passing day, I find new sources for COVID-19 data and data visualizations. I have written before about the [proper](https://www.ndsu.edu/centers/pcpe/news/detail/58432/) and [improper](https://www.aier.org/article/visualizations-are-powerful-but-often-misleading/) uses of data. In this post, my purpose is pedagogical. I intend to teach the reader how to download and organize COVID-19 data and how to honestly and meaningfully visualize this data.

First, a confession. I am a self-taught programmer. Like many of us, much of what I write can be described as _spaghetti code_, at least initially. I don't thoroughly plan a program before I write it. I roll up my sleeves and get to coding. This has its benefits. And, since I'm not writing my code for commercial use, I am able to efficiently produce results. 

One benefit of building code on the fly is that you may not know at the start of a project what sorts of qualities will be useful to include. Spaghetti code can be repurposed, usually by creating a new copy of the script and making some marginal adjustments. However, the more spaghetti code you write, the greater the difficulty of maintaining quality output. 

When I find myself returning time and again to a particular template, I eventually consolidate the scripts that I have developed so as to minimize costs of editing code by allowing one script to produce a variety of outputs. The script in this example that is the product of precisely this process of development and revision.


### Downloading the COVID-19 Data

We will use two datasets. First, we will import a shapefile to use with _geopandas_, which we will later use to generate a county level map that tracks COVID-19. the shapefile is provide for you in the Github folder housing this post. You can also download shapefiles from the U.S. Census [website](https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html). Then, we will download Johns Hopkins's COVID-19 data from the Associated Press's [account](https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker) at data.world using their [Python Module](https://data.world/integrations/python). Follow [these instructions](https://github.com/datadotworld/data.world-py/) to install the  _datadotworld_ module and access their API.

Once we have installed the _datadotworld_ module, we can get to work. First, we will need to import our modules. While not all of these modules will be used in _Part I_ of this series, it will be convenient to import them now so that we can use them later.

In [29]:
#createCOVID19StateAndCountyVisualization.py
import geopandas
import numpy as np
import pandas as pd
# We won't actually use datetime directly. Since the dataframe index will use data formatted as datetime64,
# I import it in case I need to use the datetime module to troubleshoot later 
import datetime
# you could technically call many of the submodules from matplotlib using mpl., but for convenience
# we explicitly import submodules
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.ticker as mtick
import datadotworld as dw

Now we are ready to import the shapefile and download the COVID-19 data. Let's start by creating a function to import the shapefile.

In [41]:
def import_geo_data(filename, index_col = "Date", FIPS_name = "FIPS"):
    # import county level shapefile
    map_data = geopandas.read_file(filename = filename,                                   
                                   index_col = index_col)
    # rename fips code to match variable name in COVID-19 data
    map_data.rename(columns={"State":"state"},
                    inplace = True)
    # Combine statefips and county fips to create a single fips value
    # that identifies each particular county without referencing the 
    # state separately
    # Warning: We will use .loc[] in order to avoid calling a copy of a slice
    map_data[FIPS_name] = map_data["STATEFP"].astype(str) + \
        map_data["COUNTYFP"].astype(str)
    map_data[FIPS_name] = map_data[FIPS_name].astype(np.int64)
    # set FIPS as index
    map_data.set_index(FIPS_name, inplace=True)
    
    return map_data

Next we create a function to download the COVID-19 data.

In [45]:
def import_covid_data(filename, FIPS_name):
    # Load COVID19 county data using datadotworld API
    # Data provided by Johns Hopkins, file provided by Associated Press
    dataset = dw.load_dataset("associatedpress/johns-hopkins-coronavirus-case-tracker",
                             auto_update=True)
    # the dataset includes multiple dataframes. We will only use #2
    covid_data = dataset.dataframes["2_cases_and_deaths_by_county_timeseries"]
    # Include only oberservation for political entities within states
    # i.e., not territories, etc...
    covid_data = covid_data[covid_data[FIPS_name] < 57000]
    # Transform FIPS codes into integers (not floats)
    # use .loc to avoid warning...
    covid_data.loc[:, fips_name] = covid_data[FIPS_name].astype(int)
    covid_data.set_index([FIPS_name, "date"], inplace = True)
    # Prepare a column for state abbreviations. We will draw these from a
    # dictionary created in the next step.
    covid_data.loc[:, "state_abr"] = ""
    for state, abr in state_dict.items():
        covid_data.loc[covid_data["state"] == state, "state_abr"] = abr

    return covid_data

In [46]:
# I include this dictionary to convenienlty cross reference state names and
# state abbreviations.
state_dict = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ',
    'Arkansas': 'AR', 'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT', 
    'Delaware': 'DE', 'District of Columbia': 'DC', 'Florida': 'FL', 
    'Georgia': 'GA', 'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL',
    'Indiana': 'IN', 'Iowa': 'IA','Kansas': 'KS', 'Kentucky': 'KY',
    'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD', 'Massachusetts': 'MA',
    'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO',
    'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH',
    'New Jersey': 'NJ', 'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC',
    'North Dakota': 'ND', 'Ohio': 'OH', 'Oklahoma': 'OK',
    'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI',
    'South Carolina': 'SC', 'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX',
    'Utah': 'UT', 'Vermont': 'VT', 'Virginia': 'VA',
    'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY'
}

# When we complete our script, we will add an if statement that ensures that we
# only download the data one time. This will prevent us from rudely wasting 
# bandwidth from data.world.
fips_name = "fips_code"
covid_filename = "COVID19DataAP.csv"
# rename_FIPS matches map_data FIPS with COVID19 FIPS name
map_data = import_geo_data(filename = "countiesWithStatesAndPopulation.shp",
                index_col = "Date", FIPS_name= fips_name)
covid_data = import_covid_data(filename = covid_filename, FIPS_name = fips_name)

Call both dataframes in the  console to check that everything loaded properly.

In [47]:
map_data

Unnamed: 0_level_0,STATEFP,COUNTYFP,COUNTYNS,AFFGEOID,NAME,LSAD,ALAND,AWATER,Population,state,geometry
fips_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
21007,21,007,00516850,0500000US21007,Ballard,06,639387454,69473325,7888.0,Kentucky,"POLYGON ((-89.18137 37.04630, -89.17938 37.053..."
21017,21,017,00516855,0500000US21017,Bourbon,06,750439351,4829777,19788.0,Kentucky,"POLYGON ((-84.44266 38.28324, -84.44114 38.283..."
21031,21,031,00516862,0500000US21031,Butler,06,1103571974,13943044,12879.0,Kentucky,"POLYGON ((-86.94486 37.07341, -86.94346 37.074..."
21065,21,065,00516879,0500000US21065,Estill,06,655509930,6516335,14106.0,Kentucky,"POLYGON ((-84.12662 37.64540, -84.12483 37.646..."
21069,21,069,00516881,0500000US21069,Fleming,06,902727151,7182793,14581.0,Kentucky,"POLYGON ((-83.98428 38.44549, -83.98246 38.450..."
21093,21,093,00516893,0500000US21093,Hardin,06,1614569777,17463238,110958.0,Kentucky,"POLYGON ((-86.27756 37.58881, -86.27420 37.589..."
21099,21,099,00516896,0500000US21099,Hart,06,1068530028,13692536,19035.0,Kentucky,"POLYGON ((-86.16112 37.35080, -86.15845 37.351..."
21131,21,131,00516912,0500000US21131,Leslie,06,1038206077,9189732,9877.0,Kentucky,"POLYGON ((-83.55310 37.07928, -83.53528 37.103..."
21151,21,151,00516919,0500000US21151,Madison,06,1132729653,15306635,92987.0,Kentucky,"POLYGON ((-84.52564 37.76950, -84.52350 37.771..."
21155,21,155,00516921,0500000US21155,Marion,06,888463701,9891797,19273.0,Kentucky,"POLYGON ((-85.52129 37.55434, -85.50452 37.584..."


In [48]:
covid_data

Unnamed: 0_level_0,Unnamed: 1_level_0,uid,location_type,location_name,state,total_population,cumulative_cases,cumulative_cases_per_100_000,cumulative_deaths,cumulative_deaths_per_100_000,new_cases,new_deaths,new_cases_per_100_000,new_deaths_per_100_000,new_cases_rolling_7_day_avg,new_deaths_rolling_7_day_avg,state_abr
fips_code,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1001,2020-01-22,84001001,county,Autauga,Alabama,55200.0,0,0.00,0,0.0,,,,,,,AL
1001,2020-01-23,84001001,county,Autauga,Alabama,55200.0,0,0.00,0,0.0,0.0,0.0,0.00,0.0,,,AL
1001,2020-01-24,84001001,county,Autauga,Alabama,55200.0,0,0.00,0,0.0,0.0,0.0,0.00,0.0,,,AL
1001,2020-01-25,84001001,county,Autauga,Alabama,55200.0,0,0.00,0,0.0,0.0,0.0,0.00,0.0,,,AL
1001,2020-01-26,84001001,county,Autauga,Alabama,55200.0,0,0.00,0,0.0,0.0,0.0,0.00,0.0,,,AL
1001,2020-01-27,84001001,county,Autauga,Alabama,55200.0,0,0.00,0,0.0,0.0,0.0,0.00,0.0,0.000000,0.0,AL
1001,2020-01-28,84001001,county,Autauga,Alabama,55200.0,0,0.00,0,0.0,0.0,0.0,0.00,0.0,0.000000,0.0,AL
1001,2020-01-29,84001001,county,Autauga,Alabama,55200.0,0,0.00,0,0.0,0.0,0.0,0.00,0.0,0.000000,0.0,AL
1001,2020-01-30,84001001,county,Autauga,Alabama,55200.0,0,0.00,0,0.0,0.0,0.0,0.00,0.0,0.000000,0.0,AL
1001,2020-01-31,84001001,county,Autauga,Alabama,55200.0,0,0.00,0,0.0,0.0,0.0,0.00,0.0,0.000000,0.0,AL


Next we will generate state level by summing the county level data. This is largely a pedagogical exercise as we could download state data directly. It is helpful, however, to understand how the .sum() and .groupby() function work in _pandas_.

In [49]:
def create_state_dataframe(covid_data):
    # the keys of state_dict are the names of the states
    states = list(state_dict.keys())
    # D.C. is included in the county level data, so I elect to remove D.C.
    # if you do not remove D.C., it will be called as a Series (i.e., not a DF),
    # and will require an extra step in the script
    states.remove("District of Columbia")
    # We want to sum data within each state by summing the county values for each 
    # date
    state_data = covid_data.reset_index().set_index(["date", "state","fips_code"]).groupby(["state", "date"]).sum(numeric_only = True,
              ignore_index = False)
    # These values will be recalculated since the sum of the county values
    # would need to be weighted to be meaningful
    drop_cols = ["uid", "location_name", "cumulative_cases_per_100_000", 
                 "cumulative_deaths_per_100_000", "new_cases_per_100_000",
                 "new_deaths_per_100_000",'new_cases_rolling_7_day_avg', 
                 'new_deaths_rolling_7_day_avg']
    state_data.drop(drop_cols, axis = 1, inplace = True)
    # .sum() concatenated the strings in the dataframe, so we must correct for this
    # by redefining these values
    state_data["location_type"] = "state"
    for state in states:
        state_data.loc[state_data.index.get_level_values("state") == state, "Location"] = state
        state_data.loc[state_data.index.get_level_values("state") == state, "state_abr"] = state_dict[state]
        
    return state_data    

At the bottom of the script after the line where *covid_data* is defined, create *state_data*.

In [50]:
state_data = create_state_dataframe(covid_data)

Call the result to check that *state_data* was correctly constructed.

In [51]:
state_data

Unnamed: 0_level_0,Unnamed: 1_level_0,location_type,total_population,cumulative_cases,cumulative_deaths,new_cases,new_deaths,state_abr,Location
state,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama,2020-01-22,state,4864680.0,0,0,0.0,0.0,AL,Alabama
Alabama,2020-01-23,state,4864680.0,0,0,0.0,0.0,AL,Alabama
Alabama,2020-01-24,state,4864680.0,0,0,0.0,0.0,AL,Alabama
Alabama,2020-01-25,state,4864680.0,0,0,0.0,0.0,AL,Alabama
Alabama,2020-01-26,state,4864680.0,0,0,0.0,0.0,AL,Alabama
Alabama,2020-01-27,state,4864680.0,0,0,0.0,0.0,AL,Alabama
Alabama,2020-01-28,state,4864680.0,0,0,0.0,0.0,AL,Alabama
Alabama,2020-01-29,state,4864680.0,0,0,0.0,0.0,AL,Alabama
Alabama,2020-01-30,state,4864680.0,0,0,0.0,0.0,AL,Alabama
Alabama,2020-01-31,state,4864680.0,0,0,0.0,0.0,AL,Alabama


Now it is time to merge the COVID-19 data with the data from the U.S. Census shapefile.We created *state_data* first since that that dataframe does not meed to include the data from the shapefile. 

In [52]:
def create_covid_geo_dataframe(covid_data, map_data, dates):
    # create geopandas dataframe with multiindex for date
    # original geopandas dataframe had no dates, so copies of the df are 
    # stacked vertically, with a new copy for each date in the covid_data index
    #(dates is a global)
    i = 0
    for date in dates:
        # select county observations from each date in dates
        df = covid_data[covid_data.index.get_level_values("date")==date]
        # use the fips_codes from the slice of covid_data to select counties
        # from the map_data index,making sure that the map_data index matches
        # the covid_data index
        counties = df.index.get_level_values("fips_code")
        agg_df = map_data.loc[counties]
        # each row for agg_df will reflect that 
        agg_df["date"] = date
        if i == 0:
            # create the geodataframe, select coordinate system (.crs) to
            # match map_data.crs
            matching_gpd = geopandas.GeoDataFrame(agg_df, crs = map_data.crs)
            i += 1
        else:
            # after initial geodataframe is created, stack a dataframe for
            # each date in dates. Once completed, index of matching_gpd
            # will match index of covid_data
            matching_gpd = matching_gpd.append(agg_df, ignore_index = False)         
    # Set mathcing_gpd index as["fips_code", "date"], liked covid_data index
    matching_gpd.reset_index(inplace=True)
    matching_gpd.set_index(["fips_code","date"], inplace = True)
    # add each column from covid_data to mathcing_gpd
    for key, val in covid_data.items():
        matching_gpd[key] = val
    # Create "Location" which concatenates county name and state abbreviation 
    matching_gpd["Location"] = matching_gpd["NAME"] + ", " + \
        matching_gpd["state_abr"]
    return matching_gpd       

In [53]:
# dates will be used to create a geopandas DataFrame with multiindex 
dates = sorted(list(set(covid_data.index.get_level_values("date"))))
covid_data = create_covid_geo_dataframe(covid_data, map_data, dates)

As before, let's check the result by calling the covid_data which we have redefined.

In [54]:
covid_data

Unnamed: 0_level_0,Unnamed: 1_level_0,STATEFP,COUNTYFP,COUNTYNS,AFFGEOID,NAME,LSAD,ALAND,AWATER,Population,state,...,cumulative_deaths,cumulative_deaths_per_100_000,new_cases,new_deaths,new_cases_per_100_000,new_deaths_per_100_000,new_cases_rolling_7_day_avg,new_deaths_rolling_7_day_avg,state_abr,Location
fips_code,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1001,2020-01-22,1,001,00161526,0500000US01001,Autauga,06,1539602123,25706961,55869.0,Alabama,...,0,0.00,,,,,,,AL,"Autauga, AL"
1003,2020-01-22,1,003,00161527,0500000US01003,Baldwin,06,4117546676,1133055836,223234.0,Alabama,...,0,0.00,,,,,,,AL,"Baldwin, AL"
1005,2020-01-22,1,005,00161528,0500000US01005,Barbour,06,2292144655,50538698,24686.0,Alabama,...,0,0.00,,,,,,,AL,"Barbour, AL"
1007,2020-01-22,1,007,00161529,0500000US01007,Bibb,06,1612167481,9602089,22394.0,Alabama,...,0,0.00,,,,,,,AL,"Bibb, AL"
1009,2020-01-22,1,009,00161530,0500000US01009,Blount,06,1670103911,15015423,57826.0,Alabama,...,0,0.00,,,,,,,AL,"Blount, AL"
1011,2020-01-22,1,011,00161531,0500000US01011,Bullock,06,1613059160,6054988,10101.0,Alabama,...,0,0.00,,,,,,,AL,"Bullock, AL"
1013,2020-01-22,1,013,00161532,0500000US01013,Butler,06,2012002530,2701198,19448.0,Alabama,...,0,0.00,,,,,,,AL,"Butler, AL"
1015,2020-01-22,1,015,00161533,0500000US01015,Calhoun,06,1569189622,16627597,113605.0,Alabama,...,0,0.00,,,,,,,AL,"Calhoun, AL"
1017,2020-01-22,1,017,00161534,0500000US01017,Chambers,06,1545085607,16971701,33254.0,Alabama,...,0,0.00,,,,,,,AL,"Chambers, AL"
1019,2020-01-22,1,019,00161535,0500000US01019,Cherokee,06,1433623321,120308339,26196.0,Alabama,...,0,0.00,,,,,,,AL,"Cherokee, AL"


The result is that covid_data is now a geodataframe that can be used to generate maps that reflect data at the county level. We will create these maps in Part III. 

Next we will generate data that normalizes the number of cases and deaths per million population. For daily rates of both cases and deaths, we will create a 7 day moving average.

In [55]:
def create_new_vars(covid_data, moving_average_days):
    # use a for loop that performs the same operations on data for cases and for deaths
    for key in ["cases", "deaths"]:
        # create a version of the key with the first letter capitalized
        cap_key = key.title()
        covid_data[cap_key + " per Million"] = covid_data["cumulative_" + key].div(covid_data["total_population"]).mul(10 ** 6)
        # generate daily data normalized per million population by taking the daily difference within each
        # entity (covid_data.index.names[0]), dividing this value by population and multiplying that value by 
        # 1 million 10 ** 6
        covid_data["Daily " + cap_key + " per Million"] = \
            covid_data["cumulative_" + key ].groupby(covid_data.index.names[0])\
            .diff(1).div(covid_data["total_population"]).mul(10 ** 6)
        # taking the rolling average; choice of number of days is passed as moving_average_days
        covid_data["Daily " + cap_key + " per Million MA"] = covid_data["Daily " + \
                  cap_key + " per Million"].rolling(moving_average_days).mean()

At the bottom of the script, define the number of days for the rolling moving average. Call *create_new_vars()* to create new variables for *covid_data* and *state_data*

In [56]:
moving_average_days = 7
create_new_vars(covid_data, moving_average_days)
create_new_vars(state_data, moving_average_days)

Now check that the dataframes for the new variables.

In [57]:
covid_data

Unnamed: 0_level_0,Unnamed: 1_level_0,STATEFP,COUNTYFP,COUNTYNS,AFFGEOID,NAME,LSAD,ALAND,AWATER,Population,state,...,new_cases_rolling_7_day_avg,new_deaths_rolling_7_day_avg,state_abr,Location,Cases per Million,Daily Cases per Million,Daily Cases per Million MA,Deaths per Million,Daily Deaths per Million,Daily Deaths per Million MA
fips_code,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1001,2020-01-22,1,001,00161526,0500000US01001,Autauga,06,1539602123,25706961,55869.0,Alabama,...,,,AL,"Autauga, AL",0.000000,,,0.000000,,
1003,2020-01-22,1,003,00161527,0500000US01003,Baldwin,06,4117546676,1133055836,223234.0,Alabama,...,,,AL,"Baldwin, AL",0.000000,,,0.000000,,
1005,2020-01-22,1,005,00161528,0500000US01005,Barbour,06,2292144655,50538698,24686.0,Alabama,...,,,AL,"Barbour, AL",0.000000,,,0.000000,,
1007,2020-01-22,1,007,00161529,0500000US01007,Bibb,06,1612167481,9602089,22394.0,Alabama,...,,,AL,"Bibb, AL",0.000000,,,0.000000,,
1009,2020-01-22,1,009,00161530,0500000US01009,Blount,06,1670103911,15015423,57826.0,Alabama,...,,,AL,"Blount, AL",0.000000,,,0.000000,,
1011,2020-01-22,1,011,00161531,0500000US01011,Bullock,06,1613059160,6054988,10101.0,Alabama,...,,,AL,"Bullock, AL",0.000000,,,0.000000,,
1013,2020-01-22,1,013,00161532,0500000US01013,Butler,06,2012002530,2701198,19448.0,Alabama,...,,,AL,"Butler, AL",0.000000,,,0.000000,,
1015,2020-01-22,1,015,00161533,0500000US01015,Calhoun,06,1569189622,16627597,113605.0,Alabama,...,,,AL,"Calhoun, AL",0.000000,,,0.000000,,
1017,2020-01-22,1,017,00161534,0500000US01017,Chambers,06,1545085607,16971701,33254.0,Alabama,...,,,AL,"Chambers, AL",0.000000,,,0.000000,,
1019,2020-01-22,1,019,00161535,0500000US01019,Cherokee,06,1433623321,120308339,26196.0,Alabama,...,,,AL,"Cherokee, AL",0.000000,,,0.000000,,


In [58]:
state_data

Unnamed: 0_level_0,Unnamed: 1_level_0,location_type,total_population,cumulative_cases,cumulative_deaths,new_cases,new_deaths,state_abr,Location,Cases per Million,Daily Cases per Million,Daily Cases per Million MA,Deaths per Million,Daily Deaths per Million,Daily Deaths per Million MA
state,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Alabama,2020-01-22,state,4864680.0,0,0,0.0,0.0,AL,Alabama,0.000000,,,0.000000,,
Alabama,2020-01-23,state,4864680.0,0,0,0.0,0.0,AL,Alabama,0.000000,0.000000,,0.000000,0.0,
Alabama,2020-01-24,state,4864680.0,0,0,0.0,0.0,AL,Alabama,0.000000,0.000000,,0.000000,0.0,
Alabama,2020-01-25,state,4864680.0,0,0,0.0,0.0,AL,Alabama,0.000000,0.000000,,0.000000,0.0,
Alabama,2020-01-26,state,4864680.0,0,0,0.0,0.0,AL,Alabama,0.000000,0.000000,,0.000000,0.0,
Alabama,2020-01-27,state,4864680.0,0,0,0.0,0.0,AL,Alabama,0.000000,0.000000,,0.000000,0.0,
Alabama,2020-01-28,state,4864680.0,0,0,0.0,0.0,AL,Alabama,0.000000,0.000000,,0.000000,0.0,
Alabama,2020-01-29,state,4864680.0,0,0,0.0,0.0,AL,Alabama,0.000000,0.000000,0.000000,0.000000,0.0,0.000000e+00
Alabama,2020-01-30,state,4864680.0,0,0,0.0,0.0,AL,Alabama,0.000000,0.000000,0.000000,0.000000,0.0,0.000000e+00
Alabama,2020-01-31,state,4864680.0,0,0,0.0,0.0,AL,Alabama,0.000000,0.000000,0.000000,0.000000,0.0,0.000000e+00


The new variables have been created successfully. You might notice that the value of Daily Deaths per Million MA is not exactly zero. This is a technicality, as Python will identify the number zero as an arbtirarily small float value.

The last step will be to compare data from each geographic entity by aligning the data in relation to the first day that cases per million or deaths per million passed a given threshold. This aligned data will be recorded in the *zero_day_dict*.

In [66]:
def create_zero_day_dict(covid_data, start_date):
    # Data from each entity will be stored in the dictionary
    zero_day_dict = {}
    # The dictionary will have a total of 4 keys
    # "Cases per Million", "Daily Cases per Million MA", "Deaths per Million", "Daily Deaths per Million MA"
    for key in ["Cases", "Deaths"]:
        zero_day_dict[key + " per Million"] = {}
        zero_day_dict["Daily " + key + " per Million MA"] = {}
    # Each key is associated with a minimal value that identifies day zero 
    # For deaths, the value is drawn from "Deaths per Million" 
    # For cases, the value is drawn from "Cases per Million" 
    day_zero_val = {}
    for key in zero_day_dict:
        day_zero_val[key] = 2 if "Deaths" in key else 10
    # create a list of entities (states or counties)
    entities = sorted(list(set(covid_data.index.get_level_values(0))))
    # for each key, identify the full set of values
    for key in zero_day_dict.keys():
        vals = covid_data[key]
        # select values that will be used to identify day zero
        thresh_vals = covid_data["Deaths per Million"] if "Deaths" in key else \
            covid_data["Cases per Million"]
        # for each entity, select the slice of values greater than the minimum value
        for entity in entities:
            dpc = vals[vals.index.get_level_values(0) == entity][thresh_vals > day_zero_val[key]]
            zero_day_dict[key][entity] = dpc.copy()
    return zero_day_dict, day_zero_val

In [67]:
start_date = "03-15-2020"     
end_date = dates[-1]
county_zero_day_dict, day_zero_val = create_zero_day_dict(covid_data, start_date)    
state_zero_day_dict, day_zero_val = create_zero_day_dict(state_data, start_date)

Check a key from each dictionary to make sure that the data has actually been aligned.

In [71]:
state_zero_day_dict["Deaths per Million"]

{'Alabama': state    date      
 Alabama  2020-03-29      2.055634
          2020-03-30      2.055634
          2020-03-31      3.083451
          2020-04-01      5.550211
          2020-04-02      5.550211
          2020-04-03      7.811408
          2020-04-04      9.044788
          2020-04-05      9.250352
          2020-04-06     10.072605
          2020-04-07     13.156055
          2020-04-08     13.567182
          2020-04-09     14.389436
          2020-04-10     16.445069
          2020-04-11     18.911830
          2020-04-12     19.117393
          2020-04-13     20.350773
          2020-04-14     23.434224
          2020-04-15     24.256477
          2020-04-16     27.339928
          2020-04-17     30.423378
          2020-04-18     31.451195
          2020-04-19     32.273449
          2020-04-20     33.506829
          2020-04-21     37.618096
          2020-04-22     40.290420
          2020-04-23     41.523800
          2020-04-24     42.962744
          2020-04-25   

In [72]:
county_zero_day_dict["Daily Deaths per Million MA"]

{1001: fips_code  date      
 1001       2020-04-07    2.587992e+00
            2020-04-08    0.000000e+00
            2020-04-09    0.000000e+00
            2020-04-10    0.000000e+00
            2020-04-11    0.000000e+00
            2020-04-12    0.000000e+00
            2020-04-13    0.000000e+00
            2020-04-14    0.000000e+00
            2020-04-15    2.030122e-15
            2020-04-16    1.674851e-14
            2020-04-17    2.587992e+00
            2020-04-18    0.000000e+00
            2020-04-19    0.000000e+00
            2020-04-20   -2.587992e+00
            2020-04-21    0.000000e+00
            2020-04-22    2.587992e+00
            2020-04-23    0.000000e+00
            2020-04-24    0.000000e+00
            2020-04-25    0.000000e+00
            2020-04-26    0.000000e+00
            2020-04-27    2.587992e+00
            2020-04-28    2.587992e+00
            2020-04-29    0.000000e+00
            2020-04-30    0.000000e+00
            2020-05-01   -2.587992e

You have completed the last step. All that is left now is to create a feature that does not unnecessarily download and process the data again once all steps have been completed. We will add a term, *data_processed*, that confirms when the data has been processed an if statement that checks if this variable has been created.

In [73]:
if "data_processed" not in locals():
    fips_name = "fips_code"
    covid_filename = "COVID19DataAP.csv"
    # rename_FIPS matches map_data FIPS with COVID19 FIPS name
    map_data = import_geo_data(filename = "countiesWithStatesAndPopulation.shp",
                    index_col = "Date", FIPS_name= fips_name)
    covid_data = import_covid_data(filename = covid_filename, FIPS_name = fips_name)
    state_data = create_state_dataframe(covid_data)
    # dates will be used to create a geopandas DataFrame with multiindex 
    dates = sorted(list(set(covid_data.index.get_level_values("date"))))
    covid_data = create_covid_geo_dataframe(covid_data, map_data, dates)
    moving_average_days = 7
    create_new_vars(covid_data, moving_average_days)
    create_new_vars(state_data, moving_average_days)
    start_date = "03-15-2020"     
    end_date = dates[-1]
    county_zero_day_dict, day_zero_val = create_zero_day_dict(covid_data, start_date)    
    state_zero_day_dict, day_zero_val = create_zero_day_dict(state_data, start_date)
    # once data is processed, it is saved in the memory
    # the if statement at the top of this block of code instructs the computer
    # not to repeat these operations 
    data_processed = True

In the next post, we will create visualizations of the data in *county_zero_day_dict* and *state_zero_day_dict*.