# Virginia COVID-19 Cases - Limited Exploration

__import required libraries__

In [None]:
import os
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
%matplotlib inline

# importing plotly express for plot animation
try:
    import plotly.express as px
except:
    !pip install plotly
    import plotly.express as px

In [None]:
# move to the repo head
# os.chdir(r'C:\Users\jamel\myprojects\va-covid-eda')
%cd ../
os.getcwd()

/home/jovyan/work/va-covid-eda


'/home/jovyan/work/va-covid-eda'

In [None]:
# load python's `autoreload`, to update any module changes
%load_ext autoreload

# turn on `autoreload`
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
# importing the "helpers folder as a package"
from helpers import helper_func
from helpers import save_pickle
from helpers import read_pickle

__Restore serialized objects__

In [None]:
# restoring dataframe objects
localities_df = read_pickle("pickles/localities-df.pkl")
cases_df = read_pickle("pickles/cases-df.pkl")
hospitalizations_df = read_pickle("pickles/hospitalizations-df.pkl")
deaths_df = read_pickle("pickles/deaths-df.pkl")

# preview a restored dataframe
cases_df.tail(2)

File restored from pickles/localities-df.pkl
File restored from pickles/cases-df.pkl
File restored from pickles/hospitalizations-df.pkl
File restored from pickles/deaths-df.pkl


Unnamed: 0,locality,total_cases
131,Bland,153
132,Bath,49


__Limit the localities of interest.__

Virginia's Hampton Roads region experienced outbreaks in late July. We will plot them alongside the state's capital of Richmond (City) and compare cases, hospitalizations, and deaths over time.

In [None]:
# list the localities
print(sorted(set(x for x in localities_df.locality)))

['Accomack', 'Albemarle', 'Alexandria', 'Alleghany', 'Amelia', 'Amherst', 'Appomattox', 'Arlington', 'Augusta', 'Bath', 'Bedford', 'Bland', 'Botetourt', 'Bristol', 'Brunswick', 'Buchanan', 'Buckingham', 'Buena Vista City', 'Campbell', 'Caroline', 'Carroll', 'Charles City', 'Charlotte', 'Charlottesville', 'Chesapeake', 'Chesterfield', 'Clarke', 'Colonial Heights', 'Covington', 'Craig', 'Culpeper', 'Cumberland', 'Danville', 'Dickenson', 'Dinwiddie', 'Emporia', 'Essex', 'Fairfax', 'Fairfax City', 'Falls Church', 'Fauquier', 'Floyd', 'Fluvanna', 'Franklin City', 'Franklin County', 'Frederick', 'Fredericksburg', 'Galax', 'Giles', 'Gloucester', 'Goochland', 'Grayson', 'Greene', 'Greensville', 'Halifax', 'Hampton', 'Hanover', 'Harrisonburg', 'Henrico', 'Henry', 'Highland', 'Hopewell', 'Isle of Wight', 'James City', 'King George', 'King William', 'King and Queen', 'Lancaster', 'Lee', 'Lexington', 'Loudoun', 'Louisa', 'Lunenburg', 'Lynchburg', 'Madison', 'Manassas City', 'Manassas Park', 'Marti

In [None]:
# listing select localiies for visual EDA
select_localities = ['Chesapeake', 'Norfolk', 'Richmond City', 'Virginia Beach']

# filtering `localities_df` for the selected localities
selected = localities_df.locality.isin(select_localities)

# instantiating a new dataframe with filtered localities, only
select_df = localities_df[selected]

# viewing the number of records
print(select_df.shape)

# viewing the last 5 records in the dataset
select_df.tail()

(544, 6)


Unnamed: 0,report_date,fips,locality,total_cases,hospitalizations,deaths
17950,07/29/2020,51760,Richmond City,2831,270,39
18054,07/30/2020,51550,Chesapeake,2391,199,27
18073,07/30/2020,51710,Norfolk,3080,165,22
18079,07/30/2020,51760,Richmond City,2857,273,38
18084,07/30/2020,51810,Virginia Beach,3979,193,43


### Bar Plot, Total Cases by Locality

In [None]:
# viewing an animated bar plot
fig = px.bar(select_df,  
             x ="locality",  
             y ="total_cases", 
             title ="Total Cases by Locality", 
             color ='deaths', 
             animation_frame ='report_date', 
             hover_name ='locality',  
             range_y =[0, 4250]) 
fig.show()

Richmond cases appear most likely to have resulted in death, through mid - July. It was then surpassed by Virginia Beach in both the number of deaths and in the total number of cases.

### Bar Plot, Deaths by Locality

In [None]:
fig = px.bar(select_df,  
             x ="locality",  
             y ="deaths", 
             color ='hospitalizations', 
             title ="Deaths by Locality", 
             animation_frame ='report_date', 
             hover_name ='locality',  
             range_y =[0, 50]) 
fig.show()

The plot suggests that COVID cases were less - likely to receive hospital treatment, in Norfolk, compared to Richmond. As the rate of death seems to slow toward the end of July, for Richmond, it appears to pick up pace in Virginia Beach. Meanwhile, the number of Virginia Beach hospitalizations is well below that of Richmond.

### Scatter Plot: May - July, 2020 Hospitalizations, Deaths vs Cases by Locality

In [None]:
# animating a scatter plot, with deaths determining data - point size 
fig = px.scatter( 
    select_df[select_df.report_date > "04/30/2020"],  
    x ="deaths",  
    y ="total_cases", 
    title =" May - July, 2020 Hospitalizations, Deaths vs Cases by Locality",  
    animation_frame ="report_date",  
    animation_group ="locality", 
    size ="hospitalizations",  
    color ="locality",  
    hover_name ="locality",  
    facet_col ="locality", 
    size_max = 80, 
    range_x =[-50, 200], 
    range_y =[-10, 5000] 
) 
fig.show();

Size indicates hospitalizations.

### Scatter Plot: May - July, 2020 Totals, Deaths vs Hospitalizations by Locality

In [None]:
# using data - point size to reflect `total_cases`
fig = px.scatter( 
    select_df[select_df.report_date > "04/30/2020"],  
    x ="deaths",  
    y ="hospitalizations", 
    title =" May - July, 2020 Totals, Deaths vs Hospitalizations by Locality",  
    animation_frame ="report_date",  
    animation_group ="locality", 
    size ="total_cases",  
    color ="locality",  
    hover_name ="locality",  
    facet_col ="locality", 
    size_max = 100, 
    range_x =[-10, 75], 
    range_y =[-10, 325] 
) 
fig.show()

Scatter point size is not particularly informative, in this layout. By the completion of the animation, point sizes for each locality do not appear significantly different, despite the range in total cases they represent.

### Scatter Plot: May - July, 2020 Totals, Hospitalizations vs Cases by Locality

In [None]:
# 
fig = px.scatter( 
    select_df[select_df.report_date > "04/30/2020"],  
    x ="hospitalizations", 
    title =" May - July, 2020 Totals, Deaths vs Hospitalizations by Locality",  
    y ="total_cases",  
    animation_frame ="report_date",  
    animation_group ="locality", 
    size ="total_cases",  
    color ="locality",  
    hover_name ="locality",  
    facet_col ="locality", 
    size_max = 50, 
    range_x =[0, 300], 
    range_y =[-10, 4500] 
) 
fig.show()

In each of the preceding plots, we see cases growing more rapidly in Richmond at the start of our timeline, with Virginia Beach later overtaking the capital in daily deaths and total cases. While Virginia Beach led in the number of hospitalizatons, at the beginning of our timeline, it was far surpased by Richmond from the second week of May through July.

## Feature Engineering

Let's bring in some population data.

This dataset is obtained from University of Virginia's [Weldon Cooper Center for Public Service Demographics Research Group](https://demographics.coopercenter.org/virginia-population-estimates), and was published  on January 27, 2020.

Column Name |	Description	| |
--- | --- | ---
FIPS Code |	3-digit code (XXX) for the locality |	
Locality | Independent city or county in Virginia |
April 1, 2010 Census| Official population, count from the 2010 Census |
July 1, 2019 Estimate | Population approximation "based on a variety of observed administrative record data, such as births, deaths, school enrollment, and residential housing construction" |

In [None]:
pop_df = pd.read_csv('data/VAPopulationEstimates_2019-07_UVACooperCenter.xlsx - 2019 Table.csv', 
                     skiprows=4)
pop_df

Unnamed: 0,FIPS Code,Locality,"April 1, 2010 Census","July 1, 2019 Estimate",Numeric Change,Percent Change
0,,Virginia,8001024,8535519,534495,6.7%
1,,,,,,
2,1.0,Accomack County,33164,32561,-603,-1.8%
3,3.0,Albemarle County,99010,109722,10712,10.8%
4,5.0,Alleghany County,16250,14952,-1298,-8.0%
...,...,...,...,...,...,...
179,,18 Middle Peninsula,90826,91247,421,0.5%
180,,19 Crater,496955,530142,33187,6.7%
181,,22 Accomack-Northampton,45553,44371,-1182,-2.6%
182,,23 Hampton Roads,1666310,1729109,62799,3.8%


We will reduce the dataset to eliminate unneeded columns and rows.

Since we know there should be 133 Federal Information Processing Standard (FIPS) codes, we will check a few rows beyond that.

In [None]:
# printing to verify planned operation
print(pop_df.iloc[2:136,:4])

     FIPS Code             Locality April 1, 2010 Census July 1, 2019 Estimate
2          1.0      Accomack County               33,164                32,561
3          3.0     Albemarle County               99,010               109,722
4          5.0     Alleghany County               16,250                14,952
5          7.0        Amelia County               12,690                13,053
6          9.0       Amherst County               32,353                31,766
..         ...                  ...                  ...                   ...
131      810.0  Virginia Beach City              437,994               452,643
132      820.0      Waynesboro City               21,006                22,183
133      830.0    Williamsburg City               14,067                15,383
134      840.0      Winchester City               26,203                28,180
135        NaN       Total Counties            5,548,355             5,960,959

[134 rows x 4 columns]


Row 135 is a summary row.

* We will remove the unneeded columns and summary row.
* We will also drop the `2010_census` column and use the `2019_estimate` column for our population data.
* `FIPS Code` will be converted to an integer.

In [None]:
# removing rows
pop_df = pop_df.iloc[2:135,:4].drop(['April 1, 2010 Census'], axis=1)

# converting type
pop_df['FIPS Code'] = pop_df['FIPS Code'].astype(int)
pop_df

Unnamed: 0,FIPS Code,Locality,"July 1, 2019 Estimate"
2,1,Accomack County,32561
3,3,Albemarle County,109722
4,5,Alleghany County,14952
5,7,Amelia County,13053
6,9,Amherst County,31766
...,...,...,...
130,800,Suffolk City,93825
131,810,Virginia Beach City,452643
132,820,Waynesboro City,22183
133,830,Williamsburg City,15383


Now, we will rename our column labels.

In [None]:
# renaming columns, replacing spaces with underscores and converting to lowercase
pop_df.rename(columns = {'FIPS Code':'fips_code', 
                         'Locality':'locality', 
                         'July 1, 2019 Estimate: 3':'2019_estimate'}, 
              inplace = True)

Let's add the Virginia prefix (51) to `fips_code`, to match our `fips` column in  `localities_df`. First, we need to prepend zeros to codes with fewer than 3 digits.

In [None]:
# padding `fips_code` with zeros to fill to length 3
pop_df['fips_code']=pop_df['fips_code'].apply(lambda x: '{0:0>3}'.format(x))

print(pop_df['fips_code'].head())

2    001
3    003
4    005
5    007
6    009
Name: fips_code, dtype: object


Note: padding the values with leading zeros converts the data type to objece / string.

In [None]:
# preceding all `fips_code` values with VA state FIPS code "51"
pop_df['fips_code'] = pop_df['fips_code'].apply(lambda x: '51' + x)

# viewing first and last rows
pop_df

Unnamed: 0,fips_code,locality,"July 1, 2019 Estimate"
2,51001,Accomack County,32561
3,51003,Albemarle County,109722
4,51005,Alleghany County,14952
5,51007,Amelia County,13053
6,51009,Amherst County,31766
...,...,...,...
130,51800,Suffolk City,93825
131,51810,Virginia Beach City,452643
132,51820,Waynesboro City,22183
133,51830,Williamsburg City,15383




*   Convert the `fips_code` data type back to int.
*   Remove commas and convert the `July 1, 2019 Estimate` data type to int.



In [None]:
# checking dtypes pre - conversion
print("Original dtypes\n\n", pop_df.dtypes, "\n\n", "="*60)

# correcting dtype
pop_df.fips_code = pop_df.fips_code.astype(int)

# removing commas and correcting dtype
pop_df["July 1, 2019 Estimate"] = pop_df["July 1, 2019 Estimate"].str.replace(
    ",", ""
    ).astype(int)

# checking dtypes post - conversion
print("Converted dtypes\n\n", pop_df.dtypes, "\n\n")

# viewing first rows
pop_df.head()

Original dtypes

 fips_code                object
locality                 object
July 1, 2019 Estimate    object
dtype: object 

Converted dtypes

 fips_code                 int64
locality                 object
July 1, 2019 Estimate     int64
dtype: object 




Unnamed: 0,fips_code,locality,"July 1, 2019 Estimate"
2,51001,Accomack County,32561
3,51003,Albemarle County,109722
4,51005,Alleghany County,14952
5,51007,Amelia County,13053
6,51009,Amherst County,31766


We can use our `fips_code` to merge population data with our `localities_df` data (matching on its `FIPS` column), to analyze our cases, hospitalizations, and deaths against population estimates. We will only need the code and estimate colums, though.

In [None]:
# copying `pop_df`, dropping the locality column
pop_estimate_df = pop_df.copy().drop(['locality'], axis=1)

# viewing the new dataframe's first 5 rows
pop_estimate_df.head()

Unnamed: 0,fips_code,"July 1, 2019 Estimate"
2,51001,32561
3,51003,109722
4,51005,14952
5,51007,13053
6,51009,31766


Let's clean up the column label.

In [None]:
# updating column labels
pop_estimate_df.columns = ['fips_code', '2019_estimate']

# verifying updated labels
pop_estimate_df.columns

Index(['fips_code', '2019_estimate'], dtype='object')

Let's see what is our population range of values.

In [None]:
# sort `2019_estimate` to view its range of values
pop_estimate_df["2019_estimate"].sort_values()

46        2246
121       3879
10        4318
24        5108
105       5589
        ...   
22      350760
54      413546
131     452643
74      465498
30     1143528
Name: 2019_estimate, Length: 133, dtype: int64

The range of values suggests that we can engineer meaningful features per 1000 of population for each locality.

---

__Serialize Objects__

In [None]:
save_pickle(pop_estimate_df, "pop-estimate-df.pkl")
save_pickle(select_localities, "select-localities.pkl")

--------------- PICKLING pop-estimate-df.pkl -------------------------
Saved as  <_io.BufferedWriter name='pickles/pop-estimate-df.pkl'> 

--------------- PICKLING select-localities.pkl -------------------------
Saved as  <_io.BufferedWriter name='pickles/select-localities.pkl'> 



##### [Return to the repository, on Github](https://github.com/jammy-bot/va-covid-eda)