### COVID-19 Data NYTimes Exploratory

This Notebook contains the exploratory process of NYTimes COVID-19 dataset.

Here you'll find the process to understand the dataset and correct potential issues before joining with other data.

### Issues found:
- FIPS imported as float due NULL values
- Find FIPS null but state-county data present (Adding correct FIPS)

In [96]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 0)

In [33]:
# Importing data
url="https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
# Fix issue with null and format dates
covidCounties=pd.read_csv(url, parse_dates=True, keep_default_na=False)

In [109]:
# Few data
covidCounties[0:10:]

Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061,1,0
1,2020-01-22,Snohomish,Washington,53061,1,0
2,2020-01-23,Snohomish,Washington,53061,1,0
3,2020-01-24,Cook,Illinois,17031,1,0
4,2020-01-24,Snohomish,Washington,53061,1,0
5,2020-01-25,Orange,California,6059,1,0
6,2020-01-25,Cook,Illinois,17031,1,0
7,2020-01-25,Snohomish,Washington,53061,1,0
8,2020-01-26,Maricopa,Arizona,4013,1,0
9,2020-01-26,Los Angeles,California,6037,1,0


### Cases

In [119]:
# Describe cases
covidCounties.describe().apply(lambda s: s.apply('{0:.5f}'.format))

Unnamed: 0,cases
count,1495060.0
mean,4918.1766
std,24578.26143
min,0.0
25%,121.0
50%,762.0
75%,2772.0
max,1254202.0


### Deaths

In [141]:
# Find issue with deaths

# There are 33425
print(f"There are {covidCounties[covidCounties['deaths']=='']['fips'].count()} records with empty deaths column")

# All of them belongs to Puerto Rico
covidCounties[covidCounties['deaths']==''][['state']].drop_duplicates()

There are 33425 records with empty deaths column


Unnamed: 0,state
117486,Puerto Rico


### State-County

In [157]:
print(f"Max lenght: {covidCounties['state'].str.len().max()}") # Northern Mariana Islands
print(f"Min lenght: {covidCounties['state'].str.len().min()}") # Utah

Max lenght: 24
Min lenght: 4


In [161]:
print(f"Max lenght: {covidCounties['county'].str.len().max()}") # Bristol Bay plus Lake and Peninsula - Alaska
print(f"Min lenght: {covidCounties['county'].str.len().min()}") # Lee - Florida

Max lenght: 35
Min lenght: 3


### FLIPS

In [128]:
# Unknown county and fips empty
flipsEmpty = covidCounties[covidCounties['fips'] == ''].groupby(['state','county']).count().reset_index()

# Possible to fix
flipsEmpty[flipsEmpty['county'] != 'Unknown']

# Print all county that we can try to find FIPS for
covidCounties[covidCounties['county'].isin(['Joplin','Kansas City','New York City'])][['county','state']].drop_duplicates() 

Unnamed: 0,county,state
416,New York City,New York
5641,Kansas City,Missouri
272865,Joplin,Missouri


In [121]:
# Review that all FLIPS are exact 5 CHARACTERS as expected
print(f"Max lenght: {covidCounties['fips'][covidCounties['fips']!=''].str.len().max()}")
print(f"Min lenght: {covidCounties['fips'][covidCounties['fips']!=''].str.len().min()}")

Max lenght: 5
Min lenght: 5


In [171]:
# Distrubution of Empty FIPS by State
covidCounties[['state','fips']][covidCounties['fips']!=''].groupby(['state', 'county']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,fips
state,county,Unnamed: 2_level_1
Alabama,Autauga,471
Alabama,Baldwin,481
Alabama,Barbour,461
Alabama,Bibb,465
Alabama,Blount,470
Alabama,Bullock,469
Alabama,Butler,470
Alabama,Calhoun,477
Alabama,Chambers,476
Alabama,Cherokee,470
