### COVID-19 Data NYTimes  - Exploratory

This Notebook contains the exploratory process of NYTimes COVID-19 dataset.

Here you'll find the process to understand the dataset and correct potential issues before joining with other data.

### Learning Resume
- _FIPS_ imported as float due NULL values
- All _FIPS_ are well formatted (5 characters numeric)
- There are some _FIPS_ as NULL but state-county data present (We can add correct FIPS fo them)
    - New York City
    - Kansas City
    - Joplin
- There are records with empty value in _deaths_ column, all of them belong to Puerto Rico
- State and County don't have an empty value or malformed name
- 13605 records have null FIPS

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Importing data
url="https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
# Fix issue with null and format dates
covidCounties=pd.read_csv(url, parse_dates=True, keep_default_na=False)

In [194]:
# Few data
covidCounties.head(10)

Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061,1,0
1,2020-01-22,Snohomish,Washington,53061,1,0
2,2020-01-23,Snohomish,Washington,53061,1,0
3,2020-01-24,Cook,Illinois,17031,1,0
4,2020-01-24,Snohomish,Washington,53061,1,0
5,2020-01-25,Orange,California,6059,1,0
6,2020-01-25,Cook,Illinois,17031,1,0
7,2020-01-25,Snohomish,Washington,53061,1,0
8,2020-01-26,Maricopa,Arizona,4013,1,0
9,2020-01-26,Los Angeles,California,6037,1,0


### Cases

In [193]:
# Describe cases
covidCounties.describe().apply(lambda s: s.apply('{0:.2f}'.format))

Unnamed: 0,cases
count,1495060.0
mean,4918.18
std,24578.26
min,0.0
25%,121.0
50%,762.0
75%,2772.0
max,1254202.0


### Deaths

In [141]:
# Find issue with deaths

# There are 33425
print(f"There are {covidCounties[covidCounties['deaths']=='']['fips'].count()} records with empty deaths column")

# All of them belongs to Puerto Rico
covidCounties[covidCounties['deaths']==''][['state']].drop_duplicates()

There are 33425 records with empty deaths column


Unnamed: 0,state
117486,Puerto Rico


### State-County

In [157]:
print(f"Max lenght: {covidCounties['state'].str.len().max()}") # Northern Mariana Islands
print(f"Min lenght: {covidCounties['state'].str.len().min()}") # Utah

Max lenght: 24
Min lenght: 4


In [161]:
print(f"Max lenght: {covidCounties['county'].str.len().max()}") # Bristol Bay plus Lake and Peninsula - Alaska
print(f"Min lenght: {covidCounties['county'].str.len().min()}") # Lee - Florida

Max lenght: 35
Min lenght: 3


In [191]:
# List of all states
covidCounties['state'].drop_duplicates()

0                      Washington
3                        Illinois
5                      California
8                         Arizona
44                  Massachusetts
78                      Wisconsin
143                         Texas
198                      Nebraska
310                          Utah
369                        Oregon
411                       Florida
416                      New York
418                  Rhode Island
442                       Georgia
447                 New Hampshire
                   ...           
1198                     Delaware
1265                  Mississippi
1280                   New Mexico
1293                 North Dakota
1363                      Wyoming
1364                       Alaska
1466                        Maine
1619                      Alabama
1712                        Idaho
1782                      Montana
1858                  Puerto Rico
2267               Virgin Islands
2422                         Guam
3744          

### FLIPS

In [4]:
#13605 records with null FIPS
covidCounties[covidCounties['fips']=='']

Unnamed: 0,date,county,state,fips,cases,deaths
416,2020-03-01,New York City,New York,,1,0
418,2020-03-01,Unknown,Rhode Island,,2,0
448,2020-03-02,New York City,New York,,1,0
450,2020-03-02,Unknown,Rhode Island,,2,0
482,2020-03-03,New York City,New York,,2,0
...,...,...,...,...,...,...
1494214,2021-07-07,Unknown,Puerto Rico,,5594,2552
1494226,2021-07-07,Unknown,Rhode Island,,11953,5
1494427,2021-07-07,Unknown,Tennessee,,7658,95
1494714,2021-07-07,Unknown,Utah,,1176,23


In [128]:
# Unknown county and fips empty
flipsEmpty = covidCounties[covidCounties['fips'] == ''].groupby(['state','county']).count().reset_index()

# Possible to fix
flipsEmpty[flipsEmpty['county'] != 'Unknown']

# Print all county that we can try to find FIPS for
covidCounties[covidCounties['county'].isin(['Joplin','Kansas City','New York City'])][['county','state']].drop_duplicates() 

Unnamed: 0,county,state
416,New York City,New York
5641,Kansas City,Missouri
272865,Joplin,Missouri


In [121]:
# Review that all FLIPS are exact 5 CHARACTERS as expected
print(f"Max lenght: {covidCounties['fips'][covidCounties['fips']!=''].str.len().max()}")
print(f"Min lenght: {covidCounties['fips'][covidCounties['fips']!=''].str.len().min()}")

Max lenght: 5
Min lenght: 5


In [173]:
# Distrubution of Empty FIPS by State
covidCounties[['state','fips']][covidCounties['fips']!=''].groupby(['state']).count()

Unnamed: 0_level_0,fips
state,Unnamed: 1_level_1
Alabama,31542
Alaska,11495
Arizona,7185
Arkansas,34997
California,27696
Colorado,29623
Connecticut,3836
Delaware,1439
District of Columbia,488
Florida,31820


In [174]:
# Distrubution of Empty FIPS by State-County
covidCounties[['state', 'county','fips']][covidCounties['fips']!=''].groupby(['state', 'county']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,fips
state,county,Unnamed: 2_level_1
Alabama,Autauga,471
Alabama,Baldwin,481
Alabama,Barbour,461
Alabama,Bibb,465
Alabama,Blount,470
Alabama,Bullock,469
Alabama,Butler,470
Alabama,Calhoun,477
Alabama,Chambers,476
Alabama,Cherokee,470


In [200]:
# Undestading FIPS formart: SSCCC WHERE SS represent State, and CCC county
covidCounties[covidCounties['state'] =='Alabama'].sort_values(by='fips')

Unnamed: 0,date,county,state,fips,cases,deaths
544132,2020-09-18,Autauga,Alabama,01001,1664,24
1228836,2021-04-17,Autauga,Alabama,01001,6760,106
243246,2020-06-16,Autauga,Alabama,01001,373,7
959382,2021-01-24,Autauga,Alabama,01001,5376,62
1368449,2021-05-30,Autauga,Alabama,01001,7142,110
1478829,2021-07-03,Autauga,Alabama,01001,7262,113
35596,2020-04-06,Autauga,Alabama,01001,12,1
774398,2020-11-28,Autauga,Alabama,01001,2735,42
1167128,2021-03-29,Autauga,Alabama,01001,6577,99
460054,2020-08-23,Autauga,Alabama,01001,1324,23


In [28]:
# 36061 Flip is missing we can try to use in NY
covidCounties[covidCounties['fips'].str.startswith('360')][['fips']].drop_duplicates().sort_values(['fips']).reset_index(drop=True)

covidCounties.loc[covidCounties['county'] == 'New York City', 'fips']  = '36061'

covidCounties[covidCounties['county'] == 'New York City']

Unnamed: 0,date,county,state,fips,cases,deaths
416,2020-03-01,New York City,New York,36061,1,0
448,2020-03-02,New York City,New York,36061,1,0
482,2020-03-03,New York City,New York,36061,2,0
518,2020-03-04,New York City,New York,36061,2,0
565,2020-03-05,New York City,New York,36061,4,0
...,...,...,...,...,...,...
1480703,2021-07-03,New York City,New York,36061,956230,33426
1483949,2021-07-04,New York City,New York,36061,956381,33433
1487195,2021-07-05,New York City,New York,36061,956616,33436
1490441,2021-07-06,New York City,New York,36061,956795,33438
