# Missing US data

As mentioned in the thread https://www.kaggle.com/c/covid19-global-forecasting-week-1/discussion/137564 US data is missing before 2020-03-10. As @Thomas Brandon pointed out this was due to the reporting based on counties and not states before that date. Yet the data is still there and it is fairly simple to translate it into the state based data as all the counties have the two-letter state abbreviation appended.

Below I show how this can be done e.g. for the state of New York data. I strongly urge Kaggle to incorporate these data into the training data for the competition!

In [1]:
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/covid19-global-forecasting-week-1/submission.csv
/kaggle/input/covid19-global-forecasting-week-1/test.csv
/kaggle/input/covid19-global-forecasting-week-1/train.csv
/kaggle/input/gthubdata-new/time_series_19-covid-Confirmed.csv
/kaggle/input/gthubdata-new/time_series_19-covid-Deaths.csv
/kaggle/input/gthubdata-new/time_series_19-covid-Recovered.csv


In [2]:
train = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-1/train.csv')
org = pd.read_csv('/kaggle/input/gthubdata-new/time_series_19-covid-Confirmed.csv')

See below missing data for New York:

In [3]:
train[train['Province/State']=='New York'].iloc[39:55]

Unnamed: 0,Id,Province/State,Country/Region,Lat,Long,Date,ConfirmedCases,Fatalities
14979,23197,New York,US,42.1657,-74.9481,2020-03-01,0.0,0.0
14980,23198,New York,US,42.1657,-74.9481,2020-03-02,0.0,0.0
14981,23199,New York,US,42.1657,-74.9481,2020-03-03,0.0,0.0
14982,23200,New York,US,42.1657,-74.9481,2020-03-04,0.0,0.0
14983,23201,New York,US,42.1657,-74.9481,2020-03-05,0.0,0.0
14984,23202,New York,US,42.1657,-74.9481,2020-03-06,0.0,0.0
14985,23203,New York,US,42.1657,-74.9481,2020-03-07,0.0,0.0
14986,23204,New York,US,42.1657,-74.9481,2020-03-08,0.0,0.0
14987,23205,New York,US,42.1657,-74.9481,2020-03-09,0.0,0.0
14988,23206,New York,US,42.1657,-74.9481,2020-03-10,173.0,0.0


The https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series data is organized slightly differently:

In [4]:
org.columns

Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20',
       '1/30/20', '1/31/20', '2/1/20', '2/2/20', '2/3/20', '2/4/20', '2/5/20',
       '2/6/20', '2/7/20', '2/8/20', '2/9/20', '2/10/20', '2/11/20', '2/12/20',
       '2/13/20', '2/14/20', '2/15/20', '2/16/20', '2/17/20', '2/18/20',
       '2/19/20', '2/20/20', '2/21/20', '2/22/20', '2/23/20', '2/24/20',
       '2/25/20', '2/26/20', '2/27/20', '2/28/20', '2/29/20', '3/1/20',
       '3/2/20', '3/3/20', '3/4/20', '3/5/20', '3/6/20', '3/7/20', '3/8/20',
       '3/9/20', '3/10/20', '3/11/20', '3/12/20', '3/13/20', '3/14/20',
       '3/15/20', '3/16/20', '3/17/20', '3/18/20', '3/19/20', '3/20/20',
       '3/21/20'],
      dtype='object')

It incorporates both states and counties:

In [5]:
us = org[org['Country/Region']=='US']
us['Province/State'].unique()

array(['Washington', 'New York', 'California', 'Massachusetts',
       'Diamond Princess', 'Grand Princess', 'Georgia', 'Colorado',
       'Florida', 'New Jersey', 'Oregon', 'Texas', 'Illinois',
       'Pennsylvania', 'Iowa', 'Maryland', 'North Carolina',
       'South Carolina', 'Tennessee', 'Virginia', 'Arizona', 'Indiana',
       'Kentucky', 'District of Columbia', 'Nevada', 'New Hampshire',
       'Minnesota', 'Nebraska', 'Ohio', 'Rhode Island', 'Wisconsin',
       'Connecticut', 'Hawaii', 'Oklahoma', 'Utah', 'Kansas', 'Louisiana',
       'Missouri', 'Vermont', 'Alaska', 'Arkansas', 'Delaware', 'Idaho',
       'Maine', 'Michigan', 'Mississippi', 'Montana', 'New Mexico',
       'North Dakota', 'South Dakota', 'West Virginia', 'Wyoming',
       'Kitsap, WA', 'Solano, CA', 'Santa Cruz, CA', 'Napa, CA',
       'Ventura, CA', 'Worcester, MA', 'Gwinnett, GA', 'DeKalb, GA',
       'Floyd, GA', 'Fayette, GA', 'Gregg, TX', 'Monmouth, NJ',
       'Burlington, NJ', 'Camden, NJ', 'Passaic, NJ'

Here we add the missing data for New York:

In [6]:
days = us.columns[4:]
ny_state = us[us['Province/State']=='New York']
ny_counties = us[us['Province/State'].str.find('NY')>0]
(ny_counties[days].sum() + ny_state[days])[days[39:55]]

Unnamed: 0,3/1/20,3/2/20,3/3/20,3/4/20,3/5/20,3/6/20,3/7/20,3/8/20,3/9/20,3/10/20,3/11/20,3/12/20,3/13/20,3/14/20,3/15/20,3/16/20
99,0,1,2,11,23,31,76,106,142,173,220,328,421,525,732,967


Compare this with the data contained in the training set (as of 22nd March)

In [7]:
train[train['Province/State']=='New York'].iloc[39:55]

Unnamed: 0,Id,Province/State,Country/Region,Lat,Long,Date,ConfirmedCases,Fatalities
14979,23197,New York,US,42.1657,-74.9481,2020-03-01,0.0,0.0
14980,23198,New York,US,42.1657,-74.9481,2020-03-02,0.0,0.0
14981,23199,New York,US,42.1657,-74.9481,2020-03-03,0.0,0.0
14982,23200,New York,US,42.1657,-74.9481,2020-03-04,0.0,0.0
14983,23201,New York,US,42.1657,-74.9481,2020-03-05,0.0,0.0
14984,23202,New York,US,42.1657,-74.9481,2020-03-06,0.0,0.0
14985,23203,New York,US,42.1657,-74.9481,2020-03-07,0.0,0.0
14986,23204,New York,US,42.1657,-74.9481,2020-03-08,0.0,0.0
14987,23205,New York,US,42.1657,-74.9481,2020-03-09,0.0,0.0
14988,23206,New York,US,42.1657,-74.9481,2020-03-10,173.0,0.0


Just in case, Below is a dictionary between states and abbreviations from http://code.activestate.com/recipes/577305-python-dictionary-of-us-states-and-territories/

In [8]:
states = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AS': 'American Samoa',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DC': 'District of Columbia',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'GU': 'Guam',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MP': 'Northern Mariana Islands',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NA': 'National',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'PR': 'Puerto Rico',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VI': 'Virgin Islands',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'
}