# COVID-19 Forecasting
#### By Amanda Marsden
---
I hope to test out some machine learning models to forecast and investigate the spread of COVID-19. 

---
<div class="alert alert-block alert-info">
<b>Note:</b> Data is from the Johns Hopkins Center for Systems Science and Engineering.</div>

#### Import Libraries

In [53]:
import pandas as pd
import numpy as np

#### Read in CSSE Data to DataFrames

In [33]:
global_confirmed = pd.read_csv('/Users/amandamarsden/Git/Covid-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
global_deaths = pd.read_csv('/Users/amandamarsden/Git/Covid-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')

us_confirmed = pd.read_csv('/Users/amandamarsden/Git/Covid-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_us.csv')
us_deaths = pd.read_csv('/Users/amandamarsden/Git/Covid-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_us.csv')
us_confirmed.head()

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,...,4/16/20,4/17/20,4/18/20,4/19/20,4/20/20,4/21/20,4/22/20,4/23/20,4/24/20,4/25/20
0,16.0,AS,ASM,16,60.0,,American Samoa,US,-14.271,-170.132,...,0,0,0,0,0,0,0,0,0,0
1,316.0,GU,GUM,316,66.0,,Guam,US,13.4443,144.7937,...,135,136,136,136,136,136,136,139,141,141
2,580.0,MP,MNP,580,69.0,,Northern Mariana Islands,US,15.0979,145.6739,...,13,13,14,14,14,14,14,14,14,14
3,630.0,PR,PRI,630,72.0,,Puerto Rico,US,18.2208,-66.5901,...,1043,1068,1118,1213,1252,1298,1252,1416,1276,1307
4,850.0,VI,VIR,850,78.0,,Virgin Islands,US,18.3358,-64.8963,...,51,51,53,53,53,53,54,54,54,55


#### Data Cleaning
My goal here is to create a new data frame that combines both the confirmed cases and deaths. 
<br/> <br/>
*Building the DataFrame*

1. First I transpose the dataframe and retreive only the date index. 

In [34]:
us_confirmed_transpose = us_confirmed.transpose()
us_df = us_confirmed_transpose.drop(['UID' , 'iso2', 'iso3', 'code3',
                                       'Country_Region','Combined_Key',
                                      'Lat', 'Long_', 'FIPS', 'Admin2',
                                      'Province_State'])



2. Now I created two for loops that fill in the information to a new data-frame in the format I want. The first cell is for the US confirmed cases. The second is does the same thing but for US deaths. 

In [35]:
us = []
us_date = pd.DataFrame(us_df.index,
               columns=['Date'])
for i in us_date['Date']:
    for k in us_confirmed[i]: 
        list = [i,k]
        us.append(list)
df = pd.DataFrame(us, columns = ['Date', 'Confirmed'])

us1=[]
for i in us_date['Date']:
    for k in us_confirmed['Combined_Key']:
        list = [k]
        us1.append(list)
df2 = pd.DataFrame(us1, columns = ['Combined_Key'])
us_confirmed_df = pd.concat([df, df2], axis =1)

In [36]:
]=1`````````us = []
us_date = pd.DataFrame(us_df.index,
               columns=['Date'])
for i in us_date['Date']:
    for k in us_deaths[i]: 
        list = [i,k]
        us.append(list)
df = pd.DataFrame(us, columns = ['Date', 'Deaths'])

us1=[]
for i in us_date['Date']:
    for k in us_deaths['Combined_Key']:
        list = [k]
        us1.append(list)
df2 = pd.DataFrame(us1, columns = ['Combined_Key'])
us_deaths_df = pd.concat([df, df2], axis =1)

3. Finally, I combine the two together after checking to make sure the information is correct. I then arrange the columns in the order I want. 

In [52]:
us_confirmed_dropped = us_confirmed_df.drop(['Combined_Key'], axis=1)
us_deaths_dropped = us_deaths_df.drop(['Date'], axis = 1)
us_df = pd.concat([us_confirmed_dropped, us_deaths_dropped], axis = 1)
us_df = us_df[['Date', 'Combined_Key', 'Confirmed', 'Deaths']]
us_df.head()

Unnamed: 0,Date,Combined_Key,Confirmed,Deaths
0,1/22/20,"American Samoa, US",0,0
1,1/22/20,"Guam, US",0,0
2,1/22/20,"Northern Mariana Islands, US",0,0
3,1/22/20,"Puerto Rico, US",0,0
4,1/22/20,"Virgin Islands, US",0,0


Now I'll check to make sure there isn't any missing data in the tables:

In [56]:
def missing_data():
    NA = us_df.columns[us_df.isnull().any()].tolist()
    return NA

us_df[missing_data()].isnull().sum()

Series([], dtype: float64)

Looks good! Now one thing that's super important is checking the data types of our columns. I do that with the following code and then convert my <strong>Date</strong> column to the date type. 

In [65]:
us_df.dtypes
us_df['Date'] = pd.to_datetime(us_df['Date'], format='%m/%d/%y')

In [89]:
us_states = us_df[(us_df.Combined_Key != 'American Samoa, US') 
                   & (us_df.Combined_Key != 'Puerto Rico, US')
                   & (us_df.Combined_Key != 'Virgin Islands, US')
                   & (us_df.Combined_Key != 'Northern Mariana Islands, US')
                   & (us_df.Combined_Key != 'American Samoa, US')]

In [99]:
new = us_df["Combined_Key"].str.split(", ", n = 2, expand = True)
us_df['County'] = new[0]
us_df['State'] = new[1]
us_df['Country'] = new[2]

In [104]:
us_df[missing_data()].isnull().sum()
us_df = us_df.dropna()

In [105]:
us_df

Unnamed: 0,Date,Combined_Key,Confirmed,Deaths,County,State,Country
5,2020-01-22,"Autauga, Alabama, US",0,0,Autauga,Alabama,US
6,2020-01-22,"Baldwin, Alabama, US",0,0,Baldwin,Alabama,US
7,2020-01-22,"Barbour, Alabama, US",0,0,Barbour,Alabama,US
8,2020-01-22,"Bibb, Alabama, US",0,0,Bibb,Alabama,US
9,2020-01-22,"Blount, Alabama, US",0,0,Blount,Alabama,US
10,2020-01-22,"Bullock, Alabama, US",0,0,Bullock,Alabama,US
11,2020-01-22,"Butler, Alabama, US",0,0,Butler,Alabama,US
12,2020-01-22,"Calhoun, Alabama, US",0,0,Calhoun,Alabama,US
13,2020-01-22,"Chambers, Alabama, US",0,0,Chambers,Alabama,US
14,2020-01-22,"Cherokee, Alabama, US",0,0,Cherokee,Alabama,US
