# Script to get the data from Johns Hopkins CSSE

Getting and cleaning the data from the [2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19).

This idea came from this [Kaggle dataset kernel](https://www.kaggle.com/imdevskp/corona-virus-report). I did some modifications in the code.

In [1]:
from urllib import request
import pandas as pd

## Downloading raw data

Since 23 March, CSSE no loger update files below! Thus, it was necessary to change the code.

For more info go to this [link](https://github.com/CSSEGISandData/COVID-19/issues/1250)

In [2]:
URLS = {"Confirmed":
        'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv', 
        "Deaths":
        'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv', 
        }

for url_key in URLS:
    print ("- Downloading", url_key, "...")
    request.urlretrieve(URLS[url_key], 'data/raw/new_{}_cases.csv'.format(url_key))
print ("- Done!")


confirmed = pd.read_csv('data/raw/new_Confirmed_cases.csv')
# recovered  =recv_df = pd.read_csv('data/raw/Recovered_cases.csv')
deaths = pd.read_csv('data/raw/new_Deaths_cases.csv')

confirmed.head()

- Downloading Confirmed ...
- Downloading Deaths ...
- Done!


Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/27/20,3/28/20,3/29/20,3/30/20,3/31/20,4/1/20,4/2/20,4/3/20,4/4/20,4/5/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,110,110,120,170,174,237,273,281,299,349
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,186,197,212,223,243,259,277,304,333,361
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,409,454,511,584,716,847,986,1171,1251,1320
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,267,308,334,370,376,390,428,439,466,501
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,4,5,7,7,7,8,8,8,10,14


## Cleaning and saving data

In [3]:
# Getting all dates
all_dates = confirmed.columns[4:]

new_confirmed = confirmed.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
                            value_vars=all_dates, var_name='Date', value_name='Confirmed')

# new_recovered = recovered.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
#                             value_vars=all_dates, var_name='Date', value_name='Recovered')

new_deaths = deaths.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
                            value_vars=all_dates, var_name='Date', value_name='Deaths')

clean_data = pd.concat([new_confirmed, new_deaths['Deaths']], axis=1)


clean_data.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths
0,,Afghanistan,33.0,65.0,1/22/20,0,0
1,,Albania,41.1533,20.1683,1/22/20,0,0
2,,Algeria,28.0339,1.6596,1/22/20,0,0
3,,Andorra,42.5063,1.5218,1/22/20,0,0
4,,Angola,-11.2027,17.8739,1/22/20,0,0


In [4]:
clean_data.to_csv("data/clean_data.csv")