# Covid-19 - Data Preprocessing

This notebook covers the data preprocessing for the Covid-19 Tableau Dashboard by Mostofa Ahsan
(https://public.tableau.com/views/covid19_15924716772030/Dashboard?:language=en&:display_count=y&publish=yes&:origin=viz_share_link)

The Center for Systems Science and Engineering (CSSE) at Johns Hopkins University provides one of the best data repositories on the Covid-19 Pandemic available.

Source: [CSSE](https://github.com/CSSEGISandData/COVID-19)

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read datasets from CSSE github repo
confirmed = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
recoveries = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')

In [3]:
confirmed.head(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,9/26/20,9/27/20,9/28/20,9/29/20,9/30/20,10/1/20,10/2/20,10/3/20,10/4/20,10/5/20
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,39192,39227,39233,39254,39268,39285,39290,39297,39341,39422
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,13153,13259,13391,13518,13649,13806,13965,14117,14266,14410
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,50914,51067,51213,51368,51530,51690,51847,51995,52136,52270
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,1836,1836,1966,1966,2050,2050,2110,2110,2110,2370
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,4672,4718,4797,4905,4972,5114,5211,5370,5402,5530


In [4]:
deaths.head(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,9/26/20,9/27/20,9/28/20,9/29/20,9/30/20,10/1/20,10/2/20,10/3/20,10/4/20,10/5/20
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,1453,1453,1455,1458,1458,1458,1458,1462,1462,1466
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,375,377,380,384,387,388,389,392,396,400
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,1711,1714,1719,1726,1736,1741,1749,1756,1760,1768
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,53,53,53,53,53,53,53,53,53,53
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,171,174,176,179,183,185,189,193,195,199


In [5]:
recoveries.head(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,9/26/20,9/27/20,9/28/20,9/29/20,9/30/20,10/1/20,10/2/20,10/3/20,10/4/20,10/5/20
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,32635,32642,32642,32746,32789,32842,32842,32842,32852,32879
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,7397,7397,7629,7732,7847,8077,8342,8536,8675,8825
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,35756,35860,35962,36063,36174,36282,36385,36482,36578,36672
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,1263,1263,1265,1265,1432,1432,1540,1540,1540,1615
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,1639,1707,1813,1833,1941,2082,2215,2436,2577,2591


## Transforming Wide to Long format
Data sources from CSSE are in wide format, which is not ideal to work in Tableau. Therefore, a major task in data preprocessing is to transform these data into long format.

In [6]:
# Transform wide format to long format
confirmed = pd.melt(confirmed, id_vars=confirmed.columns[:4], value_vars = confirmed.columns[4:], var_name = 'date', value_name = 'confirmed')
deaths = pd.melt(deaths, id_vars=deaths.columns[:4], value_vars = deaths.columns[4:], var_name = 'date', value_name = 'deaths')
recoveries = pd.melt(recoveries, id_vars=recoveries.columns[:4], value_vars = recoveries.columns[4:], var_name = 'date', value_name = 'recoveries')

In [7]:
confirmed.head(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed
0,,Afghanistan,33.93911,67.709953,1/22/20,0
1,,Albania,41.1533,20.1683,1/22/20,0
2,,Algeria,28.0339,1.6596,1/22/20,0
3,,Andorra,42.5063,1.5218,1/22/20,0
4,,Angola,-11.2027,17.8739,1/22/20,0


In [8]:
deaths.head(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths
0,,Afghanistan,33.93911,67.709953,1/22/20,0
1,,Albania,41.1533,20.1683,1/22/20,0
2,,Algeria,28.0339,1.6596,1/22/20,0
3,,Andorra,42.5063,1.5218,1/22/20,0
4,,Angola,-11.2027,17.8739,1/22/20,0


In [9]:
recoveries.head(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,recoveries
0,,Afghanistan,33.93911,67.709953,1/22/20,0
1,,Albania,41.1533,20.1683,1/22/20,0
2,,Algeria,28.0339,1.6596,1/22/20,0
3,,Andorra,42.5063,1.5218,1/22/20,0
4,,Angola,-11.2027,17.8739,1/22/20,0


## Combining tables
The next step is to combine confirmed, deaths and recoveries tables into a single one for more convenient analysis.

One problem emerges, however, with Canada. The `confirmed` & `deaths` tables present Canada data by `Province/State`, while the `recoveries` table only displays the total number of cases in the whole country.

This conflict will need to be addressed first before combining the tables together as unmatched join keys will be omitted. 

In [10]:
# Summarize confirmed and deaths data by date
confirmed_canada = confirmed[confirmed['Country/Region'] == 'Canada'].groupby('date').sum()[['confirmed']]
deaths_canada = deaths[deaths['Country/Region'] == 'Canada'].groupby('date').sum()[['deaths']]

# Extract columns from recoveries table
recoveries_canada = recoveries[recoveries['Country/Region'] == 'Canada']
canada_template = recoveries_canada[recoveries_canada.columns[:-1]].reset_index(drop=True)

# Join aggrregated confirmed and deaths data with extracted columns
confirmed_canada = canada_template.merge(confirmed_canada, how='inner', left_on='date', right_index=True)
deaths_canada = canada_template.merge(deaths_canada, how='inner', left_on='date', right_index=True)

# Add the agrregated data for Canada back to confirmed and deaths table
confirmed = confirmed[confirmed['Country/Region'] != 'Canada'].append(confirmed_canada)
deaths = deaths[deaths['Country/Region'] != 'Canada'].append(deaths_canada)

In [11]:
# Join confirmed, deaths and recoveries data together
data = confirmed.merge(deaths, how='inner',on=['Country/Region','Province/State','date']).merge(recoveries, how='inner',on=['Country/Region','Province/State','date'])
data = data[['Province/State','Country/Region','date','Lat','Long','confirmed','deaths','recoveries']]

In [12]:
data.sample(10)

Unnamed: 0,Province/State,Country/Region,date,Lat,Long,confirmed,deaths,recoveries
868,,Guatemala,1/25/20,15.7835,-90.2308,0,0,0
32957,,Tanzania,5/31/20,-6.369028,34.888822,509,21,183
36827,,Central African Republic,6/16/20,6.6111,20.9394,2410,14,396
28713,,MS Zaandam,5/14/20,0.0,0.0,9,2,0
32401,,Mauritania,5/29/20,21.0079,-10.9408,423,20,21
25515,Shanghai,China,5/2/20,31.202,121.4491,652,7,612
38407,St Martin,France,6/22/20,18.0708,-63.0501,42,3,36
31330,Greenland,Denmark,5/25/20,71.7069,-42.6043,12,0,11
2782,Northern Territory,Australia,2/2/20,-12.4634,130.8456,0,0,0
45137,,Brunei,7/19/20,4.5353,114.7277,141,3,138


## Population Data
One metric used in the Covid-19 dashboard is infection rate: $confirmed / population$. Countries' population is not available in the CSSE dataset so we will need to combine with another source. 

Source: [Tanu N Prabhu](https://www.kaggle.com/tanuprabhu/population-by-country-2020)

One very common problems when combining different data sources is unmatched value names.

In [14]:
# Read dataset
population = pd.read_csv('C:/Users/mosto/REPOSITORY/COVID-19 Repo/Covid-19 Dashboard/data/population.csv')

In [15]:
population.sample(10)

Unnamed: 0,Country (or dependency),Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
94,Tajikistan,9492342,2.32 %,216627,68,139960,-20000.0,3.6,22,27 %,0.12 %
194,Grenada,112418,0.46 %,520,331,340,-200.0,2.1,32,35 %,0.00 %
170,Suriname,585561,0.90 %,5260,4,156000,-1000.0,2.4,29,65 %,0.01 %
102,Sierra Leone,7942879,2.10 %,163768,111,72180,-4200.0,4.3,19,43 %,0.10 %
15,DR Congo,88972681,3.19 %,2770836,40,2267050,23861.0,6.0,17,46 %,1.15 %
31,Argentina,45111229,0.93 %,415097,17,2736690,4800.0,2.3,32,93 %,0.58 %
27,South Korea,51260395,0.09 %,43877,527,97230,11731.0,1.1,44,82 %,0.66 %
91,Honduras,9871892,1.63 %,158490,89,111890,-6800.0,2.5,24,57 %,0.13 %
59,Mali,20125282,3.02 %,592802,17,1220190,-40000.0,5.9,16,44 %,0.26 %
73,Zimbabwe,14818157,1.48 %,217456,38,386850,-116858.0,3.6,19,38 %,0.19 %


In [16]:
for c in data['Country/Region'].unique():
  if c not in population['Country (or dependency)'].unique():
    print(c)

Burma
Congo (Brazzaville)
Congo (Kinshasa)
Cote d'Ivoire
Czechia
Diamond Princess
Korea, South
Kosovo
MS Zaandam
Saint Kitts and Nevis
Saint Vincent and the Grenadines
Sao Tome and Principe
Taiwan*
US
West Bank and Gaza


There are a few of them, Unfortunately they have to be manually replaced.

In [18]:
country_mapper = {
    'Congo (Brazzaville)': 'Congo',
    'Congo (Kinshasa)': 'Congo',
    "Cote d'Ivoire": "Côte d'Ivoire",
    'Czechia': 'Czech Republic (Czechia)',
    'Korea, South': 'South Korea',
    'Saint Vincent and the Grenadines': 'St. Vincent & Grenadines',
    'Taiwan*': 'Taiwan',
    'US': 'United States',
    'West Bank and Gaza': 'Israel',
    'Saint Kitts and Nevis': 'Saint Kitts & Nevis',
    'Burma': 'Myanmar',
    'Sao Tome and Principe': 'Sao Tome & Principe'
}

data['Country/Region'] = data['Country/Region'].replace(country_mapper)
data.index = data['Country/Region']

In [19]:
# Export data
data.to_csv('covid19.csv')