#  Introduction

> Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease was first identified in December 2019 in Wuhan, the capital of China's Hubei province, and has since spread globally, resulting in the ongoing 2019–20 coronavirus pandemic. Common symptoms include fever, cough, and shortness of breath. Other symptoms may include muscle pain, diarrhea, sore throat, loss of smell, and abdominal pain... *(Source: [Wikipedia](https://en.wikipedia.org/wiki/Coronavirus_disease_2019))*

In this notebook, I will try to visualize the data of this disease. This include several steps, from scraping data to visualize it, and then build an interactive plot.

**Note: please re-run the notebook for latest data and for interative plot to work.**

# Import libraries

In [1]:
import numpy as np
import pandas as pd
# import requests
from datetime import datetime
from datetime import timedelta

# Get the data from github to dataframes

## Source: **2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE** on [github](https://github.com/CSSEGISandData/COVID-19)

In [2]:
url = {}
url['confirmed'] = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'

url['deaths'] = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'

url['recovered'] = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'

In [3]:
df_confirmed=pd.read_csv(url['confirmed'])
df_deaths=pd.read_csv(url['deaths'])
df_recovered=pd.read_csv(url['recovered'])

# Combine df_confirmed, df_deaths and df_recovered

Preview the dataframe for **confirmed**, **deaths** and **recovered** cases

In [4]:
print(f'Total row in df_confirmed: {df_confirmed.shape[0]}')
print(f'Number of unique countries: {df_confirmed.groupby("Country/Region").count().shape[0]}')
df_confirmed.head()

Total row in df_confirmed: 266
Number of unique countries: 188


Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,6/7/20,6/8/20,6/9/20,6/10/20,6/11/20,6/12/20,6/13/20,6/14/20,6/15/20,6/16/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,20342,20917,21459,22142,22890,23546,24102,24766,25527,26310
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,1246,1263,1299,1341,1385,1416,1464,1521,1590,1672
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,10154,10265,10382,10484,10589,10698,10810,10919,11031,11147
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,852,852,852,852,852,853,853,853,853,854
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,91,92,96,113,118,130,138,140,142,148


In [5]:
print(f'Total row in df_deaths: {df_deaths.shape[0]}')
print(f'Number of unique countries: {df_deaths.groupby("Country/Region").count().shape[0]}')
df_deaths.head()

Total row in df_deaths: 266
Number of unique countries: 188


Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,6/7/20,6/8/20,6/9/20,6/10/20,6/11/20,6/12/20,6/13/20,6/14/20,6/15/20,6/16/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,357,369,384,405,426,446,451,471,478,491
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,34,34,34,34,35,36,36,36,36,37
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,707,715,724,732,741,751,760,767,777,788
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,51,51,51,51,51,51,51,51,51,52
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,4,4,4,4,5,5,6,6,6,6


In [6]:
print(f'Total row in df_recovered: {df_recovered.shape[0]}')
print(f'Number of unique countries: {df_recovered.groupby("Country/Region").count().shape[0]}')
df_recovered.head()

Total row in df_recovered: 253
Number of unique countries: 188


Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,6/7/20,6/8/20,6/9/20,6/10/20,6/11/20,6/12/20,6/13/20,6/14/20,6/15/20,6/16/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,1875,2171,2651,3013,3326,3928,4201,4725,5164,5508
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,938,945,960,980,1001,1034,1039,1044,1055,1064
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,6717,6799,6951,7074,7255,7322,7420,7606,7735,7842
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,744,751,757,759,780,781,781,781,789,789
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,24,38,38,40,41,42,61,61,64,64


The number of rows in those 3 dataframe doesn't match, but the number of countries does.

Since we will do the visualization on the country scale, we will group the values by country and drop the **Province/State** column altogether.

In [7]:
df_confirmed = df_confirmed.groupby('Country/Region').sum()
df_deaths = df_deaths.groupby('Country/Region').sum()
df_recovered = df_recovered.groupby('Country/Region').sum()

# Drop the **Lat** and **Long** columns since the data no longer correct. 

df_confirmed.drop(['Lat', 'Long'], axis=1, inplace=True)
df_deaths.drop(['Lat', 'Long'], axis=1, inplace=True)
df_recovered.drop(['Lat', 'Long'], axis=1, inplace=True)

print(f'Confirmed DF {df_confirmed.shape}\nDeath DF {df_deaths.shape}\nRecovered DF {df_recovered.shape}')

Confirmed DF (188, 147)
Death DF (188, 147)
Recovered DF (188, 147)


### Adding coordinates for countries

We want to include the coordinates for each country in our final data, so we will use my coordinates dataset ([link](www.kaggle.com/dataset/48a48a5fe3252970243a19f0927a11df9a91886861bc29ec191bcbcc7683f76c)) and geopy Nominatim to get the missing info.

In [8]:
# Generate country list from scraped data

country_list = df_confirmed.index.values
print(f'Total: {len(country_list)} countries')

Total: 188 countries


Loading coordinates data

In [9]:
try:
    df_countries = pd.read_csv('countries_geocode.csv')
except:
    print('Could not load countries\' coordinates from file!')
    df_countries = pd.DataFrame(columns=['Country/Region', 'Lat', 'Long'])
    df_countries['Country/Region'] = country_list
else:
    print('Countries'' coordinates loaded from file!')

print(df_countries.shape)

# Find out new country that not already in the coord list

country_coord_list = df_countries['Country/Region'].tolist()

missing_country = list(set(country_list)-set(country_coord_list))

print(f'Missing countries list: {missing_country}')

Countries coordinates loaded from file!
(188, 3)
Missing countries list: []


Getting countries' coordinates

In [10]:
from geopy import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import numpy as np

if missing_country:
    new_row = [{'Country/Region': i, 'Lat': np.nan, 'Long': np.nan} for i in missing_country]
    df_countries = df_countries.append(new_row, ignore_index=True)

    locator = Nominatim(user_agent='Kaggle_covid')
    geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
else:
    print('No countries with missing coordinates info!')
    
for i in df_countries.index:
    if (np.isnan(df_countries.loc[i, 'Lat'])) or (np.isnan(df_countries.loc[i, 'Long'])):
        print(f'Get coordinates for {df_countries.loc[i, "Country/Region"]}...')
        loc = geocode(df_countries.loc[i, 'Country/Region'])

        if loc is None:
            print(f'Get coordinates for {df_countries.loc[i, "Country/Region"]} failed!')
            continue
        else:
            df_countries.loc[i, 'Lat'] = loc.latitude
            df_countries.loc[i, 'Long'] = loc.longitude

print('Done!')

# Save to file
df_countries.to_csv('countries_geocode.csv', index=False)

df_countries.set_index('Country/Region', drop=True, inplace=True)


No countries with missing coordinates info!
Done!


## Combine and insert coordinates data

Let's take a look at our source dataframe before combining them. Since 3 dataframes have the same structure, we only look at **df_confirmed**

In [11]:
df_confirmed.sample()

Unnamed: 0_level_0,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,...,6/7/20,6/8/20,6/9/20,6/10/20,6/11/20,6/12/20,6/13/20,6/14/20,6/15/20,6/16/20
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Burkina Faso,0,0,0,0,0,0,0,0,0,0,...,889,890,891,891,892,892,892,894,894,895


In [12]:
date_list = df_confirmed.columns.tolist()

final = {'Country/Region':[], 'Lat': [], 'Long': [], 'Date':[], 'Confirmed':[], 'Deaths':[], 'Recovered':[]}

for c in country_list:
    coord = df_countries.loc[c].tolist()
    lat = coord[0]
    long = coord[1]

    for d in date_list:
        final['Country/Region'].append(c)
        final['Lat'].append(lat)
        final['Long'].append(long)
        final['Date'].append(d)
        final['Confirmed'].append(df_confirmed.loc[c, d])
        final['Deaths'].append(df_deaths.loc[c, d])
        final['Recovered'].append(df_recovered.loc[c, d])

df_final = pd.DataFrame(final)

print(df_final.shape)

df_final.head()

(27636, 7)


Unnamed: 0,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,Afghanistan,33.768006,66.238514,1/22/20,0,0,0
1,Afghanistan,33.768006,66.238514,1/23/20,0,0,0
2,Afghanistan,33.768006,66.238514,1/24/20,0,0,0
3,Afghanistan,33.768006,66.238514,1/25/20,0,0,0
4,Afghanistan,33.768006,66.238514,1/26/20,0,0,0


# Visualization

We will use **plotly** as our visualization library

In [13]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio

# Set default plot theme
pio.templates.default = 'plotly_white'

# Set date range and number of top country
date_range = 30
num_country = 10

Create function to generate color array

In [14]:
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

def make_rainbow(num_color):
    colors_array = cm.rainbow(np.linspace(0, 1, num_color))
    return [colors.rgb2hex(i) for i in colors_array]

In [15]:
df = df_final

# Convert Date column into datetime64
df['Date'] = df['Date'].astype('datetime64')

# Create new column named **Active** to keep active cases value
df['Active'] = df['Confirmed'] - df['Deaths'] - df['Recovered']

Get current countries with highest confirmed cases

In [16]:
# Get last date of the data
last_date = df['Date'].max()

df_latest = df[df['Date'] == last_date].sort_values(by=['Confirmed'], ascending = True).reset_index(drop=True)

df_top = df_latest.tail(num_country)
top_countries = df_top['Country/Region'].tolist()

df_top.tail()

Unnamed: 0,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active
183,United Kingdom,54.702355,-3.276575,2020-06-16,299600,42054,1293,256253
184,India,22.351115,78.667743,2020-06-16,354065,11903,186935,155227
185,Russia,64.686314,97.745306,2020-06-16,544725,7274,293780,243671
186,Brazil,-10.333333,-53.2,2020-06-16,923189,45241,490005,387943
187,US,39.78373,-100.445882,2020-06-16,2137731,116963,583503,1437265


In [17]:
plot = {}

cat_color = make_rainbow(4)
cat = ['Confirmed', 'Active', 'Recovered', 'Deaths']

color_list = {}
for k, v in zip(cat, cat_color):
    color_list[k] = v

plot['top'] = go.Figure()


for c in cat[1:]:
    plot['top'].add_trace(go.Bar(x=df_top['Country/Region'], y=df_top[c],
                                 marker_color=color_list[c], name=c,
                                 text=df_top['Confirmed'].apply('{:,.0f}'.format).astype(str) + ' confirmed cases',
                                 hovertemplate='<b>%{x}</b><br>%{fullData.name}: <b>%{y:,.0f}</b> / %{text}<extra></extra>'))

plot['top'].update_layout(title_text=f"Top {num_country} Countries with COVID-19<br>{last_date.strftime('%m/%d/%Y')}",
                          title_font_size=30, title_x=0.5, 
                          width=800, height=600,
                          barmode='stack');

plot['top']

In [18]:
not_top_10 = df_latest.head(df_latest.shape[0]-num_country).groupby('Country/Region').sum().index

df_temp = df_latest.replace(not_top_10, 'Other Countries').groupby('Country/Region').sum()

pull_out = []

for c in df_temp.index:
    if c == 'Other Countries': pull_out.append(0.2)
    else: pull_out.append(0)

plot['pie'] = go.Figure(data=[go.Pie(labels = df_temp.index, values = df_temp['Confirmed'], pull = pull_out,
                                     textinfo='label+percent', insidetextorientation='radial')])

plot['pie'].update_layout(title=go.layout.Title(text=f"Top {num_country} vs the others", font=dict(size=30), x=0.5),
                          legend = go.layout.Legend(bordercolor = 'black', borderwidth = 1), 
                          width = 800, height = 600);

plot['pie'].show()

# Interactive plot

Please run the notebook to use interactive plot

In [19]:
from ipywidgets import widgets

# Get the first date in dataframe
min_date = df['Date'].min()

# Define widgets
country = widgets.Dropdown(
    description='Country/Region: ',
    value='Vietnam',
    options=df['Country/Region'].unique().tolist()
)

yaxis_log = widgets.Checkbox(
    description='Overview - Logarithmic y-axis ',
    value=False,
)

start_date = min_date

start_date_w = widgets.DatePicker(
    description='From:',
    value = start_date,
    disabled=False
)

end_date_w = widgets.DatePicker(
    description='To:',
    value = last_date,
    disabled=False
)

full_range_btn = widgets.Button(
    description='View all days',
    disabled=False,
    button_style='',
    tooltip='View from 1/22/2020'
)

last_7_btn = widgets.Button(description = 'Last 7 days',
                           disabled = False)

# Create df to use for plot
df_temp = df[df['Country/Region'] == 'Vietnam'].reset_index(drop=True)

# Insert daily columns
def insert_daily(inp_df, col_list, prefix = 'd_'):
    for c in col_list:
        inp_df[prefix+c] = inp_df[c]

        for i in range (1, len(inp_df)):
            inp_df.loc[i, prefix+c] = inp_df.loc[i, c] - inp_df.loc[i-1, c]
    
    return inp_df

df_temp = insert_daily(df_temp, cat)

df_int = df_temp[(df_temp['Date'] >= start_date) & (df_temp['Date'] <= last_date)].reset_index(drop=True)

df_pie = df_int[df_int['Date'] == end_date_w.value][['Active', 'Recovered', 'Deaths']].reset_index(drop=True).transpose()

# Draw initial plot
plot['Interactive'] = make_subplots(rows=2, cols=2, column_widths=[0.7, 0.3],
                                    shared_xaxes=True, vertical_spacing=0.05,
                                    specs=[[{"type": "xy"}, {"type": "domain", "rowspan": 2}], [{"type": "xy"}, None]],
                                    subplot_titles=("Overview","Distribution", "Daily change"))

pie_colors =[color_list[i] for i in cat[1:]]

# Overview chart
for k in cat:
    plot['Interactive'].add_trace(go.Scatter(x=df_int['Date'], y=df_int[k], 
                                             mode='lines+markers', line_shape='spline',
                                             marker=dict(size=4, color=color_list[k]), 
                                             name=k, text=df_int[k].astype(str)+' '+k, 
                                             legendgroup=k),
                                  row=1, col=1)

# Pie chart
plot['Interactive'].add_trace(go.Pie(labels = df_pie.index, values = df_pie[0], pull = pull_out,
                                     textinfo='label+value+percent', insidetextorientation='radial',
                                     marker=dict(colors=pie_colors),
                                     name='Distribution',
                                     showlegend=False),
                              row=1, col=2)

# Daily chart
for k in cat:
    plot['Interactive'].add_trace(go.Bar(x=df_int['Date'], y=df_int['d_'+k],
                                         name=f'Daily {k}',
                                         marker=dict(color=color_list[k]),
                                         legendgroup=k, showlegend=False),
                                 row=2, col=1)
    
plot['Interactive'].update_layout(title=dict(text='COVID-19 in Vietnam', font=dict(size=30), x=0.5), height=700, yaxis_type = '-')

g = go.FigureWidget(data=plot['Interactive'],
                   layout=go.Layout(height = 700))

def validate():
    if country.value in df['Country/Region'].unique():
        
        if start_date_w.value < min_date:
            start_date_w.value = min_date
        
        if end_date_w.value > last_date:
            end_date_w.value = last_date
        
        if start_date_w.value > end_date_w.value:
            start_date_w.value, end_date_w.value = end_date_w.value, start_date_w.value
        
        return True
    else:
        return False


def response(change):
    if validate():
        global df_temp, df_int, df_pie
        df_temp = df[df['Country/Region'] == country.value].reset_index(drop=True)
        df_temp = insert_daily(df_temp, cat)

        df_int = df_temp[(df_temp['Date'] >= pd.Timestamp(start_date_w.value)) & (df_temp['Date'] <= pd.Timestamp(end_date_w.value))].reset_index(drop=True)
        df_pie = df_int[df_int['Date'] == end_date_w.value][['Active', 'Recovered', 'Deaths']].reset_index(drop=True).transpose()

        with g.batch_update():
            idx = 0
            
            # Update Overview chart
            for k in cat:
                g.data[idx].x = df_int['Date']
                g.data[idx].y = df_int[k]
                g.data[idx].name = k
                g.data[idx].text = df_int[k].apply('{:,.0f}'.format).astype(str)+' '+k
                idx += 1
            
            # Update Distribution chart
            g.data[idx].labels = df_pie.index
            g.data[idx].values = df_pie[0]
            idx += 1
            
            # Update Daily change chart
            for k in cat:
                g.data[idx].x = df_int['Date']
                g.data[idx].y = df_int['d_'+k]
                g.data[idx].name = f'Daily {k}'
                g.data[idx].text = df_int['d_'+k].apply('{:,.0f}'.format).astype(str)+' '+k
                idx += 1
            
            g.layout.title.text = f'COVID-19 in {country.value}'
                
    else: print('Error')

def response_log(change):
    g.layout.yaxis.type = 'log' if yaxis_log.value else '-'
#     g.layout.yaxis2.type = 'log' if yaxis_log.value else '-'

def response_fullrange(change):
    start_date_w.value = min_date
    end_date_w.value = last_date

def response_7(change):
    start_date_w.value = last_date - pd.to_timedelta(7, unit='day')
    end_date_w.value = last_date
        
country.observe(response, names="value")
start_date_w.observe(response, names='value')
end_date_w.observe(response, names='value')
yaxis_log.observe(response_log, names='value')
full_range_btn.on_click(response_fullrange)
last_7_btn.on_click(response_7)

# Define interactive widget layout
row1 = widgets.HBox([country])
row2 = widgets.HBox([start_date_w, end_date_w, full_range_btn, last_7_btn])
row3 = widgets.HBox([yaxis_log])
widgets.VBox([row1, row2, g, row3])

VBox(children=(HBox(children=(Dropdown(description='Country/Region: ', index=182, options=('Afghanistan', 'Alb…