# Covid-19

In this notebook, we are going to examine the data about the disease covid-19 cause by the novel coronavirus (nCoV).
The data is collected globally per each country (and sometimes region) and updated every 24h.
The dataset is available here: https://www.kaggle.com/imdevskp/corona-virus-report

Our goal is to determine the mortality rate per country over time, visualise which country are the most hit by the pandemic, by number of confirmed cases, deaths, active and recovered cases.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
import plotly.express as px

# App fluorish
from IPython.display import Javascript
from IPython.core.display import display
from IPython.core.display import HTML

import geopandas

In [2]:
covid_19 = pd.read_csv('data/covid_19_clean_complete.csv', parse_dates=['Date'])
covid_19.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Afghanistan,33.0,65.0,2020-01-22,0,0,0
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0


## Data pre processing

We will only focus on:

- Confirmed cases
- Deaths
- Recovered
- Active cases

We can ignore coordinates, drop missing values and group every country by its region if any are available.

In [26]:
# cases 
cases = ['Confirmed', 'Deaths', 'Recovered', 'Active']

# Active Case = confirmed - deaths - recovered
covid_19['Active'] = covid_19['Confirmed'] - covid_19['Deaths'] - covid_19['Recovered']

# replacing Mainland china with just China
covid_19['Country/Region'] = covid_19['Country/Region'].replace('Mainland China', 'China')

# filling missing values 
covid_19[['Province/State']] = covid_19[['Province/State']].fillna('')
covid_19[cases] = covid_19[cases].fillna(0)

# fixing datatypes
covid_19['Recovered'] = covid_19['Recovered'].astype(int)

Next, we get the latest date and sum all the countries to find the global confirmed cases, deaths and mortality rate.

In [27]:
# latest
latest = covid_19[covid_19['Date'] == max(covid_19['Date'])].reset_index()

# latest condensed
latest_grouped = latest.groupby('Country/Region')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()

# Here: Ask Miriam how to convert a table where index is country and columns are single days
#by_country_date = latest_grouped['Date'].reset_index('Country/Region')
#latest_grouped.to_csv(r'data/covid_19_merged.csv', index = False)

In [28]:
total = covid_19.groupby(['Country/Region', 'Province/State'])['Confirmed', 'Deaths', 'Recovered', 'Active'].max()

In [35]:
total = covid_19.groupby('Date')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()
total = total[total['Date']==max(total['Date'])].reset_index(drop=True)
total['Global Moratality'] = total['Deaths']/total['Confirmed']
total['Deaths per 100 Confirmed Cases'] = total['Global Moratality']*100
total.style.background_gradient(cmap='inferno')

Unnamed: 0,Date,Confirmed,Deaths,Recovered,Active,Global Moratality,Deaths per 100 Confirmed Cases
0,2020-03-27 00:00:00,593291,27198,130659,435434,0.0458426,4.58426


Now we can group by countries and display in order of confirmed cases

In [36]:
by_confirmed = latest_grouped.sort_values(by='Confirmed', ascending=False)
by_confirmed = by_confirmed[['Country/Region', 'Confirmed', 'Active', 'Deaths', 'Recovered']]
by_confirmed = by_confirmed.reset_index(drop=True)

by_confirmed.style.background_gradient(cmap="Blues", subset=['Confirmed'])\
            .background_gradient(cmap="Oranges", subset=['Active'])\
            .background_gradient(cmap="Greens", subset=['Recovered'])\
            .background_gradient(cmap="Reds", subset=['Deaths'])

Unnamed: 0,Country/Region,Confirmed,Active,Deaths,Recovered
0,US,101657,99207,1581,869
1,Italy,86498,66414,9134,10950
2,China,81897,3881,3296,74720
3,Spain,65719,51224,5138,9357
4,Germany,50871,43871,342,6658
5,France,33402,25698,1997,5707
6,Iran,32332,18821,2378,11133
7,United Kingdom,14745,13833,761,151
8,Switzerland,12928,11167,231,1530
9,South Korea,9332,4665,139,4528


Let's do the same, this time only displaying overall deaths and calculating the mortality per country as:

Mortality rate = number of deaths / number of confirmed

In [40]:
by_deaths = by_confirmed[by_confirmed['Deaths']>0][['Country/Region', 'Deaths']]
by_deaths['Deaths / 100 Cases'] = round((by_confirmed['Deaths']/by_confirmed['Confirmed'])*100, 2)
by_deaths.sort_values('Deaths', ascending=False).reset_index(drop=True).style.background_gradient(cmap='Reds')

Unnamed: 0,Country/Region,Deaths,Deaths / 100 Cases
0,Italy,9134,10.56
1,Spain,5138,7.82
2,China,3296,4.02
3,Iran,2378,7.35
4,France,1997,5.98
5,US,1581,1.56
6,United Kingdom,761,5.16
7,Netherlands,547,6.33
8,Germany,342,0.67
9,Belgium,289,3.97


In [41]:
# Deaths
temp = latest_grouped[latest_grouped['Deaths']>0]
fig = px.choropleth(temp, 
                    locations="Country/Region", locationmode='country names',
                    color=np.log(temp["Deaths"]), hover_name="Country/Region", 
                    color_continuous_scale="Peach", hover_data=['Deaths'],
                    title='Countries with Deaths Reported')
fig.update(layout_coloraxis_showscale=False)
fig.show()

Live visualisation with https://app.flourish.studio/visualisation/1714161/edit

In [39]:
HTML('''<div class="flourish-embed flourish-bar-chart-race" data-src="visualisation/1571387"><script src="https://public.flourish.studio/resources/embed.js"></script></div>''')