Covid 19 - EDA

Exploration of Covid-19 datasets through Plotly (or, actually: Exploration of Plotly through a Covid-19 datasets).

My goal here was to combine several datasets and explore the Covid-19 pandemic to learn a new tool I was unfamiliar with: Plotly. I explored various functionalities to create interactive visualizations through various graphs and maps. Though I'm not sure it would soon replace Tableau, it was interesting to explore its functionality.

Dataset: https://github.com/laxmimerit/Covid-19-Preprocessed-Dataset/
Inspiration: https://www.udemy.com/course/complete-data-visualization-in-python

Imports

In [None]:
import plotly as py
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
py.offline.init_notebook_mode(connected=True)

import folium
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import math
import random
from datetime import timedelta
import warnings
warnings.filterwarnings('ignore')

#color pallette
cnf = '#393e46'
dth = '#ff2e63'
rec = '#21bf73'
act = '#fe9801'

In [None]:
Datasets

In [None]:
df = pd.read_csv('covid_19_data_cleaned.csv', parse_dates=['Date'])
country_daywise = pd.read_csv('country_daywise.csv', parse_dates=['Date'])
countrywise = pd.read_csv('countrywise.csv')
daywise = pd.read_csv('daywise.csv', parse_dates=['Date'])

In [None]:
df.head()

In [None]:
df['Province/State'] = df['Province/State'].fillna('')

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
df.head()

Grouping to create 3 districtive DataFrames

In [None]:
confirmed = df.groupby('Date').sum()['Confirmed'].reset_index()
recovered = df.groupby('Date').sum()['Recovered'].reset_index()
deaths = df.groupby('Date').sum()['Deaths'].reset_index()

In [None]:
df.info()

Let's explore our dataset using plotly, starting with total confirmed cases over time.

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x = confirmed['Date'], y= confirmed['Confirmed'], name='Confirmed', line=dict(color = 'Orange', width = 2)))
fig.add_trace(go.Scatter(x = recovered['Date'], y= recovered['Recovered'], name='Recovered', line=dict(color = 'Blue', width = 2)))
fig.add_trace(go.Scatter(x = deaths['Date'], y= deaths['Deaths'], name='Deaths', line=dict(color = 'Red', width = 2)))
fig.update_layout(title='Covid-19 Cases Worldwide', xaxis_tickfont_size = 14, yaxis = dict(title='Number of Cases'))
fig.show()

Observations:
- Recovered data stopped being reported after August 4th 2021
- We currently have more than a quarter billion confirmed cases
- Deaths seem to not have dramatically increased, hopefully due to our ability to manage Covid at this stage

On further inspection it is clear that the discontinuation of Recovered data by CSSE at Johns Hopkins University is due their assesment of the metric being unreliable globally. See also:
https://github.com/CSSEGISandData/COVID-19/issues/4465

In [None]:
df['Date'] = df['Date'].astype(str) #converting datetime to string for plotly.express

In [None]:
temp = df.groupby('Date')['Deaths', 'Recovered', 'Active'].sum().reset_index()
temp = temp.melt(id_vars = 'Date', value_vars= ['Active', 'Recovered', 'Deaths'])
fig = px.area(temp, x='Date', y='value', color='variable', height = 600, title = 'Covid-19 Cases Worldwide (rangeslider)', color_discrete_sequence=[act,rec,dth])
fig.update_layout(xaxis_rangeslider_visible=True)
fig.show()

Confirmed Cases & Deaths Worldwide - Line

In [None]:
fig = px.line(country_daywise, x = 'Date', y = 'Confirmed', color = 'Country', height = 600,
             title='Confirmed', color_discrete_sequence = px.colors.sequential.GnBu_r)
fig.update_layout(showlegend=False)
fig.show()

fig = px.line(country_daywise, x = 'Date', y = 'Deaths', color = 'Country', height = 600,
             title='Deaths', color_discrete_sequence = px.colors.sequential.Redor_r)
fig.update_layout(showlegend=False)
fig.show()

Observations:
- US, India and Brazil have many millions of confirmed cases (offset vs. population?)
- Graphs very hard to read, interactivity offers little to help this fact
- Local trends hard to estimate

Confirmed Cases Worldwide - Stacked

In [None]:
fig = px.bar(country_daywise, x = 'Date', y = 'Confirmed', color = 'Country', height = 600,
            title='Confirmed', color_discrete_sequence=px.colors.sequential.GnBu_r)
fig.update_layout(showlegend = False)
fig.show()

Observations:
- Beautiful visual
- Interactivity offers more to explore, but static would've worked just fine
- Due to large number of countries the abstraction of colors helps (it's about reading the difference vs. neighbor, not each individual country vs. whole)

Confirmed Cases per Country

In [None]:
gt_1m = country_daywise[country_daywise['Confirmed']>1000000]['Country'].unique()

temp = country_daywise.groupby(['Country', 'Date'])['Confirmed'].sum().reset_index()
temp = temp[temp['Country'].isin(gt_1m)]

countries = temp['Country'].unique()

ncols = 3
nrows = math.ceil(len(countries)/ncols)

fig = make_subplots(rows=nrows, cols = ncols, shared_xaxes= False, subplot_titles=countries)

for ind, country in enumerate(countries):
    row = int((ind/ncols)+1)
    col = int((ind%ncols)+1)
    fig.add_trace(go.Bar(x = temp['Date'], y = temp.loc[temp['Country']==country, 'Confirmed'], name = country), row = row, col = col)
    
fig.update_layout(height=4000, title_text = 'Confirmed Cases in Countries with over 1M Cases per Day')
fig.update_layout(showlegend = False)
fig.show()

Observations:
- Seperating each country makes it possible to read trends
- Different Y-axes make reading absolute values challenging
- Interactivity offers little value

Various Metrics Top 10

In [None]:
top  = 10

fig_cc = px.bar(countrywise.sort_values('Confirmed').tail(top), x = 'Confirmed', y = 'Country',
              text = 'Confirmed', orientation='h', color_discrete_sequence=[act])
fig_dr = px.bar(countrywise.sort_values('Deaths').tail(top), x = 'Deaths', y = 'Country',
              text = 'Deaths', orientation='h', color_discrete_sequence=[dth])

fig_nc = px.bar(countrywise.sort_values('New Cases').tail(top), x = 'New Cases', y = 'Country',
              text = 'New Cases', orientation='h', color_discrete_sequence=['#f04341'])
temp = countrywise[countrywise['Population']>1000000] #temp value 1M
fig_p = px.bar(temp.sort_values('Cases / Million People').tail(top), x = 'Cases / Million People', y = 'Country',
              text = 'Cases / Million People', orientation='h', color_discrete_sequence=['#b40398'])



fig_wc = px.bar(countrywise.sort_values('1 week change').tail(top), x = '1 week change', y = 'Country',
              text = '1 week change', orientation='h', color_discrete_sequence=['#c04041'])
temp = countrywise[countrywise['Confirmed']>100] #set threshold
fig_wi = px.bar(temp.sort_values('1 week % increase').tail(top), x = '1 week % increase', y = 'Country',
              text = '1 week % increase', orientation='h', color_discrete_sequence=['#b2327d'])


fig = make_subplots(rows = 3, cols = 2, shared_xaxes=False, horizontal_spacing=0.2, vertical_spacing=.05,
                   subplot_titles=('Confirmed Cases', 'Deaths Reported',
                                   'New Cases', 'Cases / Million People',
                                   '1 week change', '1 week % increase'))

fig.add_trace(fig_cc['data'][0], row = 1, col = 1)
fig.add_trace(fig_dr['data'][0], row = 1, col = 2)

fig.add_trace(fig_nc['data'][0], row = 2, col = 1)
fig.add_trace(fig_p['data'][0], row = 2, col = 2)

fig.add_trace(fig_wc['data'][0], row = 3, col = 1)
fig.add_trace(fig_wi['data'][0], row = 3, col = 2)

fig.update_layout(height = 1500)
fig.show()


Observations:
- Clear and easy to read - useful
- Interactivity offers nothing in these graphs - values suffice
- US tops the charts in many metrics, but in cases per 1M it's the lowest (US is a big country). This is also apperent from the 1 week change graphs, where the US doesn't even show up in % increase, though it has the highest total new cases.
- If your country's name end in 'ia', you'll have a lot of cases per 1M (obvious correlation)

Mapping

In [None]:
fig = px.choropleth(country_daywise, locations= 'Country', locationmode='country names', color = np.log(country_daywise['Confirmed']),
                   hover_name = 'Country', animation_frame=country_daywise['Date'].dt.strftime('%Y-%m-%d'),
                   title='Cases over time', color_continuous_scale=px.colors.sequential.GnBu)

fig.update(layout_coloraxis_showscale = True)
fig.show()

Scatterplot

In [None]:
top = 50
fig = px.scatter(countrywise.sort_values('Deaths', ascending = False).head(top), 
                x = 'Confirmed', y = 'Deaths', color = 'Country', size = 'Confirmed', height = 600,
                text = 'Country', size_max=50, log_x = True, log_y = True,
                title='Deaths vs Confirmed Cases (Cases are on log10 scale)')

fig.update_traces(textposition = 'top center')
fig.update_layout(showlegend = False, font_size = 8)
fig.show()

Observations:
- Correlation between confirmed and deaths
- Countries towards bottom-right effecient at dealing with covid (Austria, Sweden, Morroco)
- Countries top-left not so much (Mexico, Peru, Indonesia)
- Would be interesting to offset with population sizes/healthcare
- Interactivity adds value to visual

Conclusion

The usecase for plotly is not immediately obvious to me, when more powerful and userfriendly visualization tools such as Tableau exist. Its direct integration into the Python ecosystem is great, but ultimately not a redeeming factor.