# Introduction

- The aim of the project is to visualize the spread of the virus over the world. 

- The project will use the matplotlib, seaborn and plotly library for the same.

- There will also be an attempt to devise different methods/measures for looking at the pandemic data.

- The dataset used is the [Novel Corona Virus 2019 Dataset](https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset) by SRK from kaggle.

- Special thanks to [S Revanth Shalon Raj](https://github.com/Revanthshalon) for his valuable contributions and [SRK](https://www.kaggle.com/sudalairajkumar) for the dataset

# Libraries

In [0]:
# basic libraries
import numpy as np
import pandas as pd

In [2]:
# visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

  import pandas.util.testing as tm


In [0]:
# plotly visualization
import plotly as py
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs

In [0]:
%matplotlib inline

# Dataset

There are multiple datasets within the folder. The <i>'covid_19_data.csv'</i> is used.

In [0]:
cov = pd.read_csv('covid_19_data.csv')

# Data Cleaning

Minor EDA and data cleaning.

In [6]:
cov.columns

Index(['SNo', 'ObservationDate', 'Province/State', 'Country/Region',
       'Last Update', 'Confirmed', 'Deaths', 'Recovered'],
      dtype='object')

In [7]:
cov.head()

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


In [8]:
cov.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25959 entries, 0 to 25958
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   SNo              25959 non-null  int64  
 1   ObservationDate  25959 non-null  object 
 2   Province/State   12490 non-null  object 
 3   Country/Region   25959 non-null  object 
 4   Last Update      25959 non-null  object 
 5   Confirmed        25959 non-null  float64
 6   Deaths           25959 non-null  float64
 7   Recovered        25959 non-null  float64
dtypes: float64(3), int64(1), object(4)
memory usage: 1.6+ MB


**Observations**

From the intial birds eye view, we can say that:

1. The 'SNo' column is redundant for analysis as it's there for the index.

2. The 'ObservationDate' and 'Last Update' columns should be converted to datetime to better represent the data that they contain. But there is no need for it as it can be directly used in plotly as is.

3. 'Province/State' and Country/Region' are categorical in nature and can be of object data type.

4. 'Confirmed', 'Deaths' and 'Recovered' are of float data type. These columns consists of their namesake till the last upsated date. They can be converted to int.

In [0]:
# dropping the 'SNo' column
cov.drop('SNo', axis=1,inplace=True)

In [0]:
# changing data type to datetime
# cov['ObservationDate'] = pd.to_datetime(cov['ObservationDate'])
# cov['Last Update'] = pd.to_datetime(cov['Last Update'])

In [0]:
# changing data type to int
cov['Confirmed'] = cov['Confirmed'].astype(int)
cov['Deaths'] = cov['Deaths'].astype(int)
cov['Recovered'] = cov['Recovered'].astype(int)

In [12]:
# Checking the total null values
cov.isna().sum()

# There a lot of null values in 'Province/State' columns.
# Most of the details relating to states are unavailable but the countries data is available.
# Can replace by 'Missing'

ObservationDate        0
Province/State     13469
Country/Region         0
Last Update            0
Confirmed              0
Deaths                 0
Recovered              0
dtype: int64

In [0]:
# replacing the missing values in 'Province/State'
cov['Province/State'].fillna('Missing', inplace=True)

In [14]:
len(cov['Country/Region']), len(cov['Country/Region'].unique())

(25959, 223)

In [15]:
cov['Country/Region'].unique()

# There are many names of countries that are repeating or are written in a different format.
# e.g. :
#      *'Republic of Ireland'             == 'Ireland'
#      *'occupied Palestinian territory'  == 'West Bank and Gaza'
#      *"('St. Martin',)"                 == 'St. Martin'
#      *'Republic of the Congo'           == 'Congo (Brazzaville)'
#      *'The Gambia'                      == 'Gambia'
#      *'Gambia, The'                     == 'Gambia'
#      *'Bahamas, The'                    == 'Bahamas'
#      *'The Bahamas'                     == 'Bahamas'

array(['Mainland China', 'Hong Kong', 'Macau', 'Taiwan', 'US', 'Japan',
       'Thailand', 'South Korea', 'Singapore', 'Philippines', 'Malaysia',
       'Vietnam', 'Australia', 'Mexico', 'Brazil', 'Colombia', 'France',
       'Nepal', 'Canada', 'Cambodia', 'Sri Lanka', 'Ivory Coast',
       'Germany', 'Finland', 'United Arab Emirates', 'India', 'Italy',
       'UK', 'Russia', 'Sweden', 'Spain', 'Belgium', 'Others', 'Egypt',
       'Iran', 'Israel', 'Lebanon', 'Iraq', 'Oman', 'Afghanistan',
       'Bahrain', 'Kuwait', 'Austria', 'Algeria', 'Croatia',
       'Switzerland', 'Pakistan', 'Georgia', 'Greece', 'North Macedonia',
       'Norway', 'Romania', 'Denmark', 'Estonia', 'Netherlands',
       'San Marino', ' Azerbaijan', 'Belarus', 'Iceland', 'Lithuania',
       'New Zealand', 'Nigeria', 'North Ireland', 'Ireland', 'Luxembourg',
       'Monaco', 'Qatar', 'Ecuador', 'Azerbaijan', 'Czech Republic',
       'Armenia', 'Dominican Republic', 'Indonesia', 'Portugal',
       'Andorra', 'Latvia

In [0]:
# Correcting the names
cov['Country/Region'] = cov['Country/Region'].replace('Republic of Ireland', 'Ireland')
cov['Country/Region'] = cov['Country/Region'].replace('occupied Palestinian territory', 'West Bank and Gaza')
cov['Country/Region'] = cov['Country/Region'].replace("('St. Martin',)", 'St. Martin')
cov['Country/Region'] = cov['Country/Region'].replace('Republic of the Congo', 'Congo (Brazzaville)')
cov['Country/Region'] = cov['Country/Region'].replace('The Gambia', 'Gambia')
cov['Country/Region'] = cov['Country/Region'].replace('Gambia, The', 'Gambia')
cov['Country/Region'] = cov['Country/Region'].replace('Bahamas, The', 'Bahamas')
cov['Country/Region'] = cov['Country/Region'].replace('The Bahamas', 'Bahamas')

In [17]:
# checking the total number of unique 'country/region' names
len(cov['Country/Region'].unique())

215

In [18]:
cov['ObservationDate'].unique()[-1]

# Finding the last observation date
# this can be used further as a mask to find the latest data

'05/17/2020'

# EDA

## Basic

### Finding the countries with the most confirmed cases.

In [19]:
# checking for the most confirmed cases around the globe.
# grouping the data masking for most recent to find out the top 20 countries.
cov[
    cov['ObservationDate'] == cov['ObservationDate'].unique()[-1]  # masking to find most recent data
    ].groupby(
        'Country/Region'                                           # grouping the data by 'Country/Region'
        ).sum().sort_values(
            'Confirmed', ascending=False                           # sorting the values in descending by 'Confirmed' 
                ).head(20).style.background_gradient(
                    cmap='Blues'                                   # choosing a color gradient for the table
                        ).set_table_styles([
                              {'selector':'th',                    # choosing font size and weight for table header
                               'props':[('font-size','12px'),
                                        ('font-weight','bold')]},
                              {'selector':'td',                    # choosing font size and weight for table data
                               'props':[('font-size','11px'),
                                        ('font-weight','normal')]}
                               ]
                               )

Unnamed: 0_level_0,Confirmed,Deaths,Recovered
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
US,1486757,89562,272265
Russia,281752,2631,67373
UK,244995,34716,1058
Brazil,241080,16118,94122
Spain,230698,27563,146446
Italy,225435,31908,125176
France,179693,28111,61327
Germany,176369,7962,154011
Turkey,149435,4140,109962
Iran,120198,6988,94464


In [20]:
# In the previous table, USA dominates by a huge margin making it difficult to analyze the data.
# Considering USA as an outlier.
# grouping the data masking for most recent.
cov[
    (cov['ObservationDate'] == cov['ObservationDate'].unique()[-1]) & # masking to find most recent data
    (cov['Country/Region'] != 'US')].groupby(                         # masking to remove US
        'Country/Region'                                              # grouping the data by 'Country/Region'
        ).sum().sort_values(
            'Confirmed', ascending=False                              # sorting the values in descending by 'Confirmed' 
                ).head(20).style.background_gradient(
                    cmap='Blues'                                      # choosing a color gradient for the table
                        ).set_table_styles([
                              {'selector':'th',                       # choosing font size and weight for table header
                               'props':[('font-size','12px'),
                                        ('font-weight','bold')]},
                              {'selector':'td',                       # choosing font size and weight for table data
                               'props':[('font-size','11px'),
                                        ('font-weight','normal')]}
                               ]
                               )

Unnamed: 0_level_0,Confirmed,Deaths,Recovered
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Russia,281752,2631,67373
UK,244995,34716,1058
Brazil,241080,16118,94122
Spain,230698,27563,146446
Italy,225435,31908,125176
France,179693,28111,61327
Germany,176369,7962,154011
Turkey,149435,4140,109962
Iran,120198,6988,94464
India,95698,3025,36795


**Observation**

- <font color='cyan'>USA</font> dominates the globe in the shear number of confirmed cases by a huge margin.

- It is followed by <font color='cyan'>Russia</font> and <font color='cyan'>UK</font> at $2^{nd}$ <i>&</i> $3^{rd}$ respectively.

- <font color='cyan'>UK</font>, <font color='cyan'>Brazil</font>, <font color='cyan'>Spain</font> and <font color='cyan'>Italy</font> are close in number of confirmed cases.

- <font color='cyan'>France</font> and <font color='cyan'>Germany</font> have comparable numbers.

- <font color='cyan'>India</font> too is in the top 20.

### Finding the countries with the most recovered.

In [21]:
# checking for the most recovered around the globe.
# grouping the data masking for most recent to find out the top 20 countries with most recovered
cov[
    cov['ObservationDate'] == cov['ObservationDate'].unique()[-1]  # masking to find most recent data
    ].groupby(
        'Country/Region'                                           # grouping the data by 'Country/Region'
        ).sum().sort_values(
            'Recovered', ascending=False                           # sorting the values in descending by 'Recovered' 
                ).head(20).style.background_gradient(
                    cmap='Greens'                                  # choosing a color gradient for the table
                        ).set_table_styles([
                              {'selector':'th',                    # choosing font size and weight for table header
                               'props':[('font-size','12px'),
                                        ('font-weight','bold')]},
                              {'selector':'td',                    # choosing font size and weight for table data
                               'props':[('font-size','11px'),
                                        ('font-weight','normal')]}
                               ]
                               )

Unnamed: 0_level_0,Confirmed,Deaths,Recovered
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
US,1486757,89562,272265
Germany,176369,7962,154011
Spain,230698,27563,146446
Italy,225435,31908,125176
Turkey,149435,4140,109962
Iran,120198,6988,94464
Brazil,241080,16118,94122
Mainland China,82954,4634,78238
Russia,281752,2631,67373
France,179693,28111,61327


In [22]:
# In the previous table, USA dominates by a huge margin making it difficult to analyze the data.
# Considering USA as an outlier.
# grouping the data masking for most recent.
cov[
    (cov['ObservationDate'] == cov['ObservationDate'].unique()[-1]) & # masking to find most recent data
    (cov['Country/Region'] != 'US')].groupby(                         # masking to remove US
        'Country/Region'                                              # grouping the data by 'Country/Region'
        ).sum().sort_values(
            'Recovered', ascending=False                              # sorting the values in descending 
                ).head(20).style.background_gradient(
                    cmap='Greens'                                     # choosing a color gradient for the table
                        ).set_table_styles([
                              {'selector':'th',                       # choosing font size and weight for table header
                               'props':[('font-size','12px'),
                                        ('font-weight','bold')]},
                              {'selector':'td',                       # choosing font size and weight for table data
                               'props':[('font-size','11px'),
                                        ('font-weight','normal')]}
                               ]
                               )

Unnamed: 0_level_0,Confirmed,Deaths,Recovered
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Germany,176369,7962,154011
Spain,230698,27563,146446
Italy,225435,31908,125176
Turkey,149435,4140,109962
Iran,120198,6988,94464
Brazil,241080,16118,94122
Mainland China,82954,4634,78238
Russia,281752,2631,67373
France,179693,28111,61327
Canada,78332,5903,38563


**Observation**

- <font color='green'>USA</font> has the most recovered followed by <font color='green'>Germany</font> and <font color='green'>Spain</font>.

- The country with $2^{nd}$ most recovered cases in total is half of that of the one ranking $1^{st}$.

### Finding the countries with the highest fatalities.

In [23]:
# checking for the highest fatalities around the globe.
# grouping the data masking for most recent to find out the top 20 countries.
cov[
    cov['ObservationDate'] == cov['ObservationDate'].unique()[-1]  # masking to find most recent data
    ].groupby(
        'Country/Region'                                           # grouping the data by 'Country/Region'
        ).sum().sort_values(
            'Deaths', ascending=False                              # sorting the values in descending by 'Deaths' 
                ).head(20).style.background_gradient(
                    cmap='Reds'                                    # choosing a color gradient for the table
                        ).set_table_styles([
                              {'selector':'th',                    # choosing font size and weight for table header
                               'props':[('font-size','12px'),
                                        ('font-weight','bold')]},
                              {'selector':'td',                    # choosing font size and weight for table data
                               'props':[('font-size','11px'),
                                        ('font-weight','normal')]}
                               ]
                               )

Unnamed: 0_level_0,Confirmed,Deaths,Recovered
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
US,1486757,89562,272265
UK,244995,34716,1058
Italy,225435,31908,125176
France,179693,28111,61327
Spain,230698,27563,146446
Brazil,241080,16118,94122
Belgium,55280,9052,14630
Germany,176369,7962,154011
Iran,120198,6988,94464
Canada,78332,5903,38563


In [24]:
# In the previous table, USA dominates by a huge margin making it difficult to analyze the data.
# Considering USA as an outlier.
# grouping the data masking for most recent.
cov[
    (cov['ObservationDate'] == cov['ObservationDate'].unique()[-1]) & # masking to find most recent data
    (cov['Country/Region'] != 'US')].groupby(                         # masking to remove US
        'Country/Region'                                              # grouping the data by 'Country/Region'
        ).sum().sort_values(
            'Deaths', ascending=False                                 # sorting the values in descending 
                ).head(20).style.background_gradient(
                    cmap='Reds'                                       # choosing a color gradient for the table
                        ).set_table_styles([
                              {'selector':'th',                       # choosing font size and weight for table header
                               'props':[('font-size','12px'),
                                        ('font-weight','bold')]},
                              {'selector':'td',                       # choosing font size and weight for table data
                               'props':[('font-size','11px'),
                                        ('font-weight','normal')]}
                               ]
                               )

Unnamed: 0_level_0,Confirmed,Deaths,Recovered
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
UK,244995,34716,1058
Italy,225435,31908,125176
France,179693,28111,61327
Spain,230698,27563,146446
Brazil,241080,16118,94122
Belgium,55280,9052,14630
Germany,176369,7962,154011
Iran,120198,6988,94464
Canada,78332,5903,38563
Netherlands,44195,5699,167


**Observations**

- <font color='red'>USA</font> has 3 times the number of fatalities than the nation below it.

- <font color='red'>UK</font>, <font color='red'>Italy</font>, <font color='red'>France</font> & <font color='red'>Spain</font> are the most affected from COVID-19 after <font color='red'>USA</font>.

## Visualization

In [0]:
# creating a dataframe for the choropleth.
# grouping by the country and date and sorting by the date gives us the last date of a case at the top.
# then we drop all the 'duplicates' of countries so only the last date for that country remains

crnt_nmbrs = cov.groupby([
                  'Country/Region',                       # grouped by country and date.
                  'ObservationDate'
                  ]).sum().reset_index().sort_values(
                      'ObservationDate', ascending=False  # sorting by date.
                      ).drop_duplicates(
                          subset=['Country/Region']       # dropping all the rows with the same country name.
                          )

In [0]:
# creating the above dataframe masking out USA
cn_wo_usa = crnt_nmbrs[crnt_nmbrs['Country/Region']!='US']

In [67]:
# Creating a choropleth for the recovered cases.
fig = go.Figure(data=go.Choropleth(
    locations = crnt_nmbrs['Country/Region'],
    locationmode = 'country names',
    z = crnt_nmbrs['Confirmed'],
    colorscale = 'spectral',
    marker_line_color = 'black',
    marker_line_width = 0.5
), layout=go.Layout(
    title=go.layout.Title(
        text='Global Confirmed',
        x=0.5))
)

fig.show()

In [68]:
# masking out USA to show the disparities in the other nations

fig = go.Figure(data=go.Choropleth(
    locations = cn_wo_usa['Country/Region'],
    locationmode = 'country names',
    z = cn_wo_usa['Recovered'],
    colorscale = 'spectral',
    marker_line_color = 'black',
    marker_line_width = 0.5
), layout=go.Layout(
    title=go.layout.Title(
        text='Global Recovered excl. USA',
        x=0.5))
)

fig.show()

In [57]:
# Creating a choropleth for the recovered cases.
fig = go.Figure(data=go.Choropleth(
    locations = crnt_nmbrs['Country/Region'],
    locationmode = 'country names',
    z = crnt_nmbrs['Recovered'],
    colorscale = 'spectral',
    marker_line_color = 'black',
    marker_line_width = 0.5
), layout=go.Layout(
    title=go.layout.Title(
        text='Global Recovered',
        x=0.5))
)

fig.show()

In [69]:
# masking out USA to show the disparities in the other nations

fig = go.Figure(data=go.Choropleth(
    locations = cn_wo_usa['Country/Region'],
    locationmode = 'country names',
    z = cn_wo_usa['Recovered'],
    colorscale = 'spectral',
    marker_line_color = 'black',
    marker_line_width = 0.5
), layout=go.Layout(
    title=go.layout.Title(
        text='Global Recovered excl. USA',
        x=0.5))
)

fig.show()

In [59]:
# 
fig = go.Figure(data=go.Choropleth(
    locations = crnt_nmbrs['Country/Region'],
    locationmode = 'country names',
    z = crnt_nmbrs['Deaths'],
    colorscale = 'spectral',
    marker_line_color = 'black',
    marker_line_width = 0.5
), layout=go.Layout(
    title=go.layout.Title(
        text='Global Fatalities',
        x=0.5))
)

fig.show()

In [71]:
# 
fig = go.Figure(data=go.Choropleth(
    locations = cn_wo_usa['Country/Region'],
    locationmode = 'country names',
    z = cn_wo_usa['Deaths'],
    colorscale = 'spectral',
    marker_line_color = 'black',
    marker_line_width = 0.5
), layout=go.Layout(
    title=go.layout.Title(
        text='Global Fatalities excl. USA',
        x=0.5))
)

fig.show()

In [0]:
# creating the dataframe for the timelapse choropleth
df_tl = cov.groupby([
             'ObservationDate',      # grouping by observation date and country
             'Country/Region'
             ]).sum().reset_index()

In [72]:
fig = px.choropleth(df_tl,
                     locations='Country/Region',
                     locationmode='country names',
                     color='Confirmed',
                     hover_name='Country/Region',
                     animation_frame='ObservationDate'
                     )

fig.update_layout(
    title_text = 'COVID 19 Spread: Timelapse',
    title_x=0.5,
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    )
)
fig.show()

In [73]:
# masking out USA
fig = px.choropleth(df_tl[df_tl['Country/Region']!='US'],
                     locations='Country/Region',
                     locationmode='country names',
                     color='Confirmed',
                     hover_name='Country/Region',
                     animation_frame='ObservationDate'
                     )

fig.update_layout(
    title_text = 'COVID 19 Spread: Timelapse excl. USA',
    title_x=0.5,
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    )
)
fig.show()