# Analysis of Coronavirus Disease (COVID-19) Outbreak in Python

Originating from Wuhan, Hubei Province, China in early December 2019. A Severe Acute Respiratory Syndrome virus (SARS-CoV) 
otherwise known as the coronavirus SARS-CoV-2 disease has been named COVID-19" by the World Health Organization (WHO) 
and on January 30, the COVID-19 outbreak was declared to constitute a Public Health Emergency of International Concern. 

The novel virus is transmitted from person to person principally by respiratory droplets, causing such symptoms as fever, 
cough, and shortness of breath after a period believed to range from 2 to 14 days following infection, according to 
the Centers for Disease Control and Prevention (CDC).

In an outbreak of an infectious disease it is important to not only study the number of deaths, 
but also the growth rate at which the number of deaths is increasing.

Web link to the dataset: https://github.com/CSSEGISandData/COVID-19.git 
(Provided by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)

# Import the Dataset as at 16th March 2020

In [40]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [41]:
# Creating the dataframe 
df_covid1 = pd.read_csv("covid19.csv") 

# Print the dataframe 
df_covid1

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
0,Hubei,China,2020-03-16T14:38:45,67798,3099,55142,30.9756,112.2707
1,,Italy,2020-03-16T17:33:03,27980,2158,2749,41.8719,12.5674
2,,Iran,2020-03-16T14:38:45,14991,853,4590,32.4279,53.6880
3,,Spain,2020-03-16T20:13:11,9942,342,530,40.4637,-3.7492
4,,"Korea, South",2020-03-16T14:38:45,8236,75,1137,35.9078,127.7669
...,...,...,...,...,...,...,...,...
267,Cayman Islands,United Kingdom,2020-03-16T14:53:04,1,1,0,19.3133,-81.2546
268,Gibraltar,United Kingdom,2020-03-14T16:33:03,1,0,1,36.1408,-5.3536
269,From Diamond Princess,Australia,2020-03-14T02:33:04,0,0,0,35.4437,139.6380
270,West Virginia,US,2020-03-10T02:33:04,0,0,0,38.4912,-80.9545


# Perform Data Cleaning Tasks

In [42]:
# Check the column info. Notice Province/State has empty spaces (just 125 rows when compared with others having 272).
#We would need to rename some columns and change 'Last Update' column datatype into python datetime.
df_covid1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272 entries, 0 to 271
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Province/State  125 non-null    object 
 1   Country/Region  272 non-null    object 
 2   Last Update     272 non-null    object 
 3   Confirmed       272 non-null    int64  
 4   Deaths          272 non-null    int64  
 5   Recovered       272 non-null    int64  
 6   Latitude        272 non-null    float64
 7   Longitude       272 non-null    float64
dtypes: float64(2), int64(3), object(3)
memory usage: 17.1+ KB


In [45]:
# Delete the "Province/State" column from the dataframe
df_covid2 = df_covid1.drop("Province/State", axis=1)

In [46]:
df_covid2.head(7) #view the updated dataframe

Unnamed: 0,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
0,China,2020-03-16T14:38:45,67798,3099,55142,30.9756,112.2707
1,Italy,2020-03-16T17:33:03,27980,2158,2749,41.8719,12.5674
2,Iran,2020-03-16T14:38:45,14991,853,4590,32.4279,53.688
3,Spain,2020-03-16T20:13:11,9942,342,530,40.4637,-3.7492
4,"Korea, South",2020-03-16T14:38:45,8236,75,1137,35.9078,127.7669
5,Germany,2020-03-16T20:13:11,7272,17,67,51.1657,10.4515
6,France,2020-03-16T20:13:11,6633,148,12,46.2276,2.2137


In [47]:
# Rename multiple columns in one go with a dictionary
df_covid2.rename(
    columns={
        "Country/Region": "Country",
        "Last Update": "Date"
    },
    inplace=True
)

In [48]:
df_covid2.head(5)

Unnamed: 0,Country,Date,Confirmed,Deaths,Recovered,Latitude,Longitude
0,China,2020-03-16T14:38:45,67798,3099,55142,30.9756,112.2707
1,Italy,2020-03-16T17:33:03,27980,2158,2749,41.8719,12.5674
2,Iran,2020-03-16T14:38:45,14991,853,4590,32.4279,53.688
3,Spain,2020-03-16T20:13:11,9942,342,530,40.4637,-3.7492
4,"Korea, South",2020-03-16T14:38:45,8236,75,1137,35.9078,127.7669


In [49]:
# number of unique countries alreday infected

df_covid2["Country"].nunique()

156

In [50]:
#Change Date column from object into python datetime
pd.to_datetime(df_covid2.Date)

0     2020-03-16 14:38:45
1     2020-03-16 17:33:03
2     2020-03-16 14:38:45
3     2020-03-16 20:13:11
4     2020-03-16 14:38:45
              ...        
267   2020-03-16 14:53:04
268   2020-03-14 16:33:03
269   2020-03-14 02:33:04
270   2020-03-10 02:33:04
271   2020-03-11 20:53:02
Name: Date, Length: 272, dtype: datetime64[ns]

# Format the "Date" column

In [51]:
year=df_covid2['year'] = pd.DatetimeIndex(df_covid2['Date']).year 

In [52]:
month=df_covid2['month'] = pd.DatetimeIndex(df_covid2['Date']).month 

In [53]:
day=df_covid2['day'] = pd.DatetimeIndex(df_covid2['Date']).day 

In [54]:
time=df_covid2['time'] = pd.DatetimeIndex(df_covid2['Date']).time

In [55]:
df_covid2.head()

Unnamed: 0,Country,Date,Confirmed,Deaths,Recovered,Latitude,Longitude,year,month,day,time
0,China,2020-03-16T14:38:45,67798,3099,55142,30.9756,112.2707,2020,3,16,14:38:45
1,Italy,2020-03-16T17:33:03,27980,2158,2749,41.8719,12.5674,2020,3,16,17:33:03
2,Iran,2020-03-16T14:38:45,14991,853,4590,32.4279,53.688,2020,3,16,14:38:45
3,Spain,2020-03-16T20:13:11,9942,342,530,40.4637,-3.7492,2020,3,16,20:13:11
4,"Korea, South",2020-03-16T14:38:45,8236,75,1137,35.9078,127.7669,2020,3,16,14:38:45


In [56]:
#Merge the 'year','month' and 'day' as a new date column
df_covid2['New_Date'] = df_covid2.apply(lambda row: datetime.strptime(f"{int(row.year)}-{int(row.month)}-{int(row.day)}", '%Y-%m-%d'), axis=1)

In [57]:
df_covid2.head()

Unnamed: 0,Country,Date,Confirmed,Deaths,Recovered,Latitude,Longitude,year,month,day,time,New_Date
0,China,2020-03-16T14:38:45,67798,3099,55142,30.9756,112.2707,2020,3,16,14:38:45,2020-03-16
1,Italy,2020-03-16T17:33:03,27980,2158,2749,41.8719,12.5674,2020,3,16,17:33:03,2020-03-16
2,Iran,2020-03-16T14:38:45,14991,853,4590,32.4279,53.688,2020,3,16,14:38:45,2020-03-16
3,Spain,2020-03-16T20:13:11,9942,342,530,40.4637,-3.7492,2020,3,16,20:13:11,2020-03-16
4,"Korea, South",2020-03-16T14:38:45,8236,75,1137,35.9078,127.7669,2020,3,16,14:38:45,2020-03-16


In [58]:
#Drop the old 'Date'column
df_covid2.drop("Date", axis=1)

Unnamed: 0,Country,Confirmed,Deaths,Recovered,Latitude,Longitude,year,month,day,time,New_Date
0,China,67798,3099,55142,30.9756,112.2707,2020,3,16,14:38:45,2020-03-16
1,Italy,27980,2158,2749,41.8719,12.5674,2020,3,16,17:33:03,2020-03-16
2,Iran,14991,853,4590,32.4279,53.6880,2020,3,16,14:38:45,2020-03-16
3,Spain,9942,342,530,40.4637,-3.7492,2020,3,16,20:13:11,2020-03-16
4,"Korea, South",8236,75,1137,35.9078,127.7669,2020,3,16,14:38:45,2020-03-16
...,...,...,...,...,...,...,...,...,...,...,...
267,United Kingdom,1,1,0,19.3133,-81.2546,2020,3,16,14:53:04,2020-03-16
268,United Kingdom,1,0,1,36.1408,-5.3536,2020,3,14,16:33:03,2020-03-14
269,Australia,0,0,0,35.4437,139.6380,2020,3,14,02:33:04,2020-03-14
270,US,0,0,0,38.4912,-80.9545,2020,3,10,02:33:04,2020-03-10


# Filtering through the Dataset and Exploratory Data Analysis

In [61]:
# Select rows with country name 'China'
df_covid2.loc[df_covid2['Country'] == 'China']

Unnamed: 0,Country,Date,Confirmed,Deaths,Recovered,Latitude,Longitude,year,month,day,time,New_Date
0,China,2020-03-16T14:38:45,67798,3099,55142,30.9756,112.2707,2020,3,16,14:38:45,2020-03-16
10,China,2020-03-16T01:53:03,1361,8,1306,23.3417,113.4244,2020,3,16,01:53:03,2020-03-16
12,China,2020-03-14T09:53:08,1273,22,1250,33.882,113.614,2020,3,14,09:53:08,2020-03-14
13,China,2020-03-16T14:38:45,1231,1,1216,29.1832,120.0934,2020,3,16,14:38:45,2020-03-16
17,China,2020-03-14T08:33:03,1018,4,1014,27.6104,111.7088,2020,3,14,08:33:03,2020-03-14
18,China,2020-03-11T02:18:14,990,6,984,31.8257,117.2264,2020,3,11,02:18:14,2020-03-11
20,China,2020-03-12T02:13:04,935,1,934,27.614,115.7221,2020,3,12,02:13:04,2020-03-12
24,China,2020-03-16T14:38:45,760,7,746,36.3427,118.1498,2020,3,16,14:38:45,2020-03-16
26,China,2020-03-15T01:53:02,631,0,631,32.9711,119.455,2020,3,15,01:53:02,2020-03-15
27,China,2020-03-15T03:53:04,576,6,570,30.0572,107.874,2020,3,15,03:53:04,2020-03-15


In [62]:
# Group Countries in the dataset
covid_class = df_covid2.groupby("Country")

In [66]:
# Maximum number of deaths from different countries
covid_class['Deaths'].max().head(30)

Country
Afghanistan                    0
Albania                        1
Algeria                        4
Andorra                        0
Antigua and Barbuda            0
Argentina                      2
Armenia                        0
Aruba                          0
Australia                      2
Austria                        3
Azerbaijan                     1
Bahrain                        1
Bangladesh                     0
Belarus                        0
Belgium                        5
Benin                          0
Bhutan                         0
Bolivia                        0
Bosnia and Herzegovina         0
Brazil                         0
Brunei                         0
Bulgaria                       2
Burkina Faso                   0
Cambodia                       0
Cameroon                       0
Canada                         4
Central African Republic       0
Chile                          0
China                       3099
Colombia                       0
Na

In [67]:
# Maximum number of cases confirmed from different countries
covid_class['Confirmed'].max().head(30)

Country
Afghanistan                    21
Albania                        51
Algeria                        54
Andorra                         2
Antigua and Barbuda             1
Argentina                      56
Armenia                        52
Aruba                           2
Australia                     171
Austria                      1018
Azerbaijan                     15
Bahrain                       214
Bangladesh                      8
Belarus                        36
Belgium                      1058
Benin                           1
Bhutan                          1
Bolivia                        11
Bosnia and Herzegovina         25
Brazil                        200
Brunei                         54
Bulgaria                       52
Burkina Faso                   15
Cambodia                        7
Cameroon                        4
Canada                        177
Central African Republic        1
Chile                         155
China                       67798
Colomb

In [68]:
#Get on average how many people have died in different countries
covid_class['Deaths'].mean().head(30)

Country
Afghanistan                  0.000000
Albania                      1.000000
Algeria                      4.000000
Andorra                      0.000000
Antigua and Barbuda          0.000000
Argentina                    2.000000
Armenia                      0.000000
Aruba                        0.000000
Australia                    0.333333
Austria                      3.000000
Azerbaijan                   1.000000
Bahrain                      1.000000
Bangladesh                   0.000000
Belarus                      0.000000
Belgium                      5.000000
Benin                        0.000000
Bhutan                       0.000000
Bolivia                      0.000000
Bosnia and Herzegovina       0.000000
Brazil                       0.000000
Brunei                       0.000000
Bulgaria                     2.000000
Burkina Faso                 0.000000
Cambodia                     0.000000
Cameroon                     0.000000
Canada                       0.363636
Cent

In [69]:
#Get the number of people who are still quarantined
df_covid2["Quarantined"] = df_covid2["Confirmed"] - df_covid2["Recovered"]
df_covid2

Unnamed: 0,Country,Date,Confirmed,Deaths,Recovered,Latitude,Longitude,year,month,day,time,New_Date,Quarantined
0,China,2020-03-16T14:38:45,67798,3099,55142,30.9756,112.2707,2020,3,16,14:38:45,2020-03-16,12656
1,Italy,2020-03-16T17:33:03,27980,2158,2749,41.8719,12.5674,2020,3,16,17:33:03,2020-03-16,25231
2,Iran,2020-03-16T14:38:45,14991,853,4590,32.4279,53.6880,2020,3,16,14:38:45,2020-03-16,10401
3,Spain,2020-03-16T20:13:11,9942,342,530,40.4637,-3.7492,2020,3,16,20:13:11,2020-03-16,9412
4,"Korea, South",2020-03-16T14:38:45,8236,75,1137,35.9078,127.7669,2020,3,16,14:38:45,2020-03-16,7099
...,...,...,...,...,...,...,...,...,...,...,...,...,...
267,United Kingdom,2020-03-16T14:53:04,1,1,0,19.3133,-81.2546,2020,3,16,14:53:04,2020-03-16,1
268,United Kingdom,2020-03-14T16:33:03,1,0,1,36.1408,-5.3536,2020,3,14,16:33:03,2020-03-14,0
269,Australia,2020-03-14T02:33:04,0,0,0,35.4437,139.6380,2020,3,14,02:33:04,2020-03-14,0
270,US,2020-03-10T02:33:04,0,0,0,38.4912,-80.9545,2020,3,10,02:33:04,2020-03-10,0
