#  Covid-19 : Data Analysis Project

### Libraries used : Pandas, Numpy, Matplotlib, Plotly

# Project Plan


The data for this project was collected from the research titled "Coronavirus Pandemic (COVID-19)" published online at Our World in Data. We can get the data from the Coronavirus Pandemic (COVID-19) - Statistics and Research - Our World in Data. The primary data source for the Confirmed and death cases is the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. The data for Policy responses was obtained from Oxford COVID-10 Government Response Tracker. The data for Vaccinations, Tests and positivity, Hospitals and ICU is obtained from the Official data collected by the OurWorld in data team. The confirmed cases and the deaths data is updated daily. The dataset contains the Covid-19 information about almost all the countries in the world. 

 The dataset contains the following variables:
 
1. location : country name
2. date : date on which the observation was made
3. total_cases : total cases of COVID-19 till date
4. new_cases :  COVID-19 cases recorded on the particular date
5. total_cases_per_million : Total confirmed cases of COVID-19 per 1,000,000 people.
6. total_deaths : Total deaths attributed to COVID-19 till date
7. new_deaths : New deaths attributed to COVID-19 on a specific date
8. new_tests : New tests conducted on a given date
9. total_tests : total tests done till date
10. new_tests_per_thousand : new tests done on a particular date per thousand population
11. total_tests_per_thousand : total tests done on a particular date per thousand population
12. hosp_patients : Number of COVID-19 patients in hospital on a given day
13. new_vaccinations : New COVID-19 vaccination doses administered
14. people_vaccinated : Total number of people who received at least one vaccine dose




## Project Aim and Objectives 

The project aims to analyse the trend of the COVID-19 globally. Live data is used for this project.  The data is available from March 2021 to date.  We would show the map view of the world to show the countries affected by the COVID-19 and the spread of the infection. We would also try to study the impact of COVID-19 lockdown interventions in different countries on the spread of COVID-19 cases and the deaths due to COVID-19. After mapping the 5 top countries that are effected by the COVID-19, for ease of analysis and understanding, we choose the top five countries to analyse and visualise the impact of lockdown restrictions. We also study the relation between the COVID-19 positivity rate and the number of tests conducted.

### Specific Objective(s)

* __Objective 1:__ Analyse and Visualise the impact of COVID-19 Lockdown measures on the spread of COVID-19 infection - whether the infection rate increased or decreased.
* __Objective 2:__ Total COVID-19 tests conducted vs confirmed cases.
* __Objective 3:__ Visualise and analyse the impact of COVID vaccination on the hospitilisation of infected patients.


## System Design


### Architecture

The data is collected from a URL as a single CSV file from "Our World in Data" repository. First, the cleaning process is performed where the unwanted rows and columns are dropped. Cleaned data is then used to plot a world map with the choropleth world maps for visualisation. Then the data is grouped and filtered by countries and dates to perform the Exploratory Data Analysis and the analysis objectives are explained and visualised in the following sections.
  
### Processing Modules and Algorithms

* Data cleaning - We obtined the data from https://ourworldindata.org/coronavirus. The dataset was cleaned by dropping the unwanted rows and columns and the analysis was carried forward to only top countries effected by the COVID-19. In case of Missing values the data was made clean and easy to analyse.
* Dataset was filtered  and visualised using plotly.

# Program Code

## Importing the necessary libraries

The below block is to import all the necessary relavent libraries and packages

In [4]:
%config Completer.use_jedi = False
import warnings
warnings.filterwarnings("ignore")


import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode
from pandas_profiling import ProfileReport
from datetime import datetime

init_notebook_mode(connected = True)

## Read and inspect data

In [5]:
# Read live data from Url
dataset_url = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv"
dataset_df = pd.read_csv(dataset_url)

In [6]:
dataset_df.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,


In [7]:
dataset_df.tail()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
166093,ZWE,Africa,Zimbabwe,2022-02-28,236380.0,577.0,401.286,5395.0,2.0,1.286,...,1.6,30.7,36.791,1.7,61.49,0.571,,,,
166094,ZWE,Africa,Zimbabwe,2022-03-01,236871.0,491.0,413.0,5395.0,0.0,1.0,...,1.6,30.7,36.791,1.7,61.49,0.571,,,,
166095,ZWE,Africa,Zimbabwe,2022-03-02,237503.0,632.0,416.286,5396.0,1.0,1.143,...,1.6,30.7,36.791,1.7,61.49,0.571,,,,
166096,ZWE,Africa,Zimbabwe,2022-03-03,237503.0,0.0,362.286,5396.0,0.0,0.857,...,1.6,30.7,36.791,1.7,61.49,0.571,,,,
166097,ZWE,Africa,Zimbabwe,2022-03-04,238739.0,1236.0,467.429,5397.0,1.0,0.714,...,1.6,30.7,36.791,1.7,61.49,0.571,,,,


In [8]:
dataset_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166098 entries, 0 to 166097
Data columns (total 67 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   iso_code                                    166098 non-null  object 
 1   continent                                   156155 non-null  object 
 2   location                                    166098 non-null  object 
 3   date                                        166098 non-null  object 
 4   total_cases                                 163065 non-null  float64
 5   new_cases                                   162926 non-null  float64
 6   new_cases_smoothed                          160946 non-null  float64
 7   total_deaths                                145232 non-null  float64
 8   new_deaths                                  145273 non-null  float64
 9   new_deaths_smoothed                         143180 non-null  float64
 

## Data Cleaning
Remove the rows and columns that are not required for the Data analysis and check for any null values in the data, If any replace the null values with zeros.

 * After examining the dataset we got to know that, the dataset has the data of income groups and sum of all cases of all countries in continents together in a row - As this project is only focused on analysing the data at the country level, the rows are removed from analysis

In [9]:
# create a new dataframe of data only at country level

df_countries = dataset_df[dataset_df.iso_code.str.match(r'(^OWID.*)') == False]
df_countries.shape

(155117, 67)

In [10]:
# list out the columns of the dataframe
df_countries.columns

Index(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
       'new_cases_smoothed', 'total_deaths', 'new_deaths',
       'new_deaths_smoothed', 'total_cases_per_million',
       'new_cases_per_million', 'new_cases_smoothed_per_million',
       'total_deaths_per_million', 'new_deaths_per_million',
       'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients',
       'icu_patients_per_million', 'hosp_patients',
       'hosp_patients_per_million', 'weekly_icu_admissions',
       'weekly_icu_admissions_per_million', 'weekly_hosp_admissions',
       'weekly_hosp_admissions_per_million', 'new_tests', 'total_tests',
       'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
       'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated', 'total_boosters',
       'new_vaccinations', 'new_vaccinations_smoothed',
       't

In [11]:
#create a dataframe of columns that are needed for data analysis

columns = df_countries[['location','date', 'total_cases', 'new_cases','total_deaths', 'new_deaths', 'new_tests', 'total_tests',
       'total_tests_per_thousand','population', 'new_tests_per_thousand', 'new_cases_per_million', 'new_vaccinations', 'total_vaccinations', 'hosp_patients', 'people_vaccinated' ]]
dataset_df_drop = columns.copy()

In [12]:
dataset_df_drop

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated
0,Afghanistan,2020-02-24,5.0,5.0,,,,,,39835428.0,,0.126,,,,
1,Afghanistan,2020-02-25,5.0,0.0,,,,,,39835428.0,,0.000,,,,
2,Afghanistan,2020-02-26,5.0,0.0,,,,,,39835428.0,,0.000,,,,
3,Afghanistan,2020-02-27,5.0,0.0,,,,,,39835428.0,,0.000,,,,
4,Afghanistan,2020-02-28,5.0,0.0,,,,,,39835428.0,,0.000,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
166093,Zimbabwe,2022-02-28,236380.0,577.0,5395.0,2.0,,,,15092171.0,,38.232,8707.0,7890951.0,,4362150.0
166094,Zimbabwe,2022-03-01,236871.0,491.0,5395.0,0.0,,,,15092171.0,,32.533,10409.0,7901360.0,,4365856.0
166095,Zimbabwe,2022-03-02,237503.0,632.0,5396.0,1.0,,2085373.0,138.176,15092171.0,,41.876,9380.0,7910740.0,,4368726.0
166096,Zimbabwe,2022-03-03,237503.0,0.0,5396.0,0.0,,,,15092171.0,,0.000,10373.0,7921113.0,,4372925.0


In [13]:
#check for any null values in the data
dataset_df_drop.isnull().sum()

location                         0
date                             0
total_cases                   2709
new_cases                     2855
total_deaths                 20327
new_deaths                   20486
new_tests                    87806
total_tests                  85873
total_tests_per_thousand     85873
population                       0
new_tests_per_thousand       87806
new_cases_per_million         2855
new_vaccinations            123118
total_vaccinations          115436
hosp_patients               130513
people_vaccinated           117616
dtype: int64

In [14]:
#the data has null values for the total_cases, new_cases, total_deaths and new_deaths variables
#all the rows that has the null values for total_cases are removed and the remaining null values are replaced by zeros

dataset_df_drop.dropna(subset = ['total_cases'], inplace = True)
dataset_df_drop['new_cases'] = dataset_df_drop['new_cases'].fillna(0).astype(int)
dataset_df_drop['total_deaths'] = dataset_df_drop['total_deaths'].fillna(0).astype(int)
dataset_df_drop['new_deaths'] = dataset_df_drop['new_deaths'].fillna(0).astype(int)
dataset_df_drop['new_tests'] = dataset_df_drop['new_tests'].fillna(0).astype(int)
dataset_df_drop['total_tests'] = dataset_df_drop['total_tests'].fillna(0).astype(int)
dataset_df_drop['total_tests_per_thousand'] = dataset_df_drop['total_tests_per_thousand'].fillna(0)
dataset_df_drop['new_tests_per_thousand'] = dataset_df_drop['new_tests_per_thousand'].fillna(0)
dataset_df_drop['new_cases_per_million'] = dataset_df_drop['new_cases_per_million'].fillna(0)
dataset_df_drop['hosp_patients'] = dataset_df_drop['hosp_patients'].fillna(0).astype(int)
dataset_df_drop['people_vaccinated'] = dataset_df_drop['people_vaccinated'].fillna(0).astype(int)

In [15]:
dataset_df_drop.isnull().sum()

location                         0
date                             0
total_cases                      0
new_cases                        0
total_deaths                     0
new_deaths                       0
new_tests                        0
total_tests                      0
total_tests_per_thousand         0
population                       0
new_tests_per_thousand           0
new_cases_per_million            0
new_vaccinations            120440
total_vaccinations          113000
hosp_patients                    0
people_vaccinated                0
dtype: int64

In [16]:
dataset_df_drop.head()

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated
0,Afghanistan,2020-02-24,5.0,5,0,0,0,0,0.0,39835428.0,0.0,0.126,,,0,0
1,Afghanistan,2020-02-25,5.0,0,0,0,0,0,0.0,39835428.0,0.0,0.0,,,0,0
2,Afghanistan,2020-02-26,5.0,0,0,0,0,0,0.0,39835428.0,0.0,0.0,,,0,0
3,Afghanistan,2020-02-27,5.0,0,0,0,0,0,0.0,39835428.0,0.0,0.0,,,0,0
4,Afghanistan,2020-02-28,5.0,0,0,0,0,0,0.0,39835428.0,0.0,0.0,,,0,0


In [17]:
#sort the dataframe by date to properly visualise the data in map view
# dataset_df_drop['date'] = pd.to_datetime(dataset_df_drop['date'])
dataset_df_sort = dataset_df_drop.sort_values(by = "date" , ascending= True)

In [18]:
dataset_df_sort.head()

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated
145953,Taiwan,2020-01-22,1.0,0,0,0,44,64,0.003,23855010.0,0.002,0.0,,,0,0
31490,China,2020-01-22,547.0,0,17,0,0,0,0.0,1444216000.0,0.0,0.0,,,0,0
148136,Thailand,2020-01-22,4.0,0,0,0,8,194,0.003,69950840.0,0.0,0.0,,,0,0
157007,United States,2020-01-22,1.0,0,0,0,0,0,0.0,332915100.0,0.0,0.0,,,0,0
139286,South Korea,2020-01-22,1.0,0,0,0,0,0,0.0,51305180.0,0.0,0.0,,,0,0


## Data Analysis

In [19]:
#create a new dataframe for analysis
dftime = dataset_df_sort.copy()
dftime['date'] = pd.to_datetime(dftime['date'])
dftime

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated
145953,Taiwan,2020-01-22,1.0,0,0,0,44,64,0.003,2.385501e+07,0.002,0.000,,,0,0
31490,China,2020-01-22,547.0,0,17,0,0,0,0.000,1.444216e+09,0.000,0.000,,,0,0
148136,Thailand,2020-01-22,4.0,0,0,0,8,194,0.003,6.995084e+07,0.000,0.000,,,0,0
157007,United States,2020-01-22,1.0,0,0,0,0,0,0.000,3.329151e+08,0.000,0.000,,,0,0
139286,South Korea,2020-01-22,1.0,0,0,0,0,0,0.000,5.130518e+07,0.000,0.000,,,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138543,South Africa,2022-03-04,3681437.0,1898,99517,18,30153,23196880,386.344,6.004200e+07,0.502,31.611,,,0,0
44154,Ecuador,2022-03-04,836216.0,0,35264,0,0,0,0.000,1.788847e+07,0.000,0.000,,,0,0
43420,Dominican Republic,2022-03-04,575592.0,155,4370,0,0,0,0.000,1.095371e+07,0.000,14.150,,,0,0
46340,Equatorial Guinea,2022-03-04,15885.0,0,183,0,0,0,0.000,1.449891e+06,0.000,0.000,,,0,0


In [20]:
#Add a new column with only month and year
dftime['year_month'] = dftime['date'].dt.strftime('%Y-%m')

In [21]:
dftime.head()

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated,year_month
145953,Taiwan,2020-01-22,1.0,0,0,0,44,64,0.003,23855010.0,0.002,0.0,,,0,0,2020-01
31490,China,2020-01-22,547.0,0,17,0,0,0,0.0,1444216000.0,0.0,0.0,,,0,0,2020-01
148136,Thailand,2020-01-22,4.0,0,0,0,8,194,0.003,69950840.0,0.0,0.0,,,0,0,2020-01
157007,United States,2020-01-22,1.0,0,0,0,0,0,0.0,332915100.0,0.0,0.0,,,0,0,2020-01
139286,South Korea,2020-01-22,1.0,0,0,0,0,0,0.0,51305180.0,0.0,0.0,,,0,0,2020-01


In [22]:
#create a new dataframe grouping by date column
dftime_date = dftime.groupby("date").sum().reset_index()

In [23]:
dftime_date

Unnamed: 0,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated
0,2020-01-22,557.0,0.0,17.0,0.0,52.0,2.580000e+02,0.006,2.048951e+09,0.002,0.000,0.0,0.000000e+00,0.0,0.000000e+00
1,2020-01-23,657.0,100.0,18.0,1.0,94.0,3.520000e+02,0.009,2.198195e+09,0.003,2.104,0.0,0.000000e+00,0.0,0.000000e+00
2,2020-01-24,944.0,287.0,26.0,8.0,76.0,4.280000e+02,0.011,2.265617e+09,0.002,0.735,0.0,0.000000e+00,0.0,0.000000e+00
3,2020-01-25,1437.0,493.0,42.0,16.0,122.0,5.520000e+02,0.016,2.328068e+09,0.004,0.916,0.0,0.000000e+00,0.0,0.000000e+00
4,2020-01-26,2120.0,683.0,56.0,14.0,192.0,7.440000e+02,0.023,2.353856e+09,0.006,5.866,0.0,0.000000e+00,0.0,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
768,2022-02-28,436764796.0,1403426.0,5953716.0,7758.0,6382649.0,3.870567e+09,198311.992,7.836839e+09,441.593,131666.414,14309124.0,9.167558e+09,126355.0,3.021056e+09
769,2022-03-01,438301602.0,1536806.0,5961576.0,8068.0,7336961.0,3.638338e+09,202131.680,7.836839e+09,477.150,125724.136,10881861.0,8.738324e+09,122100.0,2.816156e+09
770,2022-03-02,439952971.0,1651369.0,5969520.0,7944.0,6803250.0,3.515422e+09,200948.759,7.836839e+09,462.566,125750.461,10621255.0,8.526615e+09,146491.0,2.695943e+09
771,2022-03-03,441840367.0,1887396.0,5977903.0,8384.0,5815931.0,2.878401e+09,161992.168,7.836839e+09,395.792,119042.589,11733075.0,8.341447e+09,81094.0,2.528727e+09


In [24]:
#visualise the daily new cases
fig = px.bar(dftime_date,
             x = 'date',
             y = 'new_cases',
             color = 'new_cases',
             color_continuous_scale = 'reds',
             labels = {"new_cases: New cases"})

fig.update_layout(title = 'Daily - New Cases',
                  title_x = 0.5,
                  title_font = dict(size= 18, color = 'Purple'),
                  xaxis = dict(title = 'Date'),
                  yaxis = dict(title = 'New cases'))

fig.show()

In [26]:
#visualise the daily new deaths
fig = px.bar(dftime_date,
             x = 'date',
             y = 'new_deaths',
             color = 'new_deaths',
             color_continuous_scale = 'reds',
             labels = {"new_deaths: New_deaths"})

fig.update_layout(title = 'Daily - New deaths',
                  title_x = 0.5,
                  title_font = dict(size= 18, color = 'Purple'),
                  xaxis = dict(title = 'Date'),
                  yaxis = dict(title = 'New deaths'))

fig.show()

In [27]:
#Top 30 countries with most number of conformed cases of COVID-19
df1 = dataset_df_sort.groupby("location")["new_cases"].sum().sort_values(ascending = False).reset_index().head(30)

fig = px.bar(df1,
             x = 'location',
             y = 'new_cases',
             color = 'location',
             color_continuous_scale = 'rdpu',
             labels = {"Confirmed":"Total cases"})

fig.update_layout(title = 'Top 30 Countries with the most Confirmed Cases',
                  title_x = 0.5,
                  title_font = dict(size = 18, color = 'Purple'),
                  yaxis = dict(title = 'Total_cases'),
                  xaxis = dict(tickangle = 45))
fig.show()

In [28]:
#Top 30 countries with most number of deaths
df2 = dataset_df_sort.groupby("location")["new_deaths"].sum().sort_values(ascending = False).reset_index().head(30)

fig = px.bar(df2,
             x = 'location',
             y = 'new_deaths',
             color = 'location',
             color_continuous_scale = 'rdpu',
             labels = {"Confirmed":"Total deaths"})

fig.update_layout(title = 'Top 30 Countries with the most deaths',
                  title_x = 0.5,
                  title_font = dict(size = 18, color = 'Purple'),
                  yaxis = dict(title = 'Total_deaths'),
                  xaxis = dict(tickangle = 45))
fig.show()

#### Plot a graph to visualise the maximum infection rate of each country - Most number of confirmed cases in a day

In [29]:
countries = list(dataset_df_sort['location'].unique())
max_infection_rate = []
for c in countries:
    infection_rate = dataset_df_sort[dataset_df_sort.location == c].new_cases.max()
    max_infection_rate.append(infection_rate)

In [30]:
df_infection_rate = pd.DataFrame()
df_infection_rate['Country'] = countries
df_infection_rate['Infection_rate'] = max_infection_rate
df_infection_rate_sort = df_infection_rate.sort_values(by = 'Infection_rate', ascending = False)

In [31]:
fig = px.bar(df_infection_rate_sort, x = 'Country', y = 'Infection_rate', color= 'Country', log_y=True, title= "Maximum infection rate")
fig.show()

From the bar chart we can infer that India has the highest maximum infection spread rate i.e India has recorded the highest one day covid cases in the world

### Create the country specific dataframes

In [32]:
#India

dataset_india = dataset_df_sort[dataset_df_sort.location == 'India']
dataset_india.head()

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated
69583,India,2020-01-30,1.0,1,0,0,0,0,0.0,1393409000.0,0.0,0.001,,,0,0
69584,India,2020-01-31,1.0,0,0,0,0,0,0.0,1393409000.0,0.0,0.0,,,0,0
69585,India,2020-02-01,1.0,0,0,0,0,0,0.0,1393409000.0,0.0,0.0,,,0,0
69586,India,2020-02-02,2.0,1,0,0,0,0,0.0,1393409000.0,0.0,0.001,,,0,0
69587,India,2020-02-03,3.0,1,0,0,0,0,0.0,1393409000.0,0.0,0.001,,,0,0


In [33]:
fig = px.line(dataset_india, x = 'date', y = ['total_cases', 'new_cases'], title = "Covid-19 trend in India")

fig.show()

In [34]:
#United Kingdom

dataset_UK = dataset_df_sort[dataset_df_sort.location == 'United Kingdom']
dataset_UK.head()

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated
156243,United Kingdom,2020-01-31,2.0,2,0,0,0,0,0.0,68207114.0,0.0,0.029,,,0,0
156244,United Kingdom,2020-02-01,2.0,0,0,0,0,0,0.0,68207114.0,0.0,0.0,,,0,0
156245,United Kingdom,2020-02-02,2.0,0,0,0,0,0,0.0,68207114.0,0.0,0.0,,,0,0
156246,United Kingdom,2020-02-03,8.0,6,0,0,0,0,0.0,68207114.0,0.0,0.088,,,0,0
156247,United Kingdom,2020-02-04,8.0,0,0,0,0,0,0.0,68207114.0,0.0,0.0,,,0,0


In [35]:
fig = px.line(dataset_UK, x = 'date', y = ['total_cases', 'new_cases'], title = "Covid-19 trend in UK")

fig.show()

In [36]:
#Italy

dataset_Italy = dataset_df_sort[dataset_df_sort.location == 'Italy']
dataset_Italy.head()

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated
75517,Italy,2020-01-31,2.0,2,0,0,0,0,0.0,60367471.0,0.0,0.033,,,0,0
75518,Italy,2020-02-01,2.0,0,0,0,0,0,0.0,60367471.0,0.0,0.0,,,0,0
75519,Italy,2020-02-02,2.0,0,0,0,0,0,0.0,60367471.0,0.0,0.0,,,0,0
75520,Italy,2020-02-03,2.0,0,0,0,0,0,0.0,60367471.0,0.0,0.0,,,0,0
75521,Italy,2020-02-04,2.0,0,0,0,0,0,0.0,60367471.0,0.0,0.0,,,0,0


In [37]:
fig = px.line(dataset_Italy, x = 'date', y = ['total_cases', 'new_cases'], title = "Covid-19 trend in Italy")

fig.show()

In [38]:
#Brazil

dataset_brazil = dataset_df_sort[dataset_df_sort.location == 'Brazil']
dataset_brazil.head()

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated
21280,Brazil,2020-02-26,1.0,1,0,0,0,0,0.0,213993441.0,0.0,0.005,,,0,0
21281,Brazil,2020-02-27,1.0,0,0,0,0,0,0.0,213993441.0,0.0,0.0,,,0,0
21282,Brazil,2020-02-28,1.0,0,0,0,0,0,0.0,213993441.0,0.0,0.0,,,0,0
21283,Brazil,2020-02-29,2.0,1,0,0,0,0,0.0,213993441.0,0.0,0.005,,,0,0
21284,Brazil,2020-03-01,2.0,0,0,0,0,0,0.0,213993441.0,0.0,0.0,,,0,0


In [39]:
fig = px.line(dataset_brazil, x = 'date', y = ['total_cases', 'new_cases'], title = "Covid-19 trend in Brazil")

fig.show()

In [40]:
#United States of America

dataset_USA = dataset_df_sort[dataset_df_sort.location == 'United States']
dataset_USA.head()

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated
157007,United States,2020-01-22,1.0,0,0,0,0,0,0.0,332915074.0,0.0,0.0,,,0,0
157008,United States,2020-01-23,1.0,0,0,0,0,0,0.0,332915074.0,0.0,0.0,,,0,0
157009,United States,2020-01-24,2.0,1,0,0,0,0,0.0,332915074.0,0.0,0.003,,,0,0
157010,United States,2020-01-25,2.0,0,0,0,0,0,0.0,332915074.0,0.0,0.0,,,0,0
157011,United States,2020-01-26,5.0,3,0,0,0,0,0.0,332915074.0,0.0,0.009,,,0,0


In [41]:
fig = px.line(dataset_USA, x = 'date', y = ['total_cases', 'new_cases'], title = "Covid-19 trend in United States of America")

fig.show()

In [42]:
#Germany

dataset_germany = dataset_df_sort[dataset_df_sort.location == 'Germany']
dataset_germany.head()

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated
57435,Germany,2020-01-27,1.0,1,0,0,0,0,0.0,83900471.0,0.0,0.012,,,0,0
57436,Germany,2020-01-28,4.0,3,0,0,0,0,0.0,83900471.0,0.0,0.036,,,0,0
57437,Germany,2020-01-29,4.0,0,0,0,0,0,0.0,83900471.0,0.0,0.0,,,0,0
57438,Germany,2020-01-30,4.0,0,0,0,0,0,0.0,83900471.0,0.0,0.0,,,0,0
57439,Germany,2020-01-31,5.0,1,0,0,0,0,0.0,83900471.0,0.0,0.012,,,0,0


In [43]:
fig = px.line(dataset_germany, x = 'date', y = ['total_cases', 'new_cases'], title = "Covid-19 trend in Germany")

fig.show()

In [44]:
#Russia

dataset_russia = dataset_df_sort[dataset_df_sort.location == 'Russia']
dataset_russia.head()

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,new_tests,total_tests,total_tests_per_thousand,population,new_tests_per_thousand,new_cases_per_million,new_vaccinations,total_vaccinations,hosp_patients,people_vaccinated
123585,Russia,2020-01-31,2.0,2,0,0,0,0,0.0,145912022.0,0.0,0.014,,,0,0
123586,Russia,2020-02-01,2.0,0,0,0,0,0,0.0,145912022.0,0.0,0.0,,,0,0
123587,Russia,2020-02-02,2.0,0,0,0,0,0,0.0,145912022.0,0.0,0.0,,,0,0
123588,Russia,2020-02-03,2.0,0,0,0,0,0,0.0,145912022.0,0.0,0.0,,,0,0
123589,Russia,2020-02-04,2.0,0,0,0,0,0,0.0,145912022.0,0.0,0.0,,,0,0


In [45]:
fig = px.line(dataset_russia, x = 'date', y = ['total_cases', 'new_cases'], title = "Covid-19 trend in Russia")

fig.show()

In [46]:
#Function to Visualise the effect of Covid-19 lockdown

def Visualise_effect(dataset, start_date, end_date, country_name):
    fig = px.line(dataset, x = 'date', y = ['new_cases'])
    fig.add_shape(
        dict(
        type= "line",
        x0 = start_date,
        y0 = 0,
        x1 = start_date,
        y1 = dataset['new_cases'].max(),
        line = dict(color = 'red')
        )
    )

    fig.add_annotation(
        dict(
        x = start_date,
            y = 0,
            text = 'Start date'
        )

    )
    fig.add_shape(
        dict(
        type= "line",
        x0 = end_date,
        y0 = 0,
        x1 = end_date,
        y1 = dataset['new_cases'].max(),
        line = dict(color = 'yellow')
        )
    )

    fig.add_annotation(
        dict(
        x = end_date,
            y = 0,
            text = 'End date'
        )
    )
    
    fig.update_layout(title = f'Country: {country_name}',
                  title_x = 0.5,
                  title_font = dict(size= 18, color = 'Purple'),
                  xaxis = dict(title = 'Date'),
                  yaxis = dict(title = 'New_cases'))
    
    return fig.show()



In [47]:
#Create a new dataframe grouping by location
df3 = dataset_df_sort.groupby("location")["new_cases_per_million", 'new_tests_per_thousand', 'new_cases', 'new_tests'].sum().sort_values(by = "new_cases", ascending = False).reset_index().head(30)

In [48]:
df3

Unnamed: 0,location,new_cases_per_million,new_tests_per_thousand,new_cases,new_tests
0,United States,238050.232,2464.96,79250508,820619379
1,India,30829.056,540.292,42957477,752852156
2,Brazil,134924.557,172.928,28872970,37005491
3,France,347584.573,3673.114,23434848,247648632
4,United Kingdom,268764.272,6831.574,18331636,465962237
5,Russia,113314.398,1069.24,16533932,156015815
6,Germany,187305.844,0.0,15715048,0
7,Turkey,158398.049,1712.988,13470603,145676988
8,Italy,214503.063,3129.931,12949007,188946370
9,Spain,239281.491,1835.043,11185264,85779178


In [49]:
#create a new dataframe with columsn: date, location, new_vaccinations, hosp_patients
dataset_vh = dataset_df_sort[['date' , 'location','new_vaccinations', 'hosp_patients', 'new_cases']].copy()

In [50]:
dataset_vh

Unnamed: 0,date,location,new_vaccinations,hosp_patients,new_cases
145953,2020-01-22,Taiwan,,0,0
31490,2020-01-22,China,,0,0
148136,2020-01-22,Thailand,,0,0
157007,2020-01-22,United States,,0,0
139286,2020-01-22,South Korea,,0,0
...,...,...,...,...,...
138543,2022-03-04,South Africa,,0,1898
44154,2022-03-04,Ecuador,,0,0
43420,2022-03-04,Dominican Republic,,0,155
46340,2022-03-04,Equatorial Guinea,,0,0


In [51]:
# drop the rows for which the new_vaccination column has null values
dataset_vh.dropna(subset = ['new_vaccinations'], inplace = True)

In [52]:
dataset_vh

Unnamed: 0,date,location,new_vaccinations,hosp_patients,new_cases
113268,2020-12-03,Norway,0.0,111,402
113269,2020-12-04,Norway,0.0,128,430
113270,2020-12-05,Norway,0.0,131,272
113271,2020-12-06,Norway,0.0,132,250
113272,2020-12-07,Norway,0.0,151,380
...,...,...,...,...,...
38309,2022-03-04,Curacao,114.0,0,51
144494,2022-03-04,Sweden,30103.0,1211,2030
140058,2022-03-04,South Korea,113037.0,0,254326
47815,2022-03-04,Estonia,1202.0,0,3311


In [53]:
#group the dataframe by date
df_vh = dataset_vh.groupby("date")['new_vaccinations', 'hosp_patients', 'new_cases'].sum().sort_values(by = 'date', ascending = False).reset_index()

In [54]:
df_vh

Unnamed: 0,date,new_vaccinations,hosp_patients,new_cases
0,2022-03-04,10108170.0,23150,906731
1,2022-03-03,11733075.0,64814,1246530
2,2022-03-02,10621255.0,104489,1129848
3,2022-03-01,10881861.0,108439,1120892
4,2022-02-28,14309124.0,117667,1037997
...,...,...,...,...
452,2020-12-07,0.0,151,380
453,2020-12-06,0.0,132,250
454,2020-12-05,0.0,131,272
455,2020-12-04,0.0,128,430


### Visualise the COVID-19 transmission in the world - Map view


In [55]:
df1 = dataset_df_sort.groupby("location")[["new_cases"]].sum().sort_values("new_cases",ascending=False).reset_index()

fig = px.choropleth(df1, 
                    locations = 'location',
                    locationmode = 'country names',
                    color = 'new_cases',
                    hover_name = 'location',
                    color_continuous_scale = 'Twilight',
                    hover_data = ['new_cases'])

fig.update_layout(title = 'World - Covid-19 Cases',
                  title_x = 0.5,
                  title_font = dict(size = 18, color = 'Darkblue'),
                  geo = dict(showframe = False,
                             showcoastlines = False,
                             projection_type = 'equirectangular'
                            ))
fig.show()

From the above Map, we can see that the countries India, United States, Brazil, Russia are effected more by COVID-19 pandemic

##  Analyse and Visualise the impact of COVID-19 Lockdown measures on the spread of COVID-19 infection - whether the infection rate increased or decreased.

### Visualisation

In [56]:
# Lockdown start and end dates

india_firstphse_start_date = '2020-03-23'
india_firstphse_end_date = '2020-05-30'
india_second_phase_start_date = '2020-09-01'
india_second_phase_end_date = '2020-10-31'

Germany_lockdown_start_date = '2020-03-22' 
Germany_lockdown_end_date = '2020-05-4'

Italy_lockdown_start_date = '2020-03-09' 
Italy_lockdown_end_date = '2020-05-18'

Russia_lockdown_start_date = '2020-03-28' 
Russia_lockdown_end_date = '2020-05-31'

UK_lockdown_firstphase_start_date = '2020-03-23'
UK_lockdown_firstphase_end_date = '2020-05-10'
UK_lockdown_secondphase_start_date = '2020-11-05'
UK_lockdown_secondphase_end_date = '2020-12-02'

USA_lockdown_firstphase_start_date = '2020-03-19'
USA_lockdown_firstphase_end_date = '2020-05-29'


In [57]:
Visualise_effect(dataset_india, india_firstphse_start_date, india_firstphse_end_date, "India- First Phase")
Visualise_effect(dataset_india, india_second_phase_start_date, india_second_phase_end_date, "India - Second Phase")
Visualise_effect(dataset_germany, Germany_lockdown_start_date, Germany_lockdown_end_date, "Germany")
Visualise_effect(dataset_Italy, Italy_lockdown_start_date, Italy_lockdown_end_date, "Italy")
Visualise_effect(dataset_russia, Russia_lockdown_start_date, Russia_lockdown_end_date, "Russia")
Visualise_effect(dataset_UK, UK_lockdown_firstphase_start_date, UK_lockdown_firstphase_end_date, "UK")
Visualise_effect(dataset_UK, UK_lockdown_secondphase_start_date, UK_lockdown_secondphase_end_date, "UK" )
Visualise_effect(dataset_USA, USA_lockdown_firstphase_start_date, USA_lockdown_firstphase_end_date, "USA")

### Explanation of Results

The main aim of this objective is to visualise and analyse the impact of COVID-19 lockdown measures on the spread of infection, whether imposing strict restrictions can control the spread of infection. For this we have choosen India, Russia, UK, Germany, Italy, Brazil, United States for our analysis. For this task, we have plotted a saperate graph for every country chosen as it would be easy to visualise the graphs plots.

The government of India imposed the first phase of nation wide COVID-19 lockdown from 23 March 2020, by that time 499 people were infected with the COVID-19 and the infection rate was around 100 cases per day. As we can see in the graph by the end of the first phase of the lockdown there is no decline in the infection spread rate. But considering the population of India which is close to 1.3 Billion, the lockdown measures were successful in controlling the rapid spread of infection. The daily record of new cases didn't cross 9000 cases per day. Second phase of lockdown was imposed in India for two months from 1 September to 31 October which was successful in decreasing the infection rate by 45%. The same trend is seen in United states as the lockdown only helped in only controlling the rapid spread of infection

The lockdown measures in Germany and Italy has shown significant results on the redcution of infection rate. In Germany, from the gragh, we can see that, these is close to 90% reduction in the infection rate. and we can see the same result in Italy as well.In case of Russia, initially the lockdown measures has no effect in controlling the spread of infection but after the half way throught the lockdown phase, it started showing results.

## Compare the effect of total number of tests done in a country on the increase in the number of COVID-19 cases

### Visualisation

In [58]:
fig = px.scatter(df3, x="new_cases_per_million", y="new_tests_per_thousand", color="location",
                  hover_data=['location'])
fig.update_layout(title = 'Total_cases vs Total_tests',
                  title_x = 0.5,
                  title_font = dict(size= 18, color = 'Purple'),
                  xaxis = dict(title = 'Cases_per_million'),
                  yaxis = dict(title = 'Tests'))
fig.show()

### Explanation of Results

The above scatter plot compares the total cases of COVID-19 per million population and the Total tests per thousand people in different countries.

We found that high-testing countries had more cases per million than low-testing countries. However, for low-testing countries, there was a positive correlation between the testing level and the number of cases per million. This suggests that high-testing countries tested in a preventive manner while low-testing countries may have more cases than those confirmed. From the above scatter plot we can observe that Unitede kingdom which has done the highest number of tests per thousand population, has a significantly high positive rate.

## Visualise and analyse the impact of COVID vaccination on the hospitilisation of infected patients

### Visualisation

In [59]:
fig = px.scatter(df_vh, x="new_vaccinations", y="hosp_patients",
                  hover_data=['date'], trendline = 'ols', trendline_color_override="red", color = "new_cases")
fig.update_layout(title = 'Vaccinations vs Hospitalisation',
                  title_x = 0.5,
                  title_font = dict(size= 18, color = 'Purple'),
                  xaxis = dict(title = 'Vaccinations'),
                  yaxis = dict(title = 'Hospital_admissions'))
fig.show()

### Explanation of Results

The above scatter plot will help us in analysing the impact of the COVID-19 vaccinations on the hospitalisation of the total COVID-19 cases.In the x-axis, we show the total_vaccinations and the y-axis represents the total hospitalisations and the colors represents the total COVID-19 cases on particular date. As we can see in the plot, before vaccine roll out, most of the infected people need to be hospitilised. We can see from the trendline, there is a drop in the total hospitilisations as more people getting vaccinated. From the above scatter plot we can also infer that, there is a decline in the hospitalisation rate per total cases as more people are getting vaccinated.
