<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

# <center><b>Covid-19 Analysis<b></center>

---
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Statement**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
  - **3.1** [**Installing Libraries**](#Section31)
  - **3.2** [**Upgrading Libraries**](#Section32)
  - **3.3** [**Importing Libraries**](#Section33)

**4.** [**Data Acquisition & Description**](#Section4)<br>
  - **4.1** [**Data Description**](#Section41)
  - **4.2** [**Data Information**](#Section42)

**5.** [**Data Pre-profiling**](#Section5)<br>
**6.** [**Data Cleaning**](#Section6)<br>
**7.** [**Data Post-profiling**](#Section7)<br>
**8.** [**Exploratory Data Analysis**](#Section8)<br>
**9.** [**Summarization**](#Section9)<br>

---
<a name = Section1></a>
# **1. Introduction**
---

- The **COVID-19 pandemic** has led to a dramatic loss of human life worldwide and presents an unprecedented challenge to **public health**, **food systems**, and the **world of work**.

- The **economic** and **social disruption** caused by the pandemic is devastating.

<center><img src="https://img2.chinadaily.com.cn/images/202003/17/5e705260a3101282065e02c4.jpeg" width=50%></center>

- Now is the time for **global solidarity** and **support**, especially with the most vulnerable in our societies, particularly in the emerging and developing world.

- With the advent of various vaccines, the world is trying to stop the progress and re-infection of the virus.

- In this analysis, we will see **global effect** of COVID-19 Pandemic and the **progress of vaccinating** the entire globe against it's spread.

---
<a name = Section2></a>
# **2. Problem Statement**
---

- **F.E.A.S.T.** (Food, Emergency Aid, Shelter and Training) is a charity to help the homeless and underprivileged, established around the globe.

- They have **shelters** established across **various countries** where they provide **food**, **shelter**, and **workforce training** for homeless and jobless people.

<center><img src="https://library.kissclipart.com/20181210/ktw/kissclipart-volunteer-hands-clipart-volunteering-non-profit-or-349928f96e9ae55e.png" width=50%></center>

- Due to the covid-19 pandemic, they had to **stop their operations** and **shelters** to **aid the medical bodies** by providing their shelters.

- After the worst has passed, they plan on **resuming their operations**, but they want to determine which countries would be the right candidates to resume operations.

- To determine this, they have hired you - a data scientist - where you will analyze the global effects of pandemic and the vaccination progress across various countries.

- You have been provided with a dataset from **worldometer** and **WHO** which can be used for the required analysis.


---
<a name = Section3></a>
# **3. Installing & Importing Libraries**
---

<a name = Section31></a>
### **3.1 Installing Libraries**

In [None]:
!pip install -q datascience                                         # Package that is required by pandas profiling
!pip install -q pandas-profiling                                    # Library to generate basic statistics about data

[?25l[K     |████▊                           | 10 kB 21.0 MB/s eta 0:00:01[K     |█████████▍                      | 20 kB 27.8 MB/s eta 0:00:01[K     |██████████████                  | 30 kB 26.5 MB/s eta 0:00:01[K     |██████████████████▊             | 40 kB 19.1 MB/s eta 0:00:01[K     |███████████████████████▍        | 51 kB 8.3 MB/s eta 0:00:01[K     |████████████████████████████    | 61 kB 9.6 MB/s eta 0:00:01[K     |████████████████████████████████| 69 kB 4.4 MB/s 
[?25h  Building wheel for folium (setup.py) ... [?25l[?25hdone


<a name = Section32></a>
### **3.2 Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync. 

- Make sure not to execute the cell above (3.1) and below (3.2) again after restarting the runtime.

In [None]:
!pip install -q --upgrade datascience                               # Package that is required by pandas profiling
!pip install -q --upgrade pandas-profiling                          # Library to generate basic statistics about data
!pip install --upgrade -q plotly
!pip install -q pyyaml==5.4.1

[K     |████████████████████████████████| 721 kB 5.3 MB/s 
[K     |████████████████████████████████| 95 kB 3.2 MB/s 
[?25h  Building wheel for datascience (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 261 kB 5.4 MB/s 
[K     |████████████████████████████████| 62 kB 641 kB/s 
[K     |████████████████████████████████| 303 kB 53.2 MB/s 
[K     |████████████████████████████████| 596 kB 30.3 MB/s 
[K     |████████████████████████████████| 102 kB 9.1 MB/s 
[K     |████████████████████████████████| 675 kB 52.5 MB/s 
[K     |████████████████████████████████| 10.1 MB 29.8 MB/s 
[K     |████████████████████████████████| 3.1 MB 51.0 MB/s 
[K     |████████████████████████████████| 812 kB 56.8 MB/s 
[K     |████████████████████████████████| 38.1 MB 1.1 MB/s 
[?25h  Building wheel for htmlmin (setup.py) ... [?25l[?25hdone
  Building wheel for imagehash (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into accoun

<a name = Section33></a>
### **3.3 Importing Libraries**

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from pandas_profiling import ProfileReport                          # To perform data profiling
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # For numerical python operations
#-------------------------------------------------------------------------------------------------------------------------------
%matplotlib inline
import matplotlib.pyplot as plt                                     # A popular plotting library used along with pandas
import seaborn as sns                                               # A library, built on matplotlib, for beautiful plots
import plotly.graph_objs as go                                      # For interactive graphs
import plotly.express as px                                         # For interactive graphs
#-------------------------------------------------------------------------------------------------------------------------------
import random                                                       # To shuffle lists
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once
#-------------------------------------------------------------------------------------------------------------------------------
from collections import Counter                                     # For Collections
#-------------------------------------------------------------------------------------------------------------------------------
import math                                                         # For Math Operations
#-------------------------------------------------------------------------------------------------------------------------------
from datetime import date, timedelta, datetime                      # To use timeseries data

  defaults = yaml.load(f)


---
<a name = Section4></a>
# **4. Data Acquisition & Wrangling**
---

- A high level overview of the 3 datasets:

|Dataset| Records | Features | Dataset Size |
| :--: | :--: | :--: | :--: |
| worldometer_coronavirus_summary_data | 221 | 12 | 20 KB |

<br>

- The worldometer_coronavirus_summary_data dataset consists of the following features:

<details>

**<summary>Expand</summary>**

|ID|Feature name|Feature description|
|:--|:--|:--|
|1|**country**| Country |
|2|**continent**| Continent |
|3|**total_confirmed**| Total Cases Confirmed |
|4|**total_deaths**| Total Fatalities due to Covid |
|5|**total_recovered**| Total Cases recovered |
|6|**active_cases**| Number of Active Cases |
|7|**serious_or_critical**| Number of serious cases |
|8|**total_cases_per_1m_population**| ratio (in ppm) between confirmed cases and<br> total population for the current date in the country; |
|9|**total_deaths_per_1m_population**| ratio (in ppm) between deaths and total population for the current date in the country; |
|10|**total_tests**| Total number of tests performed by the country |
|11|**total_tests_per_1m_population**| ratio (in ppm) between total number of testing and total population for the current date in the country; |
|12|**population**| Population of the country |

<br>

</details>

---

|Dataset| Records | Features | Dataset Size |
| :--: | :--: | :--: | :--: |
| worldometer_coronavirus_daily_data | 145221 | 7 | 6.87 MB |

<br>

- The worldometer_coronavirus_daily_data dataset consists of the following features:

<details>

**<summary>Expand</summary>**

|ID|Feature name|Feature description|
|:--|:--|:--|
|1|**date**| Date Registered |
|2|**country**| Country |
|3|**cumulative_total_cases**| Total cases till that day for the country |
|4|**daily_new_cases**| New cases for corresponding date |
|5|**active_cases**| Active cases on the corresponding date |
|6|**cumulative_total_deaths**| Total deaths till that corresponding date |
|7|**daily_new_deaths**| New fatalities registered for the corresponding date |

</details>

<br>

---

|Dataset| Records | Features | Dataset Size |
| :--: | :--: | :--: | :--: |
| country_vaccinations | 61820 | 15 | 12.1 MB |

<br>

- The vaccination dataset consists of the following features:

<details>

**<summary>Expand</summary>**

<center>

|ID|Feature name|Feature description|
|:--|:--|:--|
|1|**country**| this is the country for which the vaccination information is provided; |
|2|**iso_code**| ISO code for the country;  |
|3|**date**| date for the data entry; for some of the dates we have only the daily vaccinations, for others, only the (cumulative) total;  |
|4|**total_vaccinations**| this is the absolute number of total immunizations in the country;  |
|5|**people_vaccinated**| a person, depending on the immunization scheme, will receive one or more (typically 2) vaccines; at a certain moment,<br>the number of vaccination might be larger than the number of people;  |
|6|**people_fully_vaccinated**| this is the number of people that received the entire set of immunization according to the immunization scheme (typically 2) at a certain moment in time, there might<br>be a certain number of people that received one vaccine and another number (smaller) of people that received all vaccines in the scheme; |
|7|**daily_vaccinations_raw**| for a certain data entry, the number of vaccination for that date/country; |
|8|**daily_vaccinations**| for a certain data entry, the number of vaccination for that date/country;  |
|9|**total_vaccinations_per_hundred**| ratio (in percent) between vaccination number and total population up to the date in the country;  |
|10|**people_vaccinated_per_hundred**| ratio (in percent) between population immunized and total population up to the date in the country;  |
|11|**people_fully_vaccinated_per_hundred**| ratio (in percent) between population fully immunized and total population up to the date in the country;  |
|12|**daily_vaccinations_per_million**| ratio (in ppm) between vaccination number and total population for the current date in the country;  |
|13|**vaccines**| total number of vaccines used in the country (up to date);  |
|14|**source_name**| source of the information (national authority, international organization, local organization etc.);  |
|15|**source_website**| website of the source of information; |

</center>

</details>

---


- **Summary Dataset**:

In [None]:
summary_df = pd.read_csv("/content/worldometer_coronavirus_summary_data.csv")
print('Shape of Summary Data:', summary_df.shape)
summary_df.head()

Shape of Summary Data: (221, 12)


Unnamed: 0,country,continent,total_confirmed,total_deaths,total_recovered,active_cases,serious_or_critical,total_cases_per_1m_population,total_deaths_per_1m_population,total_tests,total_tests_per_1m_population,population
0,Afghanistan,Asia,157412,7311.0,140597.0,9504.0,1124.0,3919,182.0,794668.0,19783.0,40169237
1,Albania,Europe,200639,3104.0,190902.0,6633.0,23.0,69828,1080.0,1398152.0,486595.0,2873339
2,Algeria,Africa,211112,6089.0,144909.0,60114.0,15.0,4694,135.0,230861.0,5133.0,44972408
3,Andorra,Europe,18010,132.0,16162.0,1716.0,4.0,232567,1705.0,193595.0,2499935.0,77440
4,Angola,Africa,65208,1735.0,63263.0,210.0,3.0,1900,51.0,1190871.0,34707.0,34311823


- **Daily Progress Dataset**:

In [None]:
daily_df = pd.read_csv("/content/worldometer_coronavirus_daily_data.csv")
print('Shape of Daily Cases Data:', daily_df.shape)
daily_df.head()

Shape of Daily Cases Data: (145221, 7)


Unnamed: 0,date,country,cumulative_total_cases,daily_new_cases,active_cases,cumulative_total_deaths,daily_new_deaths
0,2020-2-15,Afghanistan,0.0,,0.0,0.0,
1,2020-2-16,Afghanistan,0.0,,0.0,0.0,
2,2020-2-17,Afghanistan,0.0,,0.0,0.0,
3,2020-2-18,Afghanistan,0.0,,0.0,0.0,
4,2020-2-19,Afghanistan,0.0,,0.0,0.0,


- **Vaccine Dataset**:

In [None]:
vacc_df = pd.read_csv('/content/country_vaccinations.csv')
print('Dataset Shape:', vacc_df.shape)
vacc_df.head()

Dataset Shape: (61820, 15)


Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.0,0.0,,,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing",World Health Organization,https://reliefweb.int/sites/reliefweb.int/files/resources/weekly-epidemiological-bulletin_w47.pdf
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,,,,34.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing",World Health Organization,https://reliefweb.int/sites/reliefweb.int/files/resources/weekly-epidemiological-bulletin_w47.pdf
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,,,,34.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing",World Health Organization,https://reliefweb.int/sites/reliefweb.int/files/resources/weekly-epidemiological-bulletin_w47.pdf
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,,,,34.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing",World Health Organization,https://reliefweb.int/sites/reliefweb.int/files/resources/weekly-epidemiological-bulletin_w47.pdf
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,,,,34.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing",World Health Organization,https://reliefweb.int/sites/reliefweb.int/files/resources/weekly-epidemiological-bulletin_w47.pdf


- Our primary focus is on the **vaccine** and **summary** dataset.

<a name = Section41></a>
### **4.1 Data Description**

- In this section we will get **information about the data** and see some observations.

- **Vaccine Dataset**:

In [None]:
vacc_df.describe()

Unnamed: 0,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
count,33283.0,31679.0,28875.0,27289.0,61481.0,33283.0,31679.0,28875.0,61481.0
mean,30274820.0,12456600.0,8790462.0,264471.7,132942.3,58.45763,33.382572,26.87345,3473.981572
std,159353800.0,49030230.0,33287030.0,1242607.0,816510.0,53.626851,27.175492,25.258668,4160.734058
min,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,349858.0,252629.5,144805.0,5218.0,989.0,9.02,6.88,3.65,647.0
50%,2309864.0,1471101.0,1039638.0,25355.0,7371.0,43.16,28.8,18.79,2251.0
75%,10897200.0,6017616.0,5039238.0,122420.0,42540.0,101.845,58.145,48.265,5055.0
max,2543424000.0,1225000000.0,1110506000.0,24741000.0,22424290.0,296.07,121.68,118.2,117497.0


**Observations:**

- **total_vaccinations** ranges from **0.0** to **2543424000.0**, averaging at **30274819.73**.

- **people_vaccinated** ranges from **0.0** to **1225000000.0**, averaging at **12456601.16**.

- **people_fully_vaccinated** ranges from **1.0** to **1110506000.0**, averaging at **8790462.15**.

- **daily_vaccinations_raw** ranges from **0.0** to **24741000.0**, averaging at **264471.68**.

- **daily_vaccinations** ranges from **0.0** to **22424286.0**, averaging at **132942.28**.

- **total_vaccinations_per_hundred** ranges from **0.0** to **296.07**, averaging at **58.46**.

- **people_vaccinated_per_hundred** ranges from **0.0** to **121.68**, averaging at **33.38**.

- **people_fully_vaccinated_per_hundred** ranges from **0.0** to **118.2**, averaging at **26.87**.

- **daily_vaccinations_per_million** ranges from **0.0** to **117497.0**, averaging at **3473.98**.

- **Summary Dataset**:

In [None]:
summary_df.describe()

Unnamed: 0,total_confirmed,total_deaths,total_recovered,active_cases,serious_or_critical,total_cases_per_1m_population,total_deaths_per_1m_population,total_tests,total_tests_per_1m_population,population
count,221.0,210.0,214.0,214.0,157.0,221.0,210.0,208.0,208.0,221.0
mean,1198619.0,25023.752381,1107483.0,95880.59,550.732484,60470.135747,952.666667,20831100.0,1339148.0,35594250.0
std,4534845.0,85196.027853,4033162.0,664843.9,1637.323165,58313.661751,1010.324896,77551530.0,2297144.0,140174600.0
min,1.0,1.0,1.0,0.0,1.0,9.0,2.0,2989.0,3279.0,804.0
25%,16000.0,245.0,12214.0,336.0,7.0,5053.0,122.25,258204.5,111576.2,628179.0
50%,107148.0,2022.0,100147.5,2967.5,43.0,49765.0,627.0,1800836.0,562229.5,6532613.0
75%,591885.0,11799.5,564893.2,30330.5,347.0,99380.0,1519.5,10595480.0,1471086.0,23877920.0
max,49741460.0,806651.0,39397680.0,9537132.0,13485.0,251642.0,5987.0,757810200.0,16317780.0,1439324000.0


**Observations:**

- **total_confirmed** ranges from **1** to **49741464**, averaging at **1198619.44**.

- **total_deaths** ranges from **1.0** to **806651.0**, averaging at **25023.75**.

- **total_recovered** ranges from **1.0** to **39397681.0**, averaging at **1107483.02**.

- **active_cases** ranges from **0.0** to **9537132.0**, averaging at **95880.59**.

- **serious_or_critical** ranges from **1.0** to **13485.0**, averaging at **550.73**.

- **total_cases_per_1m_population** ranges from **9** to **251642**, averaging at **60470.14**.

- **total_deaths_per_1m_population** ranges from **2.0** to **5987.0**, averaging at **952.67**.

- **total_tests** ranges from **2989.0** to **757810159.0**, averaging at **20831096.79**.

- **total_tests_per_1m_population** ranges from **3279.0** to **16317777.0**, averaging at **1339148.32**.

- **population** ranges from **804** to **1439323776**, averaging at **35594252.44**.

<a name = Section42></a>
### **4.2 Data Information**

- In this section we will see the **information about the types of features**.

- **Vaccine Dataset**:

In [None]:
vacc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61820 entries, 0 to 61819
Data columns (total 15 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   country                              61820 non-null  object 
 1   iso_code                             61820 non-null  object 
 2   date                                 61820 non-null  object 
 3   total_vaccinations                   33283 non-null  float64
 4   people_vaccinated                    31679 non-null  float64
 5   people_fully_vaccinated              28875 non-null  float64
 6   daily_vaccinations_raw               27289 non-null  float64
 7   daily_vaccinations                   61481 non-null  float64
 8   total_vaccinations_per_hundred       33283 non-null  float64
 9   people_vaccinated_per_hundred        31679 non-null  float64
 10  people_fully_vaccinated_per_hundred  28875 non-null  float64
 11  daily_vaccinations_per_milli

**Observations:**

- There are **9 float64 features** and **6 object data type** features.

- We will see the **profiling report** to get more information on these features.

- **Summary Dataset**:

In [None]:
summary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221 entries, 0 to 220
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   country                         221 non-null    object 
 1   continent                       221 non-null    object 
 2   total_confirmed                 221 non-null    int64  
 3   total_deaths                    210 non-null    float64
 4   total_recovered                 214 non-null    float64
 5   active_cases                    214 non-null    float64
 6   serious_or_critical             157 non-null    float64
 7   total_cases_per_1m_population   221 non-null    int64  
 8   total_deaths_per_1m_population  210 non-null    float64
 9   total_tests                     208 non-null    float64
 10  total_tests_per_1m_population   208 non-null    float64
 11  population                      221 non-null    int64  
dtypes: float64(7), int64(3), object(2)
m

**Observations:**

- There are **3 int64**, **7 float64 features** and **2 object data type** features in summary dataframe.

<a name = Section5></a>

---
# **5. Data Pre-Profiling**
---

- For quick analysis pandas profiling is very handy.

- Generates profile reports from a pandas DataFrame.

- For each column, statistics are presented in an interactive HTML report.

In [None]:
profile = ProfileReport(df=daily_df)
profile.to_file(output_file='Pre Profiling Report - Daily.html')
print('Accomplished!')
profile

In [None]:
profile = ProfileReport(df=summary_df)
profile.to_file(output_file='Pre Profiling Report - Summary.html')
print('Accomplished!')
profile

In [None]:
profile = ProfileReport(df=vacc_df)
profile.to_file(output_file='Pre Profiling Report - Vaccine.html')
print('Accomplished!')
profile

In [None]:
vacc_df.isna().sum()

country                                    0
iso_code                                   0
date                                       0
total_vaccinations                     28537
people_vaccinated                      30141
people_fully_vaccinated                32945
daily_vaccinations_raw                 34531
daily_vaccinations                       339
total_vaccinations_per_hundred         28537
people_vaccinated_per_hundred          30141
people_fully_vaccinated_per_hundred    32945
daily_vaccinations_per_million           339
vaccines                                   0
source_name                                0
source_website                             0
dtype: int64

**Observations:**

- Many features have missing cells present.

<a name = Section6></a>

---
# **6. Data Cleaning**
---

- In this section, we will perform the **cleaning** operations on the data using information from the previous section.

- We will first **standardize country names** between the vaccine and summary datasets.

- We will **combine** the **summary dataset** with a **subset of vaccine dataset**, based on **country** names since they both contain up-to-date cumulative data of the countries.

- Finally, we will **extract the date** based features from the daily cases dataset.

In [None]:
# Implement the above
vacc_df.country = vacc_df.country.replace().replace({
    "Czechia": "Czech Republic", 
    "United States": "USA", 
    "United Kingdom": "UK", 
    "Isle of Man": "Isle Of Man",
    "Republic of Ireland": "Ireland",
    "Northern Cyprus" : "Cyprus"
})

# drop these countries since they are included in UK 
vacc_df = vacc_df[vacc_df.country.apply(lambda x: x not in ['England', 'Scotland', 'Wales', 'Northern Ireland'])]

In [None]:
# function to easily agrregate columns based on max values of a country at latest date
def aggregate(df: pd.Series, agg_col: str):
    
    data = df.groupby("country")[agg_col].max()
    data = pd.DataFrame(data)
    
    return data

# define the columns we want to summarize
cols_to_summarize = ['people_vaccinated', 
                     'people_vaccinated_per_hundred', 
                     'people_fully_vaccinated', 
                     'people_fully_vaccinated_per_hundred', 
                     'total_vaccinations_per_hundred', 
                     'total_vaccinations']

# We will join the 2 dataframes on country, so setting country as their index
summary = summary_df.set_index("country")
vaccines = vacc_df[['country', 'vaccines']].drop_duplicates().set_index('country')
summary = summary.join(vaccines)

for col in cols_to_summarize:   
    summary = summary.join(aggregate(vacc_df, col))

summary['percentage_vaccinated'] = (summary.people_fully_vaccinated / summary.population) * 100
summary['tested_positive'] = (summary.total_confirmed / summary.total_tests) * 100
summary.head()

Unnamed: 0_level_0,continent,total_confirmed,total_deaths,total_recovered,active_cases,serious_or_critical,total_cases_per_1m_population,total_deaths_per_1m_population,total_tests,total_tests_per_1m_population,population,vaccines,people_vaccinated,people_vaccinated_per_hundred,people_fully_vaccinated,people_fully_vaccinated_per_hundred,total_vaccinations_per_hundred,total_vaccinations,percentage_vaccinated,tested_positive
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Afghanistan,Asia,157412,7311.0,140597.0,9504.0,1124.0,3919,182.0,794668.0,19783.0,40169237,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing",4397449.0,11.04,3566192.0,8.95,13.13,5228706.0,8.877918,19.808524
Albania,Europe,200639,3104.0,190902.0,6633.0,23.0,69828,1080.0,1398152.0,486595.0,2873339,"Oxford/AstraZeneca, Pfizer/BioNTech, Sinovac, Sputnik V",1075332.0,37.43,965964.0,33.62,73.51,2111797.0,33.61817,14.3503
Algeria,Africa,211112,6089.0,144909.0,60114.0,15.0,4694,135.0,230861.0,5133.0,44972408,"Oxford/AstraZeneca, Sinopharm/Beijing, Sinovac, Sputnik V",6740064.0,15.11,5380385.0,12.06,27.22,12145830.0,11.963747,91.445502
Andorra,Europe,18010,132.0,16162.0,1716.0,4.0,232567,1705.0,193595.0,2499935.0,77440,"Moderna, Oxford/AstraZeneca, Pfizer/BioNTech",54999.0,71.1,49535.0,64.04,135.14,104534.0,63.965651,9.302926
Angola,Africa,65208,1735.0,63263.0,210.0,3.0,1900,51.0,1190871.0,34707.0,34311823,Oxford/AstraZeneca,6774984.0,19.97,3072475.0,9.05,29.02,9847459.0,8.954566,5.475656


In [None]:
# Extracting date based features in daily cases dataset
daily_df['date'] = pd.to_datetime(daily_df['date'])
daily_df['year'] = daily_df['date'].dt.year
daily_df['month'] = daily_df['date'].dt.month
daily_df['day'] = daily_df['date'].dt.day
daily_df.index=daily_df['date'].dt.date
daily_df.drop('date', axis=1, inplace=True)
daily_df.head()

Unnamed: 0_level_0,country,cumulative_total_cases,daily_new_cases,active_cases,cumulative_total_deaths,daily_new_deaths,year,month,day
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2020-02-15,Afghanistan,0.0,,0.0,0.0,,2020,2,15
2020-02-16,Afghanistan,0.0,,0.0,0.0,,2020,2,16
2020-02-17,Afghanistan,0.0,,0.0,0.0,,2020,2,17
2020-02-18,Afghanistan,0.0,,0.0,0.0,,2020,2,18
2020-02-19,Afghanistan,0.0,,0.0,0.0,,2020,2,19


- **Helper Functions**

In [None]:
# helper functions 

def get_title(title:str, subtitle:str):
    return f"{title}<br><sub>{subtitle}</sub>"

def plot_bar(data=None, values=None, labels=None, title="", n=10, orientation='v', xlabel="", ylabel=""):
    hovertemplate ='<br><b>%{x}</b>'+f'<br><b>{ylabel}: </b>'+'%{y}<br><extra></extra>'
    colors = ['blue', 'red', 'yellow', 'green', 'royalblue', 'rebeccapurple', 'saddlebrown', 'darkorange', 'purple', 'burlywood', 'magenta', 'mediumturquoise', 'lightgreen', 'gold', 'violet', 'brown', 'silver', 'maroon', 'lavender', 'cyan']
    random.shuffle(colors)          
    
    if n is not None: 
        data = data.iloc[:n]
    else:
        n = ""

    # Initiate an empty figure
    fig = go.Figure()

    # Add a trace of bar to the figure
    fig.add_trace(trace=go.Bar(x=values,
                              y=labels,
                              orientation=orientation,
                              hovertemplate = hovertemplate,
                              ),
                  )

    # Update the layout with some cosmetics

    fig.update_layout(height=600, 
                      width=1200, 
                      title=title,
                      xaxis_title=xlabel,
                      yaxis_title=ylabel,
                      title_x=0.5,
                      plot_bgcolor='rgba(0,0,0,0)',
                      hovermode="x"
                      )

    fig.update_traces(marker_color=colors,
                      marker_line_color='black',
                      marker_line_width=1.5,
                      opacity=0.6)

    # Display the figure
    fig.show()

<a name = Section7></a>

---
# **7. Data Post-Profiling**
---

- In this section, we will observe the changes after performing data pre-processing, if present.

In [None]:
profile = ProfileReport(df=daily_df)
profile.to_file(output_file='Post Profiling Report - Daily.html')
print('Accomplished!')
profile

In [None]:
profile = ProfileReport(df=summary)
profile.to_file(output_file='Post Profiling Report - Summary-Vaccine.html')
print('Accomplished!')
profile

<a name = Section8></a>

---
# **8. Exploratory Data Analysis**
---

**<h4>Question:** How many new cases appeared everyday from the beginning of the pandemic?</h4>

In [None]:
count_by_day = daily_df.groupby(daily_df.index)[['daily_new_cases']].sum()
fig = px.line(count_by_day, count_by_day.index, count_by_day['daily_new_cases'], markers=True)
title = get_title('Daily Cases', 'Number of global cases since the beginning of Pandemic')
fig.update_layout(title=title, 
                  title_x=0.5)
fig.show()
fig.write_html("/content/Q1.html")

**Observations**:

- We can observe an increasing trend over the course of the pandemic.

- We can also see three major rises in the number of cases around **January**, **April**, and **August** 2021.

- We can **expect yet another rise** if the number of cases go on increasing day by day.

**<h4>Question:** How many deaths appeared everyday due to COVID-19?</h4>

In [None]:
fatal_by_day = daily_df.groupby(daily_df.index)[['daily_new_deaths']].sum()
fig = px.line(fatal_by_day, fatal_by_day.index, fatal_by_day['daily_new_deaths'], markers=True)
title = get_title('Daily Deaths', 'Number of global deaths since the beginning of Pandemic')
fig.update_traces(marker_color='red')
fig.update_layout(title=title, 
                  title_x=0.5)
fig.show()
fig.write_html("/content/Q2.html")

**Observations**:

- We can observe an increasing trend over the course of the pandemic, analogous to number of cases.

- We can see a **steep rise** in number of deaths in April 2020, giving a glance of the horror the virus can cause.

- The number of deaths **peaked in January 2021**, with highest being 17.52k on 27th January 2021

- The number of deaths have **relatively been constant** after lowering in October 2021.

**<h4>Question:** What is the distribution of total confirmed cases across the globe?</h4>

In [None]:
title = get_title("Covid-19 Continents", "The Horrow ensued across each continent")

continent_confirmed = summary_df.groupby(['continent'])['total_confirmed'].sum()
continent_active = summary_df.groupby(['continent'])['active_cases'].sum()
continent_deaths = summary_df.groupby(['continent'])['total_deaths'].sum()
continent_recovered = summary_df.groupby(['continent'])['total_recovered'].sum()

data = summary_df.copy()
data['Total Confirmed'] = data.continent.apply(lambda x: continent_confirmed[x])
data['Active Cases'] = data.continent.apply(lambda x: continent_active[x])
data['Total Deaths'] = data.continent.apply(lambda x: continent_deaths[x])
data['Total Recovered'] = data.continent.apply(lambda x: continent_recovered[x])


fig = px.choropleth(data, locations="country", 
                    locationmode='country names',
                    color="Total Confirmed", 
                    hover_name="continent", 
                    hover_data=['Active Cases', 'Total Confirmed','Total Deaths', 'Total Recovered' ],
                    title=title,
                    color_continuous_scale="reds"
                   )


fig.update_layout(title=title, 
                  title_x=0.5)

fig.show()
fig.write_html("/content/Q3.html")

**Observations**:

- This data shows the number of cases till the Week 1 of December 2021.

- **South-Asian countries** have seen the **most number of confirmed cases** till date, followed by **Europe**, **Russia**, **North America**, and **South America**.

- **Africa** and **Australia** have relatively lowest number of cases among all the continents.

**<h4>Question:** What is the percentage of active cases, recovered, and deaths due to covid across the globe?</h4>

In [None]:
data = summary.reset_index().dropna(subset=['active_cases', 'total_recovered', 'total_deaths','population'])
data['active_percent'] = data['active_cases']/data['population'] * 100
data['recovered_percent'] = data['total_recovered']/data['population'] * 100
data['deaths_percent'] = data['total_deaths']/data['population'] * 100
data['confirmed_percent'] = data['total_confirmed']/data['population'] * 100
data = data.sort_values('confirmed_percent', ascending=False).drop_duplicates(subset=['country'])

# Title for the plot
title = get_title("Percentage Statistics", "Active, Recovered and Deaths in terms of percentage of population")

# We will add a trace for every column - Active, Recovered, and Deaths
fig = go.Figure(data=[
                go.Bar(
                    name="Deaths",
                    x=data['country'], 
                    y=data['deaths_percent'],
                    marker_color='crimson',
                    marker=dict(line=dict(
                                  width=0.1,
                                  color='red'
                                  )
                    )
                    ),
                    go.Bar(
                        name="Active",
                        x=data['country'], 
                        y=data['active_percent'],
                        marker_color='royalblue',
                        marker=dict(
                            line=dict(
                                width=0.1,
                                color='blue'
                                )
                            )
                        ),
                      go.Bar(
                          name="Recovered",
                          x=data['country'], 
                          y=data['recovered_percent'],
                          marker_color='lightseagreen',
                          marker=dict(
                              line=dict(
                                  width=0.1,
                                  color='green'
                                  )
                              )
                          )
                      ]
                )

fig.update_layout(title=title,
                  xaxis_title="Country",
                  yaxis_title="Percentages(%)",
                  plot_bgcolor='rgba(0,0,0,0)',
                  hovermode="x",
                  barmode='stack'
                  )
fig.show()

fig.write_html("/content/Q4.html")

**Observations**:

- The death count is less than 1% for every country which is a favorable statistic.

- **More than 20 countries** have more than **10% of their population** infected and recovered from the disease.

- Countries like **India** and **China** have a **very large**, yet **sparse populations** at many cities so the overall infections look low as compared to other countries.

**<h4>Question:** Which countries have highest number of administered doses of vaccine?</h4>

In [None]:
total_by_country = summary.groupby('country')[['total_vaccinations']].max()
total_by_country = total_by_country.sort_values(by='total_vaccinations', ascending=False).head(20)

values = total_by_country['total_vaccinations']
labels = total_by_country.index

# Plotting using the helper functions
title = get_title("Total Registered Vaccinations", "Individuals who are have taken atleast one dose")
plot_bar(data=total_by_country, values=values, labels=labels, title=title, n=20, orientation='h', xlabel="Fully Vaccinated", ylabel="Countries")

fig.write_html("/content/Q5.html")

**Observations**:

- **China** seems to have registered more than **2.5 Billion doses**, as of December 2021, which is **highest** among all the countries.

- **India** follows China with **1.27 Billion administered doses** in the country.

- India is followed by **USA, Brazil, Indonesia**, and **Japan**, each with **less than 500 Million** administered doses, due to relatively **lower populations** than India and China.

**<h4>Question:** Which countries have highest number of fully vaccinated people in their countries?</h4>

In [None]:
full_by_country = summary.groupby('country')[['people_fully_vaccinated']].max()
full_by_country = full_by_country.sort_values(by='people_fully_vaccinated', ascending=False).head(20)

values = full_by_country['people_fully_vaccinated']
labels = full_by_country.index

# Plotting using the helper functions
title = get_title("People Fully Vaccinated", "Individuals who are fully vaccinated")
plot_bar(data=full_by_country, values=values, labels=labels, title=title, n=20, orientation='h', xlabel="Fully Vaccinated", ylabel="Countries")

fig.write_html("/content/Q6.html")

**Observations**:

- Again, **China** leads with having fully vaccinated **1.1 Billion** of it's citizens from the disease.

- **India** has **474 Million fully vaccinated** citizens in the country.

- India is followed by **USA, Brazil, Indonesia**, and **Japan**, each with **less than 200 Million** fully vaccinated citizens.

**<h4>Question:** Since the population numbers don't give a relative idea, what is the percentage of fully vaccinated people in different countries?</h4>

In [None]:
percent_by_country = summary.groupby('country')[['percentage_vaccinated']].max()
percent_by_country = percent_by_country.sort_values(by='percentage_vaccinated', ascending=False).head(20)

values = percent_by_country['percentage_vaccinated']
labels = percent_by_country.index

title = get_title("Percentage Vaccinated", "Percentage of the total population that have been fully vaccinated")
plot_bar(percent_by_country, values=labels, labels=values, title=title, n=20, xlabel="Countries", ylabel="Fully Vaccinated (%)")

fig.write_html("/content/Q7.html")

Observations:

- **Gibraltar** seems to have vaccinated **more than 100%** of their population, which is weird.

- It is followed by **Malta, Portugal, UAE, Singapore**, and **Spain**, each vaccinating more than **80% of their population**.

- Even though **China** has vaccinated **more people** than other countries, only 7**0% of their population** is fully vaccinated.

In [None]:
summary[summary.index=='Gibraltar'][['population', 'people_fully_vaccinated', 'percentage_vaccinated', ]]

Unnamed: 0_level_0,population,people_fully_vaccinated,percentage_vaccinated
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Gibraltar,33677,39823.0,118.249844


**Observations**:

- We see a **mismatch** in data because according to **2020 data at Worldometer**, in Gibraltar, there are **33677 citizens**, whereas we have **39823 people fully vaccinated**.

- This mismatch can be due to **inward** and **outward** **migration** in the country.

**<h4>Question:** Which vaccines are used in which countries?</h4>

In [None]:
data = summary.reset_index().dropna(subset=['vaccines'])

title = get_title("Popular Vaccines", "Vaccines being admisitered around the world")
fig = px.choropleth(data, locations="country", 
                    locationmode='country names',
                    color="vaccines", 
                    hover_name="country", 
                    )

fig.update_layout(height=800,
                  title=title, 
                  title_x=0.5,
                  legend_orientation = 'h'
)

fig.show()
fig.write_html("/content/Q8.html")

**Observations**:

- We have India with **Covaxin**, **Oxford/AstraZeneca's Covishield**, and **Sputnik V**.

- Most of the countries have been using a combination of either of the following vaccines - **Pfizer/BioNTech, Oxford/AstraZeneca, Covaxin, Sputnik V, Johnson&Johnson, Sinovac, Moderna, Abdala, Sinopharm**, and some others

**<h4>Question:** Which vaccines have been administered the most over time?</h4>

In [None]:
title = get_title("Vaccine Breakdown", "Cumulative totals of each vaccine administered over time")
vacc_plot = pd.DataFrame(vacc_df.groupby(['vaccines','date'])['daily_vaccinations'].sum()).reset_index()
dates = vacc_plot.date.unique().tolist() 
vaccines = vacc_plot.vaccines.unique().tolist()
vacc_plot = vacc_plot.set_index(['date', 'vaccines'])

data = []
for date in dates:
    for vac in vaccines:
        if (date, vac) not in vacc_plot.index:
            value = pd.NA
            if date == (min(dates)):
                value = 0
            data.append([date, vac, value])  
        else:
            data.append([date, vac, vacc_plot.loc[(date, vac)]['daily_vaccinations']])
            
data = pd.DataFrame(data, columns = ['date', 'vaccine', 'count'])
data = data.sort_values(['vaccine', 'date'])
data['count'] = data['count'].fillna(method='ffill')
data = data[data['date'] != max(dates)]

line_plots = []
for v in vaccines:
    vacc_data = data[data.vaccine == v]
    line_plots.append(
        go.Scatter(
            name = v,
            x = vacc_data.date,
            mode='lines+markers',
            y=vacc_data['count'],
        )
    )
    
fig = go.Figure(line_plots)
fig.update_layout(
    height=1000,
    title =title,
    xaxis_title="Date",
    yaxis_title="Count",
    hovermode='x',
    legend_orientation = 'h'
)

fig.show()
fig.write_html("/content/Q9.html")

**Observations**:

- India's **Covaxin**, **Oxford/AstraZeneca's Covishield**, and **Sputnik V** have peaked at 10 Million doses a day in October 2021, and has been consistent around **7-9 Million doses a day**.

- This is rivaled by **China**'s vaccination program where they peaked at **22 Million doses a day**.

- Then we have **USA**'s Pfizer/BioNTech, Oxford/AstraZeneca, Moderna, and Johnson&Johnson combination - peaking at **4.4 Million doses a day**.

**<h4>Question:** What is the global rate of vaccination drives in various countries?</h4>

In [None]:
dates = vacc_df.date.unique().tolist()

#add 2 dates to improve animation 
dates.extend(['2020-12-12', '2020-12-13'])

# unique countries 
countries = vacc_df.country.unique().tolist()

# for easy processing 
short = vacc_df[['date', 'country', 'total_vaccinations']]

# values of unqiue (date, country) already in short 
# i.e we want to make sure we have some data for each, even if it is 0 
keys= list(zip(short.date.tolist(), short.country.tolist()))
for date in dates:
    for country in countries:
        idx = (date, country)
        if idx not in keys:
            if date == min(dates):
                # this means there's no entry for {country} on the earliest date 
                short = short.append({
                    "date": date, 
                    "country": country, 
                    "total_vaccinations": 0
                }, ignore_index=True)
            else:
                # entry for {country} is missing on a date other than the earliest
                short = short.append({
                    "date": date, 
                    "country": country, 
                    "total_vaccinations": pd.NA
                }, ignore_index=True)

In [None]:
#fill missing values with previous day values (this is OK since it is cumulative)
short = short.sort_values(['country', 'date'])

short.total_vaccinations = short.total_vaccinations.fillna(method='ffill')

# scale the number by log to make the color transitions smoother
vaccines = short.sort_values('date')
vaccines['log_scale'] = vaccines['total_vaccinations'].apply(lambda x : math.log2(x+1))

fig = px.choropleth(vaccines, locations="country", 
                    locationmode='country names',
                    color="log_scale", 
                    hover_name="country", 
                    hover_data=['log_scale', "total_vaccinations"],
                    animation_frame="date",
                    color_continuous_scale="RdYlBu",
                   )

title = get_title("Vaccination Progress", "Number of Vaccines Administered Around the World")
fig.update_layout(coloraxis={"cmax":25,"cmin":0})
fig.update_layout(title=title, title_x=0.5, coloraxis_showscale=False)

fig.show()
fig.write_html("/content/Q10.html")

**Observations**:

- We can **first** see the vaccines being administered in **Norway** on **9th December, 2020**.

- This is followed by **USA** with more than **22k vaccines** administered on **13th December 2020**.

- China suddenly registers 1.5 Million administered vaccinations on **December 15th, 2020**.

- India started it's vaccination program on **16th January 2021**, with where **190k people** were vaccinated on the **first day of availability**.

**<h4>Question:** What is the rate of vaccination drives in top 10 countries with highest registered vaccinations?</h4>

In [None]:
# Only top-10 
countries = short.groupby('country')['total_vaccinations'].max().sort_values(ascending=False)[:10].index.tolist()

title = get_title("Vaccination Progress", "Rate of vaccinations for the top-10 vaccinated countries")

line_plots = []
for c in countries:
    vacc_data = short[short.country == c]
    line_plots.append(
        go.Scatter(
            name = c,
            x = vacc_data.date,
            mode='lines+markers',
            y=vacc_data['total_vaccinations'],
        )
    )
    
fig = go.Figure(line_plots)
fig.update_layout(
    title =title,
    yaxis_title="Count",
    hovermode='x',
    legend_orientation = 'h',

)

fig.show()
fig.write_html("/content/Q11.html")

**Observations**:

- We have China, India, and USA, followed by **Brazil** and **Japan** who picked up **pace after July 2021**.

- **India surpassed USA**'s total vaccinations in **early July**.

- A notable country is **Russia** which seems to have started it's vaccination drive early as well in **December 2020**.

- By end of 2020, we have Mexico and then Germany starting their programs, taking the count to 5 countries with vaccination programs.

- By **March'21**, all the top-10 countries had started their vaccination programs.

---
<a name = Section9></a>
# **9. Summarization**
---

<a name = Section91></a>
### **9.1 Conclusion**

- The above graphs shows how slowly but surely, the vaccines are being administered in increasingly large numbers each day.

- If we look carefully, we can also identify a slight downward trend in the number of new cases each day, as the vaccinations progress.

- We can also observe **decrease in death rate** due to covid-19, which is thanks to the **vaccination programs**.

- **South-Asian countries** seemed to have suffered the most, but they have recovered and are **getting immunized**.

- We have many combinations of vaccines used in **many countries**, which boosts the vaccination programs.

- A **predictive model** can help to determine whether the **future trend** will **increase of decrease** in terms of **number of cases**.

<a name = Section92></a>
### **9.2 Actionable Insights**

- **F.E.A.S.T.** can check the countries that have been **fully vaccinated** and **resume their operations** there.

- Countries like **India**, with large populations **cannot be a guarantee** for a safer operation as new variants that can infect and spread fast emerge.

- Most of the countries show a **promising vaccination drive**, the death rate has gone down, so by next year, most of the operations can be resumed.

- This is ofcourse, to be taken with **care**. **Safety** and **hygienic** measures must be taken at priority.

- **COVID-19** has taken a heavy toll on mankind. We have lost far too many people and suffered too much for too long.

- Now is the time to fight back. 2021 has been a rough year but **humanity has made progress**.

- Regardless of what people might say, **always wear a mask** when out in public and **maintain social distancing**. **DO NOT** give in hearsay! **Get vaccinated** as soon as possible!

- **Humanity is on its way to victory!**