# COVID-19 Vaccine Analysis

## by Justin Sierchio

In this analysis, we will be looking at the COVID-19 vaccination data. We hope to answer some of the following questions:

<ul>
    <li>Which countries are ramping up their vaccine efforts the fastest?</li>
    <li>Where are vaccines lagging behind?</li>
    <li>Are there any other observations we can make from this dataset</li>

This data is in .csv file format and is from Kaggle at: https://www.kaggle.com/gpreda/covid-world-vaccination-progress/download. More information related to the dataset can be found at: https://www.kaggle.com/gpreda/covid-world-vaccination-progress.

## Notebook Initialization

In [1]:
# Import Relevant Libraries
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

print('Initial libraries loaded into workspace!')

Initial libraries loaded into workspace!


In [2]:
# Upload Datasets for Study
df_COVID19 = pd.read_csv("country_vaccinations.csv");

print('Datasets uploaded!');

Datasets uploaded!


In [3]:
# Display 1st 5 rows from COVID-19 Vaccine dataset
df_COVID19.head()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Argentina,ARG,2020-12-29,700.0,,,,,0.0,,,,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
1,Argentina,ARG,2020-12-30,,,,,15656.0,,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
2,Argentina,ARG,2020-12-31,32013.0,,,,15656.0,0.07,,,346.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
3,Argentina,ARG,2021-01-01,,,,,11070.0,,,,245.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...
4,Argentina,ARG,2021-01-02,,,,,8776.0,,,,194.0,Sputnik V,Ministry of Health,http://datos.salud.gob.ar/dataset/vacunas-cont...


Before beginning our data cleaning, let's explain what each of the columns represents:

<ul>
    <li>country - the country for which vaccination information is provided</li>
    <li>iso_code - the ISO code the country</li>
    <li>date - the date for the last entry</li>
    <li>total_vaccinations - the absolute number of total immunizations in the country</li>
    <li>people_vaccinated - the number of people vaccinated</li>
    <li>People_fully_vaccinated - the number of people who have received the entire set of immunizations required by the specific scheme (in most cases 2)</li>
    <li>daily_vaccinations_raw - the number of vaccinations for that date and country</li>
    <li>total_vaccinations_per_hundred - ratio (%) between population immunized and total population of the country</li>
    <li>people_vaccinations_per_hundred - ratio (%) between people vaccination and total population of the country</li>
    <li>people_fully_vaccinated_per_hundred - ratio (%) between people fully vaccinated and total country population</li>
    <li>daily_vaccinations_per_million - ratio (ppm) between vaccination number and total population for current date in the country</li>
    <li>vaccines - vaccines available in the country</li>
    <li>source_name - source of the information (national authority, international organization, local organization etc.)</li>
    <li>source_website - website of the source of information</li>
</ul>

## Data Cleaning

Let's first get a sense of what the dataset looks like.

In [4]:
df_COVID19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1587 entries, 0 to 1586
Data columns (total 15 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   country                              1587 non-null   object 
 1   iso_code                             1409 non-null   object 
 2   date                                 1587 non-null   object 
 3   total_vaccinations                   1052 non-null   float64
 4   people_vaccinated                    1000 non-null   float64
 5   people_fully_vaccinated              334 non-null    float64
 6   daily_vaccinations_raw               846 non-null    float64
 7   daily_vaccinations                   1525 non-null   float64
 8   total_vaccinations_per_hundred       1052 non-null   float64
 9   people_vaccinated_per_hundred        1000 non-null   float64
 10  people_fully_vaccinated_per_hundred  334 non-null    float64
 11  daily_vaccinations_per_million

Let's look at the number of 'Nan' and 'null' values.

In [5]:
df_COVID19.isna().sum()

country                                   0
iso_code                                178
date                                      0
total_vaccinations                      535
people_vaccinated                       587
people_fully_vaccinated                1253
daily_vaccinations_raw                  741
daily_vaccinations                       62
total_vaccinations_per_hundred          535
people_vaccinated_per_hundred           587
people_fully_vaccinated_per_hundred    1253
daily_vaccinations_per_million           62
vaccines                                  0
source_name                               0
source_website                            0
dtype: int64

Let's convert all the 'NaN' and 'null' values to 0 to make the analysis easier.

In [6]:
# Fill all 'Nan' and 'null' values with 0
df_COVID19['iso_code'].fillna(0, inplace = True)
df_COVID19['total_vaccinations'].fillna(0, inplace = True)
df_COVID19['people_vaccinated'].fillna(0, inplace = True)
df_COVID19['people_fully_vaccinated'].fillna(0, inplace = True)
df_COVID19['daily_vaccinations_raw'].fillna(0, inplace = True)
df_COVID19['daily_vaccinations'].fillna(0, inplace = True)
df_COVID19['total_vaccinations_per_hundred'].fillna(0, inplace = True)
df_COVID19['people_vaccinated_per_hundred'].fillna(0, inplace = True)
df_COVID19['people_fully_vaccinated_per_hundred'].fillna(0, inplace = True)
df_COVID19['daily_vaccinations_per_million'].fillna(0, inplace = True)

# Check the data values
df_COVID19.isna().sum()

country                                0
iso_code                               0
date                                   0
total_vaccinations                     0
people_vaccinated                      0
people_fully_vaccinated                0
daily_vaccinations_raw                 0
daily_vaccinations                     0
total_vaccinations_per_hundred         0
people_vaccinated_per_hundred          0
people_fully_vaccinated_per_hundred    0
daily_vaccinations_per_million         0
vaccines                               0
source_name                            0
source_website                         0
dtype: int64

At this juncture, let's select the columns we believe we will need going forward.

In [7]:
# Feature Selection for States' Dataset
df_COVID19 = df_COVID19[['country', 'date', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated',
                        'daily_vaccinations_raw', 'daily_vaccinations', 'total_vaccinations_per_hundred', 
                        'people_vaccinated_per_hundred', 'people_fully_vaccinated_per_hundred', 
                        'daily_vaccinations_per_million', 'vaccines']]

Let's organize the data so that the latest information is shown first.

In [8]:
# Reorder columns so that date is the first column
df_COVID19 = df_COVID19.reindex(columns=['date', 'country', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated',
                        'daily_vaccinations_raw', 'daily_vaccinations', 'total_vaccinations_per_hundred', 
                        'people_vaccinated_per_hundred', 'people_fully_vaccinated_per_hundred', 
                        'daily_vaccinations_per_million', 'vaccines'])

# Resort data by date
df_COVID19['date'] = pd.to_datetime(df_COVID19['date'])
df_COVID19.sort_values(by=['date'], inplace=True, ascending=False)
df_COVID19.reset_index();

# Display result
df_COVID19.head(10)

Unnamed: 0,date,country,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines
199,2021-01-26,Canada,868454.0,868454.0,0.0,28505.0,31045.0,2.3,2.3,0.0,823.0,"Moderna, Pfizer/BioNTech"
786,2021-01-26,Italy,1525612.0,1298990.0,226622.0,59285.0,40059.0,2.52,2.15,0.37,663.0,Pfizer/BioNTech
1136,2021-01-26,Portugal,263499.0,263499.0,0.0,0.0,13847.0,2.58,2.58,0.0,1358.0,Pfizer/BioNTech
28,2021-01-26,Argentina,305880.0,266969.0,38911.0,13494.0,9626.0,0.68,0.59,0.09,213.0,Sputnik V
1542,2021-01-26,United States,23540994.0,19902237.0,3481921.0,806751.0,1119058.0,7.11,6.01,1.05,3381.0,"Moderna, Pfizer/BioNTech"
477,2021-01-26,Estonia,27180.0,27180.0,0.0,1054.0,1124.0,2.05,2.05,0.0,847.0,Pfizer/BioNTech
126,2021-01-26,Brazil,848883.0,848883.0,0.0,148275.0,119630.0,0.4,0.4,0.0,563.0,Sinovac
1166,2021-01-26,Romania,528378.0,487711.0,40667.0,43747.0,36905.0,2.75,2.54,0.21,1918.0,Pfizer/BioNTech
671,2021-01-26,India,2029480.0,2029480.0,0.0,5671.0,193521.0,0.15,0.15,0.0,140.0,"Covaxin, Covishield"
1467,2021-01-26,United Arab Emirates,2677680.0,2427680.0,250000.0,106589.0,87473.0,27.07,24.55,2.53,8844.0,"Pfizer/BioNTech, Sinopharm"


Now let's create a separate variable to keep the latest set of vaccine data by date for future analysis.

In [11]:
# Look at the latest set of death totals for each state and territory
df_COVID19_latest = df_COVID19.drop_duplicates(subset=['country'], keep='first')
df_COVID19_latest;

# Sort the cumulative death data by largest to smallest
df_COVID19_latest.sort_values(by=['country'])

Unnamed: 0,date,country,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines
28,2021-01-26,Argentina,305880.0,266969.0,38911.0,13494.0,9626.0,0.68,0.59,0.09,213.0,Sputnik V
50,2021-01-26,Austria,185643.0,185643.0,0.0,0.0,6487.0,2.06,2.06,0.00,720.0,Pfizer/BioNTech
79,2021-01-20,Bahrain,144130.0,144130.0,0.0,534.0,6110.0,8.47,8.47,0.00,3591.0,"Pfizer/BioNTech, Sinopharm"
108,2021-01-25,Belgium,213301.0,212618.0,683.0,3432.0,12415.0,1.84,1.83,0.01,1071.0,Pfizer/BioNTech
115,2021-01-16,Bermuda,1665.0,1665.0,0.0,0.0,278.0,2.67,2.67,0.00,4464.0,Pfizer/BioNTech
...,...,...,...,...,...,...,...,...,...,...,...,...
1445,2021-01-26,Turkey,1410273.0,1410273.0,0.0,107520.0,65663.0,1.67,1.67,0.00,779.0,Sinovac
1467,2021-01-26,United Arab Emirates,2677680.0,2427680.0,250000.0,106589.0,87473.0,27.07,24.55,2.53,8844.0,"Pfizer/BioNTech, Sinopharm"
1504,2021-01-25,United Kingdom,7325773.0,6853327.0,472446.0,281725.0,371761.0,10.79,10.10,0.70,5476.0,"Oxford/AstraZeneca, Pfizer/BioNTech"
1542,2021-01-26,United States,23540994.0,19902237.0,3481921.0,806751.0,1119058.0,7.11,6.01,1.05,3381.0,"Moderna, Pfizer/BioNTech"


## Exploratory Data Analysis