# COVID-19 World Vaccination Progress

This is a personal take on an analytics of the current COVID-19 world vaccination progress. For this project, we'll be using three different datasets:
* **Vaccination Progress by Country** ('country_vaccinations.csv')
* **Vaccination Progress by Manufacturer** ('country_vaccinations_by_manufacturer.csv')
* **World Population** ('2021_population.csv')

*please bear in mind that all these datasets are being updated constantly, so output results may vary based on the current data*

## Main imports

In [94]:
import pandas as pd
import numpy as np
import seaborn as sns
import os

import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px

In [2]:
vcn = pd.read_csv('country_vaccinations.csv')
vbm = pd.read_csv('country_vaccinations_by_manufacturer.csv')
wp = pd.read_csv('2021_population.csv')

## Getting used with the dataset layout

### Checking the Vaccination Progress by Country dataset first

In [3]:
vcn.head()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.0,0.0,,,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi...",World Health Organization,https://covid19.who.int/
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,,,,35.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi...",World Health Organization,https://covid19.who.int/
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,,,,35.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi...",World Health Organization,https://covid19.who.int/
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,,,,35.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi...",World Health Organization,https://covid19.who.int/
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,,,,35.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi...",World Health Organization,https://covid19.who.int/


In [4]:
vcn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37538 entries, 0 to 37537
Data columns (total 15 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   country                              37538 non-null  object 
 1   iso_code                             37538 non-null  object 
 2   date                                 37538 non-null  object 
 3   total_vaccinations                   20764 non-null  float64
 4   people_vaccinated                    19839 non-null  float64
 5   people_fully_vaccinated              16961 non-null  float64
 6   daily_vaccinations_raw               17088 non-null  float64
 7   daily_vaccinations                   37292 non-null  float64
 8   total_vaccinations_per_hundred       20764 non-null  float64
 9   people_vaccinated_per_hundred        19839 non-null  float64
 10  people_fully_vaccinated_per_hundred  16961 non-null  float64
 11  daily_vaccinations_per_milli

### Getting the number of contries listed
Our Vaccination Progress by Country dataset lists 222 countries

In [137]:
vcn['country'].nunique()

222

### Now checking the Vaccination Progress by Manufacturer dataset

In [6]:
vbm.head()

Unnamed: 0,location,date,vaccine,total_vaccinations
0,Austria,2021-01-08,Johnson&Johnson,0
1,Austria,2021-01-08,Moderna,0
2,Austria,2021-01-08,Oxford/AstraZeneca,0
3,Austria,2021-01-08,Pfizer/BioNTech,31027
4,Austria,2021-01-15,Johnson&Johnson,0


In [7]:
vbm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11331 entries, 0 to 11330
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   location            11331 non-null  object
 1   date                11331 non-null  object
 2   vaccine             11331 non-null  object
 3   total_vaccinations  11331 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 354.2+ KB


### Checking the number of contries/locations listed
In this dataset, instead, we have only 34 countries/locations listed

In [138]:
vbm['location'].nunique()

34

### Lastly, checking the World Population dataset

In [9]:
wp.head()

Unnamed: 0,iso_code,country,2021_last_updated,2020_population,area,density_sq_km,growth_rate,world_%,rank
0,CHN,China,1444712023,1439323776,"9,706,961 sq_km",149/sq_km,0.34%,18.34%,1
1,IND,India,1394784323,1380004385,"3,287,590 sq_km",424/sq_km,0.97%,17.69%,2
2,USA,United States,333114077,331002651,"9,372,610 sq_km",36/sq_km,0.58%,4.23%,3
3,IDN,Indonesia,276653405,273523615,"1,904,569 sq_km",145/sq_km,1.04%,3.51%,4
4,PAK,Pakistan,225639396,220892340,"881,912 sq_km",255/sq_km,1.95%,2.86%,5


In [10]:
wp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228 entries, 0 to 227
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   iso_code           228 non-null    object
 1   country            228 non-null    object
 2   2021_last_updated  228 non-null    object
 3   2020_population    228 non-null    object
 4   area               228 non-null    object
 5   density_sq_km      228 non-null    object
 6   growth_rate        228 non-null    object
 7   world_%            228 non-null    object
 8   rank               228 non-null    int64 
dtypes: int64(1), object(8)
memory usage: 16.2+ KB


### Checking the number of countries listed

In [11]:
wp['country'].nunique()

228

### Concatenating the Vaccination Progress by Country and the World Population dataframes

In [12]:
wp.head(1)

Unnamed: 0,iso_code,country,2021_last_updated,2020_population,area,density_sq_km,growth_rate,world_%,rank
0,CHN,China,1444712023,1439323776,"9,706,961 sq_km",149/sq_km,0.34%,18.34%,1


Creating a new World Population dataframe with only the desired columns (Name of the country, total population) and sorting it out

In [139]:
wp_sort = wp[['country','2021_last_updated']]
wp_sort.sort_values('country')
wp_sort.head()

Unnamed: 0,country,2021_last_updated
0,China,1444712023
1,India,1394784323
2,United States,333114077
3,Indonesia,276653405
4,Pakistan,225639396


Now doing the same with the Vaccination Progress by Country dataframe. Firstly, getting rid of duplicates

In [140]:
vcn_drop = vcn.drop_duplicates('country', keep = "last")
vcn_sort = vcn_drop[['country','people_fully_vaccinated']]
vcn_sort.head()

Unnamed: 0,country,people_fully_vaccinated
170,Afghanistan,
383,Albania,560007.0
565,Algeria,724812.0
741,Andorra,33904.0
903,Angola,722610.0


After checking the Top 5 countries, I realized China was left out because it hasn't updated its numbers for the past days.

In [141]:
vcn_sort.sort_values(by='people_fully_vaccinated', ascending=False).head()

Unnamed: 0,country,people_fully_vaccinated
36109,United States,168090925
15546,India,121270889
4934,Brazil,49297923
12777,Germany,47367929
17337,Japan,46422145


Getting China's last update on Vaccination Progress

In [142]:
vcn[vcn['country'] == 'China']['people_fully_vaccinated'].sort_values()

6928   223299000
6991   777046000
6751         nan
6752         nan
6753         nan
          ...   
6988         nan
6989         nan
6990         nan
6992         nan
6993         nan
Name: people_fully_vaccinated, Length: 243, dtype: float64

Updating the Vaccination Progress dataset with the lastest China report

In [143]:
china_missing_value = vcn.loc[6928].people_fully_vaccinated
vcn_sort2 = vcn_sort.copy(deep=True)
vcn_sort2.loc[vcn_sort2['country'] == 'China', 'people_fully_vaccinated'] = china_missing_value
vcn_sort2.sort_values(by='people_fully_vaccinated', ascending=False).head()

Unnamed: 0,country,people_fully_vaccinated
6993,China,223299000
36109,United States,168090925
15546,India,121270889
4934,Brazil,49297923
12777,Germany,47367929


Concatenating both dataframes making sure both have same axis length 

In [145]:
df_both = pd.concat([wp_sort,vcn_sort2])
df_both = df_both[df_both.groupby('country').country.transform(len) > 1]
df_both = df_both.drop_duplicates('country', keep = 'last')

Sorting them out and renaming columns

In [146]:
df_both_sort = df_both[['country','people_fully_vaccinated']]
df_both_sort = df_both_sort.rename(columns={'country':'country_vaccinations'})
df_both_sort.reset_index(drop=True, inplace=True)
df_both_sort.head()

Unnamed: 0,country_vaccinations,people_fully_vaccinated
0,Afghanistan,
1,Albania,560007.0
2,Algeria,724812.0
3,Andorra,33904.0
4,Angola,722610.0


Concatenating the resulting dataframe to the World Population's

In [147]:
wp_2 = pd.concat([wp, df_both])
wp_2 = wp_2[wp_2.groupby('country').country.transform(len) > 1]
wp_2 = wp_2.drop_duplicates('country', keep = 'first')

Sorting them out and renaming columns

In [148]:
wp_2_sort = wp_2[['country','2021_last_updated']]
wp_2_sort = wp_2_sort.sort_values('country')
wp_2_sort.reset_index(drop=True,inplace=True)
wp_2_sort.head()

Unnamed: 0,country,2021_last_updated
0,Afghanistan,39929284
1,Albania,2872370
2,Algeria,44694125
3,Andorra,77355
4,Angola,34043709


Concatenating the clean and sorted new dataframes

In [149]:
main1 = pd.concat([wp_2_sort,df_both_sort],axis=1)
main1.head()

Unnamed: 0,country,2021_last_updated,country_vaccinations,people_fully_vaccinated
0,Afghanistan,39929284,Afghanistan,
1,Albania,2872370,Albania,560007.0
2,Algeria,44694125,Algeria,724812.0
3,Andorra,77355,Andorra,33904.0
4,Angola,34043709,Angola,722610.0


### Checking the Main Dataframe for nulls and data type

In [22]:
main1.isna().sum()

country                     0
2021_last_updated           0
country_vaccinations        0
people_fully_vaccinated    19
dtype: int64

Dropping the nulls

In [150]:
main1 = main1.dropna()
main1 = main1.reset_index()
main1.head()

Unnamed: 0,index,country,2021_last_updated,country_vaccinations,people_fully_vaccinated
0,1,Albania,2872370,Albania,560007
1,2,Algeria,44694125,Algeria,724812
2,3,Andorra,77355,Andorra,33904
3,4,Angola,34043709,Angola,722610
4,5,Anguilla,15117,Anguilla,8965


In [24]:
main1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 172 entries, 0 to 171
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   index                    172 non-null    int64  
 1   country                  172 non-null    object 
 2   2021_last_updated        172 non-null    object 
 3   country_vaccinations     172 non-null    object 
 4   people_fully_vaccinated  172 non-null    float64
dtypes: float64(1), int64(1), object(3)
memory usage: 6.8+ KB


Converting the 2021_last_updated column from object to float

In [151]:
main1.replace(',','',regex=True,inplace=True)
main1['2021_last_updated'] = main1['2021_last_updated'].map(lambda x: float(x))
main1

Unnamed: 0,index,country,2021_last_updated,country_vaccinations,people_fully_vaccinated
0,1,Albania,2872370,Albania,560007
1,2,Algeria,44694125,Algeria,724812
2,3,Andorra,77355,Andorra,33904
3,4,Angola,34043709,Angola,722610
4,5,Anguilla,15117,Anguilla,8965
...,...,...,...,...,...
167,186,Venezuela,28728962,Venezuela,1100000
168,187,Vietnam,98253611,Vietnam,1271973
169,188,Yemen,30558838,Yemen,13322
170,189,Zambia,18975907,Zambia,212694


In [26]:
main1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 172 entries, 0 to 171
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   index                    172 non-null    int64  
 1   country                  172 non-null    object 
 2   2021_last_updated        172 non-null    float64
 3   country_vaccinations     172 non-null    object 
 4   people_fully_vaccinated  172 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 6.8+ KB


Checking for data logic compliance and cleaning any non-compliance

In [152]:
main1[main1['2021_last_updated'] < main1['people_fully_vaccinated']]

Unnamed: 0,index,country,2021_last_updated,country_vaccinations,people_fully_vaccinated
62,65,Gibraltar,33698,Gibraltar,39168


In [28]:
main1 = main1.drop(index=[62])

## Data Analysis

### Top 5 Countries in terms of Total Number of Vaccines Administered

In [29]:
tnva = vcn.groupby('country')['total_vaccinations'].max().reset_index()
tnva_sort = tnva.sort_values('total_vaccinations',ascending=False).head(5)
pd.set_option('display.float_format', lambda x: '%.0f'% x)
tnva_sort.head()

Unnamed: 0,country,total_vaccinations
40,China,1853839000
90,India,543846290
211,United States,355768825
27,Brazil,163452602
99,Japan,108179498


In [122]:
fig = px.bar(tnva_sort,
    x='country',
    y='total_vaccinations',    
    labels = {'country':'Country', 'total_vaccinations':'Total Number of Vaccines Administered'},
    title = 'Top 5 Countries in terms of Number of Vaccines Administered',
    color='total_vaccinations'
)
#fig.update_traces(texttemplate=tnva_sort['total_vaccinations'].map('{:,}'.format),textposition='outside')
fig.show()

### World Map of Number of Vaccines Administered
For this plot, we will be using another instance of our Vaccination Progress dataset

In [124]:
vcn_geo = vcn.groupby('iso_code')['total_vaccinations','people_fully_vaccinated','people_vaccinated'].max().reset_index()
vcn_geo = vcn_geo.fillna(0)
fig = px.scatter_geo(vcn_geo, locations="iso_code", color="iso_code",
                     size="total_vaccinations",title="Total Vaccinations",
                     projection='mercator')
fig.show()


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



### Top 5 Countries in terms of Population Fully Vaccinated

In [31]:
top5vac = main1.sort_values('people_fully_vaccinated',ascending=False).head()
top5vac.head()

Unnamed: 0,index,country,2021_last_updated,country_vaccinations,people_fully_vaccinated
34,36,China,1444712023,China,223299000
164,182,United States,333114077,United States,168090925
74,77,India,1394784323,India,121270889
23,24,Brazil,214139548,Brazil,49297923
60,63,Germany,83913565,Germany,47367929


In [129]:
fig = px.bar(top5vac,
    x='country',
    y='people_fully_vaccinated',    
    labels = {'country':'Country', 'people_fully_vaccinated':'Total Population Vaccinated'},
    title = 'Top 5 Countries in terms of Number of People Fully Vaccinated',
    color = 'people_fully_vaccinated'
)
#fig.update_traces(texttemplate=top5vac['people_fully_vaccinated'],textposition='outside')
fig.show()

### World Map of Number of People Fully Vaccinated
For this plot, we will be using another instance of our Vaccination Progress dataset

In [126]:
fig = px.scatter_geo(vcn_geo, locations="iso_code", color="iso_code",
                     size="people_fully_vaccinated",title="People Fully Vaccinated",
                     projection='mercator')
fig.show()

### Top 5 Countries in terms of % of Population Fully Vaccinated

In [133]:
main1['percentage'] = ((main1['people_fully_vaccinated'])/(main1['2021_last_updated']))*100
top5percentage = main1.sort_values('percentage',ascending=False).head(5)
top5percentage

Unnamed: 0,index,country,2021_last_updated,country_vaccinations,people_fully_vaccinated,percentage
98,106,Malta,442934,Malta,404213,91
73,76,Iceland,343578,Iceland,255322,74
30,32,Cayman Islands,66497,Cayman Islands,48512,73
162,180,United Arab Emirates,10001593,United Arab Emirates,7210000,72
135,146,San Marino,34017,San Marino,23700,70


In [134]:
fig = px.bar(top5percentage,
    x='country',
    y='percentage',    
    labels = {'country':'Country', 'people_fully_vaccinated':'Total Population Vaccinated'},
    title = 'Top 5 Countries in terms of % of Population Fully Vaccinated',
    color='percentage'
)
# fig.update_traces(texttemplate=top5percentage['percentage'].map('{:.2f}%'.format),textposition='outside')
fig.show()

### Bottom 5 Countries in terms of % of Population Fully Vaccinated

In [135]:
main1['percentage'] = ((main1['people_fully_vaccinated'])/(main1['2021_last_updated']))*100
top5percentage = main1.sort_values('percentage',ascending=False).tail(5)
top5percentage[::-1]

Unnamed: 0,index,country,2021_last_updated,country_vaccinations,people_fully_vaccinated,percentage
69,72,Haiti,11555977,Haiti,366,0
166,185,Vanuatu,315214,Vanuatu,114,0
146,159,South Sudan,11400885,South Sudan,4763,0
169,188,Yemen,30558838,Yemen,13322,0
32,34,Chad,16965252,Chad,10863,0


In [136]:
fig = px.bar(top5percentage[::-1],
    x='country',
    y='percentage',    
    labels = {'country':'Country', 'people_fully_vaccinated':'Total Population Vaccinated'},
    title = 'Bottom 5 Countries in terms of % People Fully Vaccinated',
    color='percentage'         
)
fig.update_traces(texttemplate=top5percentage['percentage'].map('{:.2f}%'.format),textposition='outside')
fig.show()

### Vaccine brand distribution worldwide

In [40]:
vbd = vbm.groupby(['vaccine','location'])['total_vaccinations'].max().reset_index()
vbd = vbd.groupby('vaccine')['total_vaccinations'].sum().reset_index()
vbd = vbd.sort_values('total_vaccinations',ascending=False)
vbd.head()

Unnamed: 0,vaccine,total_vaccinations
4,Pfizer/BioNTech,655977370
2,Moderna,200956257
3,Oxford/AstraZeneca,67224770
1,Johnson&Johnson,26233106
6,Sinovac,24702952


In [41]:
fig = px.pie(vbd, values='total_vaccinations',names='vaccine',title='Vaccine Brand Distribution Worldwide')
fig.show()