## COVID-19 EDA USING WHO DATA 

**VACCINATION RATE VS. DEATH RATES**


**My data came directly from the** [WHO CORONAVIRUS DASHBOARD](https://covid19.who.int/data)


**Two datasets were used for analysis:**

[THE COVID19 GLOBAL TABLE DATA](https://covid19.who.int/WHO-COVID-19-global-table-data.csv)

[COVID19 VACCINATION DATA](https://covid19.who.int/who-data/vaccination-data.csv)  


<br/><br/>

**<font color=darkred> DATA DICTIONARY</FONT>**

**Latest reported counts of cases and deaths**

Download link: https://covid19.who.int/WHO-COVID-19-global-table-data.csv


| Field name                                                   | Type    | Description                                                                                                                                                      |
| ------------------------------------------------------------ | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Name                                                         | String  | Country, territory, area                                                                                                                                         |
| WHO\_region                                                  | String  | WHO Region                                                                                                                                                       |
| Cases - cumulative total                                     | Integer | Cumulative confirmed cases reported to WHO to date.                                                                                                              |
| Cases - cumulative total per 100000 population               | Decimal | Cumulative confirmed cases reported to WHO to date per 100,000 population.                                                                                       |
| Cases - newly reported in last 7 days                        | Integer | New confirmed cases reported in the last 7 days. Calculated by subtracting previous cumulative case count (8 days prior) from current cumulative cases count.    |
| Cases - newly reported in last 7 days per 100000 population  | Decimal | New confirmed cases reported in the last 7 days per 100,000 population.                                                                                          |
| Cases - newly reported in last 24 hours                      | Integer | New confirmed cases reported in the last 24 hours. Calculated by subtracting previous cumulative case count from current cumulative cases count.                 |
| Deaths - cumulative total                                    | Integer | Cumulative confirmed deaths reported to WHO to date.                                                                                                             |
| Deaths - cumulative total per 100000 population              | Decimal | Cumulative confirmed deaths reported to WHO to date per 100,000 population.                                                                                      |
| Deaths - newly reported in last 7 days                       | Integer | New confirmed deaths reported in the last 7 days. Calculated by subtracting previous cumulative death count (8 days prior) from current cumulative deaths count. |
| Deaths - newly reported in last 7 days per 100000 population | Decimal | New confirmed deaths reported in the last 7 days per 100,000 population.                                                                                         |
| Deaths - newly reported in last 24 hours                     | Integer | New confirmed deaths reported in the last 24 hours. Calculated by subtracting previous cumulative death count from current cumulative deaths count.              |


<br/><br/>

**Vaccination data**

Download links: https://covid19.who.int/who-data/vaccination-data.csv




| Field name                               | Type    | Description                                                                                                                                                                                                                                                                                                                                     |
| ---------------------------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| COUNTRY                                  | String  | Country, territory, area                                                                                                                                                                                                                                                                                                                        |
| ISO3                                     | String  | ISO Alpha-3 country code                                                                                                                                                                                                                                                                                                                        |
| WHO\_REGION                              | String  | WHO regional offices: WHO Member States are grouped into six WHO regions: Regional Office for Africa (AFRO), Regional Office for the Americas (AMRO), Regional Office for South-East Asia (SEARO), Regional Office for Europe (EURO), Regional Office for the Eastern Mediterranean (EMRO), and Regional Office for the Western Pacific (WPRO). |
| DATA\_SOURCE                             | String  | Indicates data source: - REPORTING: Data reported by Member States, or sourced from official reports - OWID: Data sourced from Our World in Data: https://ourworldindata.org/covid-vaccinations                                                                                                                                                 |
| DATE\_UPDATED                            | Date    | Date of last update                                                                                                                                                                                                                                                                                                                             |
| TOTAL\_VACCINATIONS                      | Integer | Cumulative total vaccine doses administered                                                                                                                                                                                                                                                                                                     |
| PERSONS\_VACCINATED\_1PLUS\_DOSE         | Decimal | Cumulative number of persons vaccinated with at least one dose                                                                                                                                                                                                                                                                                  |
| TOTAL\_VACCINATIONS\_PER100              | Integer | Cumulative total vaccine doses administered per 100 population                                                                                                                                                                                                                                                                                  |
| PERSONS\_VACCINATED\_1PLUS\_DOSE\_PER100 | Decimal | Cumulative persons vaccinated with at least one dose per 100 population                                                                                                                                                                                                                                                                         |
| PERSONS\_FULLY\_VACCINATED               | Integer | Cumulative number of persons fully vaccinated                                                                                                                                                                                                                                                                                                   |
| PERSONS\_FULLY\_VACCINATED\_PER100       | Decimal | Cumulative number of persons fully vaccinated per 100 population                                                                                                                                                                                                                                                                                |
| VACCINES\_USED                           | String  | Combined short name of vaccine: “Company - Product name” (see below)                                                                                                                                                                                                                                                                            |
| FIRST\_VACCINE\_DATE                     | Date    | Date of first vaccinations. Equivalent to start/launch date of the first vaccine administered in a country.                                                                                                                                                                                                                                     |
| NUMBER\_VACCINES\_TYPES\_USED            | Integer | Number of vaccine types used per country, territory, area                                                                                                                                                                                                                                                                                       |
| PERSONS\_BOOSTER\_ADD\_DOSE              | Integer | Persons received booster or additional dose                                                                                                                                                                                                                                                                                                     |
| PERSONS\_BOOSTER\_ADD\_DOSE\_PER100      | Decimal | Persons received booster or additional dose per 100 population                                                                                                                                                                                                                                                                                  |

# Checkpoint Two: Exploratory Data Analysis

Now that your chosen dataset is approved, it is time to start working on your analysis. Use this notebook to perform your EDA and make notes where directed to as you work.

## Getting Started

Since we have not provided your dataset for you, you will need to load the necessary files in this repository. Make sure to include a link back to the original dataset here as well.

My dataset:

Your first task in EDA is to import necessary libraries and create a dataframe(s). Make note in the form of code comments of what your thought process is as you work on this setup task.

In [1]:
# Import the appropriate libraries
import pandas as pd
import numpy as np
import matplotlib.mlab as mlab
import matplotlib
from nltk.metrics import edit_distance
import missingno as msno
import pandas_profiling as pp

# Visualization Imports
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
color = sns.color_palette()
get_ipython().run_line_magic('matplotlib', 'inline')
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px
import numpy as np


from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

#Print multiple outputs from 1 cell

pd.set_option('display.max_columns', None)

In [2]:
#Import CSVs

vacc_data=pd.read_csv(r"C:\Users\holly\Desktop\DAExercises\0 graded Assignment 4\WHO data\Core Data\vaccination-data.csv")
covid_stats=pd.read_csv(r"C:\Users\holly\Desktop\DAExercises\0 graded Assignment 4\WHO data\Core Data\WHO-COVID-19-global-table-data.csv")


## Get to Know the Numbers

Now that you have everything setup, put any code that you use to get to know the dataframe and its rows and columns better in the cell below. You can use whatever techniques you like, except for visualizations. You will put those in a separate section.

When working on your code, make sure to leave comments so that your mentors can understand your thought process.

**Ensuring that the data is properly imported**

In [3]:
#Viewing the first 5 rows of each dataset

vacc_data.head()

print('\n')

covid_stats.head()

Unnamed: 0,COUNTRY,ISO3,WHO_REGION,DATA_SOURCE,DATE_UPDATED,TOTAL_VACCINATIONS,PERSONS_VACCINATED_1PLUS_DOSE,TOTAL_VACCINATIONS_PER100,PERSONS_VACCINATED_1PLUS_DOSE_PER100,PERSONS_FULLY_VACCINATED,PERSONS_FULLY_VACCINATED_PER100,VACCINES_USED,FIRST_VACCINE_DATE,NUMBER_VACCINES_TYPES_USED,PERSONS_BOOSTER_ADD_DOSE,PERSONS_BOOSTER_ADD_DOSE_PER100
0,Afghanistan,AFG,EMRO,REPORTING,01/06/2022,6171652,5456919.0,15.854,14.018,4807917.0,12.351,"AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...",22/02/2021,11.0,,
1,Albania,ALB,EURO,REPORTING,29/05/2022,2873654,1320244.0,99.9,46.39,1241712.0,43.631,"AstraZeneca - Vaxzevria,Gamaleya - Gam-Covid-V...",13/01/2021,5.0,310249.0,10.901
2,Algeria,DZA,AFRO,REPORTING,05/06/2022,15205854,7840131.0,34.676,17.879,6481186.0,14.78,"Beijing CNBG - BBIBP-CorV,Gamaleya - Gam-Covid...",30/01/2021,4.0,514063.0,1.172
3,American Samoa,ASM,WPRO,REPORTING,05/06/2022,108220,44456.0,196.061,80.541,40945.0,74.18,"Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz...",21/12/2020,3.0,22893.0,41.475
4,Andorra,AND,EURO,REPORTING,29/05/2022,153072,57880.0,198.1,75.981,53450.0,70.166,"AstraZeneca - Vaxzevria,Moderna - Spikevax,Pfi...",20/01/2021,3.0,41742.0,54.796






Unnamed: 0,Name,WHO Region,Cases - cumulative total,Cases - cumulative total per 100000 population,Cases - newly reported in last 7 days,Cases - newly reported in last 7 days per 100000 population,Cases - newly reported in last 24 hours,Deaths - cumulative total,Deaths - cumulative total per 100000 population,Deaths - newly reported in last 7 days,Deaths - newly reported in last 7 days per 100000 population,Deaths - newly reported in last 24 hours
0,Global,,534495291,6857.303528,3485948,44.722945,571825,6311088,80.968059,8591,0.110218,1082
1,United States of America,Americas,84708007,25591.338,731052,220.86,114807,1001895,302.685,2340,0.707,297
2,India,South-East Asia,43245517,3133.723,55235,4.003,8822,524792,38.028,77,0.006,15
3,Brazil,Americas,31497038,14817.992,301920,142.04,40173,668180,314.35,1139,0.536,70
4,France,Europe,29009234,44602.583,251844,387.218,65425,145564,223.809,293,0.45,57


**Getting to know the dataframe and types better**

In [4]:
print('World Vaccination Data \n')
vacc_data.info()

print('\n Cumulative Cases and Deaths \n')

covid_stats.info()

World Vaccination Data 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228 entries, 0 to 227
Data columns (total 16 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   COUNTRY                               228 non-null    object 
 1   ISO3                                  228 non-null    object 
 2   WHO_REGION                            228 non-null    object 
 3   DATA_SOURCE                           228 non-null    object 
 4   DATE_UPDATED                          228 non-null    object 
 5   TOTAL_VACCINATIONS                    228 non-null    int64  
 6   PERSONS_VACCINATED_1PLUS_DOSE         227 non-null    float64
 7   TOTAL_VACCINATIONS_PER100             228 non-null    float64
 8   PERSONS_VACCINATED_1PLUS_DOSE_PER100  227 non-null    float64
 9   PERSONS_FULLY_VACCINATED              227 non-null    float64
 10  PERSONS_FULLY_VACCINATED_PER100       227 non-null    float64

In [5]:
vacc_data.shape

print('\n')

covid_stats.shape

(228, 16)





(238, 12)

**Checking for missing data**

In [6]:
#percentage of missing data for each column and data set:

vacc_missing =(vacc_data.isna().mean().round(4))*100
vacc_missing

stats_missing =(covid_stats.isna().mean().round(4))*100
stats_missing

COUNTRY                                  0.00
ISO3                                     0.00
WHO_REGION                               0.00
DATA_SOURCE                              0.00
DATE_UPDATED                             0.00
TOTAL_VACCINATIONS                       0.00
PERSONS_VACCINATED_1PLUS_DOSE            0.44
TOTAL_VACCINATIONS_PER100                0.00
PERSONS_VACCINATED_1PLUS_DOSE_PER100     0.44
PERSONS_FULLY_VACCINATED                 0.44
PERSONS_FULLY_VACCINATED_PER100          0.44
VACCINES_USED                            1.32
FIRST_VACCINE_DATE                       9.21
NUMBER_VACCINES_TYPES_USED               1.32
PERSONS_BOOSTER_ADD_DOSE                14.04
PERSONS_BOOSTER_ADD_DOSE_PER100         14.04
dtype: float64

Name                                                            0.00
WHO Region                                                      0.42
Cases - cumulative total                                        0.00
Cases - cumulative total per 100000 population                  0.42
Cases - newly reported in last 7 days                           0.00
Cases - newly reported in last 7 days per 100000 population     0.42
Cases - newly reported in last 24 hours                         0.00
Deaths - cumulative total                                       0.00
Deaths - cumulative total per 100000 population                 0.42
Deaths - newly reported in last 7 days                          0.00
Deaths - newly reported in last 7 days per 100000 population    0.42
Deaths - newly reported in last 24 hours                        0.00
dtype: float64

In [7]:
#Sum of missing data

vacc_data.isnull().sum()

print('\n')

covid_stats.isnull().sum()

COUNTRY                                  0
ISO3                                     0
WHO_REGION                               0
DATA_SOURCE                              0
DATE_UPDATED                             0
TOTAL_VACCINATIONS                       0
PERSONS_VACCINATED_1PLUS_DOSE            1
TOTAL_VACCINATIONS_PER100                0
PERSONS_VACCINATED_1PLUS_DOSE_PER100     1
PERSONS_FULLY_VACCINATED                 1
PERSONS_FULLY_VACCINATED_PER100          1
VACCINES_USED                            3
FIRST_VACCINE_DATE                      21
NUMBER_VACCINES_TYPES_USED               3
PERSONS_BOOSTER_ADD_DOSE                32
PERSONS_BOOSTER_ADD_DOSE_PER100         32
dtype: int64





Name                                                            0
WHO Region                                                      1
Cases - cumulative total                                        0
Cases - cumulative total per 100000 population                  1
Cases - newly reported in last 7 days                           0
Cases - newly reported in last 7 days per 100000 population     1
Cases - newly reported in last 24 hours                         0
Deaths - cumulative total                                       0
Deaths - cumulative total per 100000 population                 1
Deaths - newly reported in last 7 days                          0
Deaths - newly reported in last 7 days per 100000 population    1
Deaths - newly reported in last 24 hours                        0
dtype: int64

**Summary Stats on Vaccine data**

In [27]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)

vacc_data.describe().round(4)

# print('\n')

# #use .agg
# vacc_data._get_numeric_data().agg(["min", "max", "median", "skew", "mean"])



Unnamed: 0,TOTAL_VACCINATIONS,PERSONS_VACCINATED_1PLUS_DOSE,TOTAL_VACCINATIONS_PER100,PERSONS_VACCINATED_1PLUS_DOSE_PER100,PERSONS_FULLY_VACCINATED,PERSONS_FULLY_VACCINATED_PER100,NUMBER_VACCINES_TYPES_USED,PERSONS_BOOSTER_ADD_DOSE,PERSONS_BOOSTER_ADD_DOSE_PER100
count,228.0,227.0,228.0,227.0,227.0,227.0,225.0,196.0,196.0
mean,52036029.706,22870757.899,135.098,58.545,20800114.493,53.272,4.667,9541669.342,27.862
std,266402622.218,112292368.649,77.471,26.934,105442393.692,26.659,2.73,55596940.876,23.392
min,106.0,46.0,0.137,0.121,37.0,0.116,1.0,0.0,0.0
25%,382751.0,188454.5,71.125,38.486,173105.5,31.973,3.0,35225.25,5.955
50%,2954984.5,2092750.0,138.393,63.988,1556817.0,58.981,4.0,367440.0,23.293
75%,18355517.5,8211510.5,202.586,80.388,7129183.5,74.913,6.0,3275915.0,50.093
max,3397315599.0,1297285455.0,355.748,124.882,1266874166.0,122.944,12.0,756436639.0,107.922


**Summary Stats on Cases and deaths**

In [26]:
covid_stats.describe().round(4)
print('\n')

# #use .agg
# covid_stats._get_numeric_data().agg(["min", "max", "median", "skew", "mean"])

Unnamed: 0,Cases - cumulative total,Cases - cumulative total per 100000 population,Cases - newly reported in last 7 days,Cases - newly reported in last 7 days per 100000 population,Cases - newly reported in last 24 hours,Deaths - cumulative total,Deaths - cumulative total per 100000 population,Deaths - newly reported in last 7 days,Deaths - newly reported in last 7 days per 100000 population,Deaths - newly reported in last 24 hours
count,238.0,237.0,238.0,237.0,238.0,238.0,237.0,238.0,237.0,238.0
mean,4491557.067,15954.806,29293.681,115.712,4805.252,53034.353,113.53,72.193,0.415,9.092
std,35300645.671,16331.414,235096.656,231.739,38619.162,418255.855,120.332,586.608,3.385,73.878
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,19233.5,1186.863,15.25,0.405,0.0,153.0,13.754,0.0,0.0,0.0
50%,165040.0,10893.582,197.0,15.744,0.0,1474.5,73.847,0.0,0.0,0.0
75%,1091966.25,27005.418,1863.0,129.114,219.5,11852.5,174.253,5.0,0.13,0.0
max,534495291.0,70926.021,3485948.0,2131.827,571825.0,6311088.0,647.031,8591.0,51.733,1082.0






In [31]:
covid_stats.columns
print('\n')

vacc_data.columns

print('\n Countries with 0 reported cases \n')

covid_stats.loc[covid_stats['Cases - cumulative total'] == 0]

print('\n Country with lowest overall vaccination rate \n')


vacc_data.loc[vacc_data['TOTAL_VACCINATIONS'] == 106]

Index(['Name', 'WHO Region', 'Cases - cumulative total',
       'Cases - cumulative total per 100000 population',
       'Cases - newly reported in last 7 days',
       'Cases - newly reported in last 7 days per 100000 population',
       'Cases - newly reported in last 24 hours', 'Deaths - cumulative total',
       'Deaths - cumulative total per 100000 population',
       'Deaths - newly reported in last 7 days',
       'Deaths - newly reported in last 7 days per 100000 population',
       'Deaths - newly reported in last 24 hours'],
      dtype='object')





Index(['COUNTRY', 'ISO3', 'WHO_REGION', 'DATA_SOURCE', 'DATE_UPDATED',
       'TOTAL_VACCINATIONS', 'PERSONS_VACCINATED_1PLUS_DOSE',
       'TOTAL_VACCINATIONS_PER100', 'PERSONS_VACCINATED_1PLUS_DOSE_PER100',
       'PERSONS_FULLY_VACCINATED', 'PERSONS_FULLY_VACCINATED_PER100',
       'VACCINES_USED', 'FIRST_VACCINE_DATE', 'NUMBER_VACCINES_TYPES_USED',
       'PERSONS_BOOSTER_ADD_DOSE', 'PERSONS_BOOSTER_ADD_DOSE_PER100'],
      dtype='object')


 Countries with 0 reported cases 



Unnamed: 0,Name,WHO Region,Cases - cumulative total,Cases - cumulative total per 100000 population,Cases - newly reported in last 7 days,Cases - newly reported in last 7 days per 100000 population,Cases - newly reported in last 24 hours,Deaths - cumulative total,Deaths - cumulative total per 100000 population,Deaths - newly reported in last 7 days,Deaths - newly reported in last 7 days per 100000 population,Deaths - newly reported in last 24 hours
233,Democratic People's Republic of Korea,South-East Asia,0,0.0,0,0.0,0,0,0.0,0,0.0,0
234,Pitcairn Islands,Western Pacific,0,0.0,0,0.0,0,0,0.0,0,0.0,0
235,Saint Helena,Africa,0,0.0,0,0.0,0,0,0.0,0,0.0,0
236,Tokelau,Western Pacific,0,0.0,0,0.0,0,0,0.0,0,0.0,0
237,Turkmenistan,Europe,0,0.0,0,0.0,0,0,0.0,0,0.0,0



 Country with lowest overall vaccination rate 



Unnamed: 0,COUNTRY,ISO3,WHO_REGION,DATA_SOURCE,DATE_UPDATED,TOTAL_VACCINATIONS,PERSONS_VACCINATED_1PLUS_DOSE,TOTAL_VACCINATIONS_PER100,PERSONS_VACCINATED_1PLUS_DOSE_PER100,PERSONS_FULLY_VACCINATED,PERSONS_FULLY_VACCINATED_PER100,VACCINES_USED,FIRST_VACCINE_DATE,NUMBER_VACCINES_TYPES_USED,PERSONS_BOOSTER_ADD_DOSE,PERSONS_BOOSTER_ADD_DOSE_PER100
162,Pitcairn Islands,PCN,WPRO,REPORTING,05/06/2022,106,46.0,212.0,92.0,37.0,74.0,AstraZeneca - Vaxzevria,17/05/2021,1.0,23.0,46.0


## Visualize

Create any visualizations for your EDA here. Make note in the form of code comments of what your thought process is for your visualizations.

In [None]:
#pp.ProfileReport(vacc_data)

In [None]:
#pp.ProfileReport(covid_stats)

## Summarize Your Results

With your EDA complete, answer the following questions.

1. Was there anything surprising about your dataset? 
2. Do you have any concerns about your dataset? 
3. Is there anything you want to make note of for the next phase of your analysis, which is cleaning data? 