# Project 2


### Dataset 3


https://www.kaggle.com/gpreda/covid-world-vaccination-progress

I found an amazing dataset on kaggle with covid world vaccination progress. The table is very wide, and there are a lot of null values in many areas. 

There are interesting questions that could be answered with this dataset including the ones provided already on Kaggle.

Taken from Kaggle:

In which country the vaccination programme is more advanced?
Where are more people being vaccinated per day? But in terms of percent from entire population ?
Other questions the dataset might be able to answer:

Which countries are lacking behind in COVID-19 vaccinations?
Which countries have the highest daily vaccinations on a certain date?


In [1]:
# Import libraries
import pandas as pd
import sys

In [2]:
# loading first dataset
df_1 = pd.read_csv('dataset_3.csv')
[rows, cols] = df_1.shape
print(f"The dataset has {rows} rows and {cols} columns")
print("\nColumns:")
print("\n".join(df_1.columns.to_list()))
print("\n\n")

The dataset has 5457 rows and 15 columns

Columns:
country
iso_code
date
total_vaccinations
people_vaccinated
people_fully_vaccinated
daily_vaccinations_raw
daily_vaccinations
total_vaccinations_per_hundred
people_vaccinated_per_hundred
people_fully_vaccinated_per_hundred
daily_vaccinations_per_million
vaccines
source_name
source_website





In [4]:
df_1.head()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Albania,ALB,2021-01-10,0.0,0.0,,,,0.0,0.0,,,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...
1,Albania,ALB,2021-01-11,,,,,64.0,,,,22.0,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...
2,Albania,ALB,2021-01-12,128.0,128.0,,,64.0,0.0,0.0,,22.0,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...
3,Albania,ALB,2021-01-13,188.0,188.0,,60.0,63.0,0.01,0.01,,22.0,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...
4,Albania,ALB,2021-01-14,266.0,266.0,,78.0,66.0,0.01,0.01,,23.0,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...


## Cleaning
I would like to get the data by country. The date is not important for my analysis.
I will drop iso_code, source_name and source_website

In [3]:
df_1.head()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Albania,ALB,2021-01-10,0.0,0.0,,,,0.0,0.0,,,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...
1,Albania,ALB,2021-01-11,,,,,64.0,,,,22.0,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...
2,Albania,ALB,2021-01-12,128.0,128.0,,,64.0,0.0,0.0,,22.0,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...
3,Albania,ALB,2021-01-13,188.0,188.0,,60.0,63.0,0.01,0.01,,22.0,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...
4,Albania,ALB,2021-01-14,266.0,266.0,,78.0,66.0,0.01,0.01,,23.0,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...


In [5]:
df_1 = df_1.drop(columns=['date', 'iso_code', 'source_name', 'source_website', 'vaccines'])
df_1 = df_1.fillna(0)
df_1.head()

Unnamed: 0,country,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
0,Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Albania,0.0,0.0,0.0,0.0,64.0,0.0,0.0,0.0,22.0
2,Albania,128.0,128.0,0.0,0.0,64.0,0.0,0.0,0.0,22.0
3,Albania,188.0,188.0,0.0,60.0,63.0,0.01,0.01,0.0,22.0
4,Albania,266.0,266.0,0.0,78.0,66.0,0.01,0.01,0.0,23.0


In [6]:
aggregations = {
		'total_vaccinations': max,
		'people_vaccinated': max,
		'people_fully_vaccinated': max,
		'daily_vaccinations_raw': 'mean',
		'daily_vaccinations': 'sum',
		'total_vaccinations_per_hundred': max,
		'people_vaccinated_per_hundred': max,
		'people_fully_vaccinated_per_hundred': max,
		'daily_vaccinations_per_million': 'mean',
    }
grouped_df = df_1.groupby('country', as_index=False).agg(aggregations)

grouped_df.shape

(129, 10)

After testing some aggregations and parsing the output data, I got the correct DataFrame.
Some columns must be aggregated by sum, others by average and others by maximum because it has the accumulated value

In [7]:
grouped_df[grouped_df['country'].str.contains('United States')]

Unnamed: 0,country,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
124,United States,90351750.0,58873710.0,30686881.0,997216.564103,82977935.0,27.02,17.6,9.18,3180.897436


# Analyzes

### Some analyzes that could be performed on the data:
- In which country the vaccination programme is more advanced?
- Where are more people being vaccinated per day? But in terms of percent from entire population ? 

Other questions the dataset might be able to answer:

- Which countries are lacking behind in COVID-19 vaccinations?
- Which countries have the highest daily vaccinations on a certain date?


### 1 .In which country the vaccination programme is more advanced?

This is the country that has the highest value of people_fully_vaccinated_per_hundred. I'll get the top 5

In [8]:
grouped_df.sort_values('people_fully_vaccinated_per_hundred', ascending=False)[['country', 'people_fully_vaccinated_per_hundred']].head()

Unnamed: 0,country,people_fully_vaccinated_per_hundred
41,Gibraltar,46.22
57,Israel,43.78
107,Seychelles,25.04
122,United Arab Emirates,22.12
14,Bermuda,13.88


### 2. Where are more people being vaccinated per day? 

In [9]:
grouped_df.sort_values('daily_vaccinations', ascending=False)[['country', 'daily_vaccinations']].head()

Unnamed: 0,country,daily_vaccinations
124,United States,82977935.0
22,China,49687760.0
123,United Kingdom,22248026.0
34,England,18769920.0
52,India,18545913.0


### But in terms of percent from entire population ?
To answer this question, we need the total population by country.

### 3. Which countries are lacking behind in COVID-19 vaccinations?


In [10]:
grouped_df.sort_values('people_fully_vaccinated_per_hundred', ascending=True)[['country', 'people_fully_vaccinated_per_hundred']].head()

Unnamed: 0,country,people_fully_vaccinated_per_hundred
128,Zimbabwe,0.0
32,Egypt,0.0
33,El Salvador,0.0
90,Panama,0.0
89,Pakistan,0.0


There are many countries that don't have vaccinated people. Therefore, I will analyze those who started vaccination

In [11]:
index_filtered = grouped_df['people_fully_vaccinated_per_hundred']>0



grouped_df[index_filtered].sort_values('people_fully_vaccinated_per_hundred', ascending=True)[['country', 'people_fully_vaccinated_per_hundred']].head()

Unnamed: 0,country,people_fully_vaccinated_per_hundred
62,Kazakhstan,0.01
0,Albania,0.02
31,Ecuador,0.04
15,Bolivia,0.08
111,South Africa,0.17


Now we can see the 5 countries with the fewest people completely vaccinated per hundred

### 4. Which countries have the highest daily vaccinations on a certain date? 
#### To answer this question we should to take the initail data.

In [12]:
# loading first dataset
df_2 = pd.read_csv('dataset_3.csv')

In [13]:
df_2[df_2['daily_vaccinations']==df_2['daily_vaccinations'].max()]

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
5331,United States,USA,2021-03-07,90351750.0,58873710.0,30686881.0,2439427.0,2159392.0,27.02,17.6,9.18,6457.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...


### The United States is the country with the highest number of daily vaccinations on any given day. It was on 2021-03-07
--------
#### Below we list top 5

In [14]:
df_2.sort_values('daily_vaccinations', ascending=False).head()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
5331,United States,USA,2021-03-07,90351750.0,58873710.0,30686881.0,2439427.0,2159392.0,27.02,17.6,9.18,6457.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
5330,United States,USA,2021-03-06,87912323.0,57358849.0,29776160.0,2904229.0,2158020.0,26.29,17.15,8.9,6453.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
5329,United States,USA,2021-03-05,85008094.0,55547697.0,28701201.0,2435246.0,2079147.0,25.42,16.61,8.58,6217.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
5328,United States,USA,2021-03-04,82572848.0,54035670.0,27795980.0,2032374.0,2042676.0,24.69,16.16,8.31,6108.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
5327,United States,USA,2021-03-03,80540474.0,52855579.0,26957804.0,1908873.0,2010790.0,24.08,15.8,8.06,6012.0,"Moderna, Pfizer/BioNTech",Centers for Disease Control and Prevention,https://covid.cdc.gov/covid-data-tracker/#vacc...
