# Cleaning COVID-19 Vaccination Dataset

### Data Sources
- Vaccinations Dataset: https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/vaccinations.csv

- Documentation/README: https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/README.md

### Importing Libraries:
`country_converter`: https://pypi.org/project/country-converter/

In [1]:
import pandas as pd
import numpy as np
import country_converter as coco

### 1. Load Dataset:

In [2]:
path_vaccinations = r'raw_data/vaccinations.csv'
df_csv = pd.read_csv(path_vaccinations)

In [3]:
# Create DataFrame copy to work with
df = df_csv.copy()

df.head(10)

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,,0.0,0.0,,,,,
1,Afghanistan,AFG,2021-02-23,,,,,,1367.0,,,,,33.0,1367.0,0.003
2,Afghanistan,AFG,2021-02-24,,,,,,1367.0,,,,,33.0,1367.0,0.003
3,Afghanistan,AFG,2021-02-25,,,,,,1367.0,,,,,33.0,1367.0,0.003
4,Afghanistan,AFG,2021-02-26,,,,,,1367.0,,,,,33.0,1367.0,0.003
5,Afghanistan,AFG,2021-02-27,,,,,,1367.0,,,,,33.0,1367.0,0.003
6,Afghanistan,AFG,2021-02-28,8200.0,8200.0,,,,1367.0,0.02,0.02,,,33.0,1367.0,0.003
7,Afghanistan,AFG,2021-03-01,,,,,,1580.0,,,,,38.0,1580.0,0.004
8,Afghanistan,AFG,2021-03-02,,,,,,1794.0,,,,,44.0,1794.0,0.004
9,Afghanistan,AFG,2021-03-03,,,,,,2008.0,,,,,49.0,2008.0,0.005


### 2. Drop Unnecessary Columns:

In [4]:
cols_to_drop = [
    'daily_vaccinations_raw', 
    'total_vaccinations_per_hundred', 
    'people_vaccinated_per_hundred',
    'people_fully_vaccinated_per_hundred',
    'total_boosters_per_hundred',
    'daily_vaccinations_per_million',
    'daily_people_vaccinated_per_hundred'
]

df_filtered = df.drop(columns = cols_to_drop)
df_filtered

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations,daily_people_vaccinated
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,1367.0
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,1367.0
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,1367.0
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,1367.0
...,...,...,...,...,...,...,...,...,...
195821,Zimbabwe,ZWE,2022-10-05,12219760.0,6436704.0,4750104.0,1032952.0,2076.0,638.0
195822,Zimbabwe,ZWE,2022-10-06,,,,,1714.0,563.0
195823,Zimbabwe,ZWE,2022-10-07,,,,,1529.0,462.0
195824,Zimbabwe,ZWE,2022-10-08,,,,,1344.0,361.0


### 3. Check Duplicate Rows:

In [5]:
df_filtered[df_filtered.duplicated()]

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations,daily_people_vaccinated


### 4. Check Missing Values:

In [6]:
# Examine missing values
df_filtered.head(10)

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations,daily_people_vaccinated
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,1367.0
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,1367.0
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,1367.0
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,1367.0
5,Afghanistan,AFG,2021-02-27,,,,,1367.0,1367.0
6,Afghanistan,AFG,2021-02-28,8200.0,8200.0,,,1367.0,1367.0
7,Afghanistan,AFG,2021-03-01,,,,,1580.0,1580.0
8,Afghanistan,AFG,2021-03-02,,,,,1794.0,1794.0
9,Afghanistan,AFG,2021-03-03,,,,,2008.0,2008.0


### Observations:
- Many countries do not report vaccinations on a daily basis, resulting in missing values.
- From documentation, `daily_vaccinations` and `daily_people_vaccinated` columns are calculated using the `total_vaccinations` and `people_vaccinated` columns, respectively. That is why the first value for each row is `NaN`.
    - These values are calculated by linearly interpolating the `total_vaccinations` and `people_vaccinated`, then taking the daily difference, and finally calculating a 7-day-window moving average for the daily counts (see [here](https://github.com/owid/covid-19-data/issues/333#issuecomment-763015298) for a detailed explanation).

### Calculations to Perform:
1. Fill in the missing values in `total_vaccinations`, `people_vaccinated`, `people_fully_vaccinated`, and `total_boosters` by linearly interpolating the missing values.
2. Create `daily_people_fully_vaccinated` and `daily_boosters` columns by taking the difference between the current and previous-day values of `people_fully_vaccinated`, and `total_boosters`, respectively, then applying a 7-day-window moving average.

This will leave us the following vaccination data columns:
- `total_vaccinations`, `people_vaccinated`, `people_fully_vaccinated`, `total_boosters`, `daily_vaccinations`, `daily_people_vaccinated`, `daily_people_fully_vaccinated`, `daily_boosters`

### Notes:
- The first row value of the daily data columns will contain `NaN` values, as some countries only started reporting data some time after administering vaccinations. This results in some countries' data starting with a non-zero `total_vaccination` count, making it impossible to figure out the daily amount administered on that day or before. **Therefore, these values will be kept as `NaN`.**
- For now, we keep the `iso_code` column in case `country-converter` cannot recognize country names from the `location` column.

### 5. Linearly Interpolating Missing Data:

In [7]:
# Linearly interpolates following columns for each group in groupby
def group_interpolate(group):
    columns = [
        'total_vaccinations',
        'people_vaccinated',
        'people_fully_vaccinated', 
        'total_boosters'
        ]
    return group[columns].interpolate(method='linear')

interpolated_cols = df_filtered.groupby('location').apply(group_interpolate).reset_index(drop=True).round()
interpolated_cols

Unnamed: 0,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters
0,0.0,0.0,,
1,1367.0,1367.0,,
2,2733.0,2733.0,,
3,4100.0,4100.0,,
4,5467.0,5467.0,,
...,...,...,...,...
195821,12219760.0,6436704.0,4750104.0,1032952.0
195822,12220508.0,6436980.0,4750396.0,1033133.0
195823,12221257.0,6437256.0,4750687.0,1033314.0
195824,12222006.0,6437532.0,4750978.0,1033495.0


In [8]:
df_interpolated = df_filtered.copy()

# Update df with interpolated values
df_interpolated[['total_vaccinations', 'people_vaccinated',
                 'people_fully_vaccinated', 'total_boosters']] = interpolated_cols
df_interpolated

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations,daily_people_vaccinated
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,
1,Afghanistan,AFG,2021-02-23,1367.0,1367.0,,,1367.0,1367.0
2,Afghanistan,AFG,2021-02-24,2733.0,2733.0,,,1367.0,1367.0
3,Afghanistan,AFG,2021-02-25,4100.0,4100.0,,,1367.0,1367.0
4,Afghanistan,AFG,2021-02-26,5467.0,5467.0,,,1367.0,1367.0
...,...,...,...,...,...,...,...,...,...
195821,Zimbabwe,ZWE,2022-10-05,12219760.0,6436704.0,4750104.0,1032952.0,2076.0,638.0
195822,Zimbabwe,ZWE,2022-10-06,12220508.0,6436980.0,4750396.0,1033133.0,1714.0,563.0
195823,Zimbabwe,ZWE,2022-10-07,12221257.0,6437256.0,4750687.0,1033314.0,1529.0,462.0
195824,Zimbabwe,ZWE,2022-10-08,12222006.0,6437532.0,4750978.0,1033495.0,1344.0,361.0


### 6. Calculating `daily_people_fully_vaccinated` and `daily_boosters` Columns:

In [9]:
# Calculates daily new counts for following columns
def diff_within_group(group):
    columns = [
        'people_fully_vaccinated',
        'total_boosters'
    ]
    return group[columns].diff()

diff_cols = (
    df_interpolated
    .groupby('location')
    .apply(diff_within_group)
    .reset_index(drop=True)
)
diff_cols

Unnamed: 0,people_fully_vaccinated,total_boosters
0,,
1,,
2,,
3,,
4,,
...,...,...
195821,582.0,507.0
195822,292.0,181.0
195823,291.0,181.0
195824,291.0,181.0


In [10]:
# Calculate moving average for daily values
moving_avg_cols = diff_cols.rolling(window=7, min_periods=1).mean().round()
moving_avg_cols

Unnamed: 0,people_fully_vaccinated,total_boosters
0,,
1,,
2,,
3,,
4,,
...,...,...
195821,677.0,762.0
195822,597.0,553.0
195823,564.0,503.0
195824,531.0,453.0


In [11]:
df_new_cols = df_interpolated.copy()

# Update df with new moving average values
df_new_cols[['daily_people_fully_vaccinated', 'daily_boosters']] = moving_avg_cols

df_new_cols

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations,daily_people_vaccinated,daily_people_fully_vaccinated,daily_boosters
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,,,
1,Afghanistan,AFG,2021-02-23,1367.0,1367.0,,,1367.0,1367.0,,
2,Afghanistan,AFG,2021-02-24,2733.0,2733.0,,,1367.0,1367.0,,
3,Afghanistan,AFG,2021-02-25,4100.0,4100.0,,,1367.0,1367.0,,
4,Afghanistan,AFG,2021-02-26,5467.0,5467.0,,,1367.0,1367.0,,
...,...,...,...,...,...,...,...,...,...,...,...
195821,Zimbabwe,ZWE,2022-10-05,12219760.0,6436704.0,4750104.0,1032952.0,2076.0,638.0,677.0,762.0
195822,Zimbabwe,ZWE,2022-10-06,12220508.0,6436980.0,4750396.0,1033133.0,1714.0,563.0,597.0,553.0
195823,Zimbabwe,ZWE,2022-10-07,12221257.0,6437256.0,4750687.0,1033314.0,1529.0,462.0,564.0,503.0
195824,Zimbabwe,ZWE,2022-10-08,12222006.0,6437532.0,4750978.0,1033495.0,1344.0,361.0,531.0,453.0


### 7. Check for Impossible Values:

In [12]:
# Check for negative values of daily count
df_new_cols[df_new_cols['daily_people_fully_vaccinated'] < 0]

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations,daily_people_vaccinated,daily_people_fully_vaccinated,daily_boosters


In [13]:
# Check for negative values of daily count
df_new_cols[df_new_cols['daily_boosters'] < 0]

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations,daily_people_vaccinated,daily_people_fully_vaccinated,daily_boosters


### 8. Standardize Country Names:
- Standardize country names for consistency, and remove non-country entries using `country_converter` package.

In [14]:
# Create dict that maps old name to standardized name
def convert_country_names(country_names):
    standard_names = coco.convert(names=country_names, to='name_short')
    country_convert_dict = dict(zip(country_names, standard_names))
    return country_convert_dict

# Extract location names
location_names = df_new_cols['iso_code'].unique()

# Convert country names and create dict to match old and standardized
country_dict = convert_country_names(location_names)
country_dict

OWID_AFR not found in regex
OWID_ASI not found in regex
OWID_ENG not found in regex
OWID_EUR not found in regex
OWID_EUN not found in regex
OWID_HIC not found in regex
OWID_KOS not found in regex
OWID_LIC not found in regex
OWID_LMC not found in regex
OWID_NAM not found in regex
OWID_CYN not found in regex
OWID_NIR not found in regex
OWID_OCE not found in regex
OWID_SCT not found in regex
OWID_SAM not found in regex
OWID_UMC not found in regex
OWID_WLS not found in regex
OWID_WRL not found in regex


{'AFG': 'Afghanistan',
 'OWID_AFR': 'not found',
 'ALB': 'Albania',
 'DZA': 'Algeria',
 'AND': 'Andorra',
 'AGO': 'Angola',
 'AIA': 'Anguilla',
 'ATG': 'Antigua and Barbuda',
 'ARG': 'Argentina',
 'ARM': 'Armenia',
 'ABW': 'Aruba',
 'OWID_ASI': 'not found',
 'AUS': 'Australia',
 'AUT': 'Austria',
 'AZE': 'Azerbaijan',
 'BHS': 'Bahamas',
 'BHR': 'Bahrain',
 'BGD': 'Bangladesh',
 'BRB': 'Barbados',
 'BLR': 'Belarus',
 'BEL': 'Belgium',
 'BLZ': 'Belize',
 'BEN': 'Benin',
 'BMU': 'Bermuda',
 'BTN': 'Bhutan',
 'BOL': 'Bolivia',
 'BES': 'Bonaire, Saint Eustatius and Saba',
 'BIH': 'Bosnia and Herzegovina',
 'BWA': 'Botswana',
 'BRA': 'Brazil',
 'VGB': 'British Virgin Islands',
 'BRN': 'Brunei Darussalam',
 'BGR': 'Bulgaria',
 'BFA': 'Burkina Faso',
 'BDI': 'Burundi',
 'KHM': 'Cambodia',
 'CMR': 'Cameroon',
 'CAN': 'Canada',
 'CPV': 'Cabo Verde',
 'CYM': 'Cayman Islands',
 'CAF': 'Central African Republic',
 'TCD': 'Chad',
 'CHL': 'Chile',
 'CHN': 'China',
 'COL': 'Colombia',
 'COM': 'Comoros

In [15]:
# Create copy of df
df_standardized = df_new_cols.copy()

# Replace old country names with standardized names
df_standardized['location'] = df_standardized['iso_code'].replace(country_dict)

# Drop non-country locations by location == 'not found'
df_standardized = df_standardized[df_standardized['location'] != 'not found']
df_standardized

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations,daily_people_vaccinated,daily_people_fully_vaccinated,daily_boosters
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,,,
1,Afghanistan,AFG,2021-02-23,1367.0,1367.0,,,1367.0,1367.0,,
2,Afghanistan,AFG,2021-02-24,2733.0,2733.0,,,1367.0,1367.0,,
3,Afghanistan,AFG,2021-02-25,4100.0,4100.0,,,1367.0,1367.0,,
4,Afghanistan,AFG,2021-02-26,5467.0,5467.0,,,1367.0,1367.0,,
...,...,...,...,...,...,...,...,...,...,...,...
195821,Zimbabwe,ZWE,2022-10-05,12219760.0,6436704.0,4750104.0,1032952.0,2076.0,638.0,677.0,762.0
195822,Zimbabwe,ZWE,2022-10-06,12220508.0,6436980.0,4750396.0,1033133.0,1714.0,563.0,597.0,553.0
195823,Zimbabwe,ZWE,2022-10-07,12221257.0,6437256.0,4750687.0,1033314.0,1529.0,462.0,564.0,503.0
195824,Zimbabwe,ZWE,2022-10-08,12222006.0,6437532.0,4750978.0,1033495.0,1344.0,361.0,531.0,453.0


### 8. Convert `date` to dtype `datetime`:

In [16]:
# Convert date column to datetime for consistency
df_standardized['date'] = pd.to_datetime(df_standardized['date'])
df_standardized

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations,daily_people_vaccinated,daily_people_fully_vaccinated,daily_boosters
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,,,
1,Afghanistan,AFG,2021-02-23,1367.0,1367.0,,,1367.0,1367.0,,
2,Afghanistan,AFG,2021-02-24,2733.0,2733.0,,,1367.0,1367.0,,
3,Afghanistan,AFG,2021-02-25,4100.0,4100.0,,,1367.0,1367.0,,
4,Afghanistan,AFG,2021-02-26,5467.0,5467.0,,,1367.0,1367.0,,
...,...,...,...,...,...,...,...,...,...,...,...
195821,Zimbabwe,ZWE,2022-10-05,12219760.0,6436704.0,4750104.0,1032952.0,2076.0,638.0,677.0,762.0
195822,Zimbabwe,ZWE,2022-10-06,12220508.0,6436980.0,4750396.0,1033133.0,1714.0,563.0,597.0,553.0
195823,Zimbabwe,ZWE,2022-10-07,12221257.0,6437256.0,4750687.0,1033314.0,1529.0,462.0,564.0,503.0
195824,Zimbabwe,ZWE,2022-10-08,12222006.0,6437532.0,4750978.0,1033495.0,1344.0,361.0,531.0,453.0


### 9. Rename Columns and Drop `iso_code` Column:

In [17]:
# Rename column for consistency and drop unnecessary column
df_cleaned = df_standardized.rename(columns={'location': 'country'}).drop(columns='iso_code')
df_cleaned

Unnamed: 0,country,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations,daily_people_vaccinated,daily_people_fully_vaccinated,daily_boosters
0,Afghanistan,2021-02-22,0.0,0.0,,,,,,
1,Afghanistan,2021-02-23,1367.0,1367.0,,,1367.0,1367.0,,
2,Afghanistan,2021-02-24,2733.0,2733.0,,,1367.0,1367.0,,
3,Afghanistan,2021-02-25,4100.0,4100.0,,,1367.0,1367.0,,
4,Afghanistan,2021-02-26,5467.0,5467.0,,,1367.0,1367.0,,
...,...,...,...,...,...,...,...,...,...,...
195821,Zimbabwe,2022-10-05,12219760.0,6436704.0,4750104.0,1032952.0,2076.0,638.0,677.0,762.0
195822,Zimbabwe,2022-10-06,12220508.0,6436980.0,4750396.0,1033133.0,1714.0,563.0,597.0,553.0
195823,Zimbabwe,2022-10-07,12221257.0,6437256.0,4750687.0,1033314.0,1529.0,462.0,564.0,503.0
195824,Zimbabwe,2022-10-08,12222006.0,6437532.0,4750978.0,1033495.0,1344.0,361.0,531.0,453.0


### 10. Export to .csv File:

In [18]:
path_export = r'cleaned_data/covid_vaccinations.csv'
df_cleaned.to_csv(path_export)