# Northeastern Covid-19 Vaccinations
A look into the distribution of vaccines accross the urban and rural parts New York, New Jersey and Pennylvania

Data Exploration and Analysis | DSC-530 T301 <br>
Jeremy Barton <br>
07.04.2025

<b>Appendix</b>

[Introduction](#introduction)

[Preprocessing](#preprocessing)

1. [Data Preparation](#data-preparation)
2. [Data Cleaning and Preparation](#data-cleaning-and-preparation)

[Exploratory Data Analysis](#exploratory-data-analysis)

1. [Variable Descriptons & Summary Statistics](#variable-descriptions-and-summary-statistics) 

## Introduction

The purpose of this paper is to conduct a comprehensive exploratory data analysis on a real-world dataset, utilizing various statistical techniques, visualization methods, and advanced methods to uncover insights, patterns, and relationships between variables. 

## Preprocessing

In [6]:
import pandas as pd
vax_data = pd.read_csv('data/vax_totals.csv')
vax_data

Unnamed: 0,Date,MMWR_week,Location,Distributed,Administered,Admin_Per_100K,Recip_Administered,Administered_Dose1_Recip,Series_Complete_Yes,Additional_Doses,Second_Booster,Administered_Bivalent
0,10/30/2021,43,TX,44779195,34509513,119015,33678095,17794025,15467426,1448444.0,,
1,10/30/2021,43,NJ,15160695,12301100,138492,12664320,6662856,5892769,526010.0,,
2,12/27/2020,53,PA,291825,84826,663,0,0,0,,,
3,12/27/2020,53,AK,45250,11427,1562,0,0,0,,,
4,12/16/2020,51,WI,49725,192,3,0,0,0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
38483,12/13/2020,51,AS,3900,0,0,0,0,0,,,
38484,12/13/2020,51,VI,975,0,0,0,0,0,,,
38485,12/13/2020,51,MP,4875,0,0,0,0,0,,,
38486,12/13/2020,51,US,13650,0,0,0,0,0,,,


### Data Selection

Checking the shape of this dataset.

In [7]:
vax_data.shape

(38488, 12)

38,488 records should be suitable for what we will be conducting in this paper.

This data is provided by the United States Center for Disease Control (CDC) and the descriptions clearly indicate the role of each field. 

### Data Cleaning and Preparation

Fill NaN values

In [8]:
vax_data.fillna(0)

Unnamed: 0,Date,MMWR_week,Location,Distributed,Administered,Admin_Per_100K,Recip_Administered,Administered_Dose1_Recip,Series_Complete_Yes,Additional_Doses,Second_Booster,Administered_Bivalent
0,10/30/2021,43,TX,44779195,34509513,119015,33678095,17794025,15467426,1448444.0,0.0,0.0
1,10/30/2021,43,NJ,15160695,12301100,138492,12664320,6662856,5892769,526010.0,0.0,0.0
2,12/27/2020,53,PA,291825,84826,663,0,0,0,0.0,0.0,0.0
3,12/27/2020,53,AK,45250,11427,1562,0,0,0,0.0,0.0,0.0
4,12/16/2020,51,WI,49725,192,3,0,0,0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
38483,12/13/2020,51,AS,3900,0,0,0,0,0,0.0,0.0,0.0
38484,12/13/2020,51,VI,975,0,0,0,0,0,0.0,0.0,0.0
38485,12/13/2020,51,MP,4875,0,0,0,0,0,0.0,0.0,0.0
38486,12/13/2020,51,US,13650,0,0,0,0,0,0.0,0.0,0.0


Checking for outliers using the IQR

## Exploratory Data Analysis

### Variable Descriptions with Summary Statistics

In this paper, we are claiming that the supply of vaccines have has affect the length of a patient's progression. More specifically, what impacts a surplus or deficit of vaccines (compared to the distribution of most) has on the completion of a second dose, booster, and additional doses.

In [9]:
vax_data.columns

Index(['Date', 'MMWR_week', 'Location', 'Distributed', 'Administered',
       'Admin_Per_100K', 'Recip_Administered', 'Administered_Dose1_Recip',
       'Series_Complete_Yes', 'Additional_Doses', 'Second_Booster',
       'Administered_Bivalent'],
      dtype='object')

The following are descriptions of variables used in the analysis. This is a subset containing totals from the full vaccinations dataset.

`Date`: Date data was reported on the CDC COVID Data Tracker

`MMWR`: Week of the epidemioligic year

`Location`: Jurisdictions (State/Territory/Federal Entity)

`Distributed`: Total number of deivered doses

`Administered`: Total number of administered doses based on the jurisdiction

`Recip_Administered`: Total number of doses administered based on the jurisdiction where recipient lives

`Series Complete`: Total number of people with a completed primary series (have second dose of a two-dose vaccine or one dose of a single-dose vaccine) based on the jurisdiction where recipient lives

`Additional Doses`: Total number of people who completed a primary series and have received a booster (or additional) dose.

`Second Booster`: Total number of people who have received a second booster dose in the US

`Bivalent`: Total number of people who have received an updated (bivalent) booster dose since September 1, 2022

A good question to start off the summary statistics with is, <b>"What is the average number of doses Distributed/Administered per Location?"</b>

In [31]:
# This option disables scientific notation 
pd.set_option('display.float_format', '{:.0f}'.format)

# Grouping by Location 
#   then using .mean() to find the average
#   vaccine_eda = vax_data[["Location", "Distributed", "Administered"]]
#   vaccine_eda = vaccine_eda.groupby("Location", as_index=False).mean()

# Using piping, it would look like this:
vaccine_eda = (
    vax_data
    .groupby("Location", as_index=False)
    [["Distributed","Administered"]]
    .mean()
)

# Rename to clarify
vaccine_eda.rename(columns={
        "Distributed": "Avg. Dist",
    "Administered": "Avg. Admin"
}, inplace=True)

# Median Calculations
vaccine_eda["Med. Dist"] = vax_data.groupby("Location")["Distributed"].median().reindex(vaccine_eda["Location"]).values
vaccine_eda["Med. Admin"] = vax_data.groupby("Location")["Administered"].median().reindex(vaccine_eda["Location"]).values

# Mode Calculations
vaccine_eda["Mode Dist"] = vax_data.groupby("Location")["Distributed"] \
    .apply(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan) \
    .reindex(vaccine_eda["Location"]) \
    .values

vaccine_eda["Mode Admin"] = vax_data.groupby("Location")["Administered"] \
    .apply(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan) \
    .reindex(vaccine_eda["Location"]) \
    .values

# Kurtosis Calculations
vaccine_eda["Kurt Dist"] = vax_data.groupby("Location")["Distributed"] \
    .apply(pd.Series.kurt) \
    .reindex(vaccine_eda["Location"]) \
    .values
vaccine_eda["Kurt Admin"] = vax_data.groupby("Location")["Administered"] \
    .apply(pd.Series.kurt) \
    .reindex(vaccine_eda["Location"]) \
    .values



# Add back in Administered and Distributed
vaccine_eda["Dist"] = vax_data["Distributed"]
vaccine_eda["Admin"] = vax_data["Administered"]


vaccine_eda

Unnamed: 0,Location,Avg. Dist,Avg. Admin,Med. Dist,Med. Admin,Mode Dist,Mode Admin,Kurt Dist,Kurt Admin,Dist,Admin
0,AK,1060955,779085,1089275,810238,271550,946031,-0,-1,44779195,34509513
1,AL,6365886,4153595,6829510,4624892,39000,5191035,-1,-1,15160695,12301100
2,AR,4014389,2815701,4161500,3085323,25350,40879,-1,-1,291825,84826
3,AS,72018,63402,67550,63123,54030,0,-1,-1,45250,11427
4,AZ,9893396,7959849,10143070,8370474,58500,9458251,-1,-1,49725,192
...,...,...,...,...,...,...,...,...,...,...,...
61,VT,1146741,929108,1075350,917788,22125,1101607,-1,-1,81100,20163
62,WA,11702485,9547576,11675975,9864304,221150,11228003,-1,-1,142725,16366
63,WI,7740788,6820004,7710425,6936268,49725,8080069,-1,-1,4259115,3333130
64,WV,2691873,1763784,3032115,1567789,16575,1689798,-1,-1,2231915,1730146


In [None]:
# vaccine_stats = (
#     vax_data
#     .groupby("Location", as_index=False)
#     .agg(
#         Admin_Mean      = ("Administered", "mean"),
#         Dist_Mean       = ("Distributed", "mean"),
#         Admin_Median    = ("Administered", "median"),
#         Dist_Median     = ("Distributed", "median"),
#         Admin_Mode      = ("Administered", lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan),
#         Dist_Mode       = ("Distributed", lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan),
#         Admin_Kurtosis  = ("Administered", pd.Series.kurt),
#         Dist_Kurtosis   = ("Distributed", pd.Series.kurt)
#     )
# )


Another point of intersest is if there is a surplus or deficit of doses delievred vs. adiminstered. This can be achieved by subracting the number of adminstered from delivered for each state.

In [32]:
vaccine_eda['+/- Supply'] = vaccine_eda['Dist'] - vaccine_eda['Admin']
vaccine_eda

Unnamed: 0,Location,Avg. Dist,Avg. Admin,Med. Dist,Med. Admin,Mode Dist,Mode Admin,Kurt Dist,Kurt Admin,Dist,Admin,+/- Supply
0,AK,1060955,779085,1089275,810238,271550,946031,-0,-1,44779195,34509513,10269682
1,AL,6365886,4153595,6829510,4624892,39000,5191035,-1,-1,15160695,12301100,2859595
2,AR,4014389,2815701,4161500,3085323,25350,40879,-1,-1,291825,84826,206999
3,AS,72018,63402,67550,63123,54030,0,-1,-1,45250,11427,33823
4,AZ,9893396,7959849,10143070,8370474,58500,9458251,-1,-1,49725,192,49533
...,...,...,...,...,...,...,...,...,...,...,...,...
61,VT,1146741,929108,1075350,917788,22125,1101607,-1,-1,81100,20163,60937
62,WA,11702485,9547576,11675975,9864304,221150,11228003,-1,-1,142725,16366,126359
63,WI,7740788,6820004,7710425,6936268,49725,8080069,-1,-1,4259115,3333130,925985
64,WV,2691873,1763784,3032115,1567789,16575,1689798,-1,-1,2231915,1730146,501769
