# 1. Background

# 1.1 VAERS Dataset

The Vaccine Adverse Effect Reporting System (VAERS) is a early warning system to detect possible safety problems in U.S.-licensed vaccines. VAERS accepts and analyzes reports of adverse events (or side effects) after a person has received a vaccination. 

VAERS is a passive reporting system, meaning it relies on individuals to send in reports of their experiences. As such, it is prone to biasness from self-reporting and therefore, not designed for determining if a vaccine caused a health problem. Nonetheless, it can be especially useful for detecting unusual or unexpected patterns of adverse event reporting that might indicate a possible safety problem with a vaccine. The primary objectives of VAERS are to:

1. Detect new, unusual, or rare vaccine adverse events
2. Monitor increases in known adverse events
3. Identify potential patient risk factors for particular types of adverse events
4. Assess the safety of newly licensed vaccines
5. Determine and address possible reporting clusters 
6. Recognize persistent safe-use problems and administration errors;
7. Provide a national safety monitoring system that extends to the entire general population for response to public
    health emergencies, such as a large-scale pandemic influenza vaccination program.

The VAERS dataset contains the patient's demographics, limited health data, vaccination information and the symptoms relating to the adverse effect. 

# 1.2 Problem Statement

There has been some concerns regarding the association between myocarditis and vaccination. This has caused some uproar in social media. Some groups of people, particularly anti-vaccination groups have used the VAERS dataset for analysis and concluded that COVID-19 vaccinations are associated with myocarditis.

However, based on expert's advice, the VAERS dataset is not suitable for analysis as it is a passive reporting system whereby any individuals can report their adverse events. This results in reporting biasness which can lead to over-reporting of serious adverse events. In addition, the VAERS dataset do not have a control group for baseline comparison and this can lead to erronous conclusion. 

In this project, we will be attempting to create a predictive model using the VAERS dataset and base on model results, decide whether if a suitable predictive model can be created using the VAERS dataset. 

The outcome of interest for this project is myocarditis-related symptoms and pericarditis-related symptoms

# 2. Import Packages

In [1]:
# Dataframe and data manipulation

import pandas as pd
import numpy as np

# Imputation of missing values

from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer

# Statistical package

from scipy import stats

# 3. Import Dataset

In [2]:
# Merging all the VAERS datasets using VAERS_ID column

df1 = pd.read_csv("2021VAERSData.csv")
df2 = pd.read_csv("2021VAERSVAX.csv")
df3 = pd.read_csv("2021VAERSSYMPTOMS.csv")
df4 = pd.merge(left=df1,right=df2,on="VAERS_ID",how="inner")
df = pd.merge(left=df4, right=df3, on="VAERS_ID", how="inner")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
# Set viewing option for viewing output

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 30)

# 4. Check dataset

In [4]:
# Dimensions of dataset and data types

print(df.shape)
print("")
print(df.dtypes)

(926328, 52)

VAERS_ID             int64
RECVDATE            object
STATE               object
AGE_YRS            float64
CAGE_YR            float64
                    ...   
SYMPTOMVERSION3    float64
SYMPTOM4            object
SYMPTOMVERSION4    float64
SYMPTOM5            object
SYMPTOMVERSION5    float64
Length: 52, dtype: object


In [5]:
# Check for null/missing values

df.isnull().sum()

VAERS_ID                0
RECVDATE                0
STATE              102663
AGE_YRS             83791
CAGE_YR            174954
                    ...  
SYMPTOMVERSION3    367615
SYMPTOM4           500835
SYMPTOMVERSION4    500835
SYMPTOM5           607764
SYMPTOMVERSION5    607764
Length: 52, dtype: int64

In [6]:
# Overview of the dataset

df.head()

Unnamed: 0,VAERS_ID,RECVDATE,STATE,AGE_YRS,CAGE_YR,CAGE_MO,SEX,RPT_DATE,SYMPTOM_TEXT,DIED,DATEDIED,L_THREAT,ER_VISIT,HOSPITAL,HOSPDAYS,X_STAY,DISABLE,RECOVD,VAX_DATE,ONSET_DATE,NUMDAYS,LAB_DATA,V_ADMINBY,V_FUNDBY,OTHER_MEDS,CUR_ILL,HISTORY,PRIOR_VAX,SPLTTYPE,FORM_VERS,TODAYS_DATE,BIRTH_DEFECT,OFC_VISIT,ER_ED_VISIT,ALLERGIES,VAX_TYPE,VAX_MANU,VAX_LOT,VAX_DOSE_SERIES,VAX_ROUTE,VAX_SITE,VAX_NAME,SYMPTOM1,SYMPTOMVERSION1,SYMPTOM2,SYMPTOMVERSION2,SYMPTOM3,SYMPTOMVERSION3,SYMPTOM4,SYMPTOMVERSION4,SYMPTOM5,SYMPTOMVERSION5
0,916600,01/01/2021,TX,33.0,33.0,,F,,Right side of epiglottis swelled up and hinder...,,,,,,,,,Y,12/28/2020,12/30/2020,2.0,,PVT,,,,,,,2,01/01/2021,,Y,,Pcn and bee venom,COVID19,MODERNA,037K20A,1,IM,LA,COVID19 (COVID19 (MODERNA)),Dysphagia,23.1,Epiglottitis,23.1,,,,,,
1,916601,01/01/2021,CA,73.0,73.0,,F,,Approximately 30 min post vaccination administ...,,,,,,,,,Y,12/31/2020,12/31/2020,0.0,,SEN,,Patient residing at nursing facility. See pati...,Patient residing at nursing facility. See pati...,Patient residing at nursing facility. See pati...,,,2,01/01/2021,,Y,,"""Dairy""",COVID19,MODERNA,025L20A,1,IM,RA,COVID19 (COVID19 (MODERNA)),Anxiety,23.1,Dyspnoea,23.1,,,,,,
2,916602,01/01/2021,WA,23.0,23.0,,F,,"About 15 minutes after receiving the vaccine, ...",,,,,,,,,U,12/31/2020,12/31/2020,0.0,,SEN,,,,,,,2,01/01/2021,,,Y,Shellfish,COVID19,PFIZER\BIONTECH,EL1284,1,IM,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Chest discomfort,23.1,Dysphagia,23.1,Pain in extremity,23.1,Visual impairment,23.1,,
3,916603,01/01/2021,WA,58.0,58.0,,F,,"extreme fatigue, dizziness,. could not lift my...",,,,,,,,,Y,12/23/2020,12/23/2020,0.0,none,WRK,,none,kidney infection,"diverticulitis, mitral valve prolapse, osteoar...","got measles from measel shot, mums from mumps ...",,2,01/01/2021,,,,"Diclofenac, novacaine, lidocaine, pickles, tom...",COVID19,MODERNA,unknown,UNK,,,COVID19 (COVID19 (MODERNA)),Dizziness,23.1,Fatigue,23.1,Mobility decreased,23.1,,,,
4,916604,01/01/2021,TX,47.0,47.0,,F,,"Injection site swelling, redness, warm to the ...",,,,,,,,,N,12/22/2020,12/29/2020,7.0,,PUB,,Na,Na,,,,2,01/01/2021,,,,Na,COVID19,MODERNA,,1,IM,LA,COVID19 (COVID19 (MODERNA)),Injection site erythema,23.1,Injection site pruritus,23.1,Injection site swelling,23.1,Injection site warmth,23.1,,


# 5. Data Cleaning 

This step involves:
    
1. Preliminary cleaning 
2. Removing missing data 
3. Imputation of missing data 
4. Removing redundant columns
5. Binarise outcome variables and collapsing them into a single column

# 5.1 Preliminary Cleaning 

In [7]:
# Remove duplicates

df = df.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

In [8]:
# There are two types of VAERS version records, type 1 and type 2. They differ in the reporting structure.
# As the number of type 2 records dominates the dataset, we will be just using type 2 records to preserve the 
# data integrity

print(df[df["FORM_VERS"] == 1].shape)
print(df[df["FORM_VERS"] == 2].shape)

df_clean = df[df["FORM_VERS"] == 2]

(541, 52)
(923801, 52)


In [9]:
# Restrict the data to only COVID-19 Vaccines

df_clean = df_clean[df_clean["VAX_TYPE"] == "COVID19"]

In [10]:
# Remove Redundant columns

df_clean = df_clean.drop(['CAGE_YR', 'CAGE_MO','RPT_DATE','DATEDIED','FORM_VERS', 'V_ADMINBY','V_FUNDBY','OFC_VISIT',
             'VAX_TYPE'], 
             axis=1)

# 5.2 Data Cleaning: Missing Data

In [11]:
# Overview of missing data

df_missing_data = df_clean.isnull().sum()
df_missing_data = pd.DataFrame(df_missing_data, columns=["number_missing"])
df_missing_data["percentage_missing"] = df_missing_data["number_missing"]*100/df_clean.shape[0]
df_missing_data

Unnamed: 0,number_missing,percentage_missing
VAERS_ID,0,0.000000
RECVDATE,0,0.000000
STATE,93447,10.683711
AGE_YRS,70593,8.070834
SEX,0,0.000000
...,...,...
SYMPTOMVERSION3,341172,39.005886
SYMPTOM4,467262,53.421641
SYMPTOMVERSION4,467262,53.421641
SYMPTOM5,569132,65.068346


# 5.3 Age-Years (Continuous Variable)

In [12]:
# Remove missing values 

df_clean = df_clean[df_clean['AGE_YRS'].notna()]

# 5.4 State (Multi-Categorical Variable)

In [13]:
# Remove missing values 

df_clean = df_clean[df_clean['STATE'].notna()]

# 5.5 Sex (Binary Categorical Variable)

In [14]:
# Remove Sex that is labelled as 'U'. 

df_clean = df_clean[df_clean['SEX']!= 'U']

# Binarise the Sex labels

df_clean['SEX'] = df_clean['SEX'].replace({
    'F':1,
    'M':0
})

# Convert Sex into Integer format

df_clean['SEX']=df_clean['SEX'].astype(int)


# 5.6 Patient Death (Binary Categorical Variable)

In [15]:
# Replace 'DIED' nan value with 'N'

print(df_clean['DIED'].unique)
df_clean['DIED'] = df_clean['DIED'].fillna('N')


# Binarise the Death labels

df_clean['DIED'] = df_clean['DIED'].replace({
    'N':0,
    'Y':1
})


# Convert Death into Integer format

df_clean['DIED'] = df_clean['DIED'].astype(int)

<bound method Series.unique of 0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
         ... 
926322    NaN
926323    NaN
926324    NaN
926325    NaN
926326    NaN
Name: DIED, Length: 736515, dtype: object>


# 5.7 Patient Life Threatening Condition (Binary Category)

In [16]:
# Replace 'L_THREAT' nan value with 'N'

print(df_clean['L_THREAT'].unique())
df_clean['L_THREAT'] = df_clean['L_THREAT'].fillna('N')


# Binarise the Life threatening labels

df_clean['L_THREAT'] = df_clean['L_THREAT'].replace({
    'N':0,
    'Y':1
})


# Convert Life threatening into Integer format

df_clean['L_THREAT']=df_clean['L_THREAT'].astype(int)

[nan 'Y']


# 5.8 Emergency Visit (Unknown)

In [17]:
# Drop ER column because it is 100% Missing

df_clean = df_clean.drop(['ER_VISIT'],axis=1)


# 5.9 Hospitalised (Binary Category) and Number of days hospitalised (Continuous Variable)

In [18]:
# Screen labels for Hospitalised and Number of days hospitalized

print(df_clean['HOSPITAL'].unique())
print(df_clean['HOSPDAYS'].unique())

[nan 'Y']
[       nan 9.0000e+00 1.0000e+00 4.0000e+00 2.0000e+00 6.0000e+00
 3.0000e+00 7.0000e+00 5.0000e+00 1.1000e+01 1.0000e+01 8.0000e+00
 1.5000e+01 2.2000e+01 1.4000e+01 1.6000e+01 1.3000e+01 1.7000e+01
 1.2000e+01 2.1000e+01 2.0000e+01 2.4000e+01 2.6000e+01 3.6000e+01
 1.9000e+01 1.8000e+01 2.5000e+01 2.7000e+01 2.9000e+01 2.8000e+01
 3.9000e+01 3.3000e+01 3.8000e+01 3.7000e+01 3.0000e+01 4.3000e+01
 4.2000e+01 4.0000e+01 3.4000e+01 3.2000e+01 2.3000e+01 3.1000e+01
 9.0000e+01 5.5000e+01 3.5000e+01 5.0000e+01 9.9900e+02 4.5000e+01
 5.1000e+01 4.4000e+01 4.1000e+01 4.7000e+01 1.0400e+02 6.5000e+01
 6.8000e+01 5.9000e+01 6.0000e+01 5.4000e+01 8.0000e+01 9.9999e+04
 7.8000e+01 6.3000e+01 8.2000e+01 9.5000e+01 5.7000e+01 9.7000e+01
 9.1000e+01 1.2000e+02 9.2000e+01 4.9000e+01 5.2000e+01 5.6000e+01
 6.7000e+01 4.6000e+01 6.2000e+01 8.9000e+01 7.4000e+01 1.0100e+02
 9.6000e+01 6.6000e+01 1.2700e+02 1.3600e+02 8.7000e+01 1.7100e+02
 5.3000e+01 7.5000e+01 1.3200e+02 1.0000e+02 7.6000e

In [19]:
# Replace missing Hospitalised column with N

df_clean['HOSPITAL'] = df_clean['HOSPITAL'].fillna('N')

# Ensure that if the patient is not hospitalised, the number of days hospitalised is 0
# Data points that do not conform to this are removed

df_clean.loc[df_clean.HOSPITAL == 'N', 'HOSPDAYS'] = 0

# Remove missing values from HOSPDAYS
df_clean = df_clean[df_clean['HOSPDAYS'].notna()]

# Binarise HOSPITAL column
df_clean['HOSPITAL'] = df_clean['HOSPITAL'].replace({
    'N':0,
    'Y':1
})

# Convert HOSPDAYS and HOSPITAL into Integer format

df_clean['HOSPITAL']=df_clean['HOSPITAL'].astype(int)
df_clean['HOSPDAYS']=df_clean['HOSPDAYS'].astype(int)

# Check dataset

df_clean.shape

(719720, 42)

# 5.10 Extended stay in hospitals (Binary Category)

In [20]:
# Prolongation of existing hospitalization. 
# Prolongation of hospital stay is 21 days or more due to adverse events only.

df_clean['X_STAY'].unique()
df_clean['X_STAY'] = df_clean['X_STAY'].fillna('N')

# Binarise X_STAY column
df_clean['X_STAY'] = df_clean['X_STAY'].replace({
    'N':0,
    'Y':1
})

# Convert into integer format

df_clean['X_STAY']=df_clean['X_STAY'].astype(int)

# 5.11 Presence of Disability in patients (Binary Category)

In [21]:
# Disable

# No missing values detected

df_clean['DISABLE'].unique()
df_clean['DISABLE'] = df_clean['DISABLE'].fillna('N')

# Binarise DISABLE column

df_clean['DISABLE']=df_clean['DISABLE'].replace({
    'N':0,
    'Y':1
})

# Convert into integer format

df_clean['DISABLE']=df_clean['DISABLE'].astype(int)

# 5.12 Recovery Status of the patients (Binary Category)

In [22]:
# RECOVD

# Remove missing values 

print(df_clean['RECOVD'].value_counts())
df_clean = df_clean.drop(['RECOVD'], axis=1)

N    307586
Y    246640
U    124615
Name: RECOVD, dtype: int64


# 5.13 Vaccination Date, Date of AE Onset and Number of days between Vaccination and AE (Date-Time Data)

In [23]:
# Check the variable

df_clean['VAX_DATE'].head()

0    12/28/2020
1    12/31/2020
2    12/31/2020
3    12/23/2020
4    12/22/2020
Name: VAX_DATE, dtype: object

In [24]:
# Vax Date, Onset Date and Numdays

# Remove missing values

df_clean = df_clean[df_clean['VAX_DATE'].notna()]
df_clean = df_clean[df_clean['ONSET_DATE'].notna()]

# Convert to Datetime format

df_clean['VAX_DATE'] = pd.to_datetime(df_clean['VAX_DATE'])
df_clean['ONSET_DATE'] = pd.to_datetime(df_clean['ONSET_DATE'])

# Take only data from April 2020 onwards

df_clean = df_clean[df_clean['VAX_DATE'] > '2020-04-01']

print(df_clean.shape)

df_clean['NUMDAYS'] = (df_clean['ONSET_DATE'] - df_clean['VAX_DATE']).dt.days

# Interval between vaccination and onset should not be less than 0, or else the adverse effect cannot be 
# attributed to the vaccination

df_clean = df_clean[df_clean['NUMDAYS'] >= 0]

# Convert into integer format

df_clean['NUMDAYS']=df_clean['NUMDAYS'].astype(int)

# Check Dataset

df_clean.shape

(694662, 41)


(683634, 41)

# 5.14 Text Data

In [25]:
# Exclude text data or columns with large sub-categories first during preliminary analysis

df_clean = df_clean.drop(['LAB_DATA', 'OTHER_MEDS','CUR_ILL', 'HISTORY', 'PRIOR_VAX', 
                          'SPLTTYPE','ALLERGIES', 'VAX_LOT', 'VAX_ROUTE','TODAYS_DATE','SYMPTOM_TEXT' ], axis=1)

# 5.15 Birth Defect (Binary Category)

In [26]:
# Birth Defect

print(df_clean['BIRTH_DEFECT'].unique())
df_clean['BIRTH_DEFECT'] = df_clean['BIRTH_DEFECT'].fillna('N')

# Binarise Birth_Defect column

df_clean['BIRTH_DEFECT'] = df_clean['BIRTH_DEFECT'].replace({
    'N':0,
    'Y':1
})

# Convert into integer format

df_clean['BIRTH_DEFECT']=df_clean['BIRTH_DEFECT'].astype(int)

[nan 'Y']


# 5.16 Emergency Department Visit (Binary Category)

In [27]:
# ER_ED_VISIT

df_clean['ER_ED_VISIT'].unique()

df_clean['ER_ED_VISIT'] = df_clean['ER_ED_VISIT'].fillna('N')

# Binarise ER_ED_VISIT

df_clean['ER_ED_VISIT'] = df_clean['ER_ED_VISIT'].replace({
    'N':0,
    'Y':1
})

# Convert into integer format

df_clean['ER_ED_VISIT']=df_clean['ER_ED_VISIT'].astype(int)

# 5.17 Vaccination Brand (4 Nominal Categories)

In [28]:
# Same as Vaccination Name
# No further cleaning required, will dummify this column prior to modelling

In [29]:
# VACCINATION BRAND

df_clean['VAX_MANU'].value_counts()

MODERNA                 320291
PFIZER\BIONTECH         305470
JANSSEN                  56664
UNKNOWN MANUFACTURER      1209
Name: VAX_MANU, dtype: int64

# 5.18 Vaccination Doses (Continuous Variable)

In [30]:
# Number of dose of vaccination: remove nan and unknown number of doses

print(df_clean['VAX_DOSE_SERIES'].unique())
df_clean.loc[df_clean['VAX_DOSE_SERIES'] == 'UNK', 'VAX_DOSE_SERIES'] = np.nan
df_clean = df_clean[df_clean['VAX_DOSE_SERIES'].notna()]

# Binarise the vaccination dose 

df_clean['VAX_DOSE_SERIES'] = df_clean['VAX_DOSE_SERIES'].replace({
    '1':0,
    '2':0,
    '3':1,
    '4':1,
    '5':1,
    '6':1,
    '7+':1
})

# Convert into integer format

df_clean['VAX_DOSE_SERIES']=df_clean['VAX_DOSE_SERIES'].astype(int)

['1' 'UNK' '2' '3' '5' nan '4' '7+' '6']


In [31]:
df_clean['VAX_DOSE_SERIES'].value_counts()

0    590415
1     12826
Name: VAX_DOSE_SERIES, dtype: int64

# 5.19 Vaccination Site (8 Nominal Categories)

In [32]:
# Overview of the column

print(df_clean['VAX_SITE'].value_counts())

# Remove missing values

df_clean = df_clean[df_clean['VAX_SITE'].notna()]

# Collapse all other minority categories into "others"

df_clean['VAX_SITE'] = df_clean['VAX_SITE'].replace({
    'AR':'Others',
    'OT':'Others',
    'LL':'Others',
    'RL':'Others',
    'LG':'Others',
    'GM':'Others',
    'MO':'Others',
    'NS':'Others'
})

print(df_clean['VAX_SITE'].value_counts())

LA    386029
RA    133652
AR     13050
UN     11237
OT       337
LL       210
RL       174
LG        15
GM         7
MO         6
NS         5
Name: VAX_SITE, dtype: int64
LA        386029
RA        133652
Others     13804
UN         11237
Name: VAX_SITE, dtype: int64


# 6. Outcome Variable: SYMPTOM1 to 5 (Multi-categorical)

In [33]:
# Binarise the outcome data to only Myocarditis/Pericarditis-related symptoms or Non-myocarditis/pericarditis 
# related symptoms

# Myocarditis/Pericarditis-related symptoms: 
# Acute myocaridal infarction, Myocaridal infarction, Myocarditis, Myocardial Oedema, Myocardial ischaemia, Myocardial strain,
# Viral myocarditis, Myocardial necrosis marker, normal, increased, Myocardial necrosis, Myocardial fibrosis, Myocardial injury,
# Antimyocardial antibody positive, Scan myocardial perfusion, Autoimmune myocarditis, myocarditis infectious, Myocardial rupture
# Silent myocardial infarction, ECG signs of myocardial ischaemia,
# Pericarditis, Pericardial effusion, Pericardial drainage, Pericardial fibrosis, Pericardial excision, 
# Viral pericarditis, Pericardial disease, Pericardial operation, Biopsy pericardium, 

df_clean["SYMPTOM1"].value_counts()

# Binarise the outcome for SYMPTOM1

df_clean["SYMPTOM1"].replace({
    "Acute myocardial infarction":"Myocarditis-related", 
    "Myocaridal infarction":"Myocarditis-related",
    "Myocarditis":"Myocarditis-related",
    "Myocardial Oedema":"Myocarditis-related",
    "Myocardial ischaemia":"Myocarditis-related",
    "Myocardial strain":"Myocarditis-related",
    "Viral myocarditis":"Myocarditis-related",
    "Myocardial necrosis marker":"Myocarditis-related",
    "Myocardial necrosis marker increased":"Myocarditis-related",
    "Myocardial necrosis marker normal":"Myocarditis-related",
    "Myocardial necrosis":"Myocarditis-related",
    "Myocardial fibrosis":"Myocarditis-related",
    "Myocardial injury":"Myocarditis-related",
    "Antimyocardial antibody positive":"Myocarditis-related",
    "Scan myocardial perfusion":"Myocarditis-related",
    "Autoimmune myocarditis":"Myocarditis-related",
    "Myocarditis infectious":"Myocarditis-related",
    "Myocardial rupture":"Myocarditis-related",
    "Silent myocardial infarction":"Myocarditis-related",
    "ECG signs of myocardial ischaemia":"Myocarditis-related",
    "Pericarditis":"Myocarditis-related",
    "Pericardial effusion":"Myocarditis-related",
    "Pericardial drainage":"Myocarditis-related",
    "Pericardial fibrosis":"Myocarditis-related",
    "Pericardial excision":"Myocarditis-related",
    "Viral pericarditis":"Myocarditis-related",
    "Pericardial disease":"Myocarditis-related",
    "Pericardial operation":"Myocarditis-related",
    "Biopsy pericardium":"Myocarditis-related"
    
}, inplace=True)



In [34]:
# Function to Binarise the outcome variable into myocarditis/pericarditis-related or non-myocarditis/pericarditis
# related symptom

def collapse_myocarditis(x):
    if x != "Myocarditis-related":
        return "Non-Myocarditis-related"
    else:
        return "Myocarditis-related"

In [35]:
df_clean["SYMPTOM1"] = df_clean["SYMPTOM1"].map(collapse_myocarditis)

In [36]:
df_clean["SYMPTOM1"].value_counts()

Non-Myocarditis-related    543966
Myocarditis-related           756
Name: SYMPTOM1, dtype: int64

In [37]:
# Binarise the outcome for SYMPTOM2

df_clean["SYMPTOM2"].replace({
    "Acute myocardial infarction":"Myocarditis-related", 
    "Myocaridal infarction":"Myocarditis-related",
    "Myocarditis":"Myocarditis-related",
    "Myocardial Oedema":"Myocarditis-related",
    "Myocardial ischaemia":"Myocarditis-related",
    "Myocardial strain":"Myocarditis-related",
    "Viral myocarditis":"Myocarditis-related",
    "Myocardial necrosis marker":"Myocarditis-related",
    "Myocardial necrosis marker increased":"Myocarditis-related",
    "Myocardial necrosis marker normal":"Myocarditis-related",
    "Myocardial necrosis":"Myocarditis-related",
    "Myocardial fibrosis":"Myocarditis-related",
    "Myocardial injury":"Myocarditis-related",
    "Antimyocardial antibody positive":"Myocarditis-related",
    "Scan myocardial perfusion":"Myocarditis-related",
    "Autoimmune myocarditis":"Myocarditis-related",
    "Myocarditis infectious":"Myocarditis-related",
    "Myocardial rupture":"Myocarditis-related",
    "Silent myocardial infarction":"Myocarditis-related",
    "ECG signs of myocardial ischaemia":"Myocarditis-related",
    "Pericarditis":"Myocarditis-related",
    "Pericardial effusion":"Myocarditis-related",
    "Pericardial drainage":"Myocarditis-related",
    "Pericardial fibrosis":"Myocarditis-related",
    "Pericardial excision":"Myocarditis-related",
    "Viral pericarditis":"Myocarditis-related",
    "Pericardial disease":"Myocarditis-related",
    "Pericardial operation":"Myocarditis-related",
    "Biopsy pericardium":"Myocarditis-related"
    
}, inplace=True)

In [38]:
df_clean["SYMPTOM2"] = df_clean["SYMPTOM2"].map(collapse_myocarditis)

In [39]:
# Binarise the outcome for SYMPTOM3

df_clean["SYMPTOM3"].replace({
    "Acute myocardial infarction":"Myocarditis-related", 
    "Myocaridal infarction":"Myocarditis-related",
    "Myocarditis":"Myocarditis-related",
    "Myocardial Oedema":"Myocarditis-related",
    "Myocardial ischaemia":"Myocarditis-related",
    "Myocardial strain":"Myocarditis-related",
    "Viral myocarditis":"Myocarditis-related",
    "Myocardial necrosis marker":"Myocarditis-related",
    "Myocardial necrosis marker increased":"Myocarditis-related",
    "Myocardial necrosis marker normal":"Myocarditis-related",
    "Myocardial necrosis":"Myocarditis-related",
    "Myocardial fibrosis":"Myocarditis-related",
    "Myocardial injury":"Myocarditis-related",
    "Antimyocardial antibody positive":"Myocarditis-related",
    "Scan myocardial perfusion":"Myocarditis-related",
    "Autoimmune myocarditis":"Myocarditis-related",
    "Myocarditis infectious":"Myocarditis-related",
    "Myocardial rupture":"Myocarditis-related",
    "Silent myocardial infarction":"Myocarditis-related",
    "ECG signs of myocardial ischaemia":"Myocarditis-related",
    "Pericarditis":"Myocarditis-related",
    "Pericardial effusion":"Myocarditis-related",
    "Pericardial drainage":"Myocarditis-related",
    "Pericardial fibrosis":"Myocarditis-related",
    "Pericardial excision":"Myocarditis-related",
    "Viral pericarditis":"Myocarditis-related",
    "Pericardial disease":"Myocarditis-related",
    "Pericardial operation":"Myocarditis-related",
    "Biopsy pericardium":"Myocarditis-related"
    
}, inplace=True)

In [40]:
df_clean["SYMPTOM3"] = df_clean["SYMPTOM3"].map(collapse_myocarditis)

In [41]:
# Binarise the outcome for SYMPTOM4

df_clean["SYMPTOM4"].replace({
    "Acute myocardial infarction":"Myocarditis-related", 
    "Myocaridal infarction":"Myocarditis-related",
    "Myocarditis":"Myocarditis-related",
    "Myocardial Oedema":"Myocarditis-related",
    "Myocardial ischaemia":"Myocarditis-related",
    "Myocardial strain":"Myocarditis-related",
    "Viral myocarditis":"Myocarditis-related",
    "Myocardial necrosis marker":"Myocarditis-related",
    "Myocardial necrosis marker increased":"Myocarditis-related",
    "Myocardial necrosis marker normal":"Myocarditis-related",
    "Myocardial necrosis":"Myocarditis-related",
    "Myocardial fibrosis":"Myocarditis-related",
    "Myocardial injury":"Myocarditis-related",
    "Antimyocardial antibody positive":"Myocarditis-related",
    "Scan myocardial perfusion":"Myocarditis-related",
    "Autoimmune myocarditis":"Myocarditis-related",
    "Myocarditis infectious":"Myocarditis-related",
    "Myocardial rupture":"Myocarditis-related",
    "Silent myocardial infarction":"Myocarditis-related",
    "ECG signs of myocardial ischaemia":"Myocarditis-related",
    "Pericarditis":"Myocarditis-related",
    "Pericardial effusion":"Myocarditis-related",
    "Pericardial drainage":"Myocarditis-related",
    "Pericardial fibrosis":"Myocarditis-related",
    "Pericardial excision":"Myocarditis-related",
    "Viral pericarditis":"Myocarditis-related",
    "Pericardial disease":"Myocarditis-related",
    "Pericardial operation":"Myocarditis-related",
    "Biopsy pericardium":"Myocarditis-related"
    
}, inplace=True)

In [42]:
df_clean["SYMPTOM4"] = df_clean["SYMPTOM4"].map(collapse_myocarditis)

In [43]:
# Binarise the outcome for SYMPTOM5

df_clean["SYMPTOM5"].replace({
    "Acute myocardial infarction":"Myocarditis-related", 
    "Myocaridal infarction":"Myocarditis-related",
    "Myocarditis":"Myocarditis-related",
    "Myocardial Oedema":"Myocarditis-related",
    "Myocardial ischaemia":"Myocarditis-related",
    "Myocardial strain":"Myocarditis-related",
    "Viral myocarditis":"Myocarditis-related",
    "Myocardial necrosis marker":"Myocarditis-related",
    "Myocardial necrosis marker increased":"Myocarditis-related",
    "Myocardial necrosis marker normal":"Myocarditis-related",
    "Myocardial necrosis":"Myocarditis-related",
    "Myocardial fibrosis":"Myocarditis-related",
    "Myocardial injury":"Myocarditis-related",
    "Antimyocardial antibody positive":"Myocarditis-related",
    "Scan myocardial perfusion":"Myocarditis-related",
    "Autoimmune myocarditis":"Myocarditis-related",
    "Myocarditis infectious":"Myocarditis-related",
    "Myocardial rupture":"Myocarditis-related",
    "Silent myocardial infarction":"Myocarditis-related",
    "ECG signs of myocardial ischaemia":"Myocarditis-related",
    "Pericarditis":"Myocarditis-related",
    "Pericardial effusion":"Myocarditis-related",
    "Pericardial drainage":"Myocarditis-related",
    "Pericardial fibrosis":"Myocarditis-related",
    "Pericardial excision":"Myocarditis-related",
    "Viral pericarditis":"Myocarditis-related",
    "Pericardial disease":"Myocarditis-related",
    "Pericardial operation":"Myocarditis-related",
    "Biopsy pericardium":"Myocarditis-related"
    
}, inplace=True)

In [44]:
df_clean["SYMPTOM5"] = df_clean["SYMPTOM5"].map(collapse_myocarditis)

In [45]:
# Check outcome distribution in SYMPTOM2

df_clean["SYMPTOM2"].value_counts()

Non-Myocarditis-related    544128
Myocarditis-related           594
Name: SYMPTOM2, dtype: int64

In [46]:
# Check outcome distribution in SYMPTOM3

df_clean["SYMPTOM3"].value_counts()

Non-Myocarditis-related    544207
Myocarditis-related           515
Name: SYMPTOM3, dtype: int64

In [47]:
# Check outcome distribution in SYMPTOM4

df_clean["SYMPTOM4"].value_counts()

Non-Myocarditis-related    544214
Myocarditis-related           508
Name: SYMPTOM4, dtype: int64

In [48]:
# Check outcome distribution in SYMPTOM5

df_clean["SYMPTOM5"].value_counts()

Non-Myocarditis-related    544201
Myocarditis-related           521
Name: SYMPTOM5, dtype: int64

In [49]:
# Check Dataset is alright 

df_clean[df_clean["SYMPTOM2"]=="Myocarditis-related"]

Unnamed: 0,VAERS_ID,RECVDATE,STATE,AGE_YRS,SEX,DIED,L_THREAT,HOSPITAL,HOSPDAYS,X_STAY,DISABLE,VAX_DATE,ONSET_DATE,NUMDAYS,BIRTH_DEFECT,ER_ED_VISIT,VAX_MANU,VAX_DOSE_SERIES,VAX_SITE,VAX_NAME,SYMPTOM1,SYMPTOMVERSION1,SYMPTOM2,SYMPTOMVERSION2,SYMPTOM3,SYMPTOMVERSION3,SYMPTOM4,SYMPTOMVERSION4,SYMPTOM5,SYMPTOMVERSION5
13009,926655,01/07/2021,FL,59.0,1,0,0,0,0,0,0,2021-01-07,2021-01-07,0,0,1,MODERNA,0,LA,COVID19 (COVID19 (MODERNA)),Non-Myocarditis-related,23.1,Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,
27615,936002,01/11/2021,TX,44.0,1,0,0,0,0,0,0,2020-12-24,2020-12-24,0,0,1,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.0,Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Non-Myocarditis-related,,Non-Myocarditis-related,
38602,944489,01/14/2021,VA,50.0,0,0,0,0,0,0,0,2020-12-18,2020-12-28,10,0,0,PFIZER\BIONTECH,0,RA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.0,Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Non-Myocarditis-related,
47878,952497,01/18/2021,IL,40.0,0,0,0,1,5,0,0,2020-12-19,2021-01-08,20,0,1,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,23.1,Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Myocarditis-related,23.1
65759,967588,01/23/2021,NY,76.0,1,0,0,1,3,0,0,2021-01-18,2021-01-19,1,0,1,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,23.1,Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
923272,1825030,10/28/2021,MI,68.0,1,0,0,1,3,0,0,2021-04-05,2021-10-25,203,0,0,MODERNA,0,UN,COVID19 (COVID19 (MODERNA)),Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1
924348,1825676,10/28/2021,IL,28.0,0,0,0,0,0,0,0,2021-10-15,2021-10-20,5,0,0,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Non-Myocarditis-related,,Non-Myocarditis-related,,Non-Myocarditis-related,
924416,1825718,10/28/2021,NC,22.0,0,0,0,0,0,0,0,2021-10-22,2021-10-26,4,0,0,PFIZER\BIONTECH,0,RA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Non-Myocarditis-related,,Non-Myocarditis-related,,Non-Myocarditis-related,
924417,1825718,10/28/2021,NC,22.0,0,0,0,0,0,0,0,2021-10-22,2021-10-26,4,0,0,PFIZER\BIONTECH,0,RA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Non-Myocarditis-related,,Non-Myocarditis-related,,Non-Myocarditis-related,


In [50]:
df_clean[df_clean["SYMPTOM3"]=="Myocarditis-related"]

Unnamed: 0,VAERS_ID,RECVDATE,STATE,AGE_YRS,SEX,DIED,L_THREAT,HOSPITAL,HOSPDAYS,X_STAY,DISABLE,VAX_DATE,ONSET_DATE,NUMDAYS,BIRTH_DEFECT,ER_ED_VISIT,VAX_MANU,VAX_DOSE_SERIES,VAX_SITE,VAX_NAME,SYMPTOM1,SYMPTOMVERSION1,SYMPTOM2,SYMPTOMVERSION2,SYMPTOM3,SYMPTOMVERSION3,SYMPTOM4,SYMPTOMVERSION4,SYMPTOM5,SYMPTOMVERSION5
11637,925542,01/07/2021,NY,30.00,0,0,0,0,0,0,0,2020-12-28,2021-01-01,4,0,1,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,
20604,931507,01/09/2021,NY,62.00,1,0,0,1,2,0,0,2021-01-07,2021-01-07,0,0,1,MODERNA,0,RA,COVID19 (COVID19 (MODERNA)),Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,
22939,932325,01/10/2021,SC,39.00,1,0,0,0,0,0,0,2021-01-05,2021-01-05,0,0,1,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0
45252,950525,01/16/2021,OH,45.00,0,0,0,0,0,0,0,2020-12-26,2021-01-05,10,0,1,MODERNA,0,LA,COVID19 (COVID19 (MODERNA)),Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0
63287,965715,01/22/2021,FL,57.00,1,0,1,1,3,0,0,2021-01-08,2021-01-08,0,0,1,MODERNA,0,LA,COVID19 (COVID19 (MODERNA)),Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
914059,1813852,10/24/2021,NY,94.00,0,1,0,0,0,0,0,2021-07-07,2021-08-03,27,0,0,PFIZER\BIONTECH,0,UN,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Non-Myocarditis-related,,Non-Myocarditis-related,
915802,1815220,10/25/2021,WA,28.00,0,0,0,0,0,0,0,2021-10-04,2021-10-07,3,0,1,JANSSEN,0,LA,COVID19 (COVID19 (JANSSEN)),Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1
917212,1817872,10/26/2021,DE,38.00,1,0,1,1,12,0,1,2021-03-26,2021-04-09,14,0,1,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Myocarditis-related,24.1,Non-Myocarditis-related,24.1
920849,1821819,10/27/2021,NY,51.00,1,0,1,1,2,0,0,2021-10-13,2021-10-13,0,0,0,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1


In [51]:
df_clean[df_clean["SYMPTOM4"]=="Myocarditis-related"]

Unnamed: 0,VAERS_ID,RECVDATE,STATE,AGE_YRS,SEX,DIED,L_THREAT,HOSPITAL,HOSPDAYS,X_STAY,DISABLE,VAX_DATE,ONSET_DATE,NUMDAYS,BIRTH_DEFECT,ER_ED_VISIT,VAX_MANU,VAX_DOSE_SERIES,VAX_SITE,VAX_NAME,SYMPTOM1,SYMPTOMVERSION1,SYMPTOM2,SYMPTOMVERSION2,SYMPTOM3,SYMPTOMVERSION3,SYMPTOM4,SYMPTOMVERSION4,SYMPTOM5,SYMPTOMVERSION5
36511,942662,01/14/2021,NY,40.0,1,0,0,0,0,0,0,2020-12-29,2020-12-30,1,0,1,PFIZER\BIONTECH,0,RA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Myocarditis-related,23.1,Non-Myocarditis-related,23.1
37890,943795,01/14/2021,NY,47.0,1,0,0,0,0,0,0,2021-01-03,2021-01-09,6,0,0,PFIZER\BIONTECH,0,UN,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Myocarditis-related,24.0,Non-Myocarditis-related,24.0
44932,950189,01/16/2021,SD,59.0,1,0,0,0,0,0,0,2020-12-29,2020-12-29,0,0,1,MODERNA,0,RA,COVID19 (COVID19 (MODERNA)),Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Myocarditis-related,24.0,Non-Myocarditis-related,24.0
68993,970198,01/25/2021,GA,21.0,0,0,0,1,3,0,0,2021-01-12,2021-01-13,1,0,1,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Myocarditis-related,23.1,Non-Myocarditis-related,23.1
68995,970198,01/25/2021,GA,21.0,0,0,0,1,3,0,0,2021-01-12,2021-01-13,1,0,1,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Myocarditis-related,23.1,Non-Myocarditis-related,23.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
915589,1815113,10/25/2021,CA,57.0,0,0,1,1,1,0,0,2021-10-07,2021-10-20,13,0,1,PFIZER\BIONTECH,1,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Non-Myocarditis-related,
916146,1815445,10/25/2021,CA,19.0,0,0,1,1,2,0,0,2021-10-19,2021-10-23,4,0,0,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Non-Myocarditis-related,24.1
917212,1817872,10/26/2021,DE,38.0,1,0,1,1,12,0,1,2021-03-26,2021-04-09,14,0,1,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Myocarditis-related,24.1,Non-Myocarditis-related,24.1
917702,1818177,10/26/2021,OR,29.0,0,0,0,1,1,0,0,2021-10-15,2021-10-19,4,0,1,MODERNA,1,RA,COVID19 (COVID19 (MODERNA)),Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Non-Myocarditis-related,24.1


In [52]:
df_clean[df_clean["SYMPTOM5"]=="Myocarditis-related"]

Unnamed: 0,VAERS_ID,RECVDATE,STATE,AGE_YRS,SEX,DIED,L_THREAT,HOSPITAL,HOSPDAYS,X_STAY,DISABLE,VAX_DATE,ONSET_DATE,NUMDAYS,BIRTH_DEFECT,ER_ED_VISIT,VAX_MANU,VAX_DOSE_SERIES,VAX_SITE,VAX_NAME,SYMPTOM1,SYMPTOMVERSION1,SYMPTOM2,SYMPTOMVERSION2,SYMPTOM3,SYMPTOMVERSION3,SYMPTOM4,SYMPTOMVERSION4,SYMPTOM5,SYMPTOMVERSION5
3359,919087,01/04/2021,CA,50.0,0,0,1,1,1,0,0,2020-12-23,2020-12-27,4,0,0,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Myocarditis-related,23.1
3692,919334,01/04/2021,NY,30.0,0,0,0,0,0,0,0,2020-12-28,2021-01-01,4,0,1,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Myocarditis-related,23.1
20942,931752,01/09/2021,MN,60.0,1,0,0,0,0,0,0,2021-01-06,2021-01-07,1,0,1,MODERNA,0,RA,COVID19 (COVID19 (MODERNA)),Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Myocarditis-related,24.0
24030,933131,01/10/2021,FL,49.0,1,0,0,0,0,0,0,2020-12-23,2020-12-27,4,0,1,PFIZER\BIONTECH,0,LA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Non-Myocarditis-related,24.0,Myocarditis-related,24.0
26897,935452,01/11/2021,NM,44.0,1,0,0,1,2,0,0,2021-01-06,2021-01-06,0,0,0,PFIZER\BIONTECH,0,RA,COVID19 (COVID19 (PFIZER-BIONTECH)),Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Non-Myocarditis-related,23.1,Myocarditis-related,23.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
914206,1813969,10/24/2021,AK,37.0,0,0,1,1,3,0,0,2021-10-18,2021-10-19,1,0,1,MODERNA,0,RA,COVID19 (COVID19 (MODERNA)),Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Myocarditis-related,24.1,Myocarditis-related,24.1
916029,1815372,10/25/2021,MN,26.0,0,0,0,0,0,0,0,2021-04-15,2021-05-15,30,0,1,MODERNA,0,Others,COVID19 (COVID19 (MODERNA)),Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Myocarditis-related,24.1
916031,1815372,10/25/2021,MN,26.0,0,0,0,0,0,0,0,2021-04-15,2021-05-15,30,0,1,MODERNA,0,Others,COVID19 (COVID19 (MODERNA)),Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Myocarditis-related,24.1
923696,1825278,10/28/2021,SC,47.0,0,0,0,1,2,0,0,2021-09-17,2021-09-21,4,0,1,JANSSEN,0,RA,COVID19 (COVID19 (JANSSEN)),Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Non-Myocarditis-related,24.1,Myocarditis-related,24.1


In [53]:
# Compile the symptoms into a single column: SYMPTOM1
# If myocarditis/pericarditis-related symptoms are detected in the other symptom columns, it will replace 
# the current category in SYMPTOM1 

df_clean.loc[df_clean['SYMPTOM2'] == 'Myocarditis-related', 'SYMPTOM1'] = "Myocarditis-related"
df_clean.loc[df_clean['SYMPTOM3'] == 'Myocarditis-related', 'SYMPTOM1'] = "Myocarditis-related"
df_clean.loc[df_clean['SYMPTOM4'] == 'Myocarditis-related', 'SYMPTOM1'] = "Myocarditis-related"
df_clean.loc[df_clean['SYMPTOM5'] == 'Myocarditis-related', 'SYMPTOM1'] = "Myocarditis-related"

df_clean["SYMPTOM1"].value_counts()

Non-Myocarditis-related    542123
Myocarditis-related          2599
Name: SYMPTOM1, dtype: int64

In [54]:
df_clean["SYMPTOM1"].value_counts()

Non-Myocarditis-related    542123
Myocarditis-related          2599
Name: SYMPTOM1, dtype: int64

In [55]:
# Remove Redundant columns

df_clean = df_clean.drop(['SYMPTOM2','SYMPTOM3','SYMPTOM4','SYMPTOM5','SYMPTOMVERSION1','SYMPTOMVERSION2','SYMPTOMVERSION3','SYMPTOMVERSION4','SYMPTOMVERSION5'], axis=1)

In [56]:
df_clean.isnull().sum()

VAERS_ID           0
RECVDATE           0
STATE              0
AGE_YRS            0
SEX                0
DIED               0
L_THREAT           0
HOSPITAL           0
HOSPDAYS           0
X_STAY             0
DISABLE            0
VAX_DATE           0
ONSET_DATE         0
NUMDAYS            0
BIRTH_DEFECT       0
ER_ED_VISIT        0
VAX_MANU           0
VAX_DOSE_SERIES    0
VAX_SITE           0
VAX_NAME           0
SYMPTOM1           0
dtype: int64

# 7. Replace Missing/Unknown Values for Vaccination Site using Mode

In [57]:
# Check VAX_SITE column

df_clean['VAX_SITE'].unique()

array(['LA', 'RA', 'Others', 'UN'], dtype=object)

In [58]:
# Check for other missing values

df_clean['VAX_SITE'].isna().sum()

0

In [59]:
# Check VAX_SITE column value counts

df_clean['VAX_SITE'].value_counts()

LA        386029
RA        133652
Others     13804
UN         11237
Name: VAX_SITE, dtype: int64

In [60]:
# Obtain the mode for ['VAX_SITE'] column

vax_site_mode = stats.mode(df_clean['VAX_SITE'])[0][0]

# Convert missing values and unknown values to the mode

df_clean['VAX_SITE'] = df_clean['VAX_SITE'].replace({
    'UN':vax_site_mode
})

df_clean['VAX_SITE'] = df_clean['VAX_SITE'].fillna(vax_site_mode)


In [61]:
df_clean.isnull().sum()

VAERS_ID           0
RECVDATE           0
STATE              0
AGE_YRS            0
SEX                0
DIED               0
L_THREAT           0
HOSPITAL           0
HOSPDAYS           0
X_STAY             0
DISABLE            0
VAX_DATE           0
ONSET_DATE         0
NUMDAYS            0
BIRTH_DEFECT       0
ER_ED_VISIT        0
VAX_MANU           0
VAX_DOSE_SERIES    0
VAX_SITE           0
VAX_NAME           0
SYMPTOM1           0
dtype: int64

In [62]:
# Output the dataset for further cleaning and EDA 

df_clean.to_csv("df_clean.csv",index=False)