# Adverse Reaction Cluster of the COVID-19 Vaccine: Potential Clinical Prediction Tool

### Andrea Gomez, Dung Mai, Mariana Maroto
### Graduate Center, CUNY - Machine Learning CSCI 740 Spring 2021

Our application project clusters COVID-19 vaccine adverse reactions. The purpose of the project is having a detailed understanding of the common types of adverse reactions and  identifying which adverse reactions are in need of immediate care. This project suggests a two-step approach. First, we use an unsupervised machine learning algorithm (clustering) to segment adverse reactions into groups. This will give us the most common symptoms for each group. Second, using the vaccine reaction clusters, along with additional patient information (gender, age, allergies) and vaccine manufacturer information, we will predict the need of urgent medical care by using fatalities and hospital visits. 

The dataset is provided by the Vaccine Adverse Event Reporting System VAERS and contains reports about adverse events that may be associated with COVID-19 vaccines. The database we chose was the dataset provided for the current year 2021, considering that our goal is to explore reactions due to Covid vaccines. The data contains reports processed as of 3/26/2021. 

Dataset Source: VAERS - Vaccine Adverse Event Reporting System. Data Retrieve on 4/8/2021 https://vaers.hhs.gov/data/datasets.html

## 1. Data Cleaning

1.1 Read Datasets

In [117]:
import pandas as pd
import numpy as np

all_data =  pd.read_csv('2021VAERSDATA.csv', sep=",", encoding = "ISO-8859-1")
symp = pd.read_csv('2021VAERSSYMPTOMS.csv', sep=",", encoding = "ISO-8859-1")
vax = pd.read_csv('2021VAERSVAX.csv', sep=",", encoding = "ISO-8859-1")

print('Main Dataset Size:'+ str(all_data.shape))
print('Symptoms Coded Dataset Size:' + str(symp.shape))
print('Vaccine Info Dataset Size:' + str(vax.shape))

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Main Dataset Size:(40348, 35)
Symptoms Coded Dataset Size:(56533, 11)
Vaccine Info Dataset Size:(40937, 8)


Symptoms coded dataset requires the most cleaning as each symptom should become a feature in dummy coding format. In original file, there could be more than one row for one individual if they presented more than 5 symptoms.

In [118]:
symp.head()

Unnamed: 0,VAERS_ID,SYMPTOM1,SYMPTOMVERSION1,SYMPTOM2,SYMPTOMVERSION2,SYMPTOM3,SYMPTOMVERSION3,SYMPTOM4,SYMPTOMVERSION4,SYMPTOM5,SYMPTOMVERSION5
0,916600,Dysphagia,23.1,Epiglottitis,23.1,,,,,,
1,916601,Anxiety,23.1,Dyspnoea,23.1,,,,,,
2,916602,Chest discomfort,23.1,Dysphagia,23.1,Pain in extremity,23.1,Visual impairment,23.1,,
3,916603,Dizziness,23.1,Fatigue,23.1,Mobility decreased,23.1,,,,
4,916604,Injection site erythema,23.1,Injection site pruritus,23.1,Injection site swelling,23.1,Injection site warmth,23.1,,


#### Dropping the SYMPTOMVERSION columns

In [119]:
# Remove SYMPTOMVERSION columns that are unecessary
symp = symp[symp.columns.drop(list(symp.filter(regex='SYMPTOMVERSION')))]
symp.head()

Unnamed: 0,VAERS_ID,SYMPTOM1,SYMPTOM2,SYMPTOM3,SYMPTOM4,SYMPTOM5
0,916600,Dysphagia,Epiglottitis,,,
1,916601,Anxiety,Dyspnoea,,,
2,916602,Chest discomfort,Dysphagia,Pain in extremity,Visual impairment,
3,916603,Dizziness,Fatigue,Mobility decreased,,
4,916604,Injection site erythema,Injection site pruritus,Injection site swelling,Injection site warmth,


In [120]:
symp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56533 entries, 0 to 56532
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   VAERS_ID  56533 non-null  int64 
 1   SYMPTOM1  56533 non-null  object
 2   SYMPTOM2  46302 non-null  object
 3   SYMPTOM3  37031 non-null  object
 4   SYMPTOM4  28625 non-null  object
 5   SYMPTOM5  21601 non-null  object
dtypes: int64(1), object(5)
memory usage: 2.6+ MB


In [121]:
# Find the number of unique values for all the SYMPTOMS 
uniq_symp = pd.unique(symp[['SYMPTOM1', 'SYMPTOM2', 'SYMPTOM3', 'SYMPTOM4', 'SYMPTOM5']].values.ravel('K'))
len(uniq_symp)

4407

#### Find the most common symptoms in all the patients

In [122]:
# collect all the symptoms in tuples with their respective frequency

s = ['SYMPTOM1', 'SYMPTOM2', 'SYMPTOM3', 'SYMPTOM4', 'SYMPTOM5']
symptoms = []

for col in s:
    index = symp[col].value_counts().index
    values = symp[col].value_counts().values
    symptoms += zip(index, values)

In [123]:
from itertools import groupby
from operator import itemgetter

first = itemgetter(0)

# add the values of the tuples with the same symptom
symptoms = [(k, sum(item[1] for item in tups_to_sum))
        for k, tups_to_sum in groupby(sorted(symptoms, key=first), key=first)]
 
# Function to sort hte list by second item of tuple
def Sort_Tuple(tup): 
  
    # reverse = None (Sorts in Ascending order) 
    # key is set to sort using second element of 
    # sublist lambda has been used 
    tup.sort(reverse=True, key = lambda x: x[1]) 
    return tup
   
# printing the sorted list of tuples
symptoms = Sort_Tuple(symptoms)
#print(symptoms[:34])

# add all the symptoms to a list and keep the 35
# most common symptoms
c_symp = [tup[0] for tup in symptoms][:34]
print('---List of the most common symptoms---\n')
print(c_symp)

---List of the most common symptoms---

['Headache', 'Pyrexia', 'Chills', 'Fatigue', 'Pain', 'Nausea', 'Dizziness', 'Pain in extremity', 'Myalgia', 'Injection site pain', 'Injection site erythema', 'Arthralgia', 'Dyspnoea', 'Vomiting', 'Pruritus', 'Injection site swelling', 'Rash', 'Death', 'Asthenia', 'Injection site pruritus', 'Paraesthesia', 'Malaise', 'Erythema', 'Diarrhoea', 'SARS-CoV-2 test positive', 'Injection site warmth', 'Urticaria', 'Hypoaesthesia', 'Hyperhidrosis', 'Lymphadenopathy', 'COVID-19', 'Cough', 'Feeling abnormal', 'SARS-CoV-2 test negative']


#### Hot encoding the symtoms

In [124]:
# Stack all the symptom columns in one
symp['INDEX1'] = symp.index
symp = pd.wide_to_long(symp, stubnames='SYMPTOM', i=['INDEX1'], j='SYMPNUMBER')
symp.reset_index(drop=True, inplace=True)

# Most commom symptoms 
symp['SYMPTOM'].value_counts(ascending = False).head(50)

Headache                    8881
Pyrexia                     7204
Chills                      6865
Fatigue                     6418
Pain                        6034
Nausea                      5039
Dizziness                   4229
Pain in extremity           3678
Myalgia                     3416
Injection site pain         3320
Injection site erythema     2655
Arthralgia                  2505
Dyspnoea                    2407
Vomiting                    2050
Pruritus                    2044
Injection site swelling     1975
Rash                        1934
Death                       1813
Asthenia                    1799
Injection site pruritus     1619
Paraesthesia                1508
Malaise                     1495
Erythema                    1490
Diarrhoea                   1456
SARS-CoV-2 test positive    1416
Injection site warmth       1399
Urticaria                   1379
Hypoaesthesia               1280
Hyperhidrosis               1213
Lymphadenopathy             1212
COVID-19  

In [125]:
# Remove symptoms that appeared in less than 1000 patients
# This removes the patients who didn't have any of the most common symptoms
v = symp['SYMPTOM'].value_counts()
common_symp = symp[symp['SYMPTOM'].isin(v.index[v.values > 1000])]

In [126]:
# Store ID column
ids = common_symp['VAERS_ID']

# Apply one-hot encoding
common_symp = common_symp['SYMPTOM'].str.get_dummies()

# Add the VAERS_ID column
common_symp.insert(loc=0, column='VAERS_ID', value=ids)

# Merging all the rows with the same ID after using dummy encoding 
common_symp = common_symp.groupby(['VAERS_ID']).sum().reset_index()
common_symp

Unnamed: 0,VAERS_ID,Arthralgia,Asthenia,COVID-19,Chills,Cough,Death,Diarrhoea,Dizziness,Dyspnoea,...,Pain,Pain in extremity,Paraesthesia,Pruritus,Pyrexia,Rash,SARS-CoV-2 test negative,SARS-CoV-2 test positive,Urticaria,Vomiting
0,916601,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,916602,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,916603,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,916604,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,916607,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32671,1134697,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
32672,1134819,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
32673,1135949,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
32674,1136535,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


#### Combining meaningful features from the other datasets

The 2021VAERSVAX.csv and 2021VAERSDATA.csv dataset contain important information for our analysis like gender, age, and vaccine manufacter. 

We will merge those features with the dataset we have with the encoded symptoms. We will only bring information about the patients that we have in our **commom_symp** dataset.

In [127]:
all_data.drop(all_data.columns.difference(['VAERS_ID',  'AGE_YRS', 'SEX', 'DIED', 'HOSPITAL', 'HOSPDAYS']), 1, inplace=True)
vax.drop(vax.columns.difference(['VAERS_ID','VAX_TYPE', 'VAX_MANU', 'VAX_DOSE_SERIES']), 1, inplace=True)

In [128]:
all_data

Unnamed: 0,VAERS_ID,AGE_YRS,SEX,DIED,HOSPITAL,HOSPDAYS
0,916600,33.0,F,,,
1,916601,73.0,F,,,
2,916602,23.0,F,,,
3,916603,58.0,F,,,
4,916604,47.0,F,,,
...,...,...,...,...,...,...
40343,1135429,81.0,F,,,
40344,1135949,47.0,F,,Y,2.0
40345,1136535,72.0,F,,Y,2.0
40346,1136622,70.0,M,,,


In [129]:
vax

Unnamed: 0,VAERS_ID,VAX_TYPE,VAX_MANU,VAX_DOSE_SERIES
0,916600,COVID19,MODERNA,1
1,916601,COVID19,MODERNA,1
2,916602,COVID19,PFIZER\BIONTECH,1
3,916603,COVID19,MODERNA,UNK
4,916604,COVID19,MODERNA,1
...,...,...,...,...
40932,1135429,COVID19,JANSSEN,UNK
40933,1135949,COVID19,JANSSEN,1
40934,1136535,COVID19,JANSSEN,UNK
40935,1136622,COVID19,JANSSEN,1


In [130]:
# keep records with only COVID19 vaccines
vax = vax.iloc[(vax['VAX_TYPE'] == 'COVID19').values]
# Detele duplicates
vax = vax.drop_duplicates(subset=['VAERS_ID'])

In [131]:
vax['VAERS_ID'].value_counts()

919551     1
937283     1
933197     1
931148     1
918858     1
          ..
1047178    1
1084036    1
1024641    1
1047047    1
917504     1
Name: VAERS_ID, Length: 40044, dtype: int64

In [132]:
# The VAX dataset has samples that are not contained in the 
# ALL_DATA dataset. We will keep only the samples with VAERS_ID 
# contained in our common symptoms dataset

# boolean list with the IDs of ALL_DATA contained in common_symp
ids_in = all_data['VAERS_ID'].isin(common_symp['VAERS_ID']).values
# Filter: includes only VAERS_ID in common_symp
all_data = all_data.iloc[ids_in]

# boolean list with the IDs of VAX contained in common_symp
ids_in = common_symp['VAERS_ID'].isin(vax['VAERS_ID']).values
print('length common symp', len(ids_in), 'values from common symp in vax', sum(ids_in))

print('Size of common_symptoms dataset:', len(common_symp))
print('Size of ALL_DATA dataset:',len(all_data))
print('Size of VAX dataset:',len(vax))

length common symp 32676 values from common symp in vax 32523
Size of common_symptoms dataset: 32676
Size of ALL_DATA dataset: 32676
Size of VAX dataset: 40044


We will use the **32523** samples for which we have the vaccination manufacter available. **We will include only the VAERS_ID samples for which we have information from the 3 datasets, SYMPTOMS, ALL_DATA and VAX.**

In [133]:
# boolean list with the IDs of common symp contained in vax
ids_in = common_symp['VAERS_ID'].isin(vax['VAERS_ID']).values
common_symp = common_symp.iloc[ids_in]

# Filter: includes only VAERS_ID in common_symp
all_data = all_data.iloc[ids_in]

ids_in = vax['VAERS_ID'].isin(common_symp['VAERS_ID']).values
# Filter: includes only VAERS_ID in common_symp
vax = vax.iloc[ids_in]

print('Size of common_symptoms dataset:', len(common_symp))
print('Size of ALL_DATA dataset:',len(all_data))
print('Size of VAX dataset:',len(vax))

Size of common_symptoms dataset: 32523
Size of ALL_DATA dataset: 32523
Size of VAX dataset: 32523


In [137]:
# Merging meaningful features to our sataset
#data = common_symp.merge(all_data, left_on='VAERS_ID')
data = common_symp.merge(all_data, on='VAERS_ID')
data

Unnamed: 0,VAERS_ID,Arthralgia,Asthenia,COVID-19,Chills,Cough,Death,Diarrhoea,Dizziness,Dyspnoea,...,Rash,SARS-CoV-2 test negative,SARS-CoV-2 test positive,Urticaria,Vomiting,AGE_YRS,SEX,DIED,HOSPITAL,HOSPDAYS
0,916601,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,73.0,F,,,
1,916602,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,23.0,F,,,
2,916603,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,58.0,F,,,
3,916604,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,47.0,F,,,
4,916607,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,50.0,M,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32381,1134697,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,51.0,M,Y,,
32382,1134819,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,51.0,F,Y,Y,5.0
32383,1135949,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,47.0,F,,Y,2.0
32384,1136535,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,72.0,F,,Y,2.0


In [95]:
data = data.merge(vax, on='VAERS_ID')
data

Unnamed: 0,VAERS_IDVAERS_ID,Arthralgia,Asthenia,COVID-19,Chills,Cough,Death,Diarrhoea,Dizziness,Dyspnoea,...,Vomiting,AGE_YRS,SEX,DIED,HOSPITAL,HOSPDAYS,VAERS_ID,VAX_TYPE,VAX_MANU,VAX_DOSE_SERIES


In [135]:
diff = common_symp['VAERS_ID'].compare(all_data['VAERS_ID'])

ValueError: Can only compare identically-labeled Series objects

In [141]:
# Merge vax with result2
result2 = pd.concat([common_symp.set_index('VAERS_ID'), all_data.set_index('VAERS_ID')], axis=1, join="inner")

In [142]:
result2

Unnamed: 0_level_0,Arthralgia,Asthenia,COVID-19,Chills,Cough,Death,Diarrhoea,Dizziness,Dyspnoea,Erythema,...,Rash,SARS-CoV-2 test negative,SARS-CoV-2 test positive,Urticaria,Vomiting,AGE_YRS,SEX,DIED,HOSPITAL,HOSPDAYS
VAERS_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
916601,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,73.0,F,,,
916602,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,23.0,F,,,
916603,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,58.0,F,,,
916604,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,47.0,F,,,
916607,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,50.0,M,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1134697,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,51.0,M,Y,,
1134819,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,51.0,F,Y,Y,5.0
1135949,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,47.0,F,,Y,2.0
1136535,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,72.0,F,,Y,2.0


In [143]:
result2['DIED'].value_counts()

Y    1912
Name: DIED, dtype: int64

In [146]:
# Merge vax with result2
result_final = pd.concat([result2, vax.set_index('VAERS_ID')], axis=1, join="inner")
result_final

Unnamed: 0_level_0,Arthralgia,Asthenia,COVID-19,Chills,Cough,Death,Diarrhoea,Dizziness,Dyspnoea,Erythema,...,Urticaria,Vomiting,AGE_YRS,SEX,DIED,HOSPITAL,HOSPDAYS,VAX_TYPE,VAX_MANU,VAX_DOSE_SERIES
VAERS_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
916601,0,0,0,0,0,0,0,0,1,0,...,0,0,73.0,F,,,,COVID19,MODERNA,1
916602,0,0,0,0,0,0,0,0,0,0,...,0,0,23.0,F,,,,COVID19,PFIZER\BIONTECH,1
916603,0,0,0,0,0,0,0,1,0,0,...,0,0,58.0,F,,,,COVID19,MODERNA,UNK
916604,0,0,0,0,0,0,0,0,0,0,...,0,0,47.0,F,,,,COVID19,MODERNA,1
916607,0,0,0,1,0,0,0,0,0,0,...,0,0,50.0,M,,,,COVID19,MODERNA,UNK
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1134697,0,0,0,0,0,1,0,0,0,0,...,0,0,51.0,M,Y,,,COVID19,JANSSEN,1
1134819,0,0,0,0,0,1,0,0,1,0,...,0,0,51.0,F,Y,Y,5.0,COVID19,JANSSEN,1
1135949,0,0,0,0,0,0,0,0,0,0,...,1,0,47.0,F,,Y,2.0,COVID19,JANSSEN,1
1136535,0,1,0,0,0,0,0,0,0,0,...,0,0,72.0,F,,Y,2.0,COVID19,JANSSEN,UNK
