# Adverse Reaction Cluster of the COVID-19 Vaccine: Potential Clinical Prediction Tool

### Andrea Gomez, Dung Mai, Mariana Maroto
### Graduate Center, CUNY - Machine Learning CSCI 740 Spring 2021

Our application project clusters COVID-19 vaccine adverse reactions. The purpose of the project is having a detailed understanding of the common types of adverse reactions and  identifying which adverse reactions are in need of immediate care. This project suggests a two-step approach. First, we use an unsupervised machine learning algorithm (clustering) to segment adverse reactions into groups. This will give us the most common symptoms for each group. Second, using the vaccine reaction clusters, along with additional patient information (gender, age, allergies) and vaccine manufacturer information, we will predict the need of urgent medical care by using fatalities and hospital visits. 

The dataset is provided by the Vaccine Adverse Event Reporting System VAERS and contains reports about adverse events that may be associated with COVID-19 vaccines. The database we chose was the dataset provided for the current year 2021, considering that our goal is to explore reactions due to Covid vaccines. The data contains reports processed as of 3/26/2021. 

Dataset Source: VAERS - Vaccine Adverse Event Reporting System. Data Retrieve on 4/8/2021 https://vaers.hhs.gov/data/datasets.html

## 1. Data Cleaning

1.1 Read Datasets

In [436]:
import pandas as pd
import numpy as np

all_data =  pd.read_csv('2021VAERSDATA.csv', sep=",", encoding = "ISO-8859-1")
symp= pd.read_csv('2021VAERSSYMPTOMS.csv', sep=",", encoding = "ISO-8859-1")
vax= pd.read_csv('2021VAERSVAX.csv', sep=",", encoding = "ISO-8859-1")

print('Main Dataset Size:'+ str(all_data.shape))
print('Symptoms Coded Dataset Size:' + str(symp.shape))
print('Vaccine Info Dataset Size:' + str(vax.shape))

Main Dataset Size:(40348, 35)
Symptoms Coded Dataset Size:(56533, 11)
Vaccine Info Dataset Size:(40937, 8)


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Symptoms coded dataset requires the most cleaning as each symptom should become a feature in dummy coding format. In original file, there could be more than one row for one individual if they presented more than 5 symptoms.

In [437]:
symp.head()

Unnamed: 0,VAERS_ID,SYMPTOM1,SYMPTOMVERSION1,SYMPTOM2,SYMPTOMVERSION2,SYMPTOM3,SYMPTOMVERSION3,SYMPTOM4,SYMPTOMVERSION4,SYMPTOM5,SYMPTOMVERSION5
0,916600,Dysphagia,23.1,Epiglottitis,23.1,,,,,,
1,916601,Anxiety,23.1,Dyspnoea,23.1,,,,,,
2,916602,Chest discomfort,23.1,Dysphagia,23.1,Pain in extremity,23.1,Visual impairment,23.1,,
3,916603,Dizziness,23.1,Fatigue,23.1,Mobility decreased,23.1,,,,
4,916604,Injection site erythema,23.1,Injection site pruritus,23.1,Injection site swelling,23.1,Injection site warmth,23.1,,


#### Dropping the SYMPTOMVERSION columns

In [438]:
# Remove SYMPTOMVERSION columns that are unecessary
symp = symp[symp.columns.drop(list(symp.filter(regex='SYMPTOMVERSION')))]
symp.head()

Unnamed: 0,VAERS_ID,SYMPTOM1,SYMPTOM2,SYMPTOM3,SYMPTOM4,SYMPTOM5
0,916600,Dysphagia,Epiglottitis,,,
1,916601,Anxiety,Dyspnoea,,,
2,916602,Chest discomfort,Dysphagia,Pain in extremity,Visual impairment,
3,916603,Dizziness,Fatigue,Mobility decreased,,
4,916604,Injection site erythema,Injection site pruritus,Injection site swelling,Injection site warmth,


In [439]:
symp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56533 entries, 0 to 56532
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   VAERS_ID  56533 non-null  int64 
 1   SYMPTOM1  56533 non-null  object
 2   SYMPTOM2  46302 non-null  object
 3   SYMPTOM3  37031 non-null  object
 4   SYMPTOM4  28625 non-null  object
 5   SYMPTOM5  21601 non-null  object
dtypes: int64(1), object(5)
memory usage: 2.6+ MB


In [440]:
# Find the number of unique values for all the SYMPTOMS 
uniq_symp = pd.unique(symp[['SYMPTOM1', 'SYMPTOM2', 'SYMPTOM3', 'SYMPTOM4', 'SYMPTOM5']].values.ravel('K'))
len(uniq_symp)

4407

#### Find the most common symptoms in all the patients

In [441]:
# collect all the symptoms in tuples with their respective frequency

s = ['SYMPTOM1', 'SYMPTOM2', 'SYMPTOM3', 'SYMPTOM4', 'SYMPTOM5']
symptoms = []

for col in s:
    index = symp[col].value_counts().index
    values = symp[col].value_counts().values
    symptoms += zip(index, values)

In [442]:
from itertools import groupby
from operator import itemgetter

first = itemgetter(0)

# add the values of the tuples with the same symptom
symptoms = [(k, sum(item[1] for item in tups_to_sum))
        for k, tups_to_sum in groupby(sorted(symptoms, key=first), key=first)]
 
# Function to sort hte list by second item of tuple
def Sort_Tuple(tup): 
  
    # reverse = None (Sorts in Ascending order) 
    # key is set to sort using second element of 
    # sublist lambda has been used 
    tup.sort(reverse=True, key = lambda x: x[1]) 
    return tup
   
# printing the sorted list of tuples
symptoms = Sort_Tuple(symptoms)
#print(symptoms[:34])

# add all the symptoms to a list and keep the 35
# most common symptoms
c_symp = [tup[0] for tup in symptoms][:34]
print('---List of the most common symptoms---\n')
print(c_symp)

---List of the most common symptoms---

['Headache', 'Pyrexia', 'Chills', 'Fatigue', 'Pain', 'Nausea', 'Dizziness', 'Pain in extremity', 'Myalgia', 'Injection site pain', 'Injection site erythema', 'Arthralgia', 'Dyspnoea', 'Vomiting', 'Pruritus', 'Injection site swelling', 'Rash', 'Death', 'Asthenia', 'Injection site pruritus', 'Paraesthesia', 'Malaise', 'Erythema', 'Diarrhoea', 'SARS-CoV-2 test positive', 'Injection site warmth', 'Urticaria', 'Hypoaesthesia', 'Hyperhidrosis', 'Lymphadenopathy', 'COVID-19', 'Cough', 'Feeling abnormal', 'SARS-CoV-2 test negative']


#### Hot encoding the symtoms

In [443]:
# Stack all the symptom columns in one
symp['INDEX1'] = symp.index
symp = pd.wide_to_long(symp, stubnames='SYMPTOM', i=['INDEX1'], j='SYMPNUMBER')
symp.reset_index(drop=True, inplace=True)

# Most commom symptoms 
symp['SYMPTOM'].value_counts(ascending = False).head(50)

Headache                    8881
Pyrexia                     7204
Chills                      6865
Fatigue                     6418
Pain                        6034
Nausea                      5039
Dizziness                   4229
Pain in extremity           3678
Myalgia                     3416
Injection site pain         3320
Injection site erythema     2655
Arthralgia                  2505
Dyspnoea                    2407
Vomiting                    2050
Pruritus                    2044
Injection site swelling     1975
Rash                        1934
Death                       1813
Asthenia                    1799
Injection site pruritus     1619
Paraesthesia                1508
Malaise                     1495
Erythema                    1490
Diarrhoea                   1456
SARS-CoV-2 test positive    1416
Injection site warmth       1399
Urticaria                   1379
Hypoaesthesia               1280
Hyperhidrosis               1213
Lymphadenopathy             1212
COVID-19  

In [444]:
# Remove symptoms that appeared in less than 1000 patients
# This removes the patients who didn't have any of the most common symptoms
v = symp['SYMPTOM'].value_counts()
commom_symp = symp[symp['SYMPTOM'].isin(v.index[v.values > 1000])]


In [445]:
# Store ID column
ids = commom_symp['VAERS_ID']
# Apply one-hot encoding
commom_symp = commom_symp['SYMPTOM'].str.get_dummies()
# Add the VAERS_ID column
common_symp = commom_symp.insert(loc=0, column='VAERS_ID', value=ids)

Unnamed: 0,VAERS_ID,Arthralgia,Asthenia,COVID-19,Chills,Cough,Death,Diarrhoea,Dizziness,Dyspnoea,...,Pain,Pain in extremity,Paraesthesia,Pruritus,Pyrexia,Rash,SARS-CoV-2 test negative,SARS-CoV-2 test positive,Urticaria,Vomiting
3,916603,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,916604,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,916608,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
9,916610,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
11,916611,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
282645,1134120,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
282646,1134120,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
282655,1134819,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
282659,1135949,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [450]:
# Merging all the rows with the same ID after using dummy encoding 
commom_symp = commom_symp.groupby(['VAERS_ID']).sum().reset_index()
commom_symp

Unnamed: 0,VAERS_ID,Arthralgia,Asthenia,COVID-19,Chills,Cough,Death,Diarrhoea,Dizziness,Dyspnoea,...,Pain,Pain in extremity,Paraesthesia,Pruritus,Pyrexia,Rash,SARS-CoV-2 test negative,SARS-CoV-2 test positive,Urticaria,Vomiting
0,916601,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,916602,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,916603,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,916604,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,916607,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32671,1134697,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
32672,1134819,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
32673,1135949,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
32674,1136535,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
