## 🧹 DRAFT

Before diving into the analysis, a thorough data cleaning process was performed to ensure the dataset is accurate and ready for use. Cleaning the data is a crucial step to ensure the analysis is based on reliable information and free of inconsistencies.8

### Key Cleaning Steps:

1. **Handling Missing Values**:
   - We addressed missing or incomplete data in critical columns such as `State`, `Age`, and `Sex`. This ensures that the dataset contains only rows with the necessary information for analysis.

2. **Removing Invalid Entries** 🏛️:
   - In the `State` column, we validated that all entries correspond to valid U.S. state abbreviations. Any rows with incorrect or missing state information were removed.

3. **Filtering Out Unknown Values** 🧑‍🤝‍🧑:
   - A significant number of entries had "Unknown" in the `Sex` column. To avoid potential biases, these entries were removed, leaving only records with valid male or female identifiers.

4. **Dropping Irrelevant Columns** ✂️:
   - Several columns that were not relevant to our analysis (e.g., unnecessary time or administrative data) were removed to streamline the dataset and focus on the most useful information.


In [2]:
#Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  
import nltk
import re

In [3]:
import pandas as pd

# Load the VAERS Data (Demographics and Event Information)
vaers_data = pd.read_csv('VAERSDATA.csv', low_memory=False)

In [4]:
vaers_data.head()

Unnamed: 0,VAERS_ID,RECVDATE,STATE,AGE_YRS,CAGE_YR,CAGE_MO,SEX,RPT_DATE,SYMPTOM_TEXT,DIED,...,CUR_ILL,HISTORY,PRIOR_VAX,SPLTTYPE,FORM_VERS,TODAYS_DATE,BIRTH_DEFECT,OFC_VISIT,ER_ED_VISIT,ALLERGIES
0,902418,12/15/2020,NJ,56.0,56.0,,F,,Patient experienced mild numbness traveling fr...,,...,none,none,,,2,12/15/2020,,,,none
1,902440,12/15/2020,AZ,35.0,35.0,,F,,C/O Headache,,...,,,,,2,12/15/2020,,,,
2,902446,12/15/2020,WV,55.0,55.0,,F,,"felt warm, hot and face and ears were red and ...",,...,none,"Hypertension, sleep apnea, hypothyroidism",,,2,12/15/2020,,,,"Contrast Dye IV contrast, shellfish, strawberry"
3,902464,12/15/2020,LA,42.0,42.0,,M,,within 15 minutes progressive light-headedness...,,...,none,none,,,2,12/15/2020,,,Y,none
4,902465,12/15/2020,AR,60.0,60.0,,F,,Pt felt wave come over body @ 1218 starting in...,,...,"Bronchitis, finished prednisone on 12-13-20","hypertension, fibromyalgia",,,2,12/15/2020,,,,Biaxin


In [5]:
vaers_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1012894 entries, 0 to 1012893
Data columns (total 35 columns):
 #   Column        Non-Null Count    Dtype  
---  ------        --------------    -----  
 0   VAERS_ID      1012894 non-null  int64  
 1   RECVDATE      1012894 non-null  object 
 2   STATE         842425 non-null   object 
 3   AGE_YRS       909608 non-null   float64
 4   CAGE_YR       809314 non-null   float64
 5   CAGE_MO       5374 non-null     float64
 6   SEX           1012894 non-null  object 
 7   RPT_DATE      1130 non-null     object 
 8   SYMPTOM_TEXT  1011423 non-null  object 
 9   DIED          18951 non-null    object 
 10  DATEDIED      16828 non-null    object 
 11  L_THREAT      15197 non-null    object 
 12  ER_VISIT      144 non-null      object 
 13  HOSPITAL      90081 non-null    object 
 14  HOSPDAYS      53040 non-null    float64
 15  X_STAY        505 non-null      object 
 16  DISABLE       18274 non-null    object 
 17  RECOVD        882224 non-nu

In [6]:
vaers_data.columns

Index(['VAERS_ID', 'RECVDATE', 'STATE', 'AGE_YRS', 'CAGE_YR', 'CAGE_MO', 'SEX',
       'RPT_DATE', 'SYMPTOM_TEXT', 'DIED', 'DATEDIED', 'L_THREAT', 'ER_VISIT',
       'HOSPITAL', 'HOSPDAYS', 'X_STAY', 'DISABLE', 'RECOVD', 'VAX_DATE',
       'ONSET_DATE', 'NUMDAYS', 'LAB_DATA', 'V_ADMINBY', 'V_FUNDBY',
       'OTHER_MEDS', 'CUR_ILL', 'HISTORY', 'PRIOR_VAX', 'SPLTTYPE',
       'FORM_VERS', 'TODAYS_DATE', 'BIRTH_DEFECT', 'OFC_VISIT', 'ER_ED_VISIT',
       'ALLERGIES'],
      dtype='object')

In [7]:
# List of columns to drop based on the analysis
columns_to_drop = [
    'CAGE_YR', 'CAGE_MO', 'RPT_DATE', 'HOSPDAYS', 'X_STAY', 'NUMDAYS', 
    'LAB_DATA', 'V_FUNDBY', 'OTHER_MEDS', 'SPLTTYPE', 'FORM_VERS', 
    'TODAYS_DATE', 'OFC_VISIT', 'ER_ED_VISIT', 'V_ADMINBY','HISTORY','ER_VISIT','CUR_ILL'
]

# Drop the unnecessary columns from the DataFrame
vaers_data_cleaned = vaers_data.drop(columns=columns_to_drop)

# Display the shape of the cleaned DataFrame to verify the result
print("Shape of VAERS DataFrame after dropping irrelevant columns:", vaers_data_cleaned.shape)

Shape of VAERS DataFrame after dropping irrelevant columns: (1012894, 17)


In [8]:
vaers_data_cleaned.head()

Unnamed: 0,VAERS_ID,RECVDATE,STATE,AGE_YRS,SEX,SYMPTOM_TEXT,DIED,DATEDIED,L_THREAT,HOSPITAL,DISABLE,RECOVD,VAX_DATE,ONSET_DATE,PRIOR_VAX,BIRTH_DEFECT,ALLERGIES
0,902418,12/15/2020,NJ,56.0,F,Patient experienced mild numbness traveling fr...,,,,,,Y,12/15/2020,12/15/2020,,,none
1,902440,12/15/2020,AZ,35.0,F,C/O Headache,,,,,,Y,12/15/2020,12/15/2020,,,
2,902446,12/15/2020,WV,55.0,F,"felt warm, hot and face and ears were red and ...",,,,,,Y,12/15/2020,12/15/2020,,,"Contrast Dye IV contrast, shellfish, strawberry"
3,902464,12/15/2020,LA,42.0,M,within 15 minutes progressive light-headedness...,,,,,,Y,12/15/2020,12/15/2020,,,none
4,902465,12/15/2020,AR,60.0,F,Pt felt wave come over body @ 1218 starting in...,,,,,,N,12/15/2020,12/15/2020,,,Biaxin


In [9]:
# Check for null values in the entire DataFrame
null_values = vaers_data_cleaned.isnull().sum()

# Display the count of null values for each column
print("Count of null values in each column:")
print(null_values)


Count of null values in each column:
VAERS_ID              0
RECVDATE              0
STATE            170469
AGE_YRS          103286
SEX                   0
SYMPTOM_TEXT       1471
DIED             993943
DATEDIED         996066
L_THREAT         997697
HOSPITAL         922813
DISABLE          994620
RECOVD           130670
VAX_DATE          73924
ONSET_DATE        97413
PRIOR_VAX        965799
BIRTH_DEFECT    1012281
ALLERGIES        640608
dtype: int64


In [10]:
# Drop rows with missing values in the critical columns: VAERS_ID, STATE, AGE_YRS, and SEX
vaers_data_cleaned = vaers_data_cleaned.dropna(subset=['VAERS_ID', 'STATE', 'AGE_YRS', 'SEX'])

# Display the shape of the cleaned DataFrame to verify the result
print("Shape of VAERS DataFrame after dropping rows with nulls in critical columns:", vaers_data_cleaned.shape)

# Check if there are any remaining null values in those critical columns
print("\nRemaining missing values in critical columns:")
print(vaers_data_cleaned[['VAERS_ID', 'STATE', 'AGE_YRS', 'SEX']].isnull().sum())

Shape of VAERS DataFrame after dropping rows with nulls in critical columns: (787806, 17)

Remaining missing values in critical columns:
VAERS_ID    0
STATE       0
AGE_YRS     0
SEX         0
dtype: int64


In [11]:
# List of valid U.S. state abbreviations (50 states only)
valid_states = [
    'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS',
    'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY',
    'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 
    'WI', 'WY'
]

# Filter rows where the STATE column contains valid state abbreviations
vaers_data_cleaned = vaers_data_cleaned[vaers_data_cleaned['STATE'].isin(valid_states)]

In [12]:
# Remove rows where SEX is 'U' (Unknown)
vaers_data_cleaned = vaers_data_cleaned[vaers_data_cleaned['SEX'] != 'U']

# Display the shape of the DataFrame after removing 'Unknown' sex entries
print("Shape of VAERS DataFrame after removing 'Unknown' sex entries:", vaers_data_cleaned.shape)

Shape of VAERS DataFrame after removing 'Unknown' sex entries: (772636, 17)


In [13]:
# Check the value counts for the SEX column to see how many "unknown" values exist
sex_value_counts = vaers_data_cleaned['SEX'].value_counts(dropna=False)
print("SEX column value counts:")
print(sex_value_counts)


SEX column value counts:
SEX
F    522793
M    249843
Name: count, dtype: int64


In [14]:
# Convert the DIED column to binary: 1 for 'Y', 0 for empty or null
vaers_data_cleaned['DIED'] = vaers_data_cleaned['DIED'].apply(lambda x: 1 if x == 'Y' else 0)

# Check the result of the transformation
print("Value counts for the DIED column after transformation:")
print(vaers_data_cleaned['DIED'].value_counts())


Value counts for the DIED column after transformation:
DIED
0    760514
1     12122
Name: count, dtype: int64


In [15]:
# Convert the HOSPITAL column to binary: 1 for 'Y', 0 for empty or null
vaers_data_cleaned['HOSPITAL'] = vaers_data_cleaned['HOSPITAL'].apply(lambda x: 1 if x == 'Y' else 0)

# Convert the DISABLE column to binary: 1 for 'Y', 0 for empty or null
vaers_data_cleaned['DISABLE'] = vaers_data_cleaned['DISABLE'].apply(lambda x: 1 if x == 'Y' else 0)

# Convert the BIRTH_DEFECT column to binary: 1 for 'Y', 0 for empty or null
vaers_data_cleaned['BIRTH_DEFECT'] = vaers_data_cleaned['BIRTH_DEFECT'].apply(lambda x: 1 if x == 'Y' else 0)

In [16]:
vaers_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 772636 entries, 0 to 1012886
Data columns (total 17 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   VAERS_ID      772636 non-null  int64  
 1   RECVDATE      772636 non-null  object 
 2   STATE         772636 non-null  object 
 3   AGE_YRS       772636 non-null  float64
 4   SEX           772636 non-null  object 
 5   SYMPTOM_TEXT  771639 non-null  object 
 6   DIED          772636 non-null  int64  
 7   DATEDIED      11649 non-null   object 
 8   L_THREAT      13776 non-null   object 
 9   HOSPITAL      772636 non-null  int64  
 10  DISABLE       772636 non-null  int64  
 11  RECOVD        695682 non-null  object 
 12  VAX_DATE      760654 non-null  object 
 13  ONSET_DATE    744169 non-null  object 
 14  PRIOR_VAX     44519 non-null   object 
 15  BIRTH_DEFECT  772636 non-null  int64  
 16  ALLERGIES     356508 non-null  object 
dtypes: float64(1), int64(5), object(11)
memory usage: 10

In [17]:
vaers_data_cleaned = vaers_data_cleaned.dropna(subset=['STATE'])


In [18]:
vaers_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 772636 entries, 0 to 1012886
Data columns (total 17 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   VAERS_ID      772636 non-null  int64  
 1   RECVDATE      772636 non-null  object 
 2   STATE         772636 non-null  object 
 3   AGE_YRS       772636 non-null  float64
 4   SEX           772636 non-null  object 
 5   SYMPTOM_TEXT  771639 non-null  object 
 6   DIED          772636 non-null  int64  
 7   DATEDIED      11649 non-null   object 
 8   L_THREAT      13776 non-null   object 
 9   HOSPITAL      772636 non-null  int64  
 10  DISABLE       772636 non-null  int64  
 11  RECOVD        695682 non-null  object 
 12  VAX_DATE      760654 non-null  object 
 13  ONSET_DATE    744169 non-null  object 
 14  PRIOR_VAX     44519 non-null   object 
 15  BIRTH_DEFECT  772636 non-null  int64  
 16  ALLERGIES     356508 non-null  object 
dtypes: float64(1), int64(5), object(11)
memory usage: 10

In [19]:
# Example: Drop rows with missing data in critical columns in VAERS dataset
critical_columns_vaers = ['AGE_YRS', 'SEX', 'STATE', 'SYMPTOM_TEXT', 'DIED', 'VAX_DATE']
vaers_data_cleaned = vaers_data_cleaned.dropna(subset=critical_columns_vaers)

# Similarly, do this for the other datasets, like symptoms and vaccination data

print(vaers_data_cleaned[['VAERS_ID', 'STATE', 'AGE_YRS', 'SEX']].isnull().sum())


VAERS_ID    0
STATE       0
AGE_YRS     0
SEX         0
dtype: int64


In [20]:
# Check for missing values in SYMPTOM_TEXT
print(vaers_data_cleaned['SYMPTOM_TEXT'].isnull().sum())


0


In [21]:
# Convert VAX_DATE and ONSET_DATE to datetime
vaers_data_cleaned['VAX_DATE'] = pd.to_datetime(vaers_data_cleaned['VAX_DATE'], errors='coerce')
vaers_data_cleaned['ONSET_DATE'] = pd.to_datetime(vaers_data_cleaned['ONSET_DATE'], errors='coerce')

# Verify the conversion
print(vaers_data_cleaned[['VAX_DATE', 'ONSET_DATE']].dtypes)


VAX_DATE      datetime64[ns]
ONSET_DATE    datetime64[ns]
dtype: object


In [22]:
vaers_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 759664 entries, 0 to 1012886
Data columns (total 17 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   VAERS_ID      759664 non-null  int64         
 1   RECVDATE      759664 non-null  object        
 2   STATE         759664 non-null  object        
 3   AGE_YRS       759664 non-null  float64       
 4   SEX           759664 non-null  object        
 5   SYMPTOM_TEXT  759664 non-null  object        
 6   DIED          759664 non-null  int64         
 7   DATEDIED      11576 non-null   object        
 8   L_THREAT      13674 non-null   object        
 9   HOSPITAL      759664 non-null  int64         
 10  DISABLE       759664 non-null  int64         
 11  RECOVD        683668 non-null  object        
 12  VAX_DATE      759664 non-null  datetime64[ns]
 13  ONSET_DATE    736100 non-null  datetime64[ns]
 14  PRIOR_VAX     44444 non-null   object        
 15  BIRTH_DEFECT  759664 

In [23]:
vaers_data_cleaned.head()

Unnamed: 0,VAERS_ID,RECVDATE,STATE,AGE_YRS,SEX,SYMPTOM_TEXT,DIED,DATEDIED,L_THREAT,HOSPITAL,DISABLE,RECOVD,VAX_DATE,ONSET_DATE,PRIOR_VAX,BIRTH_DEFECT,ALLERGIES
0,902418,12/15/2020,NJ,56.0,F,Patient experienced mild numbness traveling fr...,0,,,0,0,Y,2020-12-15,2020-12-15,,0,none
1,902440,12/15/2020,AZ,35.0,F,C/O Headache,0,,,0,0,Y,2020-12-15,2020-12-15,,0,
2,902446,12/15/2020,WV,55.0,F,"felt warm, hot and face and ears were red and ...",0,,,0,0,Y,2020-12-15,2020-12-15,,0,"Contrast Dye IV contrast, shellfish, strawberry"
3,902464,12/15/2020,LA,42.0,M,within 15 minutes progressive light-headedness...,0,,,0,0,Y,2020-12-15,2020-12-15,,0,none
4,902465,12/15/2020,AR,60.0,F,Pt felt wave come over body @ 1218 starting in...,0,,,0,0,N,2020-12-15,2020-12-15,,0,Biaxin


In [24]:
# Convert SEX column to binary: 1 for Male ('M'), 0 for Female ('F')
vaers_data_cleaned['SEX_BINARY'] = vaers_data_cleaned['SEX'].apply(lambda x: 1 if x == 'M' else 0)

# Verify the result
print(vaers_data_cleaned[['SEX', 'SEX_BINARY']].head())

  SEX  SEX_BINARY
0   F           0
1   F           0
2   F           0
3   M           1
4   F           0


In [25]:
# Check unique values in SEX to confirm the actual values
print(vaers_data_cleaned['SEX'].unique())

# Convert SEX to binary: 1 for Male ('M'), 0 for Female ('F')
# Handle unknown or missing values by assigning them as NaN or 0
vaers_data_cleaned['SEX_BINARY'] = vaers_data_cleaned['SEX'].apply(lambda x: 1 if x == 'M' else (0 if x == 'F' else None))

# Verify the result
print(vaers_data_cleaned[['SEX', 'SEX_BINARY']].head())

['F' 'M']
  SEX  SEX_BINARY
0   F           0
1   F           0
2   F           0
3   M           1
4   F           0


In [26]:
# Confirm the distribution of 'SEX' values (should only be 'M' and 'F')
print(vaers_data_cleaned['SEX'].value_counts())

# Convert SEX to binary: 1 for Male ('M'), 0 for Female ('F')
vaers_data_cleaned['SEX_BINARY'] = vaers_data_cleaned['SEX'].apply(lambda x: 1 if x == 'M' else 0)

# Verify the result
print(vaers_data_cleaned[['SEX', 'SEX_BINARY']].head())

# Check value counts for the new SEX_BINARY column to confirm the transformation
print(vaers_data_cleaned['SEX_BINARY'].value_counts())


SEX
F    514900
M    244764
Name: count, dtype: int64
  SEX  SEX_BINARY
0   F           0
1   F           0
2   F           0
3   M           1
4   F           0
SEX_BINARY
0    514900
1    244764
Name: count, dtype: int64


In [27]:
# Replace the original 'SEX' column with the binary version
vaers_data_cleaned['SEX'] = vaers_data_cleaned['SEX_BINARY']

# Drop the temporary 'SEX_BINARY' column since it's now stored in 'SEX'
vaers_data_cleaned = vaers_data_cleaned.drop(columns=['SEX_BINARY'])

# Verify the result
print(vaers_data_cleaned.head())


   VAERS_ID    RECVDATE STATE  AGE_YRS  SEX  \
0    902418  12/15/2020    NJ     56.0    0   
1    902440  12/15/2020    AZ     35.0    0   
2    902446  12/15/2020    WV     55.0    0   
3    902464  12/15/2020    LA     42.0    1   
4    902465  12/15/2020    AR     60.0    0   

                                        SYMPTOM_TEXT  DIED DATEDIED L_THREAT  \
0  Patient experienced mild numbness traveling fr...     0      NaN      NaN   
1                                       C/O Headache     0      NaN      NaN   
2  felt warm, hot and face and ears were red and ...     0      NaN      NaN   
3  within 15 minutes progressive light-headedness...     0      NaN      NaN   
4  Pt felt wave come over body @ 1218 starting in...     0      NaN      NaN   

   HOSPITAL  DISABLE RECOVD   VAX_DATE ONSET_DATE PRIOR_VAX  BIRTH_DEFECT  \
0         0        0      Y 2020-12-15 2020-12-15       NaN             0   
1         0        0      Y 2020-12-15 2020-12-15       NaN             0   
2    

In [28]:
# Filter age to remove outliers (between 0.5 years and 100 years)
vaers_data_cleaned = vaers_data_cleaned[(vaers_data_cleaned['AGE_YRS'] >= 0.5) & (vaers_data_cleaned['AGE_YRS'] <= 100)]

# Verify the new age range
print("Min and Max Age after filtering:", vaers_data_cleaned['AGE_YRS'].min(), "-", vaers_data_cleaned['AGE_YRS'].max())

Min and Max Age after filtering: 0.5 - 100.0


In [29]:
# Convert RECOVD to binary: 1 for 'Y', 0 for anything else (including NaN)
vaers_data_cleaned['RECOVD'] = vaers_data_cleaned['RECOVD'].apply(lambda x: 1 if x == 'Y' else 0)

# Convert L_THREAT to binary: 1 for 'Y', 0 for anything else (including NaN)
vaers_data_cleaned['L_THREAT'] = vaers_data_cleaned['L_THREAT'].apply(lambda x: 1 if x == 'Y' else 0)

# Verify the result
print("Value counts for RECOVD after binary transformation:")
print(vaers_data_cleaned['RECOVD'].value_counts())

print("\nValue counts for L_THREAT after binary transformation:")
print(vaers_data_cleaned['L_THREAT'].value_counts())

Value counts for RECOVD after binary transformation:
RECOVD
0    500340
1    259062
Name: count, dtype: int64

Value counts for L_THREAT after binary transformation:
L_THREAT
0    745737
1     13665
Name: count, dtype: int64


In [30]:
vaers_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 759402 entries, 0 to 1012886
Data columns (total 17 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   VAERS_ID      759402 non-null  int64         
 1   RECVDATE      759402 non-null  object        
 2   STATE         759402 non-null  object        
 3   AGE_YRS       759402 non-null  float64       
 4   SEX           759402 non-null  int64         
 5   SYMPTOM_TEXT  759402 non-null  object        
 6   DIED          759402 non-null  int64         
 7   DATEDIED      11531 non-null   object        
 8   L_THREAT      759402 non-null  int64         
 9   HOSPITAL      759402 non-null  int64         
 10  DISABLE       759402 non-null  int64         
 11  RECOVD        759402 non-null  int64         
 12  VAX_DATE      759402 non-null  datetime64[ns]
 13  ONSET_DATE    735849 non-null  datetime64[ns]
 14  PRIOR_VAX     44439 non-null   object        
 15  BIRTH_DEFECT  759402 

In [31]:
# Convert the RECVDATE column to datetime format
vaers_data_cleaned['RECVDATE'] = pd.to_datetime(vaers_data_cleaned['RECVDATE'], errors='coerce')

# Verify the conversion
print(vaers_data_cleaned['RECVDATE'].dtypes)
print(vaers_data_cleaned[['RECVDATE']].head())

datetime64[ns]
    RECVDATE
0 2020-12-15
1 2020-12-15
2 2020-12-15
3 2020-12-15
4 2020-12-15


In [32]:
# Check for missing values in each column
print(vaers_data_cleaned.isnull().sum())


VAERS_ID             0
RECVDATE             0
STATE                0
AGE_YRS              0
SEX                  0
SYMPTOM_TEXT         0
DIED                 0
DATEDIED        747871
L_THREAT             0
HOSPITAL             0
DISABLE              0
RECOVD               0
VAX_DATE             0
ONSET_DATE       23553
PRIOR_VAX       714963
BIRTH_DEFECT         0
ALLERGIES       403946
dtype: int64


In [33]:
# Drop rows where SYMPTOM_TEXT is missing
vaers_data_cleaned = vaers_data_cleaned.dropna(subset=['SYMPTOM_TEXT'])

# Verify the result
print(vaers_data_cleaned.isnull().sum())

VAERS_ID             0
RECVDATE             0
STATE                0
AGE_YRS              0
SEX                  0
SYMPTOM_TEXT         0
DIED                 0
DATEDIED        747871
L_THREAT             0
HOSPITAL             0
DISABLE              0
RECOVD               0
VAX_DATE             0
ONSET_DATE       23553
PRIOR_VAX       714963
BIRTH_DEFECT         0
ALLERGIES       403946
dtype: int64


In [34]:
# Fill missing PRIOR_VAX with 'Unknown'
vaers_data_cleaned['PRIOR_VAX'].fillna('Unknown', inplace=True)

# Fill missing ALLERGIES with 'None'
vaers_data_cleaned['ALLERGIES'].fillna('None', inplace=True)

# Optionally drop rows with missing VAX_DATE or ONSET_DATE if time-based analysis is important
vaers_data_cleaned = vaers_data_cleaned.dropna(subset=['VAX_DATE', 'ONSET_DATE'])

# Verify the remaining missing values
print(vaers_data_cleaned.isnull().sum())

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  vaers_data_cleaned['PRIOR_VAX'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  vaers_data_cleaned['ALLERGIES'].fillna('None', inplace=True)


VAERS_ID             0
RECVDATE             0
STATE                0
AGE_YRS              0
SEX                  0
SYMPTOM_TEXT         0
DIED                 0
DATEDIED        724374
L_THREAT             0
HOSPITAL             0
DISABLE              0
RECOVD               0
VAX_DATE             0
ONSET_DATE           0
PRIOR_VAX            0
BIRTH_DEFECT         0
ALLERGIES            0
dtype: int64


In [35]:
# Fix for PRIOR_VAX: Directly assign the modified column back to the DataFrame
vaers_data_cleaned['PRIOR_VAX'] = vaers_data_cleaned['PRIOR_VAX'].fillna('Unknown')

# Fix for ALLERGIES: Directly assign the modified column back to the DataFrame
vaers_data_cleaned['ALLERGIES'] = vaers_data_cleaned['ALLERGIES'].fillna('None')

# Optionally drop rows with missing VAX_DATE or ONSET_DATE if time-based analysis is important
vaers_data_cleaned = vaers_data_cleaned.dropna(subset=['VAX_DATE', 'ONSET_DATE'])

# Verify the remaining missing values
print(vaers_data_cleaned.isnull().sum())

VAERS_ID             0
RECVDATE             0
STATE                0
AGE_YRS              0
SEX                  0
SYMPTOM_TEXT         0
DIED                 0
DATEDIED        724374
L_THREAT             0
HOSPITAL             0
DISABLE              0
RECOVD               0
VAX_DATE             0
ONSET_DATE           0
PRIOR_VAX            0
BIRTH_DEFECT         0
ALLERGIES            0
dtype: int64


In [36]:
# Feature engineering: Ensure consistency between DIED and DATEDIED
def correct_datedied(row):
    if row['DIED'] == 1 and row['DATEDIED'] == 'Not applicable':
        return 'Unknown death date'  # If they died but DATEDIED is missing
    elif row['DIED'] == 0 and row['DATEDIED'] != 'Not applicable':
        return 'Not applicable'  # If they didn't die but DATEDIED has a date
    else:
        return row['DATEDIED']

# Apply the function to ensure consistency between DIED and DATEDIED
vaers_data_cleaned['DATEDIED'] = vaers_data_cleaned.apply(correct_datedied, axis=1)

# Verify the result
print(vaers_data_cleaned[['DIED', 'DATEDIED']].head())

   DIED        DATEDIED
0     0  Not applicable
1     0  Not applicable
2     0  Not applicable
3     0  Not applicable
4     0  Not applicable


In [37]:
# Feature engineering: Ensure consistency between DIED and DATEDIED
def correct_datedied(row):
    if row['DIED'] == 1 and row['DATEDIED'] == 'Not applicable':
        return 'Unknown death date'  # If they died but DATEDIED is missing or incorrect
    elif row['DIED'] == 1 and pd.isna(row['DATEDIED']):
        return 'Unknown death date'  # If DIED is 1 but DATEDIED is NaN
    elif row['DIED'] == 0:
        return 'Not applicable'  # If they didn't die, DATEDIED should be 'Not applicable'
    else:
        return row['DATEDIED']

# Apply the function to ensure consistency between DIED and DATEDIED
vaers_data_cleaned['DATEDIED'] = vaers_data_cleaned.apply(correct_datedied, axis=1)

# Verify the result
print(vaers_data_cleaned[['DIED', 'DATEDIED']].head())

   DIED        DATEDIED
0     0  Not applicable
1     0  Not applicable
2     0  Not applicable
3     0  Not applicable
4     0  Not applicable


In [38]:
# Check rows where DIED == 1 to see if DATEDIED is being handled correctly
print(vaers_data_cleaned[vaers_data_cleaned['DIED'] == 1][['DIED', 'DATEDIED']].head(10))


      DIED    DATEDIED
4259     1  12/25/2020
5244     1  12/28/2020
7421     1  12/29/2020
7910     1  12/29/2020
8656     1  12/20/2020
8670     1  12/27/2020
8725     1  12/26/2020
8828     1  12/29/2020
8909     1  12/30/2020
8930     1  12/23/2020


In [39]:
vaers_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 735849 entries, 0 to 1012886
Data columns (total 17 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   VAERS_ID      735849 non-null  int64         
 1   RECVDATE      735849 non-null  datetime64[ns]
 2   STATE         735849 non-null  object        
 3   AGE_YRS       735849 non-null  float64       
 4   SEX           735849 non-null  int64         
 5   SYMPTOM_TEXT  735849 non-null  object        
 6   DIED          735849 non-null  int64         
 7   DATEDIED      735849 non-null  object        
 8   L_THREAT      735849 non-null  int64         
 9   HOSPITAL      735849 non-null  int64         
 10  DISABLE       735849 non-null  int64         
 11  RECOVD        735849 non-null  int64         
 12  VAX_DATE      735849 non-null  datetime64[ns]
 13  ONSET_DATE    735849 non-null  datetime64[ns]
 14  PRIOR_VAX     735849 non-null  object        
 15  BIRTH_DEFECT  735849 

In [40]:
# 1. Check the distribution of values in `DATEDIED` after feature engineering
datedied_counts = vaers_data_cleaned['DATEDIED'].value_counts(dropna=False)
print("Value counts for DATEDIED after feature engineering:")
print(datedied_counts)

# 2. Convert only the valid dates in `DATEDIED` to datetime
vaers_data_cleaned['DATEDIED'] = pd.to_datetime(vaers_data_cleaned['DATEDIED'], errors='coerce')

# Verify the conversion and check the types
print("\nData types after converting DATEDIED to datetime:")
print(vaers_data_cleaned.dtypes)

Value counts for DATEDIED after feature engineering:
DATEDIED
Not applicable        723985
Unknown death date       389
04/01/2021                46
03/05/2021                46
02/01/2021                44
                       ...  
04/17/2022                 1
04/23/2022                 1
04/26/2023                 1
10/23/2023                 1
10/16/2023                 1
Name: count, Length: 1087, dtype: int64

Data types after converting DATEDIED to datetime:
VAERS_ID                 int64
RECVDATE        datetime64[ns]
STATE                   object
AGE_YRS                float64
SEX                      int64
SYMPTOM_TEXT            object
DIED                     int64
DATEDIED        datetime64[ns]
L_THREAT                 int64
HOSPITAL                 int64
DISABLE                  int64
RECOVD                   int64
VAX_DATE        datetime64[ns]
ONSET_DATE      datetime64[ns]
PRIOR_VAX               object
BIRTH_DEFECT             int64
ALLERGIES               object
d

  vaers_data_cleaned['DATEDIED'] = pd.to_datetime(vaers_data_cleaned['DATEDIED'], errors='coerce')


In [41]:
# Replace "Unknown death date" with NaT
vaers_data_cleaned['DATEDIED'] = vaers_data_cleaned['DATEDIED'].replace("Unknown death date", pd.NaT)

# Convert the remaining valid date entries to datetime, ensuring non-date values remain as NaT
vaers_data_cleaned['DATEDIED'] = pd.to_datetime(vaers_data_cleaned['DATEDIED'], errors='coerce')

# Verify the final data types and counts of NaT
print("Value counts for DATEDIED after handling 'Unknown death date':")
print(vaers_data_cleaned['DATEDIED'].isnull().sum())

# Check the final data types
print("\nFinal data types:")
print(vaers_data_cleaned.dtypes)


Value counts for DATEDIED after handling 'Unknown death date':
724374

Final data types:
VAERS_ID                 int64
RECVDATE        datetime64[ns]
STATE                   object
AGE_YRS                float64
SEX                      int64
SYMPTOM_TEXT            object
DIED                     int64
DATEDIED        datetime64[ns]
L_THREAT                 int64
HOSPITAL                 int64
DISABLE                  int64
RECOVD                   int64
VAX_DATE        datetime64[ns]
ONSET_DATE      datetime64[ns]
PRIOR_VAX               object
BIRTH_DEFECT             int64
ALLERGIES               object
dtype: object


In [1]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download stopwords and tokenizer resources
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Function to clean symptom text
def clean_symptom_text(text):
    # 1. Convert to lowercase
    text = text.lower()
    
    # 2. Remove special characters and numbers
    text = re.sub(r'[^a-z\s]', '', text)
    
    # 3. Tokenize the text into words
    words = word_tokenize(text)
    
    # 4. Remove stopwords
    words = [word for word in words if word not in stop_words]
    
    # 5. Lemmatize words (get base form)
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # Join the cleaned words back into a single string
    cleaned_text = ' '.join(words)
    
    return cleaned_text

# Apply the cleaning function to the SYMPTOM_TEXT column
vaers_data_cleaned['SYMPTOM_TEXT_CLEANED'] = vaers_data_cleaned['SYMPTOM_TEXT'].apply(clean_symptom_text)

# Check the cleaned text
print(vaers_data_cleaned[['SYMPTOM_TEXT', 'SYMPTOM_TEXT_CLEANED']].head())


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sifre\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sifre\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sifre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


NameError: name 'vaers_data_cleaned' is not defined

In [47]:
import nltk
nltk.download('punkt')         # Tokenizer data
nltk.download('stopwords')     # Stopwords data
nltk.download('wordnet')       # WordNet for lemmatization


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sifre\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sifre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sifre\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [49]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob

# Preprocessing function to clean and standardize text
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and numbers
    text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
    text = re.sub(r'\d+', '', text) # Remove numbers
    
    # Tokenization
    tokens = nltk.word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Join tokens back into a string
    cleaned_text = ' '.join(tokens)
    
    return cleaned_text

# Apply the function to the symptom column
vaers_data_cleaned['SYMPTOM_TEXT_CLEANED'] = vaers_data_cleaned['SYMPTOM_TEXT'].apply(clean_text)

# View cleaned text
print(vaers_data_cleaned[['SYMPTOM_TEXT', 'SYMPTOM_TEXT_CLEANED']].head())


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\sifre/nltk_data'
    - 'c:\\Users\\sifre\\anaconda3\\envs\\VAC\\nltk_data'
    - 'c:\\Users\\sifre\\anaconda3\\envs\\VAC\\share\\nltk_data'
    - 'c:\\Users\\sifre\\anaconda3\\envs\\VAC\\lib\\nltk_data'
    - 'C:\\Users\\sifre\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:/Users/sifre/AppData/Roaming/nltk_data'
    - 'C:/Users/sifre/AppData/Roaming/nltk_data'
**********************************************************************


In [50]:
pip install polyfuzz


Collecting polyfuzzNote: you may need to restart the kernel to use updated packages.

  Downloading polyfuzz-0.4.2-py2.py3-none-any.whl.metadata (14 kB)
Collecting rapidfuzz>=0.13.1 (from polyfuzz)
  Downloading rapidfuzz-3.10.0-cp39-cp39-win_amd64.whl.metadata (11 kB)
Downloading polyfuzz-0.4.2-py2.py3-none-any.whl (36 kB)
Downloading rapidfuzz-3.10.0-cp39-cp39-win_amd64.whl (1.6 MB)
   ---------------------------------------- 0.0/1.6 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.6 MB ? eta -:--:--
   ------------------- -------------------- 0.8/1.6 MB 1.9 MB/s eta 0:00:01
   -------------------------------- ------- 1.3/1.6 MB 2.2 MB/s eta 0:00:01
   ---------------------------------------- 1.6/1.6 MB 2.2 MB/s eta 0:00:00
Installing collected packages: rapidfuzz, polyfuzz
Successfully installed polyfuzz-0.4.2 rapidfuzz-3.10.0
