# <u>HMS - Harmful Brain Activity Classification
## Classify seizures and other patterns of harmful brain activity in critically ill patients

## Dataset Description
The goal of this competition is to detect and classify seizures and other types of harmful brain activity in electroencephalography (EEG) data. Even experts find this to be a challenging task and often disagree about the correct labels.

This is a code competition. Only a few examples from the test set are available for download. When your submission is scored the test folders will be replaced with versions containing the complete test set.

## Files

*train.csv* Metadata for the train set. The expert annotators reviewed 50 second long EEG samples plus matched spectrograms covering 10 a minute window centered at the same time and labeled the central 10 seconds. Many of these samples overlapped and have been consolidated. train.csv provides the metadata that allows you to extract the original subsets that the raters annotated.

* *eeg_id* - A unique identifier for the entire EEG recording.
* *eeg_sub_id* - An ID for the specific 50 second long subsample this row's labels apply to.
* *eeg_label_offset_seconds* - The time between the beginning of the consolidated EEG and this subsample.
* *spectrogram_id* - A unique identifier for the entire EEG recording.
* *spectrogram_sub_id* - An ID for the specific 10 minute subsample this row's labels apply to.
* *spectogram_label_offset_seconds* - The time between the beginning of the consolidated spectrogram and this subsample.
* *label_id* - An ID for this set of labels.
* *patient_id* - An ID for the patient who donated the data.
* *expert_consensus* - The consensus annotator label. Provided for convenience only.
[seizure/lpd/gpd/lrda/grda/other]_vote - The count of annotator votes for a given brain activity class. The full names of the activity classes are as follows: lpd: lateralized periodic discharges, gpd: generalized periodic discharges, 
* *lrd*: lateralized rhythmic delta activity, and grda: generalized rhythmic delta activity . A detailed explanations of these patterns is available here.
*test.csv* Metadata for the test set. As there are no overlapping samples in the test set, many columns in the train metadata don't apply.

* *eeg_id*
* *spectrogram_id*
* *patient_id*
 
## sample_submission.csv

* *eeg_id*
* *[seizure/lpd/gpd/lrda/grda/other]*_vote - The target columns. Your predictions must be probabilities. Note that the test samples had between 3 and 20 annotators.
train_eegs/ EEG data from one or more overlapping samples. Use the metadata in train.csv to select specific annotated subsets. The column names are the names of the individual electrode locations for EEG leads, with one exception. The EKG column is for an electrocardiogram lead that records data from the heart. All of the EEG data (for both train and test) was collected at a frequency of 200 samples per second.

* *test_eegs/* Exactly 50 seconds of EEG data.

* *train_spectrograms/* Spectrograms assembled EEG data. Use the metadata in train.csv to select specific annotated subsets. The column names indicate the frequency in hertz and the recording regions of the EEG electrodes. The latter are abbreviated as LL = left lateral; RL = right lateral; LP = left parasagittal; RP = right parasagittal.

* *test_spectrograms/* Spectrograms assembled using exactly 10 minutes of EEG data.

* *example_figures/* Larger copies of the example case images used on the overview tab.

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

import glob 
import os

In [2]:
train_data = pd.read_csv('../hms-harmful-brain-activity-classification/train.csv')
test_data = pd.read_csv('../hms-harmful-brain-activity-classification/test.csv')

print(f"Train Data Shape: {train_data.shape}")
print(f"Test Data Shape: {test_data.shape}")

Train Data Shape: (106800, 15)
Test Data Shape: (1, 3)


In [3]:
train_data.head(30)

Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote
0,1628180742,0,0.0,353733,0,0.0,127492639,42516,Seizure,3,0,0,0,0,0
1,1628180742,1,6.0,353733,1,6.0,3887563113,42516,Seizure,3,0,0,0,0,0
2,1628180742,2,8.0,353733,2,8.0,1142670488,42516,Seizure,3,0,0,0,0,0
3,1628180742,3,18.0,353733,3,18.0,2718991173,42516,Seizure,3,0,0,0,0,0
4,1628180742,4,24.0,353733,4,24.0,3080632009,42516,Seizure,3,0,0,0,0,0
5,1628180742,5,26.0,353733,5,26.0,2413091605,42516,Seizure,3,0,0,0,0,0
6,1628180742,6,30.0,353733,6,30.0,364593930,42516,Seizure,3,0,0,0,0,0
7,1628180742,7,36.0,353733,7,36.0,3811483573,42516,Seizure,3,0,0,0,0,0
8,1628180742,8,40.0,353733,8,40.0,3388718494,42516,Seizure,3,0,0,0,0,0
9,2277392603,0,0.0,924234,0,0.0,1978807404,30539,GPD,0,0,5,0,1,5


In [4]:
train_data['expert_consensus'].unique()

array(['Seizure', 'GPD', 'LRDA', 'Other', 'GRDA', 'LPD'], dtype=object)

The terms "Seizure," "GPD," "LRDA," "Other," "GRDA," and "LPD" are associated with EEG (electroencephalography) patterns:

- **Seizure**: Indicates the occurrence of seizures, which are typically characterized by sudden, abnormal electrical activity in the brain.
- **GPD (Generalized Periodic Discharges)**: Refers to a pattern of discharges that occur at regular intervals and are widespread across the brain.
- **LRDA (Lateralized Rhythmic Delta Activity)**: Represents rhythmic delta activity that is localized to one hemisphere of the brain and is often associated with seizures.
- **Other**: This category may include various other EEG patterns that do not fit into the standard classifications.
- **GRDA (Generalized Rhythmic Delta Activity)**: Involves rhythmic delta waves that occur throughout the brain but are not typically associated with seizures.
- **LPD (Lateralized Periodic Discharges)**: Refers to periodic discharges that are localized to one side of the brain and can be associated with seizures or other brain abnormalities.

These patterns are critical for neurophysiological assessment in EEG monitoring, particularly in critically ill patients, as they can have different clinical associations and implications for treatment.

[seizure/lpd/gpd/lrda/grda/other]_vote - The count of annotator votes for a given brain activity class. The full names of the activity classes are as follows: 
* lpd: lateralized periodic discharges
* gpd: generalized periodic discharges
* lrd: lateralized rhythmic delta activity
* grda: generalized rhythmic delta activity
A detailed explanations of these patterns is <a href = 'https://www.acns.org/UserFiles/file/ACNSStandardizedCriticalCareEEGTerminology_rev2021.pdf'>available here.</a>

In [5]:
train_data.describe()

Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote
count,106800.0,106800.0,106800.0,106800.0,106800.0,106800.0,106800.0,106800.0,106800.0,106800.0,106800.0,106800.0,106800.0,106800.0
mean,2104387000.0,26.286189,118.817228,1067262000.0,43.733596,520.431404,2141415000.0,32304.428493,0.878024,1.138783,1.264925,0.948296,1.059185,1.966283
std,1233371000.0,69.757658,314.557803,629147500.0,104.292116,1449.759868,1241670000.0,18538.196252,1.538873,2.818845,3.131889,2.136799,2.228492,3.62118
min,568657.0,0.0,0.0,353733.0,0.0,0.0,338.0,56.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1026896000.0,1.0,6.0,523862600.0,2.0,12.0,1067419000.0,16707.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2071326000.0,5.0,26.0,1057904000.0,8.0,62.0,2138332000.0,32068.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3172787000.0,16.0,82.0,1623195000.0,29.0,394.0,3217816000.0,48036.0,1.0,1.0,0.0,1.0,1.0,2.0
max,4294958000.0,742.0,3372.0,2147388000.0,1021.0,17632.0,4294934000.0,65494.0,19.0,18.0,16.0,15.0,15.0,25.0


In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106800 entries, 0 to 106799
Data columns (total 15 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   eeg_id                            106800 non-null  int64  
 1   eeg_sub_id                        106800 non-null  int64  
 2   eeg_label_offset_seconds          106800 non-null  float64
 3   spectrogram_id                    106800 non-null  int64  
 4   spectrogram_sub_id                106800 non-null  int64  
 5   spectrogram_label_offset_seconds  106800 non-null  float64
 6   label_id                          106800 non-null  int64  
 7   patient_id                        106800 non-null  int64  
 8   expert_consensus                  106800 non-null  object 
 9   seizure_vote                      106800 non-null  int64  
 10  lpd_vote                          106800 non-null  int64  
 11  gpd_vote                          106800 non-null  i

## Checking for null values

In [7]:
train_data.isnull().sum()

eeg_id                              0
eeg_sub_id                          0
eeg_label_offset_seconds            0
spectrogram_id                      0
spectrogram_sub_id                  0
spectrogram_label_offset_seconds    0
label_id                            0
patient_id                          0
expert_consensus                    0
seizure_vote                        0
lpd_vote                            0
gpd_vote                            0
lrda_vote                           0
grda_vote                           0
other_vote                          0
dtype: int64

## Electroencephalography (EEG)
* Electroencephalography, commonly known as EEG, is a non-invasive method used by medical professionals to record electrical activity in the brain.
* This is done using electrodes placed along the scalp.
* EEG is a crucial tool in diagnosing neurological disorders, especially epilepsy, which is characterized by recurrent seizures.

In [8]:
eeg_dir = '../hms-harmful-brain-activity-classification/train_eegs/'
spectrogram_dir = '../hms-harmful-brain-activity-classification/train_spectrograms/'

eeg_files = os.listdir(eeg_dir)
print(f"Number of EEG parquet files: {len(eeg_files)}")

spectrogram_files = os.listdir(spectrogram_dir)
print(f"Number of Spectrogram parquet files: {len(spectrogram_files)}")

Number of EEG parquet files: 17300
Number of Spectrogram parquet files: 11138


In [9]:
train = train_data.groupby('eeg_id')[['spectrogram_id','spectrogram_label_offset_seconds']].agg(
    {'spectrogram_id':'first','spectrogram_label_offset_seconds':'min'})
train.columns = ['spec_id','min']

In [10]:
max = train_data.groupby('eeg_id')[['spectrogram_id','spectrogram_label_offset_seconds']].agg(
    {'spectrogram_label_offset_seconds':'max'})
train['max'] = max

In [11]:
first = train_data.groupby('eeg_id')[['patient_id']].agg('first')
train['patient_id'] = first

In [12]:
targets = train_data.columns[-6:]
print(f"Number of targets: {targets}")
print(list(targets))

Number of targets: Index(['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote',
       'other_vote'],
      dtype='object')
['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']


In [13]:
ok = train_data.groupby('eeg_id')[targets].agg('sum')
for t in targets:
    train[t] = ok[t].values    

In [14]:
y_data = train[targets].values
y_data = y_data / y_data.sum(axis = 1, keepdims = True)
train[targets] = y_data

In [15]:
tmp = train_data.groupby('eeg_id')[['expert_consensus']].agg('first') 
train['target'] = tmp

In [16]:
train = train.reset_index()
print('Train non overlapping eeg_id shape:', train.shape)
train.head()

Train non overlapping eeg_id shape: (17089, 12)


Unnamed: 0,eeg_id,spec_id,min,max,patient_id,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote,target
0,568657,789577333,0.0,16.0,20654,0.0,0.0,0.25,0.0,0.166667,0.583333,Other
1,582999,1552638400,0.0,38.0,20230,0.0,0.857143,0.0,0.071429,0.0,0.071429,LPD
2,642382,14960202,1008.0,1032.0,5955,0.0,0.0,0.0,0.0,0.0,1.0,Other
3,751790,618728447,908.0,908.0,38549,0.0,0.0,1.0,0.0,0.0,0.0,GPD
4,778705,52296320,0.0,0.0,40955,0.0,0.0,0.0,0.0,0.0,1.0,Other


In [17]:
train.to_csv('Cleaned_Train.csv', index = False)