# HMS - harmful brain activity

The aim of this project is to create and train a deep learning model that will detect harmful brain activity from EEG signal

Libraries we're going to use

`pandas` - a library for reading and processing data frames and input data files
`numpy` - mathematical library for efficient multidimensional algebra and raw data processing
`pytorch` - deep learning library and framework that allows for efficient data processing and neural network training using cpu or gpu
`scikit-learn` - sklearn - Rich in functionality machine learning and data science library. Industry standard for ML
`scipy` - scientific library including many mathematical tools. For example for signal processing
`os` - operating system packet that allows for navigating in the file system from the level of python

In [1]:
import pandas as pd
import numpy as np
import torch
import sklearn
import scipy
import os

We want to fully utilise our resources. We use `cuda` - a tool used by `pytorch` to perform computation on gpu - this way neural networks can be trained much faster

In [2]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE

'cuda'

# Loading the data

The data comes from [kaggle](https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification) competition. 

 It consists of `.parquet` files with eeg signal data as well as plotted spectrograms of the signals.

In [3]:
# directory that contains data from kaggle hms
INPUT_DATA_DIR = "data"

# directory in which our npy files are/will be stored - this will allow for faster loading of the data later
PROCESSED_DATA_DIR = "processed_data"

The metadata is stored in train.csv and test.csv files

In [4]:
train_meta_full = pd.read_csv(INPUT_DATA_DIR + "/train.csv")
train_meta = train_meta_full.loc[train_meta_full["eeg_sub_id"] == 0]

test_meta = pd.read_csv(INPUT_DATA_DIR + "/test.csv")

Signal data tends to be very noisy. That's why it's important to filter the noise before processing the data further. For this we'll use the butterworth lowpass filter.

Lowpass filter filters out high and noisy frequencies. It only allows the frequencies lower than specified to pass through it.

The data is sampled at the frequency of 200Hz, we want to cut off noise frequencies higher than 20Hz

In [5]:
cutoff_freq = 20
sampling_rate = 200
order = 4
lowcut = 0.5
highcut = 20

b, a = scipy.signal.butter(
    order, (lowcut, highcut), btype="bandpass", analog=False, fs=sampling_rate
)

Extract `.parquet` data.

The training data is stored in `.parquet` files as individual read values of separate electrodes placed on patient's head during EEG examination.

Normally specialists don't analyze the separate signals but rather differences between the neighbouring electrodes. 

`extract_parquet` function computes those differences and creates time series of those differences as opposed to pure signal.

Those differences are then processed with previously initialized lowpass filter, clipped so that their values are not too high and converted to pytorch tensor objects.

We want to use pytorch tensors because it's the object class required to perform neural network training.

In [6]:
# take a parquet dataframe and compute correct values for each column
# we want columns such as "Fp1-F7" as can be seen in /example_figures
def extract_parquet(parquet_data: torch.tensor):
    parquet_data["Fp1-F7"] = parquet_data["Fp1"] - parquet_data["F7"]
    parquet_data["F7-T3"] = parquet_data["F7"] - parquet_data["T3"]
    parquet_data["T3-T5"] = parquet_data["T3"] - parquet_data["T5"]
    parquet_data["T5-O1"] = parquet_data["T5"] - parquet_data["O1"]

    parquet_data["Fp2-F8"] = parquet_data["Fp2"] - parquet_data["F8"]
    parquet_data["F8-T4"] = parquet_data["F8"] - parquet_data["T4"]
    parquet_data["T4-T6"] = parquet_data["T4"] - parquet_data["T6"]
    parquet_data["T6-O2"] = parquet_data["T6"] - parquet_data["O2"]

    parquet_data["Fp1-F3"] = parquet_data["Fp1"] - parquet_data["F3"]
    parquet_data["F3-C3"] = parquet_data["F3"] - parquet_data["C3"]
    parquet_data["C3-P3"] = parquet_data["C3"] - parquet_data["P3"]
    parquet_data["P3-O1"] = parquet_data["P3"] - parquet_data["O1"]

    parquet_data["Fp2-F4"] = parquet_data["Fp2"] - parquet_data["F4"]
    parquet_data["F4-C4"] = parquet_data["F4"] - parquet_data["C4"]
    parquet_data["C4-P4"] = parquet_data["C4"] - parquet_data["P4"]
    parquet_data["P4-O2"] = parquet_data["P4"] - parquet_data["O2"]

    parquet_data["Fz-Cz"] = parquet_data["Fz"] - parquet_data["Cz"]
    parquet_data["Cz-Pz"] = parquet_data["Cz"] - parquet_data["Pz"]

    parquet_data = parquet_data.drop(
        [
            "Fp1",
            "F3",
            "C3",
            "P3",
            "F7",
            "T3",
            "T5",
            "O1",
            "Fz",
            "Cz",
            "Pz",
            "Fp2",
            "F4",
            "C4",
            "P4",
            "F8",
            "T4",
            "T6",
            "O2",
        ],
        axis=1,
    )

    # we want to reorder the columns so that the EKG signal is at the end.
    idx = parquet_data.columns[1:].to_list() + [parquet_data.columns[0]]

    # we want to transpose the values so that they're easier to handle later
    parquet_data = parquet_data[idx].values.T

    # filter the high frequencies
    parquet_data = scipy.signal.lfilter(b, a, parquet_data, axis=0)

    # convert to pytorch tensor
    parquet_data = torch.from_numpy(parquet_data).type(torch.float32)

    # clip high values to 1024
    parquet_data = torch.clip(parquet_data, -1024, 1024)

    return parquet_data

We need to extract the data from the `.parquet` files, so in a loop we'll iterate through the metadata and extract the data and the training labels.

In [7]:
# A list for data entries from different files
eeg_data = []

# indices of the dataframes with too many missing values
faulty_eeg_id = []

if not os.path.exists(f"{PROCESSED_DATA_DIR}/eeg_labels.pt"):
    os.makedirs(PROCESSED_DATA_DIR,exist_ok=True)
    
    # iterate through IDs of eeg and extract each .parquet file
    for eeg_id in train_meta["eeg_id"]:
        # open the file using pandas
        parquet_data = pd.read_parquet(INPUT_DATA_DIR + f"/train_eegs/{eeg_id}.parquet")

        # fill the missing values and extract the first 10000 measurements
        parquet_data = parquet_data.interpolate(method="ffill")[:10000]

        # If at this point there are any missing values, the file can't be used
        if np.any(parquet_data.isna()):
            faulty_eeg_id.append(eeg_id)
            continue

        # we call the preprocessing function
        eeg = extract_parquet(parquet_data)

        # and add the data to the list of processed entries
        eeg_data.append(eeg)

    # convert the list of tensors to one tensor
    eeg_data = torch.stack(eeg_data)

    # save to PROCESSED_DATA_TIR
    torch.save(eeg_data, f"{PROCESSED_DATA_DIR}/eeg_data.pt")

    # extract labels for valid entries
    all_labels = train_meta.loc[~train_meta["eeg_id"].isin(faulty_eeg_id)]
    eeg_labels = all_labels[
        ["seizure_vote", "lpd_vote", "gpd_vote", "lrda_vote", "grda_vote", "other_vote"]
    ].values

    # convert labels to pytorch tensor
    eeg_labels = torch.tensor(np.array(eeg_labels), dtype=torch.float32)

    # The labels are probability - they must sum to 1
    eeg_labels = eeg_labels / eeg_labels.sum(dim=1, keepdims=True)

    # save for easier training later
    torch.save(eeg_labels, f"{PROCESSED_DATA_DIR}/eeg_labels.pt")
else:
    eeg_data = torch.load(f"{PROCESSED_DATA_DIR}/eeg_data.pt")
    eeg_labels = torch.load(f"{PROCESSED_DATA_DIR}/eeg_labels.pt")

In [8]:
print(eeg_data.shape)
print(eeg_labels.shape)

torch.Size([17018, 19, 10000])
torch.Size([17018, 6])


---

# Creating a data loader, preprocessing

Setup global variables for dataloder and preprocessing

In [9]:
SAMPLING_FREQUENCY = 200
SAMPLES_IN_MEASUREMENT = 10000
FOLDS = 5
BATCH_SIZE = 128
NUM_WORKERS = 0

Using scikit-learn we'll split the data into training and testing dataset

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    eeg_data, eeg_labels, test_size=0.2
)

X_test, X_valid, y_test, y_valid = train_test_split(
    X_test, y_test, test_size=0.5
)

To perform the computation on gpu, we have to move the data into cuda

In [11]:
X_train, y_train = X_train.to(DEVICE), y_train.to(DEVICE)
X_valid, y_valid = X_valid.to(DEVICE), y_valid.to(DEVICE)
X_test, y_test = X_test.to(DEVICE), y_test.to(DEVICE)

Create a HMS dataset class that will help us load the data during the model training

In [12]:
from torch.utils.data import DataLoader, Dataset


class CustomImageDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]


train_loader = DataLoader(
    CustomImageDataset(X_train, y_train), batch_size=BATCH_SIZE, shuffle=True
)

test_loader = DataLoader(
    CustomImageDataset(X_test, y_test), batch_size=BATCH_SIZE, shuffle=True
)

---

# Creating a model

We create a machine learning model - a python object that will be trained to predict correct data labels

In [13]:
import torch.nn as nn

class ConvLSTM(nn.Module):
    def __init__(self):
        super(ConvLSTM, self).__init__()

        # additionally to the previous approach we use convolution.
        # It downsizes the input from shape (19, 10000) to (32, 2500)
        self.conv = nn.Conv1d(19, 32, kernel_size=4, stride=4)

        self.lstm = nn.LSTM(
            input_size=32, hidden_size=50, num_layers=2, batch_first=True, dropout=0.0
        )
        self.flatten = nn.Flatten()
        self.fc = nn.Linear(50, 6)
        self.softmax = nn.Softmax(dim=1)

    # input in CHW / CW format
    def forward(self, x):
        x = self.conv(x)

        x = x.permute(0, 2, 1)  # (batch_size, seq_length, input_size)
        x, _ = self.lstm(x)
        x = x[:, -1, :]
        x = self.flatten(x)
        x = self.fc(x)
        out = self.softmax(x)
        return out

In [14]:
class ConvModel(nn.Module):
    def __init__(self):
        super(ConvModel, self).__init__()

        self.conv1 = nn.Sequential(
            # (19, 10000)
            nn.Conv1d(19, 76, kernel_size=5,padding=2,groups=19),
            nn.MaxPool1d(2,ceil_mode=True),
            nn.BatchNorm1d(76),
        )
        
        self.conv2 = nn.Sequential(
            # (64, 5000)
            nn.Conv1d(76, 64, kernel_size=5,padding=2),
            nn.MaxPool1d(2,ceil_mode=True),
            nn.BatchNorm1d(64),
        )
        
        self.conv3 = nn.Sequential(
            # (64, 5000)
            nn.Conv1d(64, 64, kernel_size=5,padding=2),
            nn.MaxPool1d(2,ceil_mode=True),
            nn.BatchNorm1d(64),
        )
        
        self.conv4 = nn.Sequential(
            # (64, 1250)
            nn.Conv1d(64, 32, kernel_size=3,padding=1),
            nn.MaxPool1d(2,ceil_mode=True),
            nn.BatchNorm1d(32),
            # (64,625)
        )

        
        self.mlp = nn.Sequential(
            nn.Flatten(),
            nn.Dropout(0.5),
            
            nn.Linear(20000,4096),
            nn.Dropout(0.5),
            nn.ReLU(inplace=True),
            
            nn.Linear(4096,4096),
            nn.Dropout(0.5),
            nn.ReLU(inplace=True),
            
            nn.Linear(4096, 6),
            
        )      
        
        self.softmax = nn.Softmax(1)
        self.relu = nn.ReLU()

    # input in CHW / CW format
    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)
        x = self.mlp(x)
        return x

---

# Training

After data preprocessing and buikding the model we need to train it. 

We'll use a [Kullback–Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) as a loss function. 
KLDivergence is the widely used metric suggested by the scientists who worked on the best solutions for the HMS competition.

We also use Adam optimizier - an adaptive learning rate optimization algorithm used in training deep learning models.

It combines the advantages of two other extensions of stochastic gradient descent 
- (SGD): Adaptive Gradient Algorithm (AdaGrad) 
- Root Mean Square Propagation (RMSProp)

It adjusts the learning rate for each parameter dynamically, making it efficient and well-suited for large datasets and complex models.

In [15]:
from torch.optim import Adam

# initialize the model and move it to gpu
model = ConvModel().to(DEVICE)

## Pretraining 1 - regression

In [25]:
# Create an optimizer that will perform the gradient descent algorithm
optimizer = Adam(model.parameters(), lr=1e-4)

loss_function = nn.MSELoss()

In [17]:
EPOCHS = 100

model.train()

# we train the network for EPOCHS epochs
for epoch in range(EPOCHS):
    # variables that will let us compute average epoch loss
    iteration = 0
    total_loss = 0
    
    loss_values_pretraining = []
    
    # iterate through data batches in the data loader
    for batch in train_loader:
        # extract data and labels
        batch_data, batch_labels = batch

        # perform the inference on the data
        prediction = model(batch_data)
        prediction = model.softmax(prediction) - 0.01 # if we don't do this the model will try to reach -inf
        
        # compare the inference output with real labels and calculate the value of loss function
        loss = loss_function(prediction, batch_labels)
        # print(loss.item())

        # accumulate loss to print the average at the end of the epoch
        total_loss += float(loss.item())

        # reset the optimizer
        optimizer.zero_grad()

        # compute the error backpropagation
        loss.backward()

        # update the weights of the model
        optimizer.step()

        iteration += 1
        
    loss_values_pretraining.append(total_loss / iteration)

    # scheduler.step()
    print(f"epoch : {epoch}")
    print(total_loss / iteration)
    print("\n")


epoch : 0
0.09436011955002759

epoch : 1
0.09019787581724541

epoch : 2
0.08584850622671787

epoch : 3
0.08334708213806152

epoch : 4
0.0803232243127912

epoch : 5
0.07839944499118306

epoch : 6
0.07725453199209453

epoch : 7
0.07429398453542005

epoch : 8
0.07142826281139784

epoch : 9
0.06966278193710006

epoch : 10
0.06745296542611078

epoch : 11
0.06437462439464632

epoch : 12
0.06312259610428989

epoch : 13
0.0598345992512235

epoch : 14
0.05747279827700597

epoch : 15
0.05428983821211574

epoch : 16
0.051708536344432385

epoch : 17
0.04822789522531991

epoch : 18
0.04640864577388095

epoch : 19
0.04418839578664748

epoch : 20
0.04208153348729432

epoch : 21
0.039213916286826134

epoch : 22
0.03659745765345119

epoch : 23
0.035120620557637976

epoch : 24
0.033166955790926364

epoch : 25
0.03238788318480844

epoch : 26
0.03020045262213065

epoch : 27
0.028964919290531462

epoch : 28
0.02705704288504948

epoch : 29
0.026324582059876384

epoch : 30
0.025258804627946604

epoch : 31
0.

---

In [26]:
model.eval()
with torch.no_grad():
    iteration = 0
    total_loss = 0
    for batch in test_loader:
        # extract data and labels
        batch_data, batch_labels = batch

        # perform the inference on the data
        prediction = model(batch_data)
        prediction = model.softmax(prediction) - 0.01

        # compare the inference output with real labels and calculate the value of loss function
        loss = loss_function(prediction, batch_labels)

        # accumulate loss
        total_loss += float(loss.item())

        iteration += 1

print(total_loss / iteration)

0.12271077292306083


## Final training

In [27]:
# Create an optimizer that will perform the gradient descent algorithm
optimizer = Adam(model.parameters(), lr=1e-4)

# As a loss function we'll be using KLDivergence
# loss_function = nn.KLDivLoss(reduction="batchmean", log_target=False)
loss_function = nn.KLDivLoss(reduction="batchmean",log_target=False)

In [28]:
EPOCHS = 200

model.train()

# we train the network for EPOCHS epochs
for epoch in range(EPOCHS):
    # variables that will let us compute average epoch loss
    iteration = 0
    total_loss = 0

    loss_values = []

    # iterate through data batches in the data loader
    for batch in train_loader:
        # extract data and labels
        batch_data, batch_labels = batch

        # perform the inference on the data
        prediction = model(batch_data)
        prediction = model.softmax(prediction)
        prediction = prediction - 0.02
        prediction = model.relu(prediction)
        prediction = prediction / prediction.sum(axis=-1,keepdims=True)

        # compare the inference output with real labels and calculate the value of loss function
        loss = loss_function(nn.functional.log_softmax(prediction, dim=1), batch_labels)
        # print(loss.item())

        # accumulate loss to print the average at the end of the epoch
        total_loss += float(loss.item())

        # reset the optimizer
        optimizer.zero_grad()

        # compute the error backpropagation
        loss.backward()

        # update the weights of the model
        optimizer.step()

        iteration += 1

    loss_values.append(total_loss / iteration)

    # scheduler.step()
    print(f"epoch : {epoch}")
    print(total_loss / iteration)
    print("\n")

epoch : 0
0.8953740162270092

epoch : 1
0.8950556917725322


KeyboardInterrupt: 

In [None]:
model.eval()
with torch.no_grad():
    iteration = 0
    total_loss = 0
    for batch in test_loader:
        # extract data and labels
        batch_data, batch_labels = batch

        # perform the inference on the data
        prediction = model(batch_data)
        prediction = model(batch_data)
        prediction = model.softmax(prediction)
        prediction = prediction - 0.02
        prediction = model.relu(prediction)
        prediction = prediction / prediction.sum(axis=-1,keepdims=True)

        # compare the inference output with real labels and calculate the value of loss function
        loss = loss_function(nn.functional.log_softmax(prediction, dim=1), batch_labels)

        # accumulate loss
        total_loss += float(loss.item())

        iteration += 1

print(total_loss / iteration)

In [30]:
prediction

tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.7026, 0.0000, 0.2974],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.9479, 0.0000, 0.0000, 0.0000, 0.0000, 0.0521],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.000

In [31]:
batch_labels

tensor([[0.0000, 0.0000, 0.2308, 0.0000, 0.0769, 0.6923],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0500, 0.0000, 0.0000, 0.0000, 0.9500],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.3333, 0.0000, 0.6667, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.6667, 0.0000, 0.3333],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0714, 0.0000, 0.0000, 0.0714, 0.1429, 0.7143],
        [0.0000, 0.0000, 0.0909, 0.0000, 0.0909, 0.8182],
        [0.0000, 0.0667, 0.0000, 0.0000, 0.0000, 0.9333],
        [1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000],
        [0.0000, 0.0000, 0.0870, 0.0000, 0.0000, 0.9130],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 0.9167, 0.0833, 0.0000, 0.0000, 0.0000],
        [0.000

In [39]:
torch.abs(batch_labels - prediction).sum(axis=1)

tensor([0.6154, 0.0000, 0.0000, 0.1000, 0.0000, 0.6667, 0.0718, 0.0000, 0.5714,
        0.3636, 0.1333, 0.1042, 0.0000, 0.0000, 0.1739, 0.0000, 0.1667, 0.6667,
        0.9843, 0.5892, 0.0000, 0.0000, 0.1000, 0.6533, 0.8000, 0.0000, 0.0000,
        0.6667, 0.0000, 0.0000, 0.1633, 0.0000, 0.0000, 0.0000, 0.0031, 0.7668,
        0.0000, 0.2667, 0.0000, 0.0000, 0.0000, 1.2457, 0.0000, 0.6667, 0.5000,
        0.2436, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.7919, 0.0000, 1.0000,
        0.0000, 0.0000, 0.0649, 0.0000, 0.0000, 0.1667, 0.0000, 1.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.5000, 0.0000, 0.1429, 0.6250, 0.2644, 0.0000,
        0.0000, 0.6154, 0.0000, 0.0000, 0.0000, 0.3896, 0.8015, 0.1538, 0.0000,
        0.0000, 0.0000, 0.0000, 0.1429, 0.0000, 0.0000, 0.5000, 0.0000, 0.6250,
        0.0000, 0.0000, 0.9237, 1.0000, 0.6667, 0.0000, 0.0000, 0.0000, 0.4781,
        0.1667, 0.0000, 0.9091, 0.9333, 0.8491, 0.6154, 0.0000, 0.0000, 0.0000,
        0.6667, 1.0446, 1.2132, 0.9966, 