# **Introduction**

  In this notebook, we will replicate an encoder-only transformer architecture with attention for a text classification task. Using a mental health dataset with 50,000 samples, we aim to explore the model's performance. Additionally, we will apply transfer learning to fine-tune a pre-trained model for enhanced results.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1. Get the data

In [2]:
from pathlib import Path
import zipfile

zipped_data_path = "/content/drive/MyDrive/mental_health.zip"

data_path = Path("data")

if data_path.is_dir():
  print(f"Data path {data_path} already exists!!!")

else:
  print(f"creating {data_path} on your demand sir 🫡")
  data_path.mkdir(parents=True, exist_ok=True)

  with zipfile.ZipFile(zipped_data_path, "r") as zip_file:
    zip_file.extractall(data_path)
    print(f"data extracted to {data_path}")

creating data on your demand sir 🫡
data extracted to data


In [1]:
import pandas as pd

In [2]:
raw_data = pd.read_csv("/content/data/mental_health/Data.csv")
raw_data

Unnamed: 0.1,Unnamed: 0,statement,status
0,0,oh my gosh,Anxiety
1,1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,3,I've shifted my focus to something else but I'...,Anxiety
4,4,"I'm restless and restless, it's been a month n...",Anxiety
...,...,...,...
53038,53038,Nobody takes me seriously I’ve (24M) dealt wit...,Anxiety
53039,53039,"selfishness ""I don't feel very good, it's lik...",Anxiety
53040,53040,Is there any way to sleep better? I can't slee...,Anxiety
53041,53041,"Public speaking tips? Hi, all. I have to give ...",Anxiety


## 2. Preprocess the data

In [3]:
raw_data.columns

Index(['Unnamed: 0', 'statement', 'status'], dtype='object')

In [4]:
raw_data = raw_data.drop(columns=['Unnamed: 0'])
raw_data.head()

Unnamed: 0,statement,status
0,oh my gosh,Anxiety
1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,I've shifted my focus to something else but I'...,Anxiety
4,"I'm restless and restless, it's been a month n...",Anxiety


In [5]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53043 entries, 0 to 53042
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   statement  52681 non-null  object
 1   status     53043 non-null  object
dtypes: object(2)
memory usage: 828.9+ KB


In [6]:
raw_data.isna().sum()

statement    362
status         0
dtype: int64

In [7]:
raw_data.dropna(inplace=True)

In [8]:
raw_data.isna().sum()

statement    0
status       0
dtype: int64

In [9]:
raw_data.status.value_counts()

status
Normal                  16343
Depression              15404
Suicidal                10652
Anxiety                  3841
Bipolar                  2777
Stress                   2587
Personality disorder     1077
Name: count, dtype: int64

In [10]:
main_categories = ["Normal", "Depression", "Suicidal"]

raw_data["status"] = raw_data["status"].apply(lambda x: x if x in main_categories else "Others")

In [11]:
raw_data["status"].value_counts()

status
Normal        16343
Depression    15404
Suicidal      10652
Others        10282
Name: count, dtype: int64

In [12]:
raw_data["status"] = raw_data["status"].map({"Normal": 0, "Depression": 1, "Suicidal": 2, "Others": 3})

In [13]:
label_names = ["Normal", "Depression", "Suicidal", "Others"]
label_names

['Normal', 'Depression', 'Suicidal', 'Others']

In [14]:
# Filter for rows where status is 'Depression'
depression_samples = raw_data[raw_data['status'] == 2]

# Print the first 5 text entries
for i in range(min(5, len(depression_samples))):
    print(f"{depression_samples.iloc[i]['statement']}\nlabel is {label_names[depression_samples.iloc[i]['status']]}\n")


I am so exhausted of this. Just when I think I can finally rest, just when I think maybe things are starting to settle, another hurdle comes flying at me. This month alone we found out my mum could be dying, my girlfriend left me, my parents revealed that they wanted a divorce, my grandad was hospitalised again and just now my little sister's been rushed to A&amp;E with possible brain damage. If there is a god up there they must fucking hate me. it is like life is trying to get me to kill myself and honestly I think I would be better off dead. I attempted when I was 12 but I was stupid and there was no way I could cut deep enough. Now I am 15 and everything is so much worse than it ever has been and I just cannot hold on much longer -- it is going to take a miracle to get me through this. I feel so alone. I feel like the world hates me and I have no idea what I did wrong to deserve this. I thought I was getting better. I was doing so well and now everything's just come crashing down ag

In [15]:
raw_data.rename(columns={"status": "label", "statement": "text"}, inplace=True)

In [16]:
raw_data

Unnamed: 0,text,label
0,oh my gosh,3
1,"trouble sleeping, confused mind, restless hear...",3
2,"All wrong, back off dear, forward doubt. Stay ...",3
3,I've shifted my focus to something else but I'...,3
4,"I'm restless and restless, it's been a month n...",3
...,...,...
53038,Nobody takes me seriously I’ve (24M) dealt wit...,3
53039,"selfishness ""I don't feel very good, it's lik...",3
53040,Is there any way to sleep better? I can't slee...,3
53041,"Public speaking tips? Hi, all. I have to give ...",3


### 2.1 Get small dataset to try our model on!!

In [17]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(raw_data, train_size=0.5, shuffle=True, random_state=42)

In [18]:
train_df

Unnamed: 0,text,label
39623,i realized the only reason i haven t killed my...,1
22633,The birds chirp so I will die.is not it funny ...,2
39969,last time i attempted i failed for like the th...,1
48004,Need advice I've done a lot of horrible things...,1
34331,Blood test anxiety I just had my blood taken f...,3
...,...,...
11294,Ppl no longer have to interact anymore. there ...,1
44744,hairpin haha well what make you think you don ...,0
38170,every time i think about suicide or search pai...,1
863,Your voice is so nice,0


In [19]:
train_df.label.value_counts()

label
0    8062
1    7770
2    5311
3    5197
Name: count, dtype: int64

In [20]:
test_df.label.value_counts()

label
0    8281
1    7634
2    5341
3    5085
Name: count, dtype: int64

In [103]:
train_small, _ = train_test_split(train_df, train_size=4992, shuffle=True, random_state=42)

In [104]:
test_small, _ = train_test_split(test_df, train_size=4992, random_state=42)

In [105]:
train_small.label.value_counts()

label
0    1533
1    1503
2     979
3     977
Name: count, dtype: int64

In [106]:
test_small.label.value_counts()

label
0    1546
1    1471
3     999
2     976
Name: count, dtype: int64

## 3. Tokenize the data

In [71]:
!pip install transformers



In [72]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [73]:
# !pip install datasets

In [107]:
from datasets import Dataset

# Convert DataFrame to Dataset
train_dataset = Dataset.from_pandas(train_small)
test_dataset = Dataset.from_pandas(test_small)


In [108]:
train_dataset

Dataset({
    features: ['text', 'label', '__index_level_0__'],
    num_rows: 4992
})

In [109]:
train_dataset = train_dataset.remove_columns(['__index_level_0__'])
test_dataset = test_dataset.remove_columns(['__index_level_0__'])

In [110]:
train_dataset.column_names

['text', 'label']

In [111]:
def tokenize(example):
  tokenized_inputs = tokenizer(example["text"], truncation=True, padding="max_length")
  label = example["label"]

  return {"input_ids": tokenized_inputs["input_ids"], "label": label}

In [112]:
train_tokenized = train_dataset.map(
    tokenize,
    batched=True,
    remove_columns=train_dataset.column_names
)


test_tokenized = test_dataset.map(
    tokenize,
    batched=True,
    remove_columns=train_dataset.column_names
)

Map:   0%|          | 0/4992 [00:00<?, ? examples/s]

Map:   0%|          | 0/4992 [00:00<?, ? examples/s]

In [113]:
input_id_0 = train_tokenized["input_ids"][99]
print(f"Label is: {label_names[train_tokenized['label'][99]]}")
tokenizer.decode(input_id_0)

Label is: Normal


'[CLS] nothing beat the cold damp feeling you get when pulling on a wet pair of knicks [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [

## 4. Data Collation

In [114]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## 5. Getting DataLoaders

In [115]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    train_tokenized,
    shuffle=True,
    collate_fn=data_collator,
    batch_size=16,
)

test_dataloader = DataLoader(
    test_tokenized,
    collate_fn=data_collator,
    batch_size=16,
)

In [116]:
len(train_dataloader), len(test_dataloader)

(312, 312)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [117]:
for batch in train_dataloader:
    input_ids = batch["input_ids"].to(device)
    labels = batch["labels"].to(device)

    print(f"Train batch size: input_ids={input_ids.size()}, labels={labels.size()}")

for test_batch in test_dataloader:
    test_input_ids = test_batch["input_ids"].to(device)
    test_labels = test_batch["labels"].to(device)

    print(f"Test batch size: test_input_ids={test_input_ids.size()}, test_labels={test_labels.size()}")


Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_ids=torch.Size([16, 512]), labels=torch.Size([16])
Train batch size: input_i

## 6. Building our very own Encoder Model

In [118]:
import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert self.head_dim * heads == embed_size, "Embed size needs to be div by heads"

        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)

    def forward(self, values, keys, query, mask):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        # Split embedding into self.heads pieces
        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = query.reshape(N, query_len, self.heads, self.head_dim)

        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(queries)

        energy = torch.einsum("nqhd, nkhd->nhqk", queries, keys)

        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3)

        out = torch.einsum("nhql, nlhd->nqhd", attention, values).reshape(
            N, query_len, self.heads * self.head_dim
        )

        out = self.fc_out(out)
        return out

class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion * embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)

        x = self.dropout(self.norm1(attention + query))
        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))
        return out

class Encoder(nn.Module):
    def __init__(self, src_vocab_size, embed_size, num_layers, heads, device, forward_expansion, dropout, max_length):
        super(Encoder, self).__init__()
        self.embed_size = embed_size
        self.device = device
        self.word_embedding = nn.Embedding(src_vocab_size, embed_size)
        self.position_embedding = nn.Embedding(max_length, embed_size)

        self.layers = nn.ModuleList(
            [
                TransformerBlock(
                    embed_size,
                    heads,
                    dropout=dropout,
                    forward_expansion=forward_expansion,
                )
                for _ in range(num_layers)
            ]
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)

        out = self.dropout(self.word_embedding(x) + self.position_embedding(positions))

        for layer in self.layers:
            out = layer(out, out, out, mask)

        return out

class TransformerEncoder(nn.Module):
    def __init__(self, src_vocab_size, embed_size, num_layers, heads, forward_expansion, dropout, device, max_length, num_classes):
        super(TransformerEncoder, self).__init__()
        self.encoder = Encoder(
            src_vocab_size,
            embed_size,
            num_layers,
            heads,
            device,
            forward_expansion,
            dropout,
            max_length
        )
        self.fc_out = nn.Linear(embed_size, num_classes)
        self.device = device

    def make_src_mask(self, src):
    # Ensure src is a tensor
      if not isinstance(src, torch.Tensor):
          raise TypeError("src must be a tensor")

      src_mask = (src != 0).unsqueeze(1).unsqueeze(2)  # Create mask
      return src_mask.to(self.device)


    def forward(self, src):
        src_mask = self.make_src_mask(src)
        enc_out = self.encoder(src, src_mask)
        # Use the output of the last token (CLS token) for classification
        out = self.fc_out(enc_out[:, 0, :])
        return out

In [119]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [120]:
label_names

['Normal', 'Depression', 'Suicidal', 'Others']

In [121]:
src_vocab_size = tokenizer.vocab_size
x = torch.tensor([[1, 5, 6, 4, 3, 9, 5, 2, 0], [1, 8, 7, 3, 4, 5, 6, 7, 2]]).to(device)
embed_size = 512
num_layers = 6
heads = 8
forward_expansion = 4
dropout = 0.1
max_length = 512
num_classes = len(label_names)

model = TransformerEncoder(src_vocab_size, embed_size, num_layers, heads, forward_expansion, dropout, device, max_length, num_classes).to(device)

out = model(x)
out

tensor([[-0.2133,  0.3808, -0.6504,  0.3155],
        [-0.7158,  0.0651, -0.2108, -0.6312]], device='cuda:0',
       grad_fn=<AddmmBackward0>)

In [122]:
text = "hi i am priyanshu"
encoded = tokenizer.encode(text, return_tensors='pt')  # Ensure it returns tensor
encoded = encoded.to(device)  # Move to device if needed
predict = model(encoded)
label_names[torch.argmax(torch.softmax(predict, dim=1), dim=1)]

'Normal'

In [123]:
print(model)

TransformerEncoder(
  (encoder): Encoder(
    (word_embedding): Embedding(28996, 512)
    (position_embedding): Embedding(512, 512)
    (layers): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): SelfAttention(
          (values): Linear(in_features=64, out_features=64, bias=False)
          (keys): Linear(in_features=64, out_features=64, bias=False)
          (queries): Linear(in_features=64, out_features=64, bias=False)
          (fc_out): Linear(in_features=512, out_features=512, bias=True)
        )
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (feed_forward): Sequential(
          (0): Linear(in_features=512, out_features=2048, bias=True)
          (1): ReLU()
          (2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (fc_out): L

## 7. Training our Model 🥷

In [124]:
optimizer = torch.optim.Adam(model.parameters(), 0.01)
loss_fn = torch.nn.CrossEntropyLoss()

In [128]:

from tqdm.auto import tqdm

epoch = 3

for epoch in tqdm(range(epoch)):

  ### Training

  model.train()
  train_loss, train_acc = 0, 0

  for batch in train_dataloader:
    input_ids = batch["input_ids"].to(device)
    labels = batch["labels"].to(device)

    # Forward Pass
    y_pred = model(input_ids)

    # Calculating loss
    loss = loss_fn(y_pred, labels)
    train_loss += loss.item()

    # Optimizing
    optimizer.zero_grad()

    # Backpropogation
    loss.backward()

    # Optimizer step
    optimizer.step()

    # Accuracy
    y_pred_class = torch.argmax(torch.softmax(y_pred, dim=1), dim=1)
    train_acc += torch.mean((y_pred_class == labels).float()).item()

  train_loss = train_loss / len(train_dataloader)
  train_acc = train_acc / len(train_dataloader)

  print(f"train accuracy is: {train_acc}")
  print(f"train loss is: {train_loss}")

  ### Testing

  model.eval()

  test_loss, test_acc = 0, 0

  with torch.inference_mode():
    for test_batch in test_dataloader:
      test_input_ids = test_batch["input_ids"].to(device)
      test_labels = test_batch["labels"].to(device)

      # Forward pass
      test_pred = model(input_ids)



      # Loss
      loss = loss_fn(test_pred, test_labels)
      test_loss += loss.item()

      # Accuracy
      test_pred_class = torch.argmax(torch.softmax(test_pred, dim=1), dim=1)
      test_acc += torch.mean((test_pred_class == labels).float()).item()

  test_loss = test_loss / len(test_dataloader)
  test_acc = test_acc / len(test_dataloader)

  print(f"test accuracy is: {test_acc}")
  print(f"test loss is: {test_loss}")


  0%|          | 0/3 [00:00<?, ?it/s]

train accuracy is: 0.3098958333333333
train loss is: 1.3718209404211779
test accuracy is: 0.1875
test loss is: 1.3658131177608783
train accuracy is: 0.304286858974359
train loss is: 1.367537293678675
test accuracy is: 0.5
test loss is: 1.3660897314548492
train accuracy is: 0.2932692307692308
train loss is: 1.3730966983697352
test accuracy is: 0.1875
test loss is: 1.3654269247482984


In [137]:
# Save the model!!
torch.save(model, "model_mental_health.pth")

In [138]:
# Load the model
loaded_model = torch.load("model_mental_health.pth")

In [144]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())

# Example usage
num_parameters = count_parameters(model)
print(f"Total number of parameters: {num_parameters}")

Total number of parameters: 29370372


## 8. Test the model!!!

In [142]:
text = "There are wounds that never show on the body that are deeper and more hurtful than anything that bleeds."
encoded = tokenizer.encode(text, return_tensors='pt')
encoded = encoded.to(device)

loaded_model.eval()
with torch.no_grad():
  predict = loaded_model(encoded)

label_names[torch.argmax(torch.softmax(predict, dim=1), dim=1)]

'Normal'

**Note:-** Our model is bad but this is how we replicate the Text classification using transformer encoder