<a href="https://colab.research.google.com/github/gupta24789/multilabel-classification/blob/main/multilabel_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q  pytorch-lightning

In [2]:
import random
import pandas as pd
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import pytorch_lightning as pl
import torchmetrics
from transformers import AutoTokenizer, AutoModel

## Set Seed

In [3]:
seed = 121
random.seed(seed)
torch.manual_seed(seed)
pl.seed_everything(seed)

INFO:lightning_fabric.utilities.seed:Seed set to 121


121

## Load Data

In [4]:
train_df = pd.read_csv("https://raw.githubusercontent.com/gupta24789/multilabel-classification/main/data/train.csv")
val_df = pd.read_csv("https://raw.githubusercontent.com/gupta24789/multilabel-classification/main/data/test.csv")

print(f"Train shape : {train_df.shape}")
print(f"Val shape : {val_df.shape}")

train_df.columns = train_df.columns.str.lower()
val_df.columns = val_df.columns.str.lower()

Train shape : (16777, 9)
Val shape : (4195, 9)


In [5]:
train_df.head(3)

Unnamed: 0,id,title,abstract,computer science,physics,mathematics,statistics,quantitative biology,quantitative finance
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0


## Data Prep

In [6]:
train_df['context'] = train_df.title + ". " + train_df.abstract
val_df['context'] = val_df.title + ". " + val_df.abstract

target_columns = ['computer science', 'physics', 'mathematics',
       'statistics', 'quantitative biology', 'quantitative finance']

train_df = train_df[['context'] + target_columns]
val_df = val_df[['context'] + target_columns]

train_df['labels'] = train_df[target_columns].values.tolist()
val_df['labels'] = val_df[target_columns].values.tolist()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['labels'] = train_df[target_columns].values.tolist()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_df['labels'] = val_df[target_columns].values.tolist()


In [7]:
train_df.head(3)

Unnamed: 0,context,computer science,physics,mathematics,statistics,quantitative biology,quantitative finance,labels
0,Reconstructing Subject-Specific Effect Maps. ...,1,0,0,0,0,0,"[1, 0, 0, 0, 0, 0]"
1,Rotation Invariance Neural Network. Rotation...,1,0,0,0,0,0,"[1, 0, 0, 0, 0, 0]"
2,Spherical polyharmonics and Poisson kernels fo...,0,0,1,0,0,0,"[0, 0, 1, 0, 0, 0]"


## Transformer Model Exploration

In [8]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
transformer_model = AutoModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Data Loaders

In [9]:
train_df.context.str.split(" ").str.len().describe([.99])

count    16777.000000
mean       147.537760
std         56.085355
min          7.000000
50%        144.000000
99%        277.000000
max        394.000000
Name: context, dtype: float64

In [10]:
def custom_collate(batch):

  text = [item['context'] for item in batch]
  label = [item['labels'] for item in batch]

  inputs = tokenizer(text, max_length= 400, add_special_tokens=True, pad_to_max_length=True, truncation=True, padding='max_length', return_tensors='pt')
  label = torch.tensor(label, dtype = torch.float)

  batch = {"input_ids": inputs['input_ids'], "token_type_ids": inputs['token_type_ids'],"attention_mask": inputs['attention_mask'], "label": label}
  return batch

In [11]:
train_data = train_df[['context','labels']].to_dict('records')
val_data = val_df[['context','labels']].to_dict('records')

In [12]:
train_data[:1]

[{'context': "Reconstructing Subject-Specific Effect Maps.   Predictive models allow subject-specific inference when analyzing disease\nrelated alterations in neuroimaging data. Given a subject's data, inference can\nbe made at two levels: global, i.e. identifiying condition presence for the\nsubject, and local, i.e. detecting condition effect on each individual\nmeasurement extracted from the subject's data. While global inference is widely\nused, local inference, which can be used to form subject-specific effect maps,\nis rarely used because existing models often yield noisy detections composed of\ndispersed isolated islands. In this article, we propose a reconstruction\nmethod, named RSM, to improve subject-specific detections of predictive\nmodeling approaches and in particular, binary classifiers. RSM specifically\naims to reduce noise due to sampling error associated with using a finite\nsample of examples to train classifiers. The proposed method is a wrapper-type\nalgorithm tha

In [13]:
batch_size = 2
train_dl = DataLoader(train_data, batch_size = batch_size, shuffle = True, collate_fn= custom_collate)

In [14]:
example = next(iter(train_dl))
example['input_ids'].shape, example['token_type_ids'].shape, example['attention_mask'].shape, example['label'].shape

(torch.Size([2, 400]),
 torch.Size([2, 400]),
 torch.Size([2, 400]),
 torch.Size([2, 6]))

In [15]:
## dataloaders
batch_size = 8
train_dl = DataLoader(train_data, batch_size = batch_size, shuffle = True, collate_fn= custom_collate, num_workers = 2)
val_dl = DataLoader(val_data, batch_size = batch_size, shuffle = False, collate_fn= custom_collate, num_workers = 2)

## Build Model

In [16]:
class MultiLabelTransformer(pl.LightningModule):

  def __init__(self, output_dim, learning_rate, dropout, freeze = False):
    super().__init__()
    self.learning_rate = learning_rate

    ## define loss & accuracy
    self.loss_fn = nn.BCEWithLogitsLoss()
    self.train_f1 = torchmetrics.F1Score(task="multilabel", num_labels=output_dim)
    self.val_f1 = torchmetrics.F1Score(task="multilabel", num_labels=output_dim)
    self.train_ham = torchmetrics.HammingDistance(task="multilabel", num_labels=output_dim)
    self.val_ham = torchmetrics.HammingDistance(task="multilabel", num_labels=output_dim)

    ## define layers
    self.transformer_model = AutoModel.from_pretrained(model_name)
    hidden_dim = self.transformer_model.config.hidden_size
    self.linear1 = nn.Linear(hidden_dim, 128)
    self.linear2 = nn.Linear(128, output_dim)
    self.dropout = nn.Dropout(dropout)
    self.relu = nn.ReLU()
    self.tanh = nn.Tanh()

    ## freeze layers
    if freeze:
      for name, params in self.transformer_model.named_parameters():
        params.requires_grad = False


  def forward(self, inputs):
    """
    """

    embeddings = self.transformer_model(**inputs)
    last_hidden_state, pooler_output = embeddings['last_hidden_state'], embeddings['pooler_output']

    ## average of last 4 hidden state
    hidden_state = torch.mean(last_hidden_state[:,-4:,:], dim = 1)

    out  = self.dropout(hidden_state)
    out = self.tanh(self.linear1(out))
    out = self.linear2(out)
    return out

  def _shared_step(self, batch):
    label = batch.pop('label')
    logits = self(batch)
    loss = self.loss_fn(logits, label)
    return logits, loss, label

  def training_step(self, batch, batch_idx):
    logits, loss, label = self._shared_step(batch)
    self.train_f1(logits, label)
    self.train_ham(logits, label)
    self.log_dict({"train_loss": loss, "train_f1": self.train_f1,"train_ham" : self.train_ham}, on_step = False, on_epoch = True, prog_bar=True)
    return loss

  def validation_step(self,batch, batch_idx):
    logits, loss, label = self._shared_step(batch)
    self.val_f1(logits, label)
    self.val_ham(logits, label)
    self.log_dict({"val_loss": loss,  "val_f1": self.val_f1, "val_ham": self.val_ham}, on_step = False, on_epoch = True, prog_bar=True)
    return loss

  def on_training_epoch_end(self):
    self.train_f1.reset()
    self.train_ham.reset()

  def on_validation_epoch_end(self):
    print(f"Epoch : {self.current_epoch} Val F1 : {self.val_f1.compute()}  val ham : {self.val_ham.compute()}")
    self.val_f1.reset()
    self.val_ham.reset()

  def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr = self.learning_rate)
    return optimizer

In [17]:
# ## test model architecture
# model = MultiLabelTransformer(output_dim = len(target_columns), learning_rate = 1e-3, dropout= 0.5, freeze = False)
# inputs = {
#     "input_ids": example['input_ids'],
#     "token_type_ids": example['token_type_ids'],
#     "attention_mask": example['attention_mask']
# }
# logits = model(inputs)
# model.loss_fn(logits, example['label'])

In [18]:
## Model Training

model = MultiLabelTransformer(output_dim = len(target_columns), learning_rate = 3e-5, dropout = 0.5, freeze = False)

callbacks = pl.callbacks.ModelCheckpoint(dirpath = "checkpoints_logs",
                                         filename = '{epoch}-{val_loss:.2f}-{val_ham:.2f}',
                                          mode = "min",
                                          monitor = "val_ham",
                                          save_last = True,
                                          save_top_k=-1)

trainer = pl.Trainer(accelerator= "gpu",
           max_epochs=3,
           check_val_every_n_epoch = 1,
           callbacks = [callbacks])

trainer.fit(model, train_dl, val_dl)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:639: Checkpoint directory /content/checkpoints_logs exists and is not empty.
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
   | Name              | Type                      | Params
-----------------------------------------------------------------
0  | loss_fn           | BCEWithLogitsLoss         | 0     
1  | train_f1          | MultilabelF1Score         | 0     
2  | val_f1            | MultilabelF1Score         | 0     
3  | train_ham         | MultilabelHammingDistance | 0     
4  | val_ham  

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Epoch : 0 Val F1 : 0.20000000298023224  val ham : 0.5833333730697632




Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 0 Val F1 : 0.8019343018531799  val ham : 0.07973778247833252


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 1 Val F1 : 0.8341268301010132  val ham : 0.06920939683914185


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 2 Val F1 : 0.8284925222396851  val ham : 0.07155340909957886


INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


## Predict

In [19]:
model = model.eval()
model = model.to("cuda")

In [35]:
def predict(text):
  inputs = tokenizer(text, return_tensors='pt')
  inputs.to('cuda:0')
  preds = model(inputs)
  preds = preds.detach().cpu().numpy().flatten()
  preds = (preds>0.5).astype(int)
  print("Pred : ",[target_columns[i] for i, val in enumerate(preds) if val==1])

In [47]:
random_sample = val_df.sample().to_dict('records')[0]
context = random_sample['context']
label = random_sample['labels']
print("True : ",[target_columns[i] for i, val in enumerate(label) if val==1])
predict(context)

True :  ['computer science', 'statistics']
Pred :  ['computer science', 'statistics']
