In [1]:
pip install transformers datasets torch scikit-learn

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_c

In [2]:
import os
os.environ["DISABLE_FLASH_ATTN"] = "1"

In [3]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"


In [4]:
import torch
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments, DistilBertTokenizer, DistilBertForSequenceClassification
from datasets import load_dataset, Dataset
from sklearn.metrics import accuracy_score
import numpy as np

In [6]:

imdb_df = pd.read_csv('IMDB_Dataset_Cleaned_Featured.csv')
sst2_df = pd.read_csv('SST2_Dataset_Cleaned_Featured.csv')

# Subsample the data for faster experimentation
imdb_df = imdb_df.sample(n=500, random_state=42).reset_index(drop=True)
sst2_df = sst2_df.sample(n=500, random_state=42).reset_index(drop=True)

# Convert pandas DataFrame to Hugging Face Dataset
imdb = Dataset.from_pandas(imdb_df)
sst2 = Dataset.from_pandas(sst2_df)

In [8]:
import gc
def clear_gpu():
    globals_to_clear = ['model', 'optimizer', 'meta_model', 'meta_optimizer']
    for name in globals_to_clear:
        if name in globals():
            del globals()[name]
    gc.collect()
    torch.cuda.empty_cache()

# Call it before big steps
clear_gpu()

In [9]:
from sklearn.metrics import accuracy_score

#eval code

def evaluate(model, data_loader, device, task='sst2'):
    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, task=task)
            preds = torch.argmax(outputs, dim=1)

            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(all_labels, all_preds)
    print(f"✅ Accuracy on {task.upper()} test set: {acc:.4f}")
    return acc


In [10]:
# Initialize BERT tokenizer
'''
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Function to preprocess data
def preprocess_data(examples):
    # Tokenize the reviews
    tokenized = tokenizer(examples['review'], padding=True, truncation=True, max_length=512)

    # Additional features: Example with `num_nouns`, `num_verbs`, etc. (just placeholders)
    tokenized['num_nouns'] = examples['num_nouns']
    tokenized['num_verbs'] = examples['num_verbs']
    tokenized['sentiment_shift'] = examples['sentiment_shift']

    return tokenized

# Apply preprocessing to both training and testing datasets
imdb = imdb.map(preprocess_data, batched=True)
sst2 = sst2.map(preprocess_data, batched=True)

# Set format for PyTorch
imdb.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label', 'num_nouns', 'num_verbs', 'sentiment_shift'])
sst2.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label', 'num_nouns', 'num_verbs', 'sentiment_shift'])
'''

# Initialize DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Function to preprocess data
def preprocess_data(examples):
    # Tokenize the reviews
    tokenized = tokenizer(examples['review'], padding="max_length", truncation=True, max_length=512)

    # Additional features: Example with `num_nouns`, `num_verbs`, etc. (just placeholders)
    tokenized['num_nouns'] = examples['num_nouns']
    tokenized['num_verbs'] = examples['num_verbs']
    tokenized['sentiment_shift'] = examples['sentiment_shift']

    return tokenized

# Apply preprocessing to both training and testing datasets
imdb = imdb.map(preprocess_data, batched=True)
sst2 = sst2.map(preprocess_data, batched=True)

# Set format for PyTorch
imdb.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label', 'num_nouns', 'num_verbs', 'sentiment_shift'])
sst2.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label', 'num_nouns', 'num_verbs', 'sentiment_shift'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [11]:
'''
class BERTWithExtraFeatures(torch.nn.Module):
    def __init__(self):
        super(BERTWithExtraFeatures, self).__init__()
        self.bert = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
        self.dropout = torch.nn.Dropout(0.1)
        self.fc = torch.nn.Linear(3, 2)  # Assuming we add 3 extra features (num_nouns, num_verbs, sentiment_shift)

    def forward(self, input_ids, attention_mask, num_nouns, num_verbs, sentiment_shift):
        # Pass through BERT
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output

        # Concatenate additional features to BERT's output
        extra_features = torch.stack([num_nouns, num_verbs, sentiment_shift], dim=-1)
        combined_input = torch.cat((pooled_output, extra_features), dim=-1)

        # Pass through a dropout layer and the final classification layer
        combined_input = self.dropout(combined_input)
        logits = self.fc(combined_input)

        return logits
'''

from transformers import BertForSequenceClassification, BertModel, DistilBertForSequenceClassification, DistilBertModel
import torch
import torch.nn as nn

'''
class BERTWithExtraFeatures(BertForSequenceClassification):
    def __init__(self, config, num_extra_features):
        super().__init__(config)
        self.num_extra_features = num_extra_features
        self.extra_features_layer = nn.Linear(num_extra_features, 1)  # Convert extra features to scalar

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None,
                num_nouns=None, num_verbs=None, sentiment_shift=None):
        # Get the outputs from the original BERT model
        outputs = super().forward(input_ids=input_ids, attention_mask=attention_mask,
                                  token_type_ids=token_type_ids, labels=labels)

        # Process extra features if they exist
        if num_nouns is not None and num_verbs is not None and sentiment_shift is not None:
            # Concatenate extra features into a single tensor
            extra_features = torch.cat((num_nouns.unsqueeze(1), num_verbs.unsqueeze(1), sentiment_shift.unsqueeze(1)), dim=1)

            # Pass extra features through the extra_features_layer
            extra_features_output = self.extra_features_layer(extra_features)  # Shape: (batch_size, 1)

            # Expand extra features to match logits shape (batch_size, num_classes)
            extra_features_output = extra_features_output.expand(-1, outputs.logits.size(1))  # Shape: (batch_size, num_classes)

            # Add the extra features to the logits
            outputs.logits += extra_features_output

        return outputs



# Define the number of extra features (e.g., num_nouns, num_verbs, sentiment_shift)
num_extra_features = 3  # Adjust this based on the number of additional features you have

# Instantiate the model with the BERT configuration and the extra features count
model = BERTWithExtraFeatures.from_pretrained('bert-base-uncased', num_extra_features=num_extra_features)

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Using device:", device)

# Move model to GPU
model.to(device)
'''
'''
class DistilBERTWithExtraFeatures(DistilBertForSequenceClassification):
    def __init__(self, config, num_extra_features):
        super().__init__(config)
        self.num_extra_features = num_extra_features
        self.extra_features_layer = nn.Linear(num_extra_features, 1)  # Convert extra features to scalar

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None,
                num_nouns=None, num_verbs=None, sentiment_shift=None):
        # Get the outputs from the original DistilBERT model
        outputs = super().forward(input_ids=input_ids, attention_mask=attention_mask,
                                  token_type_ids=token_type_ids, labels=labels)

        # Process extra features if they exist
        if num_nouns is not None and num_verbs is not None and sentiment_shift is not None:
            # Concatenate extra features into a single tensor
            extra_features = torch.cat((num_nouns.unsqueeze(1), num_verbs.unsqueeze(1), sentiment_shift.unsqueeze(1)), dim=1)

            # Pass extra features through the extra_features_layer
            extra_features_output = self.extra_features_layer(extra_features)  # Shape: (batch_size, 1)

            # Expand extra features to match logits shape (batch_size, num_classes)
            extra_features_output = extra_features_output.expand(-1, outputs.logits.size(1))  # Shape: (batch_size, num_classes)

            # Add the extra features to the logits
            outputs.logits += extra_features_output

        return outputs
'''

class DistilBERTWithExtraFeatures(DistilBertForSequenceClassification):
    def __init__(self, config, num_extra_features):
        super().__init__(config)
        self.num_extra_features = num_extra_features
        self.extra_features_layer = nn.Linear(num_extra_features, 1)  # Convert extra features to scalar

    def forward(self, input_ids=None, attention_mask=None, labels=None,
                num_nouns=None, num_verbs=None, sentiment_shift=None):
        # Get the outputs from the original DistilBERT model
        outputs = super().forward(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

        # Process extra features if they exist
        if num_nouns is not None and num_verbs is not None and sentiment_shift is not None:
            # Concatenate extra features into a single tensor
            extra_features = torch.cat((num_nouns.unsqueeze(1), num_verbs.unsqueeze(1), sentiment_shift.unsqueeze(1)), dim=1)

            # Pass extra features through the extra_features_layer
            extra_features_output = self.extra_features_layer(extra_features)  # Shape: (batch_size, 1)

            # Expand extra features to match logits shape (batch_size, num_classes)
            extra_features_output = extra_features_output.expand(-1, outputs.logits.size(1))  # Shape: (batch_size, num_classes)

            # Add the extra features to the logits
            outputs.logits += extra_features_output

        return outputs

# Define the number of extra features (e.g., num_nouns, num_verbs, sentiment_shift)
num_extra_features = 3  # Adjust this based on the number of additional features you have

# Instantiate the model with the DistilBERT configuration and the extra features count
model = DistilBERTWithExtraFeatures.from_pretrained('distilbert-base-uncased', num_extra_features=num_extra_features)

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Using device:", device)

# Move model to GPU
model.to(device)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBERTWithExtraFeatures were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'extra_features_layer.bias', 'extra_features_layer.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Using device: cuda


DistilBERTWithExtraFeatures(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
        

In [12]:
from transformers import EarlyStoppingCallback

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=5,              # number of training epochs
    per_device_train_batch_size=16,  # batch size for training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    evaluation_strategy="epoch",     # Evaluate after each epoch
    save_strategy="epoch",           # Save model after each epoch
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',      # <-- Required for early stopping
    greater_is_better=True                 # <-- True if higher = better (e.g., accuracy)
)

trainer = Trainer(
    model=model,                         # the model to be trained
    args=training_args,                  # training arguments
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)], # stop if no visible improvements after 2 cons. runs
    train_dataset=imdb,         # training dataset
    eval_dataset=sst2,     # evaluation dataset
    compute_metrics=lambda p: {'accuracy': accuracy_score(p.predictions.argmax(axis=-1), p.label_ids)}
)

trainer.train()

results = trainer.evaluate(sst2)
print("Test accuracy:", results['eval_accuracy'])

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mivansojivarghese[0m ([33mivansojivarghese-nanyang-technological-university-singapore[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6813,0.690225,0.504
2,0.6687,0.682203,0.568
3,0.5811,0.591129,0.818
4,0.2788,0.479217,0.83
5,0.2479,0.457897,0.804


Test accuracy: 0.83


In [13]:
from transformers import DistilBertModel
import torch.nn as nn

criterion = nn.CrossEntropyLoss()

class DistilBERTMultiTask(nn.Module):
    def __init__(self, shared_model_name='distilbert-base-uncased', num_labels_task1=2, num_labels_task2=2):
        super(DistilBERTMultiTask, self).__init__()
        self.shared_encoder = DistilBertModel.from_pretrained(shared_model_name)
        self.dropout = nn.Dropout(0.3)
        self.classifier_sst2 = nn.Linear(self.shared_encoder.config.hidden_size, num_labels_task1)
        self.classifier_imdb = nn.Linear(self.shared_encoder.config.hidden_size, num_labels_task2)

    def forward(self, input_ids, attention_mask, task='sst2'):
        outputs = self.shared_encoder(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.last_hidden_state[:, 0]
        pooled_output = self.dropout(pooled_output)

        if task == 'sst2':
            return self.classifier_sst2(pooled_output)
        elif task == 'imdb':
            return self.classifier_imdb(pooled_output)
        else:
            raise ValueError("Unknown task")


In [14]:

def train_alternating_batches(model, sst2_loader, imdb_loader, optimizer, criterion, device):
    model.train()
    sst2_iter = iter(sst2_loader)
    imdb_iter = iter(imdb_loader)
    total_batches = min(len(sst2_loader), len(imdb_loader))

    for _ in range(total_batches):
        for task_name, loader_iter in [("sst2", sst2_iter), ("imdb", imdb_iter)]:
            try:
                batch = next(loader_iter)
            except StopIteration:
                continue

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            optimizer.zero_grad()
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, task=task_name)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()


            # print(f"[{task_name.upper()}] Loss: {loss.item():.4f}")


def train_sequential_single_task(model, data_loader, optimizer, criterion, device, task='sst2'):
    model.train()
    for batch in data_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, task=task)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

def train_hybrid_multitask(model, sst2_loader, imdb_loader, optimizer, criterion, device, num_epochs=4):
    for epoch in range(num_epochs):
        print(f"Epoch {epoch+1}/{num_epochs}")

        if epoch % 2 == 0:
            print("🔁 Alternating batches (SST2 + IMDB)")
            train_alternating_batches(model, sst2_loader, imdb_loader, optimizer, criterion, device)
        else:
            task = "sst2" if (epoch // 2) % 2 == 0 else "imdb"
            print(f"📚 Training on one task only: {task.upper()}")
            if task == "sst2":
                train_sequential_single_task(model, sst2_loader, optimizer, criterion, device, task='sst2')
            else:
                train_sequential_single_task(model, imdb_loader, optimizer, criterion, device, task='imdb')


In [16]:
print(sst2_df.columns)
print(imdb_df.columns)


Index(['label', 'num_nouns', 'num_verbs', 'sentiment_shift', 'input_ids',
       'attention_mask'],
      dtype='object')
Index(['label', 'num_nouns', 'num_verbs', 'sentiment_shift', 'input_ids',
       'attention_mask'],
      dtype='object')


In [17]:
from transformers import AutoTokenizer
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import pandas as pd
import torch

# --- Tokenizer ---
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# --- Custom Collate ---
def custom_collate_fn(batch):
    input_ids = torch.stack([item['input_ids'] for item in batch])
    attention_mask = torch.stack([item['attention_mask'] for item in batch])
    labels = torch.tensor([item['labels'] for item in batch], dtype=torch.long)
    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'labels': labels
    }

# --- Dataset Class ---
"""
class TextDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len=128):
        self.texts = dataframe['review'].tolist()
        self.labels = dataframe['label'].tolist()
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }
"""

class PreTokenizedDataset(Dataset):
    def __init__(self, dataframe):
        self.input_ids = dataframe['input_ids'].tolist()
        self.attention_mask = dataframe['attention_mask'].tolist()
        self.labels = dataframe['label'].tolist()

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
            'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }


# --- Split raw datasets ---
# Convert list of dicts to DataFrames
sst2_df = pd.DataFrame(sst2)
imdb_df = pd.DataFrame(imdb)

# Split into train/test
sst2_train_df, sst2_test_df = train_test_split(sst2_df, test_size=0.2, random_state=42)
imdb_train_df, imdb_test_df = train_test_split(imdb_df, test_size=0.2, random_state=42)

# Create datasets
'''
sst2_train_dataset = TextDataset(sst2_train_df, tokenizer)
imdb_train_dataset = TextDataset(imdb_train_df, tokenizer)
sst2_test_dataset = TextDataset(sst2_test_df, tokenizer)
imdb_test_dataset = TextDataset(imdb_test_df, tokenizer)
'''

sst2_train_dataset = PreTokenizedDataset(sst2_train_df)
imdb_train_dataset = PreTokenizedDataset(imdb_train_df)
sst2_test_dataset = PreTokenizedDataset(sst2_test_df)
imdb_test_dataset = PreTokenizedDataset(imdb_test_df)


# Create dataloaders
sst2_loader = DataLoader(sst2_train_dataset, batch_size=16, shuffle=True, collate_fn=custom_collate_fn)
imdb_loader = DataLoader(imdb_train_dataset, batch_size=16, shuffle=True, collate_fn=custom_collate_fn)

sst2_test_loader = DataLoader(sst2_test_dataset, batch_size=32, shuffle=False, collate_fn=custom_collate_fn)
imdb_test_loader = DataLoader(imdb_test_dataset, batch_size=32, shuffle=False, collate_fn=custom_collate_fn)

# --- Initialize and train model ---
model = DistilBERTMultiTask().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
train_hybrid_multitask(model, sst2_loader, imdb_loader, optimizer, criterion, device, num_epochs=10)

# --- Optional: Evaluate after training ---
evaluate(model, sst2_test_loader, device, task='sst2')
evaluate(model, imdb_test_loader, device, task='imdb')


Epoch 1/10
🔁 Alternating batches (SST2 + IMDB)


  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)


Epoch 2/10
📚 Training on one task only: SST2
Epoch 3/10
🔁 Alternating batches (SST2 + IMDB)
Epoch 4/10
📚 Training on one task only: IMDB
Epoch 5/10
🔁 Alternating batches (SST2 + IMDB)
Epoch 6/10
📚 Training on one task only: SST2
Epoch 7/10
🔁 Alternating batches (SST2 + IMDB)
Epoch 8/10
📚 Training on one task only: IMDB
Epoch 9/10
🔁 Alternating batches (SST2 + IMDB)
Epoch 10/10
📚 Training on one task only: SST2
✅ Accuracy on SST2 test set: 0.8400


  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)


✅ Accuracy on IMDB test set: 0.8500


0.85

In [18]:
# --- Initialize and train model ---
model = DistilBERTMultiTask().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

train_sequential_single_task(model, sst2_loader, optimizer, criterion, device, task='sst2')

# --- Optional: Evaluate after training ---
evaluate(model, sst2_test_loader, device, task='sst2')
evaluate(model, imdb_test_loader, device, task='imdb')  # should perform poorly


  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)


✅ Accuracy on SST2 test set: 0.6900


  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)


✅ Accuracy on IMDB test set: 0.4000


0.4

In [19]:
# --- Initialize and train model ---
model = DistilBERTMultiTask().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

train_sequential_single_task(model, imdb_loader, optimizer, criterion, device, task='imdb')

# --- Optional: Evaluate after training ---
evaluate(model, sst2_test_loader, device, task='sst2')
evaluate(model, imdb_test_loader, device, task='imdb')

  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)


✅ Accuracy on SST2 test set: 0.5600


  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)


✅ Accuracy on IMDB test set: 0.8300


0.83

In [20]:
# --- Initialize and train model ---
model = DistilBERTMultiTask().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

for epoch in range(10):
    print(f"Epoch {epoch+1}")
    train_alternating_batches(model, sst2_loader, imdb_loader, optimizer, criterion, device)

# --- Optional: Evaluate after training ---
evaluate(model, sst2_test_loader, device, task='sst2')
evaluate(model, imdb_test_loader, device, task='imdb')


Epoch 1


  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)


Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
✅ Accuracy on SST2 test set: 0.8400


  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)


✅ Accuracy on IMDB test set: 0.8300


0.83

In [21]:
'''
!pip install higher

import torch
import torch.nn as nn
import torch.nn.functional as F
import higher
from transformers import DistilBertModel

# ✅ DistilBERT-based Few-Shot Learner

import gc
gc.collect()
torch.cuda.empty_cache()

class DistilBERTFewShot(nn.Module):
    def __init__(self, model_name='distilbert-base-uncased', hidden_size=768, num_labels=2):
        super().__init__()
        self.encoder = DistilBertModel.from_pretrained(model_name, torch_dtype=torch.float32)  # force float32
        self.classifier = nn.Linear(hidden_size, num_labels)
        self.dropout = nn.Dropout(0.3)

    def forward(self, input_ids, attention_mask):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        pooled = self.dropout(outputs.last_hidden_state[:, 0])
        return self.classifier(pooled)

# 🧪 Meta-Training Loop with higher
def meta_train_maml(model, meta_optimizer, task_loaders, device, num_iterations=100, inner_lr=1e-2, inner_steps=1):
    model.to(device)
    model.train()

    for iteration in range(num_iterations):
        meta_optimizer.zero_grad()
        meta_loss = 0.0

        for task_name, loader in task_loaders.items():
            support_set, query_set = next(iter(loader))

            support_input_ids = support_set['input_ids'].to(device)
            support_attention_mask = support_set['attention_mask'].to(device)
            support_labels = support_set['labels'].to(device)

            query_input_ids = query_set['input_ids'].to(device)
            query_attention_mask = query_set['attention_mask'].to(device)
            query_labels = query_set['labels'].to(device)

            # Inner loop: fast adaptation with 'higher'
            with higher.innerloop_ctx(model, meta_optimizer, copy_initial_weights=False) as (fmodel, diffopt):
                for _ in range(inner_steps):
                    support_preds = fmodel(support_input_ids, support_attention_mask)
                    support_loss = F.cross_entropy(support_preds, support_labels)
                    diffopt.step(support_loss)

                # Evaluate on query set with updated params
                query_preds = fmodel(query_input_ids, query_attention_mask)
                query_loss = F.cross_entropy(query_preds, query_labels)
                meta_loss += query_loss

        # Average and backpropagate meta-loss
        meta_loss /= len(task_loaders)
        meta_loss.backward()
        meta_optimizer.step()

        if iteration % 10 == 0:
            print(f"Iteration {iteration}: Meta Loss = {meta_loss.item():.4f}")
'''

'\n!pip install higher\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport higher\nfrom transformers import DistilBertModel\n\n# ✅ DistilBERT-based Few-Shot Learner\n\nimport gc\ngc.collect()\ntorch.cuda.empty_cache()\n\nclass DistilBERTFewShot(nn.Module):\n    def __init__(self, model_name=\'distilbert-base-uncased\', hidden_size=768, num_labels=2):\n        super().__init__()\n        self.encoder = DistilBertModel.from_pretrained(model_name, torch_dtype=torch.float32)  # force float32\n        self.classifier = nn.Linear(hidden_size, num_labels)\n        self.dropout = nn.Dropout(0.3)\n\n    def forward(self, input_ids, attention_mask):\n        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)\n        pooled = self.dropout(outputs.last_hidden_state[:, 0])\n        return self.classifier(pooled)\n\n# 🧪 Meta-Training Loop with higher\ndef meta_train_maml(model, meta_optimizer, task_loaders, device, num_iterations=100, inner_lr=1e-

In [22]:
'''
import random
from torch.utils.data import Dataset

class FewShotTaskLoader:
    def __init__(self, dataset, n_support=5, n_query=15, tokenizer=None, max_len=128):
        self.dataset = dataset
        self.n_support = n_support
        self.n_query = n_query
        self.tokenizer = tokenizer
        self.max_len = max_len

    def sample_task(self):
        # Randomly sample without replacement
        indices = random.sample(range(len(self.dataset)), self.n_support + self.n_query)
        support_indices = indices[:self.n_support]
        query_indices = indices[self.n_support:]

        support_samples = [self.dataset[i] for i in support_indices]
        query_samples = [self.dataset[i] for i in query_indices]

        return self._collate_fn(support_samples), self._collate_fn(query_samples)

    def _collate_fn(self, batch):
        input_ids = torch.stack([item['input_ids'] for item in batch])
        attention_mask = torch.stack([item['attention_mask'] for item in batch])
        labels = torch.tensor([item['labels'] for item in batch], dtype=torch.long)

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

    def __iter__(self):
        while True:
            yield self.sample_task()


sst2_task_loader = FewShotTaskLoader(sst2_train_dataset, n_support=5, n_query=15)
imdb_task_loader = FewShotTaskLoader(imdb_train_dataset, n_support=5, n_query=15)

task_loaders = {
    'sst2': sst2_task_loader,
    'imdb': imdb_task_loader
}
'''


"\nimport random\nfrom torch.utils.data import Dataset\n\nclass FewShotTaskLoader:\n    def __init__(self, dataset, n_support=5, n_query=15, tokenizer=None, max_len=128):\n        self.dataset = dataset\n        self.n_support = n_support\n        self.n_query = n_query\n        self.tokenizer = tokenizer\n        self.max_len = max_len\n\n    def sample_task(self):\n        # Randomly sample without replacement\n        indices = random.sample(range(len(self.dataset)), self.n_support + self.n_query)\n        support_indices = indices[:self.n_support]\n        query_indices = indices[self.n_support:]\n\n        support_samples = [self.dataset[i] for i in support_indices]\n        query_samples = [self.dataset[i] for i in query_indices]\n\n        return self._collate_fn(support_samples), self._collate_fn(query_samples)\n\n    def _collate_fn(self, batch):\n        input_ids = torch.stack([item['input_ids'] for item in batch])\n        attention_mask = torch.stack([item['attention_mask'

In [23]:
'''
def meta_evaluate(model, task_loader, device, inner_lr=1e-2, inner_steps=5, num_tasks=20):
    model.eval()
    total = 0
    correct = 0

    for _ in range(num_tasks):
        support_set, query_set = task_loader.sample_task()

        support_input_ids = support_set['input_ids'].to(device)
        support_attention_mask = support_set['attention_mask'].to(device)
        support_labels = support_set['labels'].to(device)

        query_input_ids = query_set['input_ids'].to(device)
        query_attention_mask = query_set['attention_mask'].to(device)
        query_labels = query_set['labels'].to(device)

        # with higher.innerloop_ctx(model, torch.optim.SGD(model.parameters(), lr=inner_lr)) as (fmodel, diffopt):
        with higher.innerloop_ctx(model, meta_optimizer, copy_initial_weights=True) as (fmodel, diffopt):
            for _ in range(inner_steps):
                support_preds = fmodel(support_input_ids, support_attention_mask)
                loss = F.cross_entropy(support_preds, support_labels)
                diffopt.step(loss)

            # Evaluate on query
            query_preds = fmodel(query_input_ids, query_attention_mask)
            preds = torch.argmax(query_preds, dim=1)
            total += query_labels.size(0)
            correct += (preds == query_labels).sum().item()

    acc = correct / total
    print(f"✅ Few-Shot Accuracy: {acc:.4f}")
    return acc
'''

'\ndef meta_evaluate(model, task_loader, device, inner_lr=1e-2, inner_steps=5, num_tasks=20):\n    model.eval()\n    total = 0\n    correct = 0\n\n    for _ in range(num_tasks):\n        support_set, query_set = task_loader.sample_task()\n\n        support_input_ids = support_set[\'input_ids\'].to(device)\n        support_attention_mask = support_set[\'attention_mask\'].to(device)\n        support_labels = support_set[\'labels\'].to(device)\n\n        query_input_ids = query_set[\'input_ids\'].to(device)\n        query_attention_mask = query_set[\'attention_mask\'].to(device)\n        query_labels = query_set[\'labels\'].to(device)\n\n        # with higher.innerloop_ctx(model, torch.optim.SGD(model.parameters(), lr=inner_lr)) as (fmodel, diffopt):\n        with higher.innerloop_ctx(model, meta_optimizer, copy_initial_weights=True) as (fmodel, diffopt):\n            for _ in range(inner_steps):\n                support_preds = fmodel(support_input_ids, support_attention_mask)\n         

In [24]:
'''
import gc
gc.collect()
torch.cuda.empty_cache()

clear_gpu()

# Now test meta-learned model on few-shot settings
meta_model = DistilBERTFewShot().to(device)
meta_optimizer = torch.optim.Adam(meta_model.parameters(), lr=1e-4)
meta_train_maml(meta_model, meta_optimizer, task_loaders, device, num_iterations=10)

# Few-shot eval
meta_evaluate(meta_model, sst2_task_loader, device)
meta_evaluate(meta_model, imdb_task_loader, device)
'''

'\nimport gc\ngc.collect()\ntorch.cuda.empty_cache()\n\nclear_gpu()\n\n# Now test meta-learned model on few-shot settings\nmeta_model = DistilBERTFewShot().to(device)\nmeta_optimizer = torch.optim.Adam(meta_model.parameters(), lr=1e-4)\nmeta_train_maml(meta_model, meta_optimizer, task_loaders, device, num_iterations=10)\n\n# Few-shot eval\nmeta_evaluate(meta_model, sst2_task_loader, device)\nmeta_evaluate(meta_model, imdb_task_loader, device)\n'

In [25]:
# clear_gpu()

import torch
import torch.nn as nn
import torch.nn.functional as F
import random
import gc
from transformers import DistilBertModel

gc.collect()
torch.cuda.empty_cache()

# ✅ DistilBERT-based Few-Shot Learner
class DistilBERTFewShot(nn.Module):
    def __init__(self, model_name='distilbert-base-uncased', hidden_size=768, num_labels=2):
        super().__init__()
        self.encoder = DistilBertModel.from_pretrained(model_name, torch_dtype=torch.float32)
        self.classifier = nn.Linear(hidden_size, num_labels)
        self.dropout = nn.Dropout(0.3)

    def forward(self, input_ids, attention_mask):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        pooled = self.dropout(outputs.last_hidden_state[:, 0])
        return self.classifier(pooled)

# 🔁 Clone and load weights manually
def clone_model_weights(model):
    return {name: param.clone() for name, param in model.named_parameters() if param.requires_grad}

def load_weights(model, weights):
    for name, param in model.named_parameters():
        if name in weights:
            param.data.copy_(weights[name])

# 🧪 Meta-Training Loop (manual first-order MAML)
def meta_train_maml(model, meta_optimizer, task_loaders, device, num_iterations=100, inner_lr=1e-2, inner_steps=1):
    model.to(device)
    model.train()

    for iteration in range(num_iterations):
        meta_optimizer.zero_grad()
        meta_loss = 0.0

        for task_name, loader in task_loaders.items():
            support_set, query_set = next(iter(loader))

            support_input_ids = support_set['input_ids'].to(device)
            support_attention_mask = support_set['attention_mask'].to(device)
            support_labels = support_set['labels'].to(device)

            query_input_ids = query_set['input_ids'].to(device)
            query_attention_mask = query_set['attention_mask'].to(device)
            query_labels = query_set['labels'].to(device)

            # Save original weights
            orig_weights = clone_model_weights(model)

            # Inner loop: manual SGD on support set
            for _ in range(inner_steps):
                preds = model(support_input_ids, support_attention_mask)
                loss = F.cross_entropy(preds, support_labels)
                grads = torch.autograd.grad(loss, model.parameters(), create_graph=False)
                for (name, param), grad in zip(model.named_parameters(), grads):
                    if grad is not None:
                        param.data -= inner_lr * grad

            # Compute query loss with adapted model
            query_preds = model(query_input_ids, query_attention_mask)
            query_loss = F.cross_entropy(query_preds, query_labels)
            meta_loss += query_loss

            # Restore original weights
            load_weights(model, orig_weights)

        # Meta-update
        meta_loss /= len(task_loaders)
        meta_loss.backward()
        meta_optimizer.step()

        if iteration % 10 == 0:
            print(f"Iteration {iteration}: Meta Loss = {meta_loss.item():.4f}")

# 📦 FewShotTaskLoader
class FewShotTaskLoader:
    def __init__(self, dataset, n_support=5, n_query=15):
        self.dataset = dataset
        self.n_support = n_support
        self.n_query = n_query

    def sample_task(self):
        indices = random.sample(range(len(self.dataset)), self.n_support + self.n_query)
        support_indices = indices[:self.n_support]
        query_indices = indices[self.n_support:]

        support_samples = [self.dataset[i] for i in support_indices]
        query_samples = [self.dataset[i] for i in query_indices]

        return self._collate_fn(support_samples), self._collate_fn(query_samples)

    def _collate_fn(self, batch):
        input_ids = torch.stack([item['input_ids'] for item in batch])
        attention_mask = torch.stack([item['attention_mask'] for item in batch])
        labels = torch.tensor([item['labels'] for item in batch], dtype=torch.long)
        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

    def __iter__(self):
        while True:
            yield self.sample_task()

# 🧪 Meta Evaluation (manual inner-loop)
def meta_evaluate(model, task_loader, device, inner_lr=1e-2, inner_steps=5, num_tasks=20):
    model.eval()
    total = 0
    correct = 0

    for _ in range(num_tasks):
        support_set, query_set = task_loader.sample_task()

        support_input_ids = support_set['input_ids'].to(device)
        support_attention_mask = support_set['attention_mask'].to(device)
        support_labels = support_set['labels'].to(device)

        query_input_ids = query_set['input_ids'].to(device)
        query_attention_mask = query_set['attention_mask'].to(device)
        query_labels = query_set['labels'].to(device)

        orig_weights = clone_model_weights(model)

        for _ in range(inner_steps):
            support_preds = model(support_input_ids, support_attention_mask)
            loss = F.cross_entropy(support_preds, support_labels)
            grads = torch.autograd.grad(loss, model.parameters(), create_graph=False)
            for (name, param), grad in zip(model.named_parameters(), grads):
                if grad is not None:
                    param.data -= inner_lr * grad

        query_preds = model(query_input_ids, query_attention_mask)
        preds = torch.argmax(query_preds, dim=1)
        total += query_labels.size(0)
        correct += (preds == query_labels).sum().item()

        load_weights(model, orig_weights)

    acc = correct / total
    print(f"✅ Few-Shot Accuracy: {acc:.4f}")
    return acc

# 🏁 Training and evaluation setup
sst2_task_loader = FewShotTaskLoader(sst2_train_dataset, n_support=5, n_query=15)
imdb_task_loader = FewShotTaskLoader(imdb_train_dataset, n_support=5, n_query=15)

task_loaders = {
    'sst2': sst2_task_loader,
    'imdb': imdb_task_loader
}

meta_model = DistilBERTFewShot().to(device)
meta_optimizer = torch.optim.Adam(meta_model.parameters(), lr=1e-4)

meta_train_maml(meta_model, meta_optimizer, task_loaders, device, num_iterations=100)

meta_evaluate(meta_model, sst2_task_loader, device)
meta_evaluate(meta_model, imdb_task_loader, device)


  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)


Iteration 0: Meta Loss = 0.7416
Iteration 10: Meta Loss = 0.7834
Iteration 20: Meta Loss = 0.6611
Iteration 30: Meta Loss = 0.7050
Iteration 40: Meta Loss = 0.8117
Iteration 50: Meta Loss = 0.7478
Iteration 60: Meta Loss = 0.7572
Iteration 70: Meta Loss = 0.8500
Iteration 80: Meta Loss = 0.6889
Iteration 90: Meta Loss = 1.0144
✅ Few-Shot Accuracy: 0.6533
✅ Few-Shot Accuracy: 0.5067


0.5066666666666667

In [26]:

# 📦 Install SetFit for contrastive few-shot classification
!pip install -q setfit datasets

from setfit import SetFitModel, SetFitTrainer
from datasets import Dataset

import random

# 🧪 Example: IMDB 5-shot dataset
# Replace this with actual few-shot IMDB samples you extract
'''
train_data = {
    "text": [
        "An amazing and moving film!",
        "Terrible movie, I want my time back.",
        "Absolutely loved it, great acting.",
        "Boring plot, bad characters.",
        "Masterpiece! A must-watch.",
        "Worst movie I’ve ever seen.",
        "Heartwarming and beautiful.",
        "Not worth watching at all.",
        "Outstanding direction and storytelling.",
        "The most disappointing film of the year."
    ],
    "label": [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
}
train_dataset = Dataset.from_dict(train_data)
'''

# Load IMDB dataset
imdb_dataset = load_dataset("imdb", split="train")

# Shuffle and sample 10 examples (5 positive, 5 negative)
positive_samples = [x for x in imdb_dataset if x['label'] == 1][:5]
negative_samples = [x for x in imdb_dataset if x['label'] == 0][:5]
few_shot_samples = positive_samples + negative_samples
random.shuffle(few_shot_samples)

# Create SetFit-compatible dataset
from datasets import Dataset
train_data = {
    "text": [sample['text'] for sample in few_shot_samples],
    "label": [sample['label'] for sample in few_shot_samples]
}
train_dataset = Dataset.from_dict(train_data)

# 🚀 Load SetFit model for few-shot training
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# 🏋️ Fine-tune with contrastive learning + classification head
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,
    metric="accuracy",
    batch_size=4,
    num_iterations=20,      # Contrastive epochs
    num_epochs=5,           # Classifier epochs
    column_mapping={"text": "text", "label": "label"},
)

trainer.train()

# 📈 Evaluate the trained model
metrics = trainer.evaluate()
print("SetFit (IMDB 5-shot) Evaluation:", metrics)


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/75.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.5/75.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.52k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(
Applying column mapping to the training dataset
Applying column mapping to the evaluation dataset


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 400
  Batch size = 4
  Num epochs = 5


Step,Training Loss
1,0.0841
50,0.079
100,0.0013
150,0.0002
200,0.0002
250,0.0001
300,0.0001
350,0.0001
400,0.0001
450,0.0001


***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

SetFit (IMDB 5-shot) Evaluation: {'accuracy': 1.0}


In [27]:

# 🧪 Example: SST-2 5-shot dataset
'''
sst2_data = {
    "text": [
        "A fantastic, feel-good story.",
        "I hated every moment of it.",
        "Great film with solid performances.",
        "An utter waste of time.",
        "Absolutely loved the message it conveyed.",
        "Worst acting I've seen in years.",
        "Beautiful, touching, and inspiring.",
        "Horrible plot and lazy writing.",
        "Such a joy to watch from start to end.",
        "Painful to sit through this disaster."
    ],
    "label": [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
}
sst2_dataset = Dataset.from_dict(sst2_data)
'''

# 📥 Load SST-2 from GLUE benchmark
sst2_dataset = load_dataset("glue", "sst2", split="train")

# 🔍 Get 5 positive and 5 negative samples
positive_samples = [x for x in sst2_dataset if x['label'] == 1][:5]
negative_samples = [x for x in sst2_dataset if x['label'] == 0][:5]
few_shot_samples = positive_samples + negative_samples
random.shuffle(few_shot_samples)

# ✅ Convert to SetFit format
sst2_data = {
    "text": [sample["sentence"] for sample in few_shot_samples],
    "label": [sample["label"] for sample in few_shot_samples]
}
sst2_dataset = Dataset.from_dict(sst2_data)

# 🔁 Train SetFit model on SST-2 5-shot data
model_sst2 = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

trainer_sst2 = SetFitTrainer(
    model=model_sst2,
    train_dataset=sst2_dataset,
    eval_dataset=sst2_dataset,
    metric="accuracy",
    batch_size=4,
    num_iterations=20,
    num_epochs=5,
    column_mapping={"text": "text", "label": "label"},
)

trainer_sst2.train()
sst2_metrics = trainer_sst2.evaluate()
print("SetFit (SST-2 5-shot) Evaluation:", sst2_metrics)


README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer_sst2 = SetFitTrainer(
Applying column mapping to the training dataset
Applying column mapping to the evaluation dataset


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 400
  Batch size = 4
  Num epochs = 5


Step,Training Loss
1,0.8125
50,0.1332
100,0.0014
150,0.0004
200,0.0004
250,0.0003
300,0.0003
350,0.0002
400,0.0003
450,0.0003


***** Running evaluation *****


SetFit (SST-2 5-shot) Evaluation: {'accuracy': 1.0}


In [28]:

# 📊 Performance Comparison: MAML vs SetFit

# print("🔁 MAML Performance:")
# These values should be printed during meta-training; alternatively, log them manually.
# e.g., print("MAML Final Meta Loss: ...")

print("\n🔍 SetFit Evaluation Results:")
print("IMDB (SetFit 5-shot):", metrics)
print("SST-2 (SetFit 5-shot):", sst2_metrics)



🔍 SetFit Evaluation Results:
IMDB (SetFit 5-shot): {'accuracy': 1.0}
SST-2 (SetFit 5-shot): {'accuracy': 1.0}


In [29]:

# 📦 Domain-Adversarial Training with Gradient Reversal Layer (DANN)
from torch.autograd import Function

class GradientReversalFunction(Function):
    @staticmethod
    def forward(ctx, x, lambda_):
        ctx.lambda_ = lambda_
        return x.view_as(x)

    @staticmethod
    def backward(ctx, grad_output):
        return -ctx.lambda_ * grad_output, None

class GRL(nn.Module):
    def __init__(self, lambda_=1.0):
        super().__init__()
        self.lambda_ = lambda_

    def forward(self, x):
        return GradientReversalFunction.apply(x, self.lambda_)

class DistilBERTWithDANN(nn.Module):
    def __init__(self, model_name='distilbert-base-uncased', num_labels=2):
        super().__init__()
        self.encoder = DistilBertModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(0.3)
        self.classifier = nn.Linear(self.encoder.config.hidden_size, num_labels)
        self.domain_classifier = nn.Sequential(
            GRL(lambda_=1.0),
            nn.Linear(self.encoder.config.hidden_size, 128),
            nn.ReLU(),
            nn.Linear(128, 2)  # Domain: IMDB=0, SST-2=1
        )

    def forward(self, input_ids, attention_mask):
        features = self.encoder(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state[:, 0]
        features = self.dropout(features)
        task_logits = self.classifier(features)
        domain_logits = self.domain_classifier(features)
        return task_logits, domain_logits


In [30]:

# 🔁 Training loop for DANN with both task and domain classification
def train_with_dann(model, dataloader, optimizer, criterion_task, criterion_domain, device, domain_label):
    model.train()
    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        domain = torch.full_like(labels, domain_label).to(device)

        optimizer.zero_grad()
        task_logits, domain_logits = model(input_ids, attention_mask)

        loss_task = criterion_task(task_logits, labels)
        loss_domain = criterion_domain(domain_logits, domain)

        loss = loss_task + loss_domain
        loss.backward()
        optimizer.step()

        print(f"Task Loss: {loss_task.item():.4f}, Domain Loss: {loss_domain.item():.4f}")



In [31]:

# ⚙️ Run DANN training for IMDB and SST-2 few-shot (simulate with same data as SetFit)
from torch.utils.data import DataLoader, TensorDataset

# Simulated tiny dataloaders using same 5-shot data from SetFit stage
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

def create_dataloader(texts, labels, batch_size=4):
    encodings = tokenizer(texts, truncation=True, padding=True, return_tensors='pt')
    labels = torch.tensor(labels)
    dataset = TensorDataset(encodings['input_ids'], encodings['attention_mask'], labels)
    return DataLoader(dataset, batch_size=batch_size)

imdb_dl = create_dataloader(train_data["text"], train_data["label"])
sst2_dl = create_dataloader(sst2_data["text"], sst2_data["label"])

# Wrap batches for training loop compatibility
def wrap_dataloader(dl):
    for batch in dl:
        yield {'input_ids': batch[0], 'attention_mask': batch[1], 'labels': batch[2]}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dann_model = DistilBERTWithDANN().to(device)
optimizer = torch.optim.Adam(dann_model.parameters(), lr=2e-5)
criterion_task = nn.CrossEntropyLoss()
criterion_domain = nn.CrossEntropyLoss()

# Train with domain labels: IMDB=0, SST2=1
print("🔁 DANN Training on IMDB (domain=0)")
train_with_dann(dann_model, wrap_dataloader(imdb_dl), optimizer, criterion_task, criterion_domain, device, domain_label=0)
print("🔁 DANN Training on SST-2 (domain=1)")
train_with_dann(dann_model, wrap_dataloader(sst2_dl), optimizer, criterion_task, criterion_domain, device, domain_label=1)


🔁 DANN Training on IMDB (domain=0)
Task Loss: 0.6819, Domain Loss: 0.8305
Task Loss: 0.9070, Domain Loss: 0.9184
Task Loss: 0.6475, Domain Loss: 0.8966
🔁 DANN Training on SST-2 (domain=1)
Task Loss: 0.5431, Domain Loss: 0.5573
Task Loss: 0.8943, Domain Loss: 0.5516
Task Loss: 0.8314, Domain Loss: 0.5125


In [32]:
def evaluate(model, dataloader, device):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch[0].to(device)
            attention_mask = batch[1].to(device)
            labels = batch[2].to(device)

            task_logits, _ = model(input_ids, attention_mask)
            preds = torch.argmax(task_logits, dim=1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)
    acc = correct / total
    print(f"✅ Task Accuracy: {acc:.4f}")
    return acc

evaluate(dann_model, imdb_dl, device)
evaluate(dann_model, sst2_dl, device)


✅ Task Accuracy: 0.5000
✅ Task Accuracy: 0.5000


0.5


the best performing model overall is:

🧲 SetFit (Contrastive Learning)
With 100% accuracy on both SST2 and IMDB, it outperforms all other models in your few-shot setup.


Why is SetFit the best in this scenario?
🔥 Top Accuracy in Few-Shot Regime:

It achieved perfect classification on both datasets using only 5 examples per class, which is highly impressive.

This shows exceptional generalization, especially when labeled data is extremely limited.

📉 No Need for Large-Scale Fine-Tuning:

Unlike standard transformers, SetFit doesn’t fine-tune the entire model.

Instead, it leverages contrastive learning at the embedding level and only trains a simple classifier, making it lightweight and efficient.

🧠 Pretrained Sentence Embeddings:

It uses Sentence Transformers (e.g., paraphrase-mpnet-base-v2) which are already well-aligned for semantic similarity.

That means even with a tiny few-shot dataset, the embeddings are rich enough to enable strong classification without overfitting.

⚖️ Stability Across Domains:

IMDB and SST2 differ stylistically (reviews vs. short sentiment phrases), but SetFit handles both equally well, showing strong domain generalization.



But is it always the best?
Not necessarily. Here's why:

Real-World Datasets are Noisy: In clean 5-shot datasets, SetFit shines. But in real-world noisy or highly imbalanced settings, MAML or DANN might offer better robustness.

Generalization to New Tasks: MAML, for example, is designed to adapt quickly to new tasks, which could be better in more dynamic multi-domain settings.

TL;DR
Use SetFit if:
- You want superior few-shot performance
- Tasks are in-distribution and semantically consistent
- You value simplicity and fast training

Use MAML or DANN if:
- You need robust meta-learning or domain adaptation
- You're facing high task/domain variability
- You can afford longer training loops

In [None]:
######################

Based on your results, SetFit (5-shot) is the best-performing model in terms of raw accuracy, achieving 1.0 on both SST-2 and IMDB test sets. However, choosing the best model depends on more than just test accuracy—let’s look at it from multiple angles:

Best Performing: SetFit
Accuracy: SST-2 = 1.0, IMDB = 1.0

Few-shot capable: Only needed 5 examples per class.

Why it stands out:

Uses contrastive learning + a lightweight classifier.

Leverages sentence embeddings effectively.

Generalizes extremely well from very little data.

⚠️ Caveat: Accuracy of 1.0 in few-shot can sometimes suggest overfitting or test leakage—make sure that evaluation was done on unseen data, and that no information from the test set was used during training.

Most Robust: Hybrid Multitask / Alternating Batches
Hybrid Multitask: SST-2 = 0.84, IMDB = 0.85

Alternating Batches: SST-2 = 0.84, IMDB = 0.83

Why they stand out:

Consistently strong on both datasets.

Likely learned generalizable features from multi-domain exposure.

No few-shot tricks — trained in full supervised setting.

Stable across runs, no extreme performance drops like the sequential model.

 Less Reliable: Sequential Single-Task
Run 1: IMDB drops to 0.40

Run 2: SST-2 drops to 0.56

Why it struggles:

Shows signs of catastrophic forgetting — training on one task causes degradation on the other.

Order-dependent performance (run 1 vs. run 2).

Moderate Generalization: Few-Shot (MAML)
SST-2 = 0.65, IMDB = 0.51

Impressive for few-shot, but not as strong as SetFit.

Good choice when you want adaptability with minimal data, but less performant than multi-task models with full supervision.



 Underperforming: DANN
Both = 0.50

Possibly failed to align the domains effectively.

Accuracy near random — could mean domain confusion harmed the feature extractor.

If everything is working correctly and you verified the SetFit results aren't overfitting, then SetFit wins. Otherwise, Hybrid Multitask is the most reliable supervised approach.

In [None]:
##########

🧠 How DANN Works (Quick Recap)
DANN tries to:

Classify data correctly (label prediction) — like a normal model.

Confuse a domain classifier — to force shared, domain-invariant features.

So it has:

A feature extractor (e.g., DistilBERT)

A label classifier (for sentiment)

A domain discriminator (to tell SST2 from IMDB)

And uses a gradient reversal layer to discourage domain discrimination.


Reasons for Underperformance
1. No Domain-Invariant Features Learned
If the model couldn’t find a shared representation across SST-2 and IMDB (due to major linguistic/style differences), it might have:

Made the feature extractor "too generic"

Lost task-specific information

Resulted in poor sentiment predictions

Effect: Accuracy = ~0.50 (random guess)

2. Gradient Reversal Weight Too High
If the domain confusion loss dominates the training too early:

The model forgets how to classify sentiment

Only learns to fool the domain classifier

🔧 Fix: Gradually increase the domain loss weight during training (e.g., using a scheduling function like lambda = 2 / (1 + exp(-γ * p)) - 1 from the original DANN paper)

3. Domain Discriminator Too Strong or Too Weak
If it’s too strong, it overpowers the main task.

If too weak, the feature extractor doesn’t learn to align domains.

🔍 Check discriminator accuracy — if it's 100%, that’s a red flag (features still fully domain-specific).

4. Class Imbalance or Data Mismatch
SST-2 and IMDB differ in:

Average review length

Writing style (formal vs. casual)

Sentiment distribution

If your domain adaptation didn't account for that, the model might have failed to generalize.

📊 Balance batch sampling across domains (important in DANN!).

5. Shallow Domain Classifier
If the domain classifier is just a linear layer, it may not give meaningful gradients to shape the feature extractor.

✅ Use a deeper MLP with dropout for better signal.