# Transformers - Part II

## Tasks

### Task 1

Create embeddings for objects from a subset of the `IMDB` dataset using pre-trained models from Hugging Face.  
In this task, make them using the `BERT` model (`bert-base-cased`), using the `get_embeddings_labels function`.  
Check before submitting that the tensor with embeddings has a size of (200, 768).  

In [1]:
import torch
import numpy as np
from datasets import load_dataset
from torch.utils.data import DataLoader, Subset
from transformers import AutoTokenizer, BertModel, RobertaModel, DistilBertModel
from tqdm import tqdm

In [3]:
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [4]:
# Function to get model and tokenizer
def get_model(model_name):
    assert model_name in ['bert', 'roberta', 'distilbert']
    
    checkpoint_names = {
        'bert': 'bert-base-cased',
        'roberta': 'roberta-base',
        'distilbert': 'distilbert-base-cased'
    }
    
    model_classes = {
        'bert': BertModel,
        'roberta': RobertaModel,
        'distilbert': DistilBertModel
    }
    
    return AutoTokenizer.from_pretrained(checkpoint_names[model_name]), model_classes[model_name].from_pretrained(checkpoint_names[model_name])

In [5]:
# Get BERT model and tokenizer
tokenizer, model = get_model('bert')
model = model.to(device)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [6]:
# Function to get embeddings
@torch.inference_mode()
def get_embeddings_labels(model, loader):
    model.eval()
    
    total_embeddings = []
    labels = []
    
    for batch in tqdm(loader):
        labels.append(batch['labels'].unsqueeze(1))

        batch = {key: batch[key].to(device) for key in ['attention_mask', 'input_ids']}

        embeddings = model(**batch)['last_hidden_state'][:, 0, :]

        total_embeddings.append(embeddings.cpu())

    return torch.cat(total_embeddings, dim=0), torch.cat(labels, dim=0).to(torch.float32)

In [7]:
# Load the IMDB dataset
dataset = load_dataset('imdb', split='train')

# Generate 200 random indices
np.random.seed(100)
idx = np.random.randint(len(dataset), size=200).tolist()

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [8]:
# Define a collate function for the DataLoader
def collate_fn(batch):
    texts = [item['text'] for item in batch]
    labels = [item['label'] for item in batch]
    
    encoded = tokenizer(
        texts, 
        padding='max_length', 
        truncation=True, 
        max_length=512, 
        return_tensors='pt'
    )
    
    encoded['labels'] = torch.tensor(labels)
    
    return encoded

In [9]:
# Create a subset of the dataset
subset = Subset(dataset, idx)

# Create a DataLoader with shuffle=False as required
loader = DataLoader(subset, batch_size=16, shuffle=False, collate_fn=collate_fn)

In [10]:
# Get embeddings and labels
embeddings, labels = get_embeddings_labels(model, loader)

# Check the shape of the embeddings
embedding_shape = embeddings.shape
print(f'Embeddings shape: {embedding_shape}')

# Save the embeddings to a file
torch.save(embeddings, 'bert_embeddings.pt')

100%|██████████| 13/13 [03:03<00:00, 14.15s/it]

Embeddings shape: torch.Size([200, 768])





### Task 2

Create embeddings for objects from a subset of the `IMDB` dataset using pre-trained models from Hugging Face.  
In this task, make them using the `RoBERTa` model (`roberta-base`), using the `get_embeddings_labels function`.  
Check before submitting that the tensor with embeddings has a size of (200, 768).  

In [11]:
tokenizer, model = get_model('roberta')
model = model.to(device)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# Get embeddings and labels
embeddings, labels = get_embeddings_labels(model, loader)

# Check the shape of the embeddings
embedding_shape = embeddings.shape
print(f'Embeddings shape: {embedding_shape}')

# Save the embeddings to a file
torch.save(embeddings, 'roberta_embeddings.pt')

100%|██████████| 13/13 [03:00<00:00, 13.87s/it]

Embeddings shape: torch.Size([200, 768])





### Task 3

Create embeddings for objects from a subset of the `IMDB` dataset using pre-trained models from Hugging Face.  
In this task, make them using the `DistilBERT` model (`distilbert-base-cased`), using the `get_embeddings_labels function`.  
Check before submitting that the tensor with embeddings has a size of (200, 768).  

In [13]:
tokenizer, model = get_model('distilbert')
model = model.to(device)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

In [14]:
# Get embeddings and labels
embeddings, labels = get_embeddings_labels(model, loader)

# Check the shape of the embeddings
embedding_shape = embeddings.shape
print(f'Embeddings shape: {embedding_shape}')

# Save the embeddings to a file
torch.save(embeddings, 'distilbert_embeddings.pt')

100%|██████████| 13/13 [01:29<00:00,  6.85s/it]

Embeddings shape: torch.Size([200, 768])



