<a href="https://colab.research.google.com/github/kla55/transformer/blob/main/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install transformers datasets torch

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
import numpy as np
import random

In [3]:
def set_seed(seed):
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

set_seed(42)

In [4]:
from datasets import load_dataset

# Load a dataset (e.g., for text generation or summarization)
dataset = load_dataset("wikitext", 'wikitext-103-raw-v1')

for i in range(5):
    print(f"Example {i + 1}: {dataset['train'][i]}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/157M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/157M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Example 1: {'text': ''}
Example 2: {'text': ' = Valkyria Chronicles III = \n'}
Example 3: {'text': ''}
Example 4: {'text': ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n'}
Example 5: {'text': " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard f

In [15]:
# train_ds = dataset['train']
# test_ds = dataset['test']
# validation_ds = dataset['validation']

# Subsample

train_subset_size = 10000
test_subset_size = 1000
validation_subset_size = 1000

train_subset = dataset['train'].shuffle(seed=42).select(range(train_subset_size))
test_subset = dataset['test'].shuffle(seed=42).select(range(test_subset_size))
validation_subset = dataset['validation'].shuffle(seed=42).select(range(validation_subset_size))

train_ds = train_subset
test_ds = test_subset
validation_ds = validation_subset

In [16]:
class WikiTextDataset(Dataset):
    def __init__(self, tokenizer, dataset, max_length=512):
        self.tokenizer = tokenizer
        self.dataset = dataset
        self.max_length = max_length

    def __len__(self):
        return len(self.dataset)

    # def __getitem__(self, idx):
    #     text = self.dataset[idx]['text']
    #     encoding = self.tokenizer(text, truncation=True, padding='max_length', max_length=self.max_length, return_tensors='pt')
    #     input_ids = encoding['input_ids'].squeeze()
    #     attention_mask = encoding['attention_mask'].squeeze()
    #     return {'input_ids': input_ids, 'attention_mask': attention_mask}
    def __getitem__(self, idx):
        text = self.dataset[idx]['text']
        encoding = self.tokenizer(text, truncation=True, padding='max_length', max_length=self.max_length, return_tensors='pt')
        input_ids = encoding['input_ids'].squeeze()
        attention_mask = encoding['attention_mask'].squeeze()
        # we are shifting the input_ids by 1 to create the labels.
        labels = input_ids.clone()
        labels[:-1] = input_ids[1:]
        labels[-1] = -100 # padding token
        return {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels}


In [17]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
if tokenizer.pad_token is None:
    print("Adding pad token")
    tokenizer.pad_token = tokenizer.eos_token
    # tokenizer.add_special_tokens({'pad_token': '[PAD]'})

In [18]:
train_dataset = WikiTextDataset(tokenizer, train_ds)
test_dataset = WikiTextDataset(tokenizer, test_ds)
validation_dataset = WikiTextDataset(tokenizer, validation_ds)

In [19]:
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=False)
validation_dataloader = DataLoader(validation_dataset, batch_size=8, shuffle=False)

In [20]:
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AutoModelForCausalLM.from_pretrained('bert-base-uncased', num_labels=2)
model.to(device)

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


BertLMHeadModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

In [21]:
optimizer = optim.AdamW(model.parameters(), lr=5e-5)
total_steps = len(train_dataset) * 3
scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=5e-5, steps_per_epoch=len(train_dataset), epochs=3)

In [22]:
print(f'Training on device: {device}')

Training on device: cuda


In [None]:
for epoch in range(3):
    print(f"Epoch {epoch+1}/{3}")
    model.train()
    total_loss = 0
    for step, batch in enumerate(train_dataloader):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        if step % 100 == 0:
            print(f"Epoch {epoch+1}, Step {step+1}/{len(train_dataloader)}, Loss: {loss.item()}")

        # Calculate average training loss for the epoch
    avg_train_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1}, Average Training Loss: {avg_train_loss}")

    # Evaluation
    model.eval()
    total_val_loss = 0
    with torch.no_grad():
        for batch in validation_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            total_val_loss += loss.item()

    # Calculate average validation loss for the epoch
    avg_val_loss = total_val_loss / len(validation_dataloader)

Epoch 1/3
Epoch 1, Step 1/1250, Loss: 18.33817481994629
Epoch 1, Step 101/1250, Loss: 8.771114349365234
Epoch 1, Step 201/1250, Loss: 4.589552879333496
Epoch 1, Step 301/1250, Loss: 1.8719438314437866
Epoch 1, Step 401/1250, Loss: 2.0080080032348633
Epoch 1, Step 501/1250, Loss: 0.7822039723396301
Epoch 1, Step 601/1250, Loss: 1.173858642578125
Epoch 1, Step 701/1250, Loss: 1.1321109533309937
Epoch 1, Step 801/1250, Loss: 0.7939498424530029
Epoch 1, Step 901/1250, Loss: 1.2959167957305908
Epoch 1, Step 1001/1250, Loss: 0.8100189566612244
Epoch 1, Step 1101/1250, Loss: 1.0478272438049316
Epoch 1, Step 1201/1250, Loss: 1.1483689546585083
Epoch 1, Average Training Loss: 2.646349134218693
Epoch 2/3
Epoch 2, Step 1/1250, Loss: 0.8245216608047485
Epoch 2, Step 101/1250, Loss: 0.7987210154533386
Epoch 2, Step 201/1250, Loss: 0.7036279439926147
Epoch 2, Step 301/1250, Loss: 0.6524927616119385
Epoch 2, Step 401/1250, Loss: 0.520881175994873


## Notes
1. Choose a pretrained LLM
2. Load the WikiText dataset
3. Prep the data: Transform the raw text data into a format that the LLM understands, including tokenization and padding/truncation
4. Setup the fine-tuning environment: Choose an optimizer, loss function and configure the training device
5. Fine-tune Model: The core steps where you can iterate over the trainign data, calculate loss, back propagation and update model weights.
6. Evaluation: Check performance of the finetune model on the validation set.

## Key Explanation Points
Wikitextdata class: Cystom pytorch dataset\
- Tokenization: The tokenizer converts text into IDs that the model can understand. padding='max_length' adds padding token tot he end of each sequence to have the same length. truncation=True ensures sequences longer than max_length are truncated.

- Dataloader: creates batches of data to feed into the model during trainin.
- Model Loading: automodelforcasualLM.from_pretrained() laods the pre-trained language model and moves the model to the GPU.
- Optimizer and Scheduler: AdamW is a popular optimizer for fine-tining transformers. The linear_schedule with warmup provides a learnign rate schedule with a warm-up period to help the model converge better.


In [None]:
# Notes
1.