<a href="https://colab.research.google.com/github/marekrei/ml-examples/blob/main/ML_examples_03_language_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Examples - 03 - Language Models

## Text classification using a BERT encoder model

Based on https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/chapter3/section4.ipynb

Training the BERT model for binary sentiment detection, using the SST2 dataset.

In [17]:
# Text classification example with BERT
# Created by Marek Rei
# Based on https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/chapter3/section4.ipynb
# Training the model for binary sentiment detection, using the SST2 dataset.

# Some settings
# Which pre-trained model to use.
# See https://huggingface.co/models for options.
checkpoint = "bert-base-uncased"

# How much training data to use.
# 1.0 uses the whole training set but it can take a bit of time to train.
train_data_sample_ratio = 0.1

# Example sentence to use
# We print out predictions for this sentence before and after training
example_sentence = "this was by far the best movie of the year"

In [18]:
# Install the necessary libraries
!pip install datasets evaluate transformers[sentencepiece]



In [19]:
# Import the libraries
import torch
import evaluate

from torch.utils.data import DataLoader
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AdamW
from transformers import AutoModelForSequenceClassification
from transformers import get_scheduler
from tqdm.auto import tqdm

In [20]:
# Checking whether you are running on CPU or GPU.
# If the output here says "cuda" then it's running on GPU. Otherwise it's probably CPU.
# In order to run your code in Colab on the GPU, go to Edit -> Notebook settings -> Hardware accelerator and set it to "GPU".
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)

cuda


In [21]:
# Loading the pretrained model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
model = model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
# Load the data
raw_datasets = load_dataset("glue", "sst2")
raw_datasets.cleanup_cache_files()

# Using only a sample of the training data if needed

if train_data_sample_ratio < 1.0:
    num_training_examples = int(train_data_sample_ratio*len(raw_datasets["train"]))
    raw_datasets["train"] = load_dataset("glue", "sst2", split='train[:'+str(num_training_examples)+']')

# Perform tokenization
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# Need to remove columns that the model won't know
tokenized_datasets = tokenized_datasets.remove_columns(["sentence", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

# DataCollatorWithPadding constructs batches that are padded to the length of the longest sentence in the batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

Map:   0%|          | 0/6734 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

In [23]:
# Printing out the shapes in one batch
example_batch = None
for batch in train_dataloader:
    example_batch = batch
    break

print({k: v.shape for k, v in example_batch.items()})


# Then printing out the loss, output shape and output values from one batch.
outputs = model(**example_batch.to(device))
print("output.loss: ", outputs.loss)
print("output.logits.shape: ", outputs.logits.shape)
print("output.logits: ", outputs.logits)

# Generating predictions for an example sentence.
# Haven't trained the model yet so these will be random.
def print_example_predictions(example_sentence, example_model):
    _e = tokenize_function({"sentence": example_sentence})
    _k = {k: torch.tensor([_e[k]]).to(device) for k in _e}
    model.eval()
    example_outputs = model(**_k)
    example_logits = example_outputs.logits.cpu().detach().numpy()
    example_probabilities = torch.nn.functional.softmax(example_outputs.logits, dim=1).cpu().detach().numpy()
    print(example_probabilities)
    print("Example sentence: ", example_sentence)
    print("Predicted logits: ", example_logits)
    print("Predicted probabilities: ", example_probabilities)
    print("Prediction: ", "negative" if example_probabilities[0][0] > example_probabilities[0][1] else "positive")

print_example_predictions(example_sentence, model)


{'labels': torch.Size([8]), 'input_ids': torch.Size([8, 35]), 'token_type_ids': torch.Size([8, 35]), 'attention_mask': torch.Size([8, 35])}
output.loss:  tensor(0.7004, device='cuda:0', grad_fn=<NllLossBackward0>)
output.logits.shape:  torch.Size([8, 2])
output.logits:  tensor([[ 0.2484,  0.0826],
        [ 0.1475,  0.1194],
        [ 0.2153, -0.1583],
        [ 0.2415,  0.1186],
        [ 0.3401, -0.0177],
        [ 0.2702,  0.0553],
        [ 0.2532,  0.0840],
        [ 0.3793,  0.0121]], device='cuda:0', grad_fn=<AddmmBackward0>)
[[0.5697866  0.43021342]]
Example sentence:  this was by far the best movie of the year
Predicted logits:  [[ 0.21315879 -0.06782169]]
Predicted probabilities:  [[0.5697866  0.43021342]]
Prediction:  negative


In [24]:
# Setting up model training for fine-tuning
optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [25]:
# Setting the model to training mode
model.train()

# Running the training
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/2526 [00:00<?, ?it/s]

In [26]:
# Setting the model to evaluation mode
model.eval()

# Running evaluation
metric = evaluate.load("glue", "sst2")
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

print(metric.compute())

{'accuracy': 0.9048165137614679}


In [27]:
# Getting predictions for the example sentence again, now that we have trained the model
print_example_predictions(example_sentence, model)

[[4.4249822e-04 9.9955744e-01]]
Example sentence:  this was by far the best movie of the year
Predicted logits:  [[-4.057501   3.6651306]]
Predicted probabilities:  [[4.4249822e-04 9.9955744e-01]]
Prediction:  positive


## Generating output from a language model

In [28]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Define the model name
MODEL_NAME = "gpt2"

# Load the tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Ensure pad token is set (some models might not have one by default)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load the model
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",  # Automatically maps to available GPU if present
    torch_dtype=torch.float16  # Use float16 to save memory
)

# Define the example input
example_input = "Explain the importance of renewable energy in mitigating climate change."

# Tokenize the input with padding and attention mask
print("Tokenizing input...")
inputs = tokenizer(
    example_input,
    return_tensors="pt",
    padding=True,  # Adds padding if necessary
    truncation=True,  # Truncates input if it exceeds model's max length
    max_length=512  # Adjust based on your model's max input length
)

# Move inputs to the appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = {key: value.to(device) for key, value in inputs.items()}

# Generate output
print("Generating output...")
generation_output = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],  # Pass the attention mask
    max_length=200,  # Adjust the length as needed
    temperature=0.7,  # Control randomness in the output
    top_p=0.9,  # Use nucleus sampling
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id  # Explicitly set the pad token ID
)

# Decode the output
output_text = tokenizer.decode(generation_output[0], skip_special_tokens=True)

# Print the output
print("\nGenerated Output:")
print(output_text)


Loading tokenizer...
Loading model...
Tokenizing input...
Generating output...

Generated Output:
Explain the importance of renewable energy in mitigating climate change.

The U.N. Intergovernmental Panel on Climate Change (IPCC) recently published a report that estimates that greenhouse gas emissions could be as high as 6.7 percent of the total global economy in 2050. It's clear that it is a problem that must be addressed.

The U.N. Climate Change Working Group (UNGCHA) is the world's leading international group of climate scientists, which has helped develop the world's most comprehensive climate policy and to develop the scientific consensus on climate change.

The IPCC is responsible for establishing, regulating and assessing the climate, and is responsible for developing the consensus on policy responses to climate change.

The U.N. Working Group is the world's leading international group of climate scientists, which has helped develop the world's most comprehensive climate policy