# Comparing three Transformer-based language models

In this notebook we compare how to interact with BERT, GPT and BART models and the kinds of output they produce.

First we make sure we have the transformers library installed so we can access all three models.

In [None]:
pip install transformers




# BERT model

This code loads a pre-trained BERT model and uses it to classify a sentence into one of two classes (positive or negative). BERT was originally trained for masked language modeling and next-sentence prediction, but this version has been adapted for text classification.

"bert-base-uncased" is a BERT model that doesn’t distinguish between uppercase/lowercase.
The AutoTokenizer is a tokenizer that converts raw text into tokens for the BERT model.
AutoModelForSequenceClassification loads a BERT model fine-tuned for binary classification (2 labels)

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Input Sentence: The weather today is great!
Predicted Class: 0


We define a sentence to classify, and convert the sentence into inputs the model can understand.
return_tensors="pt": converts the tokens into a PyTorch tensor (needed for BERT).
padding=True: ensures inputs are properly padded to match the expected input size.
truncation=True: cuts off text that exceeds BERT’s 512-token limit.
max_length=512: sets the maximum input length.

In [None]:
# Example sentence
sentence = "I love AI security!"

# Tokenize input
inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=512)

Logits are unnormalized values that will be converted into class predictions. We pass the tokenized input into BERT and then extract the logits (raw prediction scores) from the output.

In [None]:
# Forward pass through the model
outputs = model(**inputs)
logits = outputs.logits

We find the highest-scoring class (0 or 1) and convert the result to a Python integer. If class 0 = "Negative" and class 1 = "Positive", this tells us if the model thinks the sentence is positive or negative.

In [None]:
# Prediction
predicted_class = torch.argmax(logits, dim=1).item()

We print the results - and BERT correctly thinks our sentence is positive!

In [None]:
# Output
print(f"Input Sentence: {sentence}")
print(f"Predicted Class: {predicted_class}")

# GPT model

This code loads a pre-trained GPT-2 model and generates text based on a given prompt.

AutoTokenizer automatically loads the tokenizer for the GPT-2 model and AutoModelForCausalLM loads GPT-2 for text generation.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load pre-trained GPT-2 model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Input Prompt: Once upon a time in a magical forest,
Generated Text: Once upon a time in a magical forest, the wizard who was the greatest ruler of the land had just returned from the Great War. His name was Harry Potter (or J.K. Rowling? I still remember the one whose name is often mistaken


We define our input prompt and tokenize it into numerical input IDs for GPT-2.
return_tensors="pt" converts the tokens into a PyTorch tensor.

In [None]:
# Input prompt
prompt = "My favourite thing about AI security is"

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt")

max_length=50 means the generated text will be at most 50 tokens long.
num_return_sequences=1 tells the model to generate 1 text sample.
do_sample=True enables random sampling, making the output more diverse instead of deterministic.

In [None]:
# Generate text
output = model.generate(**inputs, max_length=50, num_return_sequences=1, do_sample=True)

We convert the tokenized output back into readable text, and print it.

In [None]:
# Decode generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

# Output
print(f"Input Prompt: {prompt}")
print(f"Generated Text: {generated_text}")

# BART

This code loads a pre-trained BART model and summarizes a given text into a shorter version. BART is an encoder-decoder (Seq2Seq) model that is great at text summarization and text generation tasks.

AutoTokenizer loads the correct tokenizer for the pre-trained BART model.
AutoModelForSeq2SeqLM loads a sequence-to-sequence model like BART, which is designed for text summarization, translation, and text generation.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load pre-trained BART model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Original Text: 
The field of artificial intelligence has been rapidly evolving over the past few decades, 
with significant advancements in machine learning and deep learning technologies. 
These advancements have enabled AI to outperform humans in specific tasks such as image recognition, 
natural language processing, and even complex strategy games.

Generated Summary: The field of artificial intelligence has been rapidly evolving over the past few decades. Machine learning and deep learning technologies have enabled AI to outperform humans in specific tasks.


In [None]:
# Example long text
text = """
The field of artificial intelligence has been rapidly evolving over the past few decades,
with significant advancements in machine learning and deep learning technologies.
AI security is the field devoted to securing these models from external threats.
"""

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=1024)

BART has a max input length of 1024 tokens. If the input is too long, truncation ensures it fits within the model’s limit.
max_length=50: limits the summary to a maximum of 50 tokens.
min_length=20: ensures the summary is at least 20 tokens long.
length_penalty=2.0: encourages longer summaries (higher values discourage short outputs).
num_beams=4: uses beam search (considers multiple possible summaries before selecting the best one) with 4 beams to generate the best summary.


In [None]:
# Generate summary
summary_ids = model.generate(**inputs, max_length=50, min_length=20, length_penalty=2.0, num_beams=4)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Output
print(f"Original Text: {text}")
print(f"Generated Summary: {summary}")