### Transformer based LLMs
The transformer based LLMs are widely used in natural language processing tasks such as machine translation, question answering, and text generation.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### 1. Use a pre-trained BERT variant model
The following examples are shown to use a pre-trained BERT variant model for the mask word prediction (text generation) task. The pre-trained model is downloaded from the Hugging Face Transformers library. Note that the models are not perfect and may not work well in all cases. However, they can provide a good starting point for further research and development.

In [2]:
# Tokenize text data using the corresponding tokenizer for each model
from transformers import BertTokenizer, BertForMaskedLM
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertForMaskedLM.from_pretrained('bert-base-uncased')

from transformers import RobertaTokenizer, RobertaForMaskedLM
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = RobertaForMaskedLM.from_pretrained('roberta-base')

from transformers import DistilBertTokenizer, DistilBertForMaskedLM
distilbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
distilbert_model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')

sentence = "NTU's EE6405 course is an awesome [MASK]."

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

In [3]:
for tokenizer, model in zip([bert_tokenizer, roberta_tokenizer, distilbert_tokenizer], 
                            [bert_model, roberta_model, distilbert_model]):
    print('Model:', model.config.name_or_path)
    model.to(device).eval()
    # A difference in the default setting of the mask token
    if model.config.name_or_path == 'roberta-base':
        inputs = tokenizer(sentence.replace('[MASK]', '<mask>'), return_tensors="pt")
    else:
        inputs = tokenizer(sentence, return_tensors="pt")

    with torch.no_grad():
        logits = model(**inputs.to(device)).logits

        # Extract the mask token index
        mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

        # Predict the top 5 tokens for the masked position
        predicted_token_ids = logits[0, mask_token_index, :].topk(3, dim=1).indices.squeeze()

        # Convert token ids to words
        predicted_words = [tokenizer.decode([token_id]).strip() for token_id in predicted_token_ids]

        # Print the predictions
        for word in predicted_words:
            print(f"Sentence: {sentence.replace('[MASK]', word)}")

Model: bert-base-uncased
Sentence: NTU's EE6405 course is an awesome experience.
Sentence: NTU's EE6405 course is an awesome course.
Sentence: NTU's EE6405 course is an awesome one.
Model: roberta-base
Sentence: NTU's EE6405 course is an awesome workout.
Sentence: NTU's EE6405 course is an awesome experience.
Sentence: NTU's EE6405 course is an awesome resource.
Model: distilbert-base-uncased
Sentence: NTU's EE6405 course is an awesome experience.
Sentence: NTU's EE6405 course is an awesome feat.
Sentence: NTU's EE6405 course is an awesome attraction.


In [4]:
# alternatively, we can call the model with a pipeline
from transformers import pipeline
distilroberta_model = pipeline('fill-mask', model='distilroberta-base')
predictions = distilroberta_model(sentence.replace('[MASK]', '<mask>'))
print()
print(f'Sentence with the best prediction:\n{sentence.replace("[MASK]", predictions[0]["token_str"].strip())}')
for prediction in predictions:
    print(f"Predicted word: {prediction['token_str']}, Score: {prediction['score']:.4f}")

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0



Sentence with the best prediction:
NTU's EE6405 course is an awesome resource.
Predicted word:  resource, Score: 0.2637
Predicted word:  experience, Score: 0.0283
Predicted word:  example, Score: 0.0259
Predicted word:  tool, Score: 0.0250
Predicted word:  addition, Score: 0.0220


### 2. Use a pre-trained generative model
In this section, we will demonstrate how to use a pre-trained generative model for language translation, text completion, and multiple choice question answering. Note that the models are not perfect and may not work well in all cases. However, they can provide a good starting point for further research and development.

In [5]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')

from transformers import BartTokenizer, BartForConditionalGeneration
bart_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
bart_model = BartForConditionalGeneration.from_pretrained('facebook/bart-large', 
                                                     forced_bos_token_id=0) # takes a while to load 

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [6]:
sentence1 = "translate English to German: Good morning. How are you?"
sentence2 = "We study NLP and learn to <mask> in EE6405."
for tokenizer, model, sent in zip([t5_tokenizer, bart_tokenizer], 
                                  [t5_model, bart_model],
                                  [sentence1, sentence2]):
    print('Model:', model.config.name_or_path)
    model.to(device).eval()
    with torch.no_grad():
        input_ids = tokenizer.encode(sent, return_tensors='pt')
        output = model.generate(input_ids=input_ids.to(device), max_length=50, num_beams=2, early_stopping=True)
        print('Input:', sent)
        print('Output:', tokenizer.decode(output[0], skip_special_tokens=True))

Model: t5-base
Input: translate English to German: Good morning. How are you?
Output: Guten Morgen, wie sind Sie?
Model: facebook/bart-large
Input: We study NLP and learn to <mask> in EE6405.
Output: We study NLP and learn to use it in EE6405.


In [7]:
from transformers import AutoTokenizer, XLNetForMultipleChoice
tokenizer = AutoTokenizer.from_pretrained("xlnet/xlnet-base-cased")
model = XLNetForMultipleChoice.from_pretrained('xlnet/xlnet-base-cased').to(device).eval()

Some weights of XLNetForMultipleChoice were not initialized from the model checkpoint at xlnet/xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
question = "What does NLP stands for"
answers = ["Computer Vision", "Natural Language Processing"]
encoding = tokenizer([question]*len(answers), answers, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
    outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()}) 
    logits = outputs.logits
    softmax = torch.nn.functional.softmax(logits, dim = -1)
    index = torch.argmax(softmax, dim = -1)
    print("Answer:", answers[index])

Answer: Natural Language Processing


### 3. Fine-tune a pre-trained transformer model for sequence classification
Fine-tuning a pre-trained transformer model for language modeling is a popular approach for language modeling. In this notebook, we will use the Hugging Face transformers library to fine-tune transformer-based language models on the newsgroups dataset

In [9]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

groups = ['alt.atheism', 'rec.sport.baseball', 'sci.space']

# Load dataset
# newsgroups_data = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)
newsgroups_data = fetch_20newsgroups(subset='all', shuffle=True, random_state=42, categories=groups)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
num_epochs = 1 # set this to a higher number in actual fine-tuning process

### 3.1 Fine-tune a pretrained BertForSequenceClassification model
Note: If your device does not have cuda-enable, it is recommended to use a smaller dataset and a smaller training epoch to have a taste of the fine-tune. The fine-tune layers are, by default, freeze except the the classification layer to save computational resources. The training epoch is set to 1 for demonstration purpose. 

In [10]:
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
encoded_data = tokenizer(newsgroups_data.data, padding=True, truncation=True, return_tensors='pt')

X = encoded_data['input_ids']
y = torch.tensor(newsgroups_data.target)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
# Define BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', torch_dtype = torch.float16, num_labels=len(np.unique(newsgroups_data.target)))

# Freeze all layers except the final classification layer
for param in model.bert.parameters():
    param.requires_grad = False

# Define optimizer, set a small lr for fine-tune
optimizer = AdamW(model.classifier.parameters(), lr=2e-5)

# DataLoader
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)

# Fine-tune the BERT model
model.to(device)
model.train()
for epoch in range(num_epochs):
    losses = []
    for data, target in train_loader:
        optimizer.zero_grad()
        outputs = model(input_ids=data.to(device), labels=target.to(device))
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
    print(f'{epoch+1}/{num_epochs} epochs: {np.mean(losses)}')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


1/1 epochs: 1.1006632194244603


In [12]:
test_dataset = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False)

model.eval()
with torch.no_grad():
    predicts = []
    truth = []
    for data, target in test_loader:
        outputs = model(input_ids=data.to(device), labels=target.to(device))
        predicts.append(torch.argmax(outputs.logits, dim=1).detach().cpu().numpy())
        truth.append(target.numpy())
    accuracy = accuracy_score(np.concatenate(truth), np.concatenate(predicts))
    print(f'Test Set Accuracy: {accuracy}')

Test Set Accuracy: 0.37050359712230213


### 3.2 Fine-tune a pretrained GPT-2 model

In [13]:
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification

In [14]:
# Load the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token  # Ensure the tokenizer uses a pad token

encoded_data = tokenizer(newsgroups_data.data, padding=True, truncation=True, return_tensors='pt')

X = encoded_data['input_ids']
y = torch.tensor(newsgroups_data.target)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
# Load the GPT-2 model with a classification head
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=len(np.unique(newsgroups_data.target)))
model.config.pad_token_id = model.config.eos_token_id

# Freeze all layers except the last transformer block and the language model head
for name, param in model.named_parameters():
    if 'h.11' not in name and 'ln_f' not in name and 'lm_head' not in name:
        param.requires_grad = False
        
# Setup an optimizer to only update the un-frozen parameters
optimizer = AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=2e-5)

# DataLoader
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)

# Fine-tune the BERT model
model.to(device)
model.train()
for epoch in range(num_epochs):
    losses = []
    for data, target in train_loader:
        optimizer.zero_grad()
        outputs = model(input_ids=data.to(device), labels=target.to(device))
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
    print(f'{epoch+1}/{num_epochs} epochs: {np.mean(losses)}')

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


1/1 epochs: 1.5180492137404655


In [16]:
test_dataset = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False)

model.eval()
with torch.no_grad():
    predicts = []
    truth = []
    for data, target in test_loader:
        outputs = model(input_ids=data.to(device), labels=target.to(device))
        predicts.append(torch.argmax(outputs.logits, dim=1).detach().cpu().numpy())
        truth.append(target.numpy())
    accuracy = accuracy_score(np.concatenate(truth), np.concatenate(predicts))
    print(f'Test Set Accuracy: {accuracy}')

Test Set Accuracy: 0.5539568345323741


### Practice for the Week
You have learnt to fine-tune pre-trained models for your own use cases. However, there are many pre-trained models available on the internet that can be used for a variety of tasks. Let's practice fine-tuning more advanced pre-trained models. In this task, you are recommended to first implement a simple fine-tuning task using a pre-trained model, such as a Bert (without looking at the code above).

Here's the practice: Unless every input sequence is precisely the same length without any padding, it is also necessary to include attention mask which is missing from the demonstration of fine-tuning the BERT model in the notebook. The attention mask is used to mask out the padding tokens in the input sequence, so that the model does not pay attention to them. 

Then, you can try to fine-tune a more advanced pre-trained model, such as RoBERTa, BART, XLNet, for your own task. 

Alternatively, try to fine-tune a model to classify the newsgroup other than the two examples provided in the notebook.

In [17]:
encoded_data.keys()
# outputs = model(input_ids, attention_mask=attention_mask, labels=labels)

dict_keys(['input_ids', 'attention_mask'])