# The Transformer's Two Faces: Fine-Tuning for Classification and Conversation

This notebook is a practical, hands-on to two of the most critical applications of fine-tuning in modern NLP. We explore how to adapt pre-trained Transformer models for vastly different goals: structured classification and open-ended text generation.

---

### **Project Overview**

We tackle two distinct fine-tuning challenges:

1.  **Sentiment Analysis with BERT:** We fine-tune `bert-base-cased` on the Yelp Reviews dataset to build a powerful 5-star rating classifier. This demonstrates a classic supervised learning workflow using PyTorch's native training loop.

2.  **Conversational AI with OPT:** We use the more advanced `SFTTrainer` from the TRL library to fine-tune a causal language model (`facebook/opt-350m`) on the OpenAssistant dataset. The goal is to transform a base generative model into a helpful conversational assistant, showing a clear "before and after" improvement in its ability to follow instructions.

---

### **Key Technologies Demonstrated**

* **Models:** BERT (Encoder), OPT (Decoder)
* **Libraries:** PyTorch, Hugging Face `transformers`, `datasets`, and `trl`
* **Tasks:** Multi-Class Classification, Causal Language Model Fine-Tuning
* **Techniques:** Custom PyTorch training loop, `SFTTrainer`, Data Collators for instruction tuning.

In [9]:
# !pip install torch==2.2.2
# !pip install torchtext==0.17.2
# !pip install portalocker==2.8.2
# !pip install torchdata==0.7.1
# !pip install pandas
# !pip install matplotlib==3.9.0 scikit-learn==1.5.0
# !pip install numpy==1.26.0
# !pip install transformers==4.42.1
# !pip install datasets
# !pip install torchmetrics==1.4.0.post0
# !pip install accelerate==0.31.0
# !pip install trl==0.9.4
# !pip install protobuf==3.20.*

In [45]:
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR

from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    pipeline
)

from transformers import (
    BertConfig,
    BertForMaskedLM,
    DataCollatorForLanguageModeling
)

from datasets import load_dataset
from trl import (
    SFTConfig,                 
    SFTTrainer,                
    DataCollatorForCompletionOnlyLM
)
from torchmetrics import Accuracy

from tqdm.auto import tqdm
import math
import time
import matplotlib.pyplot as plt

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# Supervised Fine-tuning with Pytorch

## Dataset preparations

In [11]:
dataset = load_dataset("yelp_review_full")
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [12]:
dataset["train"][100]['text']

'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more t

In [13]:
dataset["train"][100]["label"]

0

In [14]:
dataset["train"] = dataset["train"].select([i for i in range(1000)])
dataset["test"] = dataset["test"].select([i for i in range(200)])

In [15]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 200
    })
})

### Tokenizing data

In [17]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
print(tokenizer.model_max_length)

512


In [18]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [19]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [20]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 200
    })
})

In [22]:
tokenized_datasets['train'][0].keys()

dict_keys(['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'])

In [23]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

In [25]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 200
    })
})

In [27]:
sample = tokenized_datasets["train"][0]
tokens = tokenizer.convert_ids_to_tokens(sample["input_ids"])

print(f"--- Sample 0 ---")
print(f"Label: {sample['labels']}")
print("-" * 20)

print(f"{'Token':<15} {'Input ID':<10} {'Attention Mask'}")
print("=" * 45)

for token, input_id, attention_mask in zip(tokens, sample["input_ids"], sample["attention_mask"]):
    if token == '[PAD]':
        print("\n... (padding tokens omitted)")
        break
    print(f"{token:<15} {input_id:<10} {attention_mask}")

--- Sample 0 ---
Label: 4
--------------------
Token           Input ID   Attention Mask
[CLS]           101        1
d               173        1
##r             1197       1
.               119        1
gold            2284       1
##berg          2953       1
offers          3272       1
everything      1917       1
i               178        1
look            1440       1
for             1111       1
in              1107       1
a               170        1
general         1704       1
practitioner    22351      1
.               119        1
he              1119       1
'               112        1
s               188        1
nice            3505       1
and             1105       1
easy            3123       1
to              1106       1
talk            2037       1
to              1106       1
without         1443       1
being           1217       1
patron          10063      1
##izing         4404       1
;               132        1
he              1119       1
'           

### DataLoader


In [30]:
tokenized_datasets["train"]

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})

In [31]:
train_dataloader = DataLoader(tokenized_datasets["train"], shuffle=True, batch_size=2)
eval_dataloader = DataLoader(tokenized_datasets["test"], batch_size=2)

## Train the model


### Load a pretrained model


Here, we'll load a pretrained classification model with 5 classes:


In [33]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Optimizer and learning rate schedule

In [34]:
optimizer = AdamW(model.parameters(), lr=5e-4)
num_epochs = 10
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = LambdaLR(optimizer, lr_lambda=lambda current_step: (1 - current_step / num_training_steps))

In [35]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

### Training loop

In [36]:
def train_model(model, tr_dataloader):
    progress_bar = tqdm(range(num_training_steps))
    model.train()
    tr_losses = []

    # Training loop
    for epoch in range(num_epochs):
        total_loss = 0 

        for batch in tr_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            loss.backward()
            total_loss += loss.item()
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)
            
        tr_losses.append(total_loss/len(tr_dataloader))
        
    plt.plot(tr_losses)
    plt.title("Training loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.show()

## Evaluate

In [37]:
def evaluate_model(model, evl_dataloader):
    metric = Accuracy(task="multiclass", num_classes=5).to(device)
    model.eval()

    with torch.no_grad():
        for batch in evl_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1)

            # Accumulate the predictions and labels for the metric
            metric(predictions, batch["labels"])

    # Compute the accuracy
    accuracy = metric.compute()
    print("Accuracy:", accuracy.item())

In [None]:
# train_model(model=model, tr_dataloader=train_dataloader)
# torch.save(model, 'bert-classification-model.pt')

![loss_gpt.png](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/HausLW2F_w30s1UK0zj7mQ/training-loss-BERT-Classification.png)


## Loading the saved model

In [43]:
model.load_state_dict(torch.load('bert-classification-model.pt', map_location=device))

<All keys matched successfully>

In [46]:
evaluate_model(model, eval_dataloader)

Accuracy: 0.26499998569488525


Tuning a more complex model that can generate conversations between a human and an assistant using SFTtrainer.

# Training a conversational model using SFTTrainer

How fine-tuning a decoder transformer using a specific dataset affects the quality of the generated responses in a question-answering task.

Load the train split of "timdettmers/openassistant-guanaco" dataset from Hugging Face:


In [47]:
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
dataset

README.md:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


openassistant_best_replies_train.jsonl:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

openassistant_best_replies_eval.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 9846
})

Load the pretrained causal model "facebook/opt-350m" along with its tokenizer from Hugging Face:


In [56]:
dataset[30]['text']

'### Human: What is the difference between open assistant and ChatGPT? Why should i use Open Assistant? And can you give me some examples of advanced chatbots that are similar to Open Assistant?### Assistant: First of all, a major difference is that Open Assistant is, as the name implies, open source. ChatGPT is closed source and might have more restrictions. \n\nOpen Assistant has less rules and restrictions than ChatGPT, and can be implemented into a multitude of applications. \n\nSome examples of similar chatbots are ChatGPT, Bing Chat, YouChat, Replit Ghostwriter, Character.ai, and Jasper, but there are many more.### Human: What is the difference in performance between the two models? Can you be more specific about the performance in conversational tasks, knowledge grounded question answering and in summarization?'

In [48]:
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/663M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

Create instruction and response templates based on the train dataset format

In [57]:
instruction_template = "### Human:"
response_template = "### Assistant:"

Create a collator to curate data in the appropriate shape for training using **"DataCollatorForCompletionOnlyLM"**:

In [58]:
collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False)

Create an SFTTrainer object and pass the model as well as the dataset and collator


In [61]:
training_args = SFTConfig(
    output_dir="/tmp",
    num_train_epochs=10,
    learning_rate=2e-5,
    save_strategy="epoch",
    per_device_train_batch_size=2,  
    per_device_eval_batch_size=2,  
    max_seq_length=1024,
    do_eval=True
)

trainer = SFTTrainer(
    model,
    args=training_args,
    train_dataset=dataset,
    dataset_text_field="text",
    data_collator=collator,
)

Prompt the pretrained model with a specific question: 


In [62]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=70)

In [63]:
print(pipe('''Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.''')[0]["generated_text"])

Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.

The term "monopsony" is used in the context of the "mono" (mono-economy) model. The term "mono-economy" is used in the context of the "mono-economy" model. The term "mono-economy" is used in the context of


Looks like the model is barely aware of what "monopsony" is in the context of economics.


In [65]:
print(pipe('''What is the difference between open assistant and ChatGPT? Why should i use Open Assistant? And can you give me 
some examples of advanced chatbots that are similar to Open Assistant?''')[0]["generated_text"])

What is the difference between open assistant and ChatGPT? Why should i use Open Assistant? And can you give me 
some examples of advanced chatbots that are similar to Open Assistant?

I'm not sure if Open Assistant is the same as ChatGPT, but I've used Open Assistant for a while and it's pretty good.

Open Assistant is a chatbot that can be used to communicate with other chatbots. It's a chatbot that can be used to communicate with other chatbots.

Chat


Training the model

In [67]:
# trainer.train()

Loading the tuned model that i trained on a GPU: 


In [68]:
model.load_state_dict(torch.load('Assistant_model.pt', map_location=torch.device('cpu')))

<All keys matched successfully>

Checking how the tuned model performs in answering the same specialized question


In [69]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=70)

In [70]:
print(pipe('''Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use 
examples related to potential monopsonies in the labour market and cite relevant research.''')[0]["generated_text"])

Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.

The term "monopsony" in economics refers to the practice of controlling the working class by imposing a price on them. This can be seen as a form of economic control, but it can also be seen as a form of political control, as the price of a product can be influenced by the interests of the state.




In [71]:
print(pipe('''What is the difference between open assistant and ChatGPT? Why should i use Open Assistant? And can you give me 
some examples of advanced chatbots that are similar to Open Assistant?''')[0]["generated_text"])

What is the difference between open assistant and ChatGPT? Why should i use Open Assistant? And can you give me 
some examples of advanced chatbots that are similar to Open Assistant?

Open Assistant is a chatbot that uses a combination of AI and machine learning to generate text and generate responses. It is designed to be easy to use and to be able to understand and respond to user input. It is designed to be able to understand and respond to user input, and it can be used to generate text, generate responses,
