# PragmaticLM

This notebook shows the fine tuning of T5 model on custom data consisting indirect user requests. The fine-tuned model is able to catch the user's intention and restructures the prompt which can be used with any generative model to handle indirect requests, or to simply improve the prompt quality.

In [28]:
import os
import numpy as np
import transformers
import torch
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling, pipeline)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from datasets import load_dataset, Dataset
from sklearn.model_selection import train_test_split

In [6]:
torch.cuda.is_available()
!nvidia-smi
print(torch.cuda.current_device())
print(torch.cuda.get_device_name(0))

Tue Mar 18 13:38:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   52C    P8             10W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [7]:
# Load model from HF transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [9]:
# inference on T5 base model
input_text = "summarize: These sections should give you a comprehensive understanding of T5's internals, from architecture to fine-tuning, and how to build your PragmaticLM on top of it. Each section provides specific code to explore different aspects of the model, which you can run and modify in your Jupyter notebook. The modular approach allows you to understand each component before integrating them into your complete framework."

input_ids = tokenizer(input_text, return_tensors="pt").input_ids

output_ids = model.generate(input_ids, max_length=50)

output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("Generated Output:", output_text)

Generated Output: sections give you a comprehensive understanding of T5's internals . each section provides specific code to explore different aspects of the model .


In [10]:
# Dataset
ds = load_dataset("msamogh/indirect-requests")

README.md:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

train_mean_world_understanding.jsonl:   0%|          | 0.00/66.0k [00:00<?, ?B/s]

(…)alidation_mean_world_understanding.jsonl:   0%|          | 0.00/69.8k [00:00<?, ?B/s]

test_mean_world_understanding.jsonl:   0%|          | 0.00/113k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/246 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/272 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/388 [00:00<?, ? examples/s]

In [11]:
print('Utterance:', ds['train']['utterance'][0], '\nUnderstanding:',ds['train']['situation'][0])

Utterance: I could really go for some biryani and samosa right now. Any recommendations? 
Understanding: User wants to find a restaurant of a particular cuisine in a city


In [12]:
print(ds['train'][0])

{'creation_date': '2023-07-26T00:55:08.699000', 'utterance': 'I could really go for some biryani and samosa right now. Any recommendations?', 'slot_description': 'Cuisine of food served in the restaurant', 'situation': 'User wants to find a restaurant of a particular cuisine in a city', 'service': 'Restaurants_1', 'possible_slot_values': "['Mexican', 'Chinese', 'Indian', 'American', 'Italian']", 'bool_rephrased_slot_values': "['Mexican', 'Chinese', 'Indian', 'American', 'Italian']", 'target_slot_value': '<ambiguous>', 'mean_world_understanding': 10.0}


In [13]:
ds

DatasetDict({
    train: Dataset({
        features: ['creation_date', 'utterance', 'slot_description', 'situation', 'service', 'possible_slot_values', 'bool_rephrased_slot_values', 'target_slot_value', 'mean_world_understanding'],
        num_rows: 246
    })
    validation: Dataset({
        features: ['creation_date', 'utterance', 'slot_description', 'situation', 'service', 'possible_slot_values', 'bool_rephrased_slot_values', 'target_slot_value', 'mean_world_understanding'],
        num_rows: 272
    })
    test: Dataset({
        features: ['creation_date', 'utterance', 'slot_description', 'situation', 'service', 'possible_slot_values', 'bool_rephrased_slot_values', 'target_slot_value', 'mean_world_understanding'],
        num_rows: 388
    })
})

In [44]:
def prepare_dataset(ds):
  new_ds = concatenate_datasets([ds["train"], ds["validation"], ds["test"]])

  new_ds = new_ds.remove_columns(
      [col for col in new_ds.column_names if col not in ["utterance", "situation"]]
      )

  data = {
      'input_text': ["Restructure Prompt: " + example['utterance'] for example in new_ds],
      'target_text': [example['situation'] for example in new_ds]
  }

  df = pd.DataFrame(data)

  train_df, val_df = train_test_split(df, test_size=0.1, random_state=42)

  train_dataset = Dataset.from_pandas(train_df)
  val_dataset = Dataset.from_pandas(val_df)

  return train_dataset, val_dataset

### Preprocessing

In [65]:
def preprocess_function(examples):
    inputs = tokenizer(examples['input_text'], padding="max_length", truncation=True, max_length=128)
    targets = tokenizer(examples['target_text'], padding="max_length", truncation=True, max_length=128)

    # dealing w pad token id
    targets["input_ids"] = [
        [(t if t != tokenizer.pad_token_id else -100) for t in target]
        for target in targets["input_ids"]
    ]

    inputs["labels"] = targets["input_ids"]
    return inputs

In [79]:
train_dataset, val_dataset = prepare_dataset(ds)
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/815 [00:00<?, ? examples/s]

Map:   0%|          | 0/91 [00:00<?, ? examples/s]

In [102]:
# trainning args
training_args = TrainingArguments(
    run_name = 'pragmaticLM',
    output_dir="./results",
    eval_strategy="epoch",
    report_to = 'none',
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    save_total_limit=3,
    #load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    optim="adamw_torch",
    max_steps=1000,
)

### Trainer

In [103]:
class CustomTrainer(Trainer):
    def create_optimizer(self):
        if self.optimizer is None:
            encoder_params = []
            decoder_params = []

            for name, param in self.model.named_parameters():
                if name.startswith('encoder'):
                    encoder_params.append(param)
                elif name.startswith('decoder') or name.startswith('lm_head'):
                    decoder_params.append(param)

            optimizer_grouped_parameters = [
                {'params': encoder_params, 'lr': 1e-5},
                {'params': decoder_params, 'lr': 3e-5}
            ]

            self.optimizer = torch.optim.AdamW(optimizer_grouped_parameters)

        return self.optimizer

In [104]:
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
)

  trainer = CustomTrainer(


In [105]:
#fine-tuning
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,No log,0.637744
2,No log,0.333011
3,No log,0.232076
4,No log,0.181206
5,0.815500,0.152063
6,0.815500,0.140264
7,0.815500,0.131465
8,0.815500,0.119708
9,0.193600,0.116908


TrainOutput(global_step=1000, training_loss=0.5045081176757813, metrics={'train_runtime': 600.0661, 'train_samples_per_second': 13.332, 'train_steps_per_second': 1.666, 'total_flos': 1216545625866240.0, 'train_loss': 0.5045081176757813, 'epoch': 9.803921568627452})

In [106]:
# save pragmaticLM_v1
model.save_pretrained("./prompt-restructuring-t5")
tokenizer.save_pretrained("./prompt-restructuring-t5")

('./prompt-restructuring-t5/tokenizer_config.json',
 './prompt-restructuring-t5/special_tokens_map.json',
 './prompt-restructuring-t5/spiece.model',
 './prompt-restructuring-t5/added_tokens.json',
 './prompt-restructuring-t5/tokenizer.json')

In [113]:
# inference
def restructure_prompt(input_prompt):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    input_text = f"Restructure Prompt: {input_prompt}"
    inputs = tokenizer(input_text, return_tensors="pt", padding=True).to(device)

    model.to(device)

    output = model.generate(
        inputs.input_ids,
        max_length=64,
        num_beams=4,
        early_stopping=True
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)

In [114]:
# example 1
test_prompt = "Could you help me find a medical professional who specializes in women's health issues?"
restructured = restructure_prompt(test_prompt)
print(f"Original: {test_prompt}")
print(f"Restructured: {restructured}")

Original: Could you help me find a medical professional who specializes in women's health issues?
Restructured: User wants to find a medical service provider based on their location and speciality


In [121]:
# example 2
test_prompt = "He does not want to go to office today. Suggest him some time pass activity to do."
restructured = restructure_prompt(test_prompt)
print(f"Original: {test_prompt}")
print(f"Restructured: {restructured}")

Original: He does not want to go to office today. Suggest him some time pass activity to do.
Restructured: User wants to find a hotel at a given location
