## Overview
This notebook is dedicated to the fine-tuning of Large Language Model (LLM) to address a **sequence-to-sequence** problem. Our goal is to develop a versatile tool capable of translating user queries into API calls.

The tool's primary function is to interpret user queries and perform one of three types of operations:

**Unit Conversion**: It can convert units from one measurement to another. For example, from feet to centimeters, or kilograms to ounces.

**Math Calculations**: It can handle various mathematical calculations, including basic arithmetic (e.g., addition and subtraction) and more advanced operations (e.g., logarithm).

**Search Queries**: For general queries, it acts as a search engine, looking up information with no parameters.

Examples shown as below:

Input: “ft to cm”; output: “UnitConvert(SourceUnit:foot, TargetUnit:centimeter,

SourceValue:1)”

Input: “how many ounces in 5.8 kilograms”; output: “UnitConvert(SourceUnit:kilogram,

TargetUnit:ounce, SourceValue:5.8)”

Input: “two to the power of 10”, output: “Calculate(2^10)”

Input: “2001-1989”, output: “Calculate(2001-1989)”

Input: “what is chatgpt”, output: “Search()”

Input: “primary year 1 maths calculation checklist”, output: “Search()”

Input: “what are different length units”, output: “Search()”

### Approach
Preprocessing: I preprocessed the generated data and transformed it into a format suitable for creating a Hugging Face dataset.

Model Selection: Recognizing the need for highly controlled text generation, I explored different models. CodeT5 models stood out due to their language understanding and text-to-code capabilities, making them suitable for this problem. I initially tried the "Salesforce/codet5-small" model but later opted for the "Salesforce/codet5p-220m" model for better performance. More details about the model can be found here https://github.com/salesforce/CodeT5.

Fine-Tuning: Using the sample dataset I created, I fine-tuned a new Seq2Seq2 model to solve the problem. In the process of fine-tuning the model, it's essential to be mindful of hyperparameters such as learning rate, batch size, and more. If the model encounters convergence challenges or suboptimal performance, fine-tuning these hyperparameters becomes necessary.

Evaluation Metrics: To assess model performance, I used BLEU score and exact match metrics. BLEU score measures the quality of generated text, while exact match evaluates whether the generated output matches the expected output.

Held-Out Dataset: I maintained a held-out dataset for evaluating the model's performance on unseen data, which helps ensure generalization.

Experiment Tracking: I used Wandb (Weights and Biases) to keep track of my experiment process, allowing me to monitor training progress and results.

Resource: Google Colab-GPU (T4/V100).

In [1]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 13.6 gigabytes of available RAM

Not using a high-RAM runtime


In [2]:
Upload the previously generated training examples in data_generate notebook to Google Drive for training access
from google.colab import drive
drive.mount('/content/drive')
!cp  /content/drive/My\ Drive/Colab\ Notebooks/question_and_answer_list_1800.json /content/
import json
with open("question_and_answer_list_1800.json", "r") as json_file:
    loaded_data_list = json.load(json_file)

In [3]:
len(loaded_data_list)

1800

In [4]:
!pip install -q transformers[sentencepiece] datasets wandb

In [5]:
!pip install sentencepiece evaluate



In [6]:
!pip install accelerate -U



## Preprocess data
There are special character tokens in the dataset of this problem eg. power operation "^" etc, hence, this needs to be considered when choosing the proper tokenizer to preprocess the dataset. We don't want to these tokens be treated as unknown tokens definitely. I chose CodeT5 series of tokenizer in this notebook. For details, refer to https://github.com/salesforce/CodeT5.

In [7]:
from datasets import Dataset

In [8]:
# Added some high quality manually picked examples
search_examples = [
    ("primary year 1 maths calculation checklist", 'Search()'),
    ("what are different length units", "Search()"),
    ("What is the capital of France?", 'Search()'),
    ("How to bake a chocolate cake from scratch", 'Search()'),
    ("Tell me about quantum physics", 'Search()'),
    ("How to play the guitar", 'Search()'),
    ("What are the ingredients for spaghetti carbonara?", 'Search()'),
    ("Find information about different length units", 'Search()'),
    ("Who won the World Series in 2020?", 'Search()'),
    ("Can you recommend a good book to read?", 'Search()'),
    ("What is the meaning of life?", 'Search()'),
    ("When was the Declaration of Independence signed?", 'Search()'),
    ("What is the population of New York City?", 'Search()'),
    ("What's the weather forecast for 5 days from now?", 'Search()'),
    ("Who is the 42nd President of the United States?", 'Search()'),
    ("What is the GDP of China in 2021?", 'Search()'),
    ("How to calculate the area of a circle with a radius of 10?", 'Search()'),
    ("Tell me about the Apollo 11 mission to the moon.", 'Search()'),
    ("What are the prime factors of 36?", 'Search()'),
    ("How much does a gallon of milk weigh?", 'Search()'),
    ("List of popular rock bands from the 70s", 'Search()'),
    ("Best travel destinations for a solo trip", 'Search()'),
    ("How to make a perfect omelette", 'Search()'),
    ("Famous inventions in the 20th century", 'Search()'),
    ("Upcoming science fiction movies in 2023", 'Search()'),
    ("Health benefits of meditation and mindfulness", 'Search()'),
    ("Top 10 historical landmarks in Europe", 'Search()'),
    ("Budget-friendly meal planning tips", 'Search()'),
    ("Overview of renewable energy sources", 'Search()'),
    ("DIY home improvement projects for beginners", 'Search()'),
    ("Why do leaves change color in the fall?", 'Search()'),
    ("Top 5 programming languages for web development", 'Search()'),
    ("Tips for growing a successful vegetable garden", 'Search()'),
    ("History of the Olympic Games", 'Search()'),
    ("Exploring the mysteries of black holes", 'Search()'),
    ("The impact of social media on society", 'Search()'),
    ("Different types of yoga and their benefits", 'Search()'),
    ("Understanding the concept of time dilation", 'Search()'),
    ("How to properly clean and organize your closet", 'Search()'),
    ("The art of making homemade pizza", 'Search()'),
]

cal_examples = [
    ("What is 2 to the power of 5?", "Calculate(2^5)"),
    ("How much is 100 divided by 5?", "Calculate(100/5)"),
    ("Calculate the square root of 25.", "Calculate(sqrt(25))"),
    ("Find the logarithm of 100 to the base 10.", "Calculate(log(100,10))"),
    ("Evaluate 5 factorial.", "Calculate(5!)"),
    ("What is the sine of 45 degrees?", "Calculate(sin(45))"),
    ("Calculate the cosine of 30 degrees.", "Calculate(cos(30))"),
    ("What is 8 plus 3?", "Calculate(8+3)"),
    ("How to subtract 15 from 30?", "Calculate(30-15)"),
    ("Evaluate 3 times 7.", "Calculate(3*7)"),
    ("What is the tangent of 60 degrees?", "Calculate(tan(60))"),
    ("Calculate the natural log of 50.", "Calculate(ln(50))"),
    ("Compute e to the power of 3.", "Calculate(exp(3))"),
    ("What is 10 minus 4?", "Calculate(10-4)"),
    ("two to the power of 10", "Calculate(2^10)"),
    ("2001-1989", "Calculate(2001-1989)"),
    ("9 times 12 minus 3 to the power of 10", "Calculate(9*12-3^10)"),
    ("100 divided by 5 plus 10", "Calculate(100/5+10)"),
    ("Square root of 25 minus 3", "Calculate(sqrt(25)-3)"),
    ("Logarithm of 100 to the base 10 times 2", "Calculate(log(100,10)*2)"),
    ("5 factorial divided by 2", "Calculate(5!/2)"),
    ("Sine of 45 degrees plus 1.5", "Calculate(sin(45)+1.5)"),
    ("Cosine of 30 degrees minus 0.5", "Calculate(cos(30)-0.5)"),
    ("8 plus 3 times 4", "Calculate(8+3*4)"),
    ("15 minus 30 divided by 5", "Calculate(15-30/5)"),
    ("3 times 7 minus tangent of 45 radians", "Calculate(3*7-tan(45))"),
    ("1 plus natural log of 50", "Calculate(1+ln(50))"),
    ("Exponential of 3 divided by 2", "Calculate(exp(3)/2)"),
    ("Square root of 64 plus 2 cubed", "Calculate(sqrt(64)+2^3)"),
    ("Logarithm of 100 to the base 2 times 3 squared", "Calculate(log(100,2)*3^2)"),
    ("10 factorial divided by 4 minus 7", "Calculate(10!/4-7)"),
    ("Sine of 30 degrees plus cosine of 60 degrees", "Calculate(sin(30)+cos(60))"),
    ("4 divided by the natural log of 10", "Calculate(4/ln(10))"),
    ("5 squared times 3 factorial", "Calculate(5^2*3!)"),
    ("Exponential of 4 plus 3 to the power of 2", "Calculate(exp(4)+3^2)"),
    ("Square root of 100 plus 10 cubed minus 2", "Calculate(sqrt(100)+10^3-2)"),
    ("Find the area of a rectangle with length 6 and width 4.", "Calculate(6*4)"),
    ("If a car is traveling at 60 miles per hour, how far will it travel in 3 hours?", "Calculate(60*3)"),
    ("What is 15% of 80?", "Calculate(0.15*80)"),
    ("A shop sells a product for $25 with a 20% discount. What is the discounted price?", "Calculate(25-0.20*25)"),
]

unit_conversion_examples = [
    ("ft to cm", "UnitConvert(SourceUnit:foot, TargetUnit:centimeter, SourceValue:1)"),
    ("how many ounces in 5.8 kilograms", "UnitConvert(SourceUnit:kilogram, TargetUnit:ounce, SourceValue:5.8)"),
    ("Convert 5 miles to kilometers", 'UnitConvert(SourceUnit:"mile", TargetUnit:"kilometer", SourceValue:5)'),
    ("Change 1000 grams into kilograms", 'UnitConvert(SourceUnit:"gram", TargetUnit:"kilogram", SourceValue:1000)'),
    ("How many feet are in 3 meters?", 'UnitConvert(SourceUnit:"meter", TargetUnit:"foot", SourceValue:3)'),
    ("What is 20 acres in square meters?", 'UnitConvert(SourceUnit:"acre", TargetUnit:"square meter", SourceValue:20)'),
    ("I need 2 liters in milliliters.", 'UnitConvert(SourceUnit:"liter", TargetUnit:"milliliter", SourceValue:2)'),
    ("Convert 30°C to Fahrenheit", 'UnitConvert(SourceUnit:"Celsius", TargetUnit:"Fahrenheit", SourceValue:30)'),
    ("What's 100 pounds in ounces?", 'UnitConvert(SourceUnit:"pound", TargetUnit:"ounce", SourceValue:100)'),
    ("Change 60 miles per hour to kilometers per hour", 'UnitConvert(SourceUnit:"mile per hour", TargetUnit:"kilometer per hour", SourceValue:60)'),
    ("Translate 25 kilometers per hour to meters per second", 'UnitConvert(SourceUnit:"kilometer per hour", TargetUnit:"meter per second", SourceValue:25)'),
    ("Tell me the equivalent of 5 gallons in quarts", 'UnitConvert(SourceUnit:"gallon", TargetUnit:"quart", SourceValue:5)'),
    ("How many ounces in 2.5 pounds?", 'UnitConvert(SourceUnit:"pound", TargetUnit:"ounce", SourceValue:2.5)'),
    ("Change 60°F to Celsius", 'UnitConvert(SourceUnit:"Fahrenheit", TargetUnit:"Celsius", SourceValue:60)'),
    ("Convert 80 kilometers to miles", 'UnitConvert(SourceUnit:"kilometer", TargetUnit:"mile", SourceValue:80)'),
    ("What is 1800 square feet in square meters?", 'UnitConvert(SourceUnit:"square foot", TargetUnit:"square meter", SourceValue:1800)'),
    ("Translate 40 kilometers per hour to miles per hour", 'UnitConvert(SourceUnit:"kilometer per hour", TargetUnit:"mile per hour", SourceValue:40)'),
    ("How many liters in 2.5 gallons?", 'UnitConvert(SourceUnit:"gallon", TargetUnit:"liter", SourceValue:2.5)'),
    ("Convert 1 day to hours", 'UnitConvert(SourceUnit:"day", TargetUnit:"hour", SourceValue:1)'),
    ("How many minutes are in 2 hours?", 'UnitConvert(SourceUnit:"hour", TargetUnit:"minute", SourceValue:2)'),
    ("Change 30 seconds into minutes", 'UnitConvert(SourceUnit:"second", TargetUnit:"minute", SourceValue:30)'),
    ("What is 7 days in hours?", 'UnitConvert(SourceUnit:"day", TargetUnit:"hour", SourceValue:7)'),
    ("I need 120 minutes in hours.", 'UnitConvert(SourceUnit:"minute", TargetUnit:"hour", SourceValue:120)'),
    ("Convert 7200 seconds to hours", 'UnitConvert(SourceUnit:"second", TargetUnit:"hour", SourceValue:7200)'),
    ("How many milligrams in 3 grams?", 'UnitConvert(SourceUnit:"gram", TargetUnit:"milligram", SourceValue:3)'),
    ("Convert 2.5 liters to gallons", 'UnitConvert(SourceUnit:"liter", TargetUnit:"gallon", SourceValue:2.5)'),
    ("Change 100 square feet to square meters", 'UnitConvert(SourceUnit:"square foot", TargetUnit:"square meter", SourceValue:100)'),
    ("What is 50 miles per hour in kilometers per hour?", 'UnitConvert(SourceUnit:"mile per hour", TargetUnit:"kilometer per hour", SourceValue:50)'),
    ("How many pounds in 1 ton?", 'UnitConvert(SourceUnit:"ton", TargetUnit:"pound", SourceValue:1)'),
    ("Convert 20°C to Kelvin", 'UnitConvert(SourceUnit:"Celsius", TargetUnit:"Kelvin", SourceValue:20)'),
    ("What's 250 grams in ounces?", 'UnitConvert(SourceUnit:"gram", TargetUnit:"ounce", SourceValue:250)'),
    ("Change 60 kilometers to miles", 'UnitConvert(SourceUnit:"kilometer", TargetUnit:"mile", SourceValue:60)'),
    ("Translate 45 kilometers per hour to meters per second", 'UnitConvert(SourceUnit:"kilometer per hour", TargetUnit:"meter per second", SourceValue:45)'),
    ("Tell me the equivalent of 8 quarts in gallons", 'UnitConvert(SourceUnit:"quart", TargetUnit:"gallon", SourceValue:8)'),
    ("How many ounces in 3.5 pounds?", 'UnitConvert(SourceUnit:"pound", TargetUnit:"ounce", SourceValue:3.5)'),
    ("Change 32°F to Celsius", 'UnitConvert(SourceUnit:"Fahrenheit", TargetUnit:"Celsius", SourceValue:32)'),
    ("What is 10 square meters in square feet?", 'UnitConvert(SourceUnit:"square meter", TargetUnit:"square foot", SourceValue:10)'),
    ("How many gallons in 4 barrels?", 'UnitConvert(SourceUnit:"barrel", TargetUnit:"gallon", SourceValue:4)'),
    ("Convert 2 days to hours", 'UnitConvert(SourceUnit:"day", TargetUnit:"hour", SourceValue:2)'),
    ("Change 90 seconds into minutes", 'UnitConvert(SourceUnit:"second", TargetUnit:"minute", SourceValue:90)'),
]

In [9]:
from datasets import Dataset, DatasetDict
import random

# Combine all examples into a single list
all_examples = search_examples + cal_examples + unit_conversion_examples
all_examples.extend(loaded_data_list)
# Shuffle the combined list

random.seed(42)
random.shuffle(all_examples)

# Calculate the number of examples for train, val, and test
total_examples = len(all_examples)
train_size = int(0.8 * total_examples)
val_size = int(0.1 * total_examples)

# Split the combined list into train, val, and test sets
train_examples = all_examples[:train_size]
val_examples = all_examples[train_size:train_size + val_size]
test_examples = all_examples[train_size + val_size:]

# Create a dataset for each set
train_dataset = Dataset.from_dict({
    'query': [example[0] for example in train_examples],
    'api_string': [example[1] for example in train_examples]
})

val_dataset = Dataset.from_dict({
    'query': [example[0] for example in val_examples],
    'api_string': [example[1] for example in val_examples]
})

test_dataset = Dataset.from_dict({
    'query': [example[0] for example in test_examples],
    'api_string': [example[1] for example in test_examples]
})

print("Number of train examples:", len(train_dataset))
print("Number of val examples:", len(val_dataset))
print("Number of test examples:", len(test_dataset))

init_dataset = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

Number of train examples: 1536
Number of val examples: 192
Number of test examples: 192


In [10]:
validation_dataset = init_dataset['validation']

# Print specific examples from the validation dataset
sample_size = 6
for i in range(sample_size):
    example = validation_dataset[i]
    print(f"Example {i + 1}:")
    print("Query:", example['query'])
    print("API String:", example['api_string'])
    print()


Example 1:
Query: What is 10 square meters in square feet?
API String: UnitConvert(SourceUnit:"square meter", TargetUnit:"square foot", SourceValue:10)

Example 2:
Query: 10 reasons why electric cars are good?
API String: Search()

Example 3:
Query:  Robin was making baggies of cookies with 6 cookies in each bag. If she had 23 chocolate chip cookies and 25 oatmeal cookies, how many baggies could she make? 
API String: Calculate((23.0+25.0)/6.0)

Example 4:
Query:  Will was organizing his baseball cards in a binder with 3 on each page. If he had 8 new cards and 10 old cards to put in the binder, how many pages would he use? 
API String: Calculate((8.0+10.0)/3.0)

Example 5:
Query:  Gwen was organizing her book case making sure each of the shelves had exactly 4 books on it. If she had 5 shelves of mystery books and 3 shelves of picture books, how many books did she have total? 
API String: Calculate(4.0*(5.0+3.0))

Example 6:
Query:  April's discount flowers was having a sale where each 

In [11]:
tokenizer_name = "Salesforce/codet5p-220m"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

prefix = "Convert query: "
max_input_length = 256
max_target_length = 256

def preprocess_examples(examples):
  queries = examples["query"]
  api_strings = examples["api_string"]

  inputs = [prefix + query for query in queries]
  model_inputs = tokenizer(inputs, max_length=max_input_length, padding="max_length", truncation=True)

  labels = tokenizer(api_strings, max_length=max_target_length, padding="max_length", truncation=True).input_ids

  # replace the index of the padding tokens by -100
  # such that they are not taken into account by the CrossEntropyLoss
  labels_with_ignore_index = []
  for labels_example in labels:
    labels_example = [label if label != 0 else -100 for label in labels_example]
    labels_with_ignore_index.append(labels_example)

  model_inputs["labels"] = labels_with_ignore_index

  return model_inputs

In [12]:
dataset = init_dataset.map(preprocess_examples, batched=True)

Map:   0%|          | 0/1536 [00:00<?, ? examples/s]

Map:   0%|          | 0/192 [00:00<?, ? examples/s]

Map:   0%|          | 0/192 [00:00<?, ? examples/s]

## Training

In [13]:
import wandb

wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mlynn_xu[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [14]:
from datasets import load_metric
import evaluate
metric = evaluate.load("bleu")
import numpy as np
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    bleu_result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": bleu_result["bleu"]}

    # Calculate Exact Match (EM)
    exact_match = sum([1 if p == l else 0 for p, l in zip(decoded_preds, decoded_labels)]) / len(decoded_labels)
    result["exact_match"] = exact_match
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [18]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer
import os
model_name= "Salesforce/codet5p-220m"

lr=5e-5
num_train_epochs=15
warmup_steps=200
weight_decay=0.05
fp16=False
batch_size = 8
config = {"epochs": num_train_epochs, "learning_rate": lr, "batch_size": batch_size, "warmup_steps":warmup_steps,
                "tokenizer": tokenizer_name, "model_name":model_name, "max_target_length": max_target_length,
          "max_input_length": max_input_length}
run = wandb.init(project="Query_Convert", notes="", config=config)

In [19]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
training_args = Seq2SeqTrainingArguments(
        report_to="wandb",
        output_dir="/content/drive/MyDrive/CodeT5/Notebooks/Checkpoints",
        overwrite_output_dir=True,
        do_train=True,
        save_strategy='epoch',
        evaluation_strategy="epoch",
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=int(batch_size/2),
        gradient_accumulation_steps=4,
        learning_rate=lr,
        weight_decay=weight_decay,
        warmup_steps=warmup_steps,
        logging_first_step=True,
        logging_steps=1,
        save_total_limit=1,
        dataloader_drop_last=True,
        local_rank=-1,
        deepspeed=None,
        fp16=fp16,
        predict_with_generate=True,
        generation_max_length=max_target_length
    )

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, label_pad_token_id=tokenizer.pad_token_id)

trainer = Seq2SeqTrainer(
    model,
    training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Bleu,Exact Match
1,0.4494,0.338337,0.2895,0.4531
2,0.2576,0.116052,0.687,0.6042
3,0.0643,0.057022,0.8351,0.7969
4,0.0785,0.041875,0.8801,0.8906
5,0.0268,0.02828,0.9314,0.8958
6,0.0211,0.019646,0.9533,0.9427
7,0.0089,0.02283,0.9503,0.9427
8,0.0178,0.028482,0.9379,0.9479
9,0.0023,0.019612,0.954,0.9375
10,0.0043,0.016132,0.9611,0.9531


TrainOutput(global_step=720, training_loss=0.2372070283683974, metrics={'train_runtime': 1562.432, 'train_samples_per_second': 14.746, 'train_steps_per_second': 0.461, 'total_flos': 7015194899251200.0, 'train_loss': 0.2372070283683974, 'epoch': 15.0})

In [20]:
save_directory = "." # save in the current working directory, you can change this of course
model.save_pretrained(save_directory)

## Inference

### Sanity check

In [21]:
inference_examples = []
for data_type in ['train', 'validation', 'test']:
  for iter in range(3):
    idx = random.randint(0, len(init_dataset[data_type])-1)
    test_example = init_dataset[data_type][idx]
    inference_examples.append((data_type, test_example['query'], test_example['api_string']))

In [22]:
model = AutoModelForSeq2SeqLM.from_pretrained(save_directory)

Note that there are several ways of doing generation (greedy decoding/beam search/top k sampling/etc). Here we will use greedy decoding and beam search for sanity check.

In [23]:
example_results = []
print("******************greedy decoding*****************************")
for data_type, query, api_string in inference_examples:
  print("***********************************************")
  print("Type: ", data_type)
  print("Query: ", query)
  print("Ground truth: ", api_string)
  input_ids = tokenizer(query, return_tensors='pt').input_ids
  # generate
  outputs = model.generate(input_ids, max_new_tokens=max_target_length)
  res = tokenizer.decode(outputs[0], skip_special_tokens=True)
  print("Generated api string:", res)
  example_results.append((data_type, query, api_string, res))

******************greedy decoding*****************************
***********************************************
Type:  train
Query:   Victor was working as a sacker at a grocery store where he made 6 dollars an hour. On Monday he worked 5 hours and on Tuesday he worked 5 hours. How much money did Victor make in those two days? 
Ground truth:  Calculate(6.0*(5.0+5.0))
Generated api string: Calculate(6.0*(5.0+5.0))
***********************************************
Type:  train
Query:  10 reasons why jogging is good for you?
Ground truth:  Search()
Generated api string: Search()
***********************************************
Type:  train
Query:  100 reasons why i love you for husband?
Ground truth:  Search()
Generated api string: Search()
***********************************************
Type:  validation
Query:  10 reasons why employees get fired?
Ground truth:  Search()
Generated api string: Search()
***********************************************
Type:  validation
Query:  1 acre equals how

In [24]:
# activate beam search and early_stopping
print("******************activate beam search (5) and early_stopping*****************************")
example_results = []
for data_type, query, api_string in inference_examples:
  print("***********************************************")
  print("Type: ", data_type)
  print("Query: ", query)
  print("Ground truth: ", api_string)
  input_ids = tokenizer(query, return_tensors='pt')
  # generate
  # outputs = model.generate(input_ids)
  outputs = model.generate(
    **input_ids,
    num_beams=5,
    early_stopping=True, max_new_tokens=max_target_length
)
  res = tokenizer.decode(outputs[0], skip_special_tokens=True)
  print("Generated api string:", res)
  example_results.append((data_type, query, api_string, res))

******************activate beam search (5) and early_stopping*****************************
***********************************************
Type:  train
Query:   Victor was working as a sacker at a grocery store where he made 6 dollars an hour. On Monday he worked 5 hours and on Tuesday he worked 5 hours. How much money did Victor make in those two days? 
Ground truth:  Calculate(6.0*(5.0+5.0))
Generated api string: Calculate(6.0*(5.0+5.0))
***********************************************
Type:  train
Query:  10 reasons why jogging is good for you?
Ground truth:  Search()
Generated api string: Search()
***********************************************
Type:  train
Query:  100 reasons why i love you for husband?
Ground truth:  Search()
Generated api string: Search()
***********************************************
Type:  validation
Query:  10 reasons why employees get fired?
Ground truth:  Search()
Generated api string: Search()
***********************************************
Type:  validati

## Calculate test metrics

In [25]:
def compute_test_metrics(decoded_preds, decoded_labels):
  decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
  bleu_result = metric.compute(predictions=decoded_preds, references=decoded_labels)
  result = {"bleu": bleu_result['bleu']}
  # Calculate Exact Match (EM)
  exact_match = sum([1.0 if p == l else 0 for p, l in zip(decoded_preds, decoded_labels)]) / len(decoded_labels)
  result["exact_match"] = exact_match
  result = {k: round(v, 4) for k, v in result.items()}
  return result

In [26]:
test_examples = []
for data_type in ['test']:
  for idx in range(len(init_dataset[data_type])):
    test_example = init_dataset[data_type][idx]
    test_examples.append((data_type, test_example['query'], test_example['api_string']))

test_results = []
for data_type, query, api_string in test_examples:
  input_ids = tokenizer(query, return_tensors='pt').input_ids
  # generate
  outputs = model.generate(input_ids, max_new_tokens=max_target_length)
  res = tokenizer.decode(outputs[0], skip_special_tokens=True)
  test_results.append((data_type, query, api_string, res))

In [27]:
decoded_preds_test = [item[3] for item in test_results]
decoded_labels_test = [item[2] for item in test_results]
test_metric = compute_test_metrics(decoded_preds_test, decoded_labels_test)
test_metric

{'bleu': 0.9727, 'exact_match': 0.9427}

## Performance summary
Validation data: {'bleu': 0.9563, 'exact_match': 0.9479}

Hold-out test data: {'bleu': 0.9727, 'exact_match': 0.9427}

## Upload trained model to the hub

Next, we can login with the credentials of our HuggingFace account (you can sign up on [hf.co](https://hf.co) if you haven't already!).

In [28]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [30]:

new_model = "codet5p-220m-finetuned-for-query-convert"
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/blueplanet2373/codet5p-220m-finetuned-for-query-convert/commit/5b20589f04b762c7b20974052efd713a6e14fce5', commit_message='Upload tokenizer', commit_description='', oid='5b20589f04b762c7b20974052efd713a6e14fce5', pr_url=None, pr_revision=None, pr_num=None)

In [31]:
# finish after post-training analysis, testing, other logged code
wandb.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/bleu,▁▅▇▇███████████
eval/exact_match,▁▃▆▇▇██████████
eval/loss,█▃▂▂▁▁▁▁▁▁▁▁▁▁▁
eval/runtime,▁▄▄▄▄▄▅█▄▃▃▅▃▃█
eval/samples_per_second,█▄▅▄▅▄▄▁▅▅▅▃▅▆▁
eval/steps_per_second,█▄▅▄▅▄▄▁▅▅▅▃▅▆▁
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/learning_rate,▁▂▂▃▄▄▅▆▆▇████▇▇▇▇▆▆▆▆▅▅▅▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁
train/loss,█▅▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
eval/bleu,0.9563
eval/exact_match,0.9479
eval/loss,0.02166
eval/runtime,29.2147
eval/samples_per_second,6.572
eval/steps_per_second,1.643
train/epoch,15.0
train/global_step,720.0
train/learning_rate,0.0
train/loss,0.0032


In [None]:
# Empty VRAM
del model
del trainer
import gc
gc.collect()
gc.collect()

0

## TODO:
Data Cleaning: Considerations for better cleaning of the training data have been identified. These include addressing data inconsistencies, such as variations in unit representations (e.g., "meter^2" or "square meter"), and resolving issues like improperly formatted examples, where essential quotation marks may be missing for "SourceUnit" and "TargetUnit" arguments. Implementing these data cleaning can help further improve the performance of the model.

More exhaustive parameter search: Hyperparameter tuning can be automated using techniques like grid search or Bayesian optimization.

Inference Latency Optimization: A more rigorous inference time test, deployment strategy evaluation and potential optimization to ensure that the model's real-time response aligns with the specific requirements for production deployment can be conducted.




