We start by installing all the packages that are needed for the fine-tuning process.

In [1]:
!pip install -qqq -U torch transformers datasets evaluate accelerate peft trl langchain bitsandbytes tensorboard python-dotenv wandb --progress-bar off

And import all the packages and functions that are needed. We will do the fine-tuning using the HuggingFace packages that greatly simplifies the process.

In [2]:
import gc
import json
import wandb
from dotenv import load_dotenv

import torch
from datasets import load_dataset, Dataset
from peft import (
    LoraConfig, 
    PeftModel,
)
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from trl import SFTTrainer
from evaluate import load
from accelerate import PartialState
import langchain
from langchain.cache import SQLiteCache
from statistics import mean

It is important to include the HF_token in the *.env* file. By the time this notebook is created, the model that we are going to fine-tune (Llama3-8B) is only available after request for access. The HF_token identifies the user, making available the model for us.

In [3]:
load_dotenv(".env", override=True)

True

In [4]:
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mmartinriosgarcia[0m ([33mprem-incar[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

In [5]:
# Model
base_model = "meta-llama/Meta-Llama-3-8B-Instruct"
# Name of the new model
new_model = "TunedLlama-3-8B"

# Dataset
dataset_path = "LLM_organic_synthesis/workplace_data/datasets/USPTO-n100k-t2048_exp1/train.json"

In [6]:
device_string = PartialState().process_index

compute_dtype = getattr(torch, "float16")

In [7]:
dataset = load_dataset("json", data_files=dataset_path, split="train")
dataset = dataset.shuffle(seed=42).select(range(100)) # Only use 10000 samples for quick demo

In [8]:
dataset = dataset.train_test_split(test_size=0.1, seed=42)
dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'output'],
        num_rows: 90
    })
    test: Dataset({
        features: ['instruction', 'output'],
        num_rows: 10
    })
})

In [9]:
dataset['train'][0]

{'instruction': 'Below is a description of an organic reaction. Extract information from it to an ORD JSON record.\n\n### Procedure:\n0.007 ml of propane-2-sulphonyl chloride is added to a solution of 0.026 g of tert-butyl {1-(2-amino -1-hydroxyethyl)-3-[3-(3-methoxypropyl)-1-methyl-1H-indol-5-ylmethyl]-4-methylpentyl}carbamate (Example 3Kb), and 0.007 ml of triethylamine in 1 ml of dichloromethane is added at 0° C. After 6 hours, the reaction mixture is concentrated by evaporation—the N-Boc intermediate is identified on the basis of the Rf value from the residue by means of flash chromatography (SiO2 60F). The N-Boc intermediate is dissolved in 0.82 ml of 4N HCl/dioxane—after 4 hours, the reaction mixture is concentrated by evaporation, and the residue is dissolved in 0.5 ml of tert-butanol, frozen in liquid nitrogen and lyophilized under high vacuum overnight. The title compound is identified on the basis of the Rf value from the residue.\n\n### ORD JSON:\n',
 'output': '{"inputs": {

In [10]:
# QLoRA config

# Activate 4-bit precision base model loading
# Compute dtype for 4-bit base models
# Quantization type (fp4 or nf4)
# Activate nested quantization for 4-bit base models
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

# LoRA config
peft_config = LoraConfig(
    r=64, # The rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
    lora_alpha=16, # LoRA scaling factor. It changes how the adaptation layer's weights affect the base model's
    lora_dropout=0.1, # Dropout is a regularization technique where a proportion of neurons (or parameters) are randomly “dropped out” or turned off during training to prevent overfitting.
    bias="none", # Specifies if the bias parameters should be trained. Can be 'none', 'all' or 'lora_only'.
    task_type="CAUSAL_LM", # Task to perform, Causal LM: Causal language modeling.
)

In [11]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Model config
# Where the model is placed,set device_map="auto" loads a model onto multiple GPUs..
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Note the special characters that are introduced. They are the same that Meta used for the pre-training of the model. They are used to define to the model the different roles and the instructions statements of each role. They can be consulted here: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/ (When consulting the link care about both models, Llama 3 and Llama 3 Instruct).

In [12]:
def formatting_prompts_func(prompt):
    return f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{prompt['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n{prompt['output']}<|eot_id|><|end_of_text|>"

In [13]:
training_arguments = TrainingArguments(
    learning_rate=2e-6,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    num_train_epochs=20,
    fp16=False,
    bf16=True, #bf16 to True with an A100
    logging_steps=1,
    evaluation_strategy="steps",
    eval_steps=0.05,
    max_grad_norm=0.3,
    warmup_steps=10,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="wandb",
    output_dir="./results/",
    save_strategy='no', # Only safe the final model, not the checkpoints
)

In [14]:
trainer = SFTTrainer(
    model=model,
    max_seq_length=None,
    args=training_arguments,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    tokenizer=tokenizer,
    packing=True,# Because the dataset already has the format {"prompt": "<prompt text>", "completion": "<ideal generated text>"}
    formatting_func=formatting_prompts_func,
)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [15]:
trainer.train()

Step,Training Loss,Validation Loss
42,1.31,1.242012
84,1.401,1.236067
126,1.2022,1.229693
168,0.9482,1.223773
210,1.1151,1.216029
252,1.2892,1.209604
294,1.2132,1.20536
336,1.112,1.203412
378,1.3029,1.202893
420,1.2925,1.202379


TrainOutput(global_step=840, training_loss=1.2616559531007494, metrics={'train_runtime': 637.4507, 'train_samples_per_second': 2.635, 'train_steps_per_second': 1.318, 'total_flos': 7.774663832764416e+16, 'train_loss': 1.2616559531007494, 'epoch': 20.0})

In [16]:
trainer.save_model('final_checkpoint')
tokenizer.save_pretrained('final_checkpoint')



('final_checkpoint/tokenizer_config.json',
 'final_checkpoint/special_tokens_map.json',
 'final_checkpoint/tokenizer.json')

In [17]:
# Flush memory
del trainer, model
gc.collect()
gc.collect()
torch.cuda.empty_cache()

In [18]:
# Reload tokenizer and model
llama_model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [19]:
# Merge adapter with base model
sft_model = PeftModel.from_pretrained(llama_model, 'final_checkpoint')
sft_model = sft_model.merge_and_unload()

# Save model and tokenizer
sft_model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)



('TunedLlama-3-8B/tokenizer_config.json',
 'TunedLlama-3-8B/special_tokens_map.json',
 'TunedLlama-3-8B/tokenizer.json')