We start by installing all the packages that are needed for the fine-tuning process.

In [1]:
!pip install -qqq -U torch transformers datasets evaluate accelerate peft trl langchain bitsandbytes tensorboard python-dotenv --progress-bar off

And import all the packages and functions that are needed. We will do the fine-tuning using the HuggingFace packages that greatly simplifies the process.

In [2]:
import gc
import json
from dotenv import load_dotenv

import torch
from datasets import load_dataset
from peft import (
    LoraConfig, 
    PeftModel,
)
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from trl import SFTTrainer
from evaluate import load
from accelerate import PartialState
import langchain
from langchain.cache import SQLiteCache
from statistics import mean

It is important to include the HF_token in the *.env* file. By the time this notebook is created, the model that we are going to fine-tune (Llama3-8B) is only available after request for access. The HF_token identifies the user, making available the model for us.

In [3]:
load_dotenv(".env", override=True)

True

In [4]:
# Model
base_model = "meta-llama/Meta-Llama-3-8B-Instruct"
# Name of the new model
new_model = "TunedLlama-3-8B"

# Dataset
dataset_path = "LLM_organic_synthesis/workplace_data/datasets/USPTO-n100k-t2048_exp1/train.json"

In [5]:
device_string = PartialState().process_index

compute_dtype = getattr(torch, "float16")

In [6]:
dataset = load_dataset("json", data_files=dataset_path, split="train")
dataset = dataset.shuffle(seed=42).select(range(1000)) # Only use 10000 samples for quick demo

In [7]:
dataset = dataset.train_test_split(test_size=0.01, seed=42)
dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'output'],
        num_rows: 990
    })
    test: Dataset({
        features: ['instruction', 'output'],
        num_rows: 10
    })
})

In [8]:
dataset['train'][0]

{'instruction': "Below is a description of an organic reaction. Extract information from it to an ORD JSON record.\n\n### Procedure:\n110 mg of (R)-α-lipoic acid were dissolved in 2 ml of anhydrous dimethylformamide, and 97 mg of N,N'-carbonyldiimidazole were added to the solution, whilst ice-cooling. The mixture was then stirred at room temperature for 4 hours. At the end of this time, 57 mg of methanesulfonamide and 26 mg of sodium hydride (as a 55% w/w dispersion in mineral oil) were added to the reaction mixture, whilst ice-cooling, and the mixture was stirred at room temperature for 5 hours and then left to stand overnight. The solvent was then removed from the reaction mixture by evaporation under reduced pressure, and water was added to the residue this obtained. The resulting mixture was neutralized by the addition of 2 N aqueous hydrochloric acid, after which it was extracted with ethyl acetate. The extraction solution was washed with a saturated aqueous solution of sodium chl

In [9]:
# QLoRA config

# Activate 4-bit precision base model loading
# Compute dtype for 4-bit base models
# Quantization type (fp4 or nf4)
# Activate nested quantization for 4-bit base models
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

# LoRA config
peft_config = LoraConfig(
    r=64, # The rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
    lora_alpha=16, # LoRA scaling factor. It changes how the adaptation layer's weights affect the base model's
    lora_dropout=0.1, # Dropout is a regularization technique where a proportion of neurons (or parameters) are randomly “dropped out” or turned off during training to prevent overfitting.
    bias="none", # Specifies if the bias parameters should be trained. Can be 'none', 'all' or 'lora_only'.
    task_type="CAUSAL_LM", # Task to perform, Causal LM: Causal language modeling.
)

In [10]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Model config
# Where the model is placed,set device_map="auto" loads a model onto multiple GPUs..
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
)
model.config.use_cache = False
model.config.pretraining_tp = 1

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Note the special characters that are introduced. They are the same that Meta used for the pre-training of the model. They are used to define to the model the different roles and the instructions statements of each role. They can be consulted here: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/ (When consulting the link care about both models, Llama 3 and Llama 3 Instruct).

In [11]:
def formatting_prompts_func(prompt):
    return f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>{prompt['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>{prompt['output']}<|eot_id|>"

In [12]:
training_arguments = TrainingArguments(
    learning_rate=2e-4,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    num_train_epochs=2,
    fp16=False,
    bf16=True, #bf16 to True with an A100
    logging_steps=1,
    evaluation_strategy="steps",
    eval_steps=0.5,
    max_grad_norm=0.3,
    warmup_steps=10,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    output_dir="./results/",
    save_strategy='no', # Only safe the final model, not the checkpoints
)

In [13]:
trainer = SFTTrainer(
    model=model,
    max_seq_length=None,
    args=training_arguments,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    tokenizer=tokenizer,
    packing=True,# Because the dataset already has the format {"prompt": "<prompt text>", "completion": "<ideal generated text>"}
    formatting_func=formatting_prompts_func,
    # dataset_text_field='instruction',
)



Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [14]:
trainer.train()

Step,Training Loss,Validation Loss
441,0.4231,0.582815
882,0.4861,0.574735


TrainOutput(global_step=882, training_loss=0.5061882249757547, metrics={'train_runtime': 645.9727, 'train_samples_per_second': 2.731, 'train_steps_per_second': 1.365, 'total_flos': 8.163397024402637e+16, 'train_loss': 0.5061882249757547, 'epoch': 2.0})

In [15]:
trainer.save_model('final_checkpoint')
tokenizer.save_pretrained('final_checkpoint')



('final_checkpoint/tokenizer_config.json',
 'final_checkpoint/special_tokens_map.json',
 'final_checkpoint/tokenizer.json')

In [16]:
# Flush memory
del trainer, model
gc.collect()
gc.collect()
torch.cuda.empty_cache()

In [17]:
# Reload tokenizer and model
llama_model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    quantization_config=bnb_config,
    torch_dtype=torch.float16,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [18]:
# Merge adapter with base model
sft_model = PeftModel.from_pretrained(llama_model, 'final_checkpoint')
sft_model = sft_model.merge_and_unload()

# Save model and tokenizer
sft_model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)



('TunedLlama-3-8B/tokenizer_config.json',
 'TunedLlama-3-8B/special_tokens_map.json',
 'TunedLlama-3-8B/tokenizer.json')