We start by installing all the packages that are needed for the fine-tuning process.

In [1]:
!pip install -qqq -U torch transformers datasets evaluate accelerate peft trl langchain bitsandbytes tensorboard python-dotenv wandb --progress-bar off

And import all the packages and functions that are needed. We will do the fine-tuning using the HuggingFace packages that greatly simplifies the process.

In [2]:
import warnings
import os
warnings.filterwarnings('ignore')
os.environ["WANDB_SILENT"]="true"

In [3]:
import gc
import json
import wandb
from dotenv import load_dotenv

import torch
from datasets import load_dataset
from peft import (
    LoraConfig, 
    PeftModel,
)
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from trl import (
    SFTTrainer,
    DataCollatorForCompletionOnlyLM,
)
from evaluate import load
from accelerate import PartialState
import langchain
from langchain.cache import SQLiteCache
from statistics import mean

```{margin}
It is important to include the *HF_token* in the *.env* file. By the time this notebook was created, the model that we are going to fine-tune (Llama3-8B) is only available after request for access.
```

In [4]:
load_dotenv(".env", override=True)

True

```{margin}
This will allow to report the training to Weights and Biases being very easy to consult the loss curve and other interesting parameters of the training.
```

In [5]:
wandb.login()

True

In [6]:
# Model
base_model = "meta-llama/Meta-Llama-3-8B-Instruct"
# Name of the new model
new_model = "TunedLlama-3-8B"

# Dataset
dataset_path = "LLM_organic_synthesis/workplace_data/datasets/USPTO-n100k-t2048_exp1/train.json"

In [7]:
device_string = PartialState().process_index

compute_dtype = getattr(torch, "float16")

In [8]:
dataset = load_dataset("json", data_files=dataset_path, split="train")
dataset = dataset.shuffle(seed=42).select(range(1000)) # Only use 5000 samples for quick demo

In [9]:
dataset = dataset.train_test_split(test_size=0.1, seed=42)
dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'output'],
        num_rows: 900
    })
    test: Dataset({
        features: ['instruction', 'output'],
        num_rows: 100
    })
})

In [10]:
dataset['train'][0]

{'instruction': 'Below is a description of an organic reaction. Extract information from it to an ORD JSON record.\n\n### Procedure:\nTo a solution of 5-bromo-2-methyl-1H-indole (3.0 g, 14.35 mmol) in dry tetrahydrofuran (20 ml) was added sodium hydride (900 mg, 22.5 mmol) with ice-cooling. After stifling for about 30 min, a solution of t-BuLi (27.5 ml, 1.3 M solution in hexane) was added dropwise with stifling at −78° C. under an inert atmosphere of nitrogen. The reaction mixture was warmed slowly to −40° C. over 45 min and stirred at this temperature for another 30 min. The mixture was cooled again below −78° C., followed by the addition of 4,4,5,5-tetramethyl-2-(propan-2-yloxy)-1,3,2-dioxaborolane (5.3 g, 28.49 mmol) dropwise. After warming to room temperature, the mixture was quenched with NH4Cl solution (100 ml) and extracted with ethyl acetate (3×100 ml). The combined organic layers were dried over anhydrous sodium sulfate, filtered and concentrated under reduced pressure to give

In [11]:
# QLoRA config

# Activate 4-bit precision base model loading
# Compute dtype for 4-bit base models
# Quantization type (fp4 or nf4)
# Activate nested quantization for 4-bit base models
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# LoRA config
peft_config = LoraConfig(
    r=32, # The rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
    lora_alpha=64, # LoRA scaling factor. It changes how the adaptation layer's weights affect the base model's
    lora_dropout=0.1, # Dropout is a regularization technique where a proportion of neurons (or parameters) are randomly “dropped out” or turned off during training to prevent overfitting.
    bias="none", # Specifies if the bias parameters should be trained. Can be 'none', 'all' or 'lora_only'.
    task_type="CAUSAL_LM", # Task to perform, Causal LM: Causal language modeling.
)

In [12]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Model config
# Where the model is placed,set device_map="auto" loads a model onto multiple GPUs..
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [13]:
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
        output_texts.append(text)
    return output_texts

Note the special characters that are introduced. They are the same that Meta used for the pre-training of the model. They are used to define to the model the different roles and the instructions statements of each role. They can be consulted here: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/ 

```{margin}
When consulting the link care about both models, Llama-3 and Llama-3 Instruct. In this Notebook, the Instruct version is the one being used.
```

In [14]:
training_arguments = TrainingArguments(
    learning_rate=3e-6,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    num_train_epochs=10,
    fp16=False,
    bf16=True, #bf16 to True with an A100
    logging_steps=1,
    evaluation_strategy="steps",
    eval_steps=0.05,
    max_grad_norm=0.3,
    warmup_steps=10,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    output_dir="./results/",
    save_strategy='no', # Only safe the final model, not the checkpoints
)

In [15]:
response_template = " ### Answer:"

In [16]:
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

In [17]:
trainer = SFTTrainer(
    model=model,
    max_seq_length=None,
    args=training_arguments,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    tokenizer=tokenizer,
    packing=False,
    formatting_func=formatting_prompts_func,
    data_collator=collator,
)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [18]:
trainer.train()

Step,Training Loss,Validation Loss
28,0.9996,0.636476
56,0.8361,0.604613
84,0.8277,0.574966
112,0.753,0.549391
140,0.6837,0.527555
168,0.7008,0.507865
196,0.7132,0.489182
224,0.5992,0.471585
252,0.6436,0.460574
280,0.6036,0.455087


TrainOutput(global_step=560, training_loss=0.5340924625418015, metrics={'train_runtime': 2403.4626, 'train_samples_per_second': 3.745, 'train_steps_per_second': 0.233, 'total_flos': 3.342546165492941e+17, 'train_loss': 0.5340924625418015, 'epoch': 9.955555555555556})

In [19]:
test_ds_path = "test.json"
test_dataset = load_dataset("json", data_files=test_ds_path, split="train")
test_dataset = test_dataset.shuffle(seed=42)
test_dataset

Dataset({
    features: ['instruction', 'output'],
    num_rows: 100
})

In [20]:
sft_pipe = pipeline(
    "text-generation",
    temperature=0.01,
    model=trainer.model,
    tokenizer=tokenizer,
)

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'JambaForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MusicgenMelodyFo

In [21]:
PREFIX = """<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful scientific assistant. Your task is to extract information about organic reactions. {shot}"""
SUFFIX = """<|start_header_id|>user<|end_header_id|>\n\n{sample}<|start_header_id|>assistant<|end_header_id|>\n\n"""
SHOT = """
One example is provided to you to show how to perform the task:

### Procedure:\nA suspension of 8 g of the product of Example 7 and 0.4 g of DABCO in 90 ml of xylenes were heated under N2 at 130\u00b0-135\u00b0 C. while 1.8 ml of phosgene was added portionwise at a rate to maintain a reflux temperature of about 130\u00b0-135\u00b0 C. The mixture was refluxed an additional two hours, cooled under N2 to room temperature, filtered, and the filtrate was concentrated in vacuo to yield 6.9 g of the subject compound as a crude oil.\n\n
### ORD JSON:\n{\"inputs\": {\"m1_m2_m4\": {\"components\": [{\"identifiers\": [{\"type\": \"NAME\", \"value\": \"product\"}], \"amount\": {\"mass\": {\"value\": 8.0, \"units\": \"GRAM\"}}, \"reaction_role\": \"REACTANT\"}, {\"identifiers\": [{\"type\": \"NAME\", \"value\": \"DABCO\"}], \"amount\": {\"mass\": {\"value\": 0.4, \"units\": \"GRAM\"}}, \"reaction_role\": \"REACTANT\"}, {\"identifiers\": [{\"type\": \"NAME\", \"value\": \"xylenes\"}], \"amount\": {\"volume\": {\"value\": 90.0, \"units\": \"MILLILITER\"}}, \"reaction_role\": \"SOLVENT\"}]}, \"m3\": {\"components\": [{\"identifiers\": [{\"type\": \"NAME\", \"value\": \"phosgene\"}], \"amount\": {\"volume\": {\"value\": 1.8, \"units\": \"MILLILITER\"}}, \"reaction_role\": \"REACTANT\"}]}}, \"conditions\": {\"temperature\": {\"control\": {\"type\": \"AMBIENT\"}}, \"conditions_are_dynamic\": true}, \"workups\": [{\"type\": \"ADDITION\", \"details\": \"was added portionwise at a rate\"}, {\"type\": \"TEMPERATURE\", \"details\": \"to maintain a reflux temperature of about 130\\u00b0-135\\u00b0 C\"}, {\"type\": \"TEMPERATURE\", \"details\": \"The mixture was refluxed an additional two hours\", \"duration\": {\"value\": 2.0, \"units\": \"HOUR\"}}, {\"type\": \"FILTRATION\", \"details\": \"filtered\"}, {\"type\": \"CONCENTRATION\", \"details\": \"the filtrate was concentrated in vacuo\"}], \"outcomes\": [{\"products\": [{\"identifiers\": [{\"type\": \"NAME\", \"value\": \"subject compound\"}], \"measurements\": [{\"type\": \"AMOUNT\", \"details\": \"MASS\", \"amount\": {\"mass\": {\"value\": 6.9, \"units\": \"GRAM\"}}}], \"reaction_role\": \"PRODUCT\"}]}]}
"""

In [22]:
# Generate text for the 0-shot
results = {}
for i in range(2):
    print(f"Working in the {i}-shot prompts")
    references = []
    predictions_sft = []
    prompts = []
    count = 0
    for t in test_dataset:
        count += 1
        print(f"Working in the prompt {count}")
        instruction = t['instruction']
        output = t['output']
        if i == 0:
            shot = ''
        else:
            shot = SHOT
        system = PREFIX.format(shot=shot)
        user = SUFFIX.format(sample=instruction)
        prompt = system + user
        references.append(output)
        
        with torch.cuda.amp.autocast():
            pred = sft_pipe(prompt)
        predictions_sft.append(pred[0]['generated_text'].replace(prompt, ''))
    
    results [f"{i}-shot"] = {
        "predictions": predictions_sft,
        "references": references,
    }

Working in the 0-shot prompts
Working in the prompt 1
Working in the prompt 2
Working in the prompt 3
Working in the prompt 4
Working in the prompt 5
Working in the prompt 6
Working in the prompt 7
Working in the prompt 8
Working in the prompt 9
Working in the prompt 10


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Working in the prompt 11
Working in the prompt 12
Working in the prompt 13
Working in the prompt 14
Working in the prompt 15
Working in the prompt 16
Working in the prompt 17
Working in the prompt 18
Working in the prompt 19
Working in the prompt 20
Working in the prompt 21
Working in the prompt 22
Working in the prompt 23
Working in the prompt 24
Working in the prompt 25
Working in the prompt 26
Working in the prompt 27
Working in the prompt 28
Working in the prompt 29
Working in the prompt 30
Working in the prompt 31
Working in the prompt 32
Working in the prompt 33
Working in the prompt 34
Working in the prompt 35
Working in the prompt 36
Working in the prompt 37
Working in the prompt 38
Working in the prompt 39
Working in the prompt 40
Working in the prompt 41
Working in the prompt 42
Working in the prompt 43
Working in the prompt 44
Working in the prompt 45
Working in the prompt 46
Working in the prompt 47
Working in the prompt 48
Working in the prompt 49
Working in the prompt 50


In [23]:
bertscore = load("bertscore")
for i in range(2):
    predictions_sft = results[f'{i}-shot']["predictions"]
    references = results[f'{i}-shot']["references"]

    results_sft = bertscore.compute(predictions=predictions_sft, references=references, model_type="distilbert-base-uncased")

    results[f"{i}-shot"].update({
        "precision": mean(results_sft["precision"]),
        "recall": mean(results_sft["recall"]),
        "f1_scores": mean(results_sft["f1"]),
    })

In [24]:
with open('sft_results.json', 'w') as f:
   json.dump(results, f, indent=4)