<a href="https://colab.research.google.com/github/nguforche/LLaMPS/blob/main/Fine_Tuning_Llama_2_for_KG_to_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KG-to-Text with Llama 2
This notebook fine-tunes Llama 2 using QLoRA for the KG-to-Text. The model is fine-tuned on the WebNLG (Constrained) training set.

# Data Preparation
This notebook uses the version 2.1 constrained dataset, which contains three JSON files, each corresponding to a subset of the data (train, dev, and test).
The following command clones the WebNLG repo.

In [None]:
%%bash
git clone https://gitlab.com/shimorina/webnlg-dataset.git
ls webnlg-dataset/release_v2.1_constrained/json

# Prompting
The prompt template is defined according to the [Llama 2 chat prompt template standard](https://huggingface.co/blog/llama2).
The formatted triples will replace the `{triples}` placeholder.

In [None]:
prompt_text = """<s>[INST] Following is a set of knowledge graph triples delimited by triple backticks, each on a separate line, in the format: subject | predicate | object.

```
{triples}
```

Generate a coherent piece of text that contains all of the information in the triples. Only use information from the provided triples.
After you finish writing the piece of text, write triple dollar signs (i.e.: $$$).[/INST]"""

# Creating the Hugging Face Dataset
The Hugging Face dataset loads data from JSON Lines files, so the data should be in this format. The following code creates these files and uses them to create the Hugging Face datasets.

In [None]:
!pip install -q datasets jsonlines

import json
import jsonlines
import os

from datasets import load_dataset
from tqdm import tqdm

def format_triplets(triplets):
        """Helper function to format triples."""
        return '\n'.join([f"{triplet['subject']} | {triplet['property']} | {triplet['object']}" for triplet in triplets])

dataset_dir_path = "webnlg-dataset/release_v2.1_constrained/json"
data_subsets_file_names = {
    "train": "webnlg_release_v2.1_constrained_train.json",
    "dev": "webnlg_release_v2.1_constrained_dev.json",
    "test": "webnlg_release_v2.1_constrained_test.json"
}

data_subsets_file_paths = {k: os.path.join(dataset_dir_path, v) for k, v in data_subsets_file_names.items()}

for data_subset_name, data_subset_file_path in data_subsets_file_paths.items():
    all_responses = []
    with open(data_subset_file_path, 'r') as data_subset_file:
        data_subset_dict = json.load(data_subset_file)

    with jsonlines.open(f"{data_subset_name}.jsonl", mode='w') as writer:
        for i, entry in enumerate(tqdm(data_subset_dict["entries"])):
            triples = format_triplets(entry[str(i+1)]['modifiedtripleset'])
            responses = [l["lex"] for l in entry[str(i+1)]['lexicalisations']]
            all_responses.append(responses)
            lexicalizations = entry[str(i+1)]['lexicalisations']
            good_responses = [l["lex"] for l in lexicalizations if l["comment"] == "good"] \
                if data_subset_name != "test" else [lexicalizations[0]["lex"]]
            for response in good_responses:
                prompt = prompt_text.format(triples=triples)
                writer.write({"prompt": prompt, "response": response + "$$$ </s>"})

# Load the Hugging Face datasets
train_dataset = load_dataset('json', data_files='train.jsonl', split="train")

# Preprocess the Hugging Face datasets
train_dataset = train_dataset.map(
    lambda examples: {'text': [f"{prompt} {response}"
    for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True
    )

test_dataset = load_dataset('json', data_files='test.jsonl', split="train")

# Modeling

To use Llama 2, you need to request access to the model from Meta.
The instructions to obtain access can be found in the [Llama 2 organization card on Hugging Face](https://huggingface.co/meta-llama).

Once you obtain access to the model, follow the following instructions:
1. [Obtain your Hugging Face user access token](https://huggingface.co/docs/hub/security-tokens)
2. Create a `credentials.env` file in your current working directory. The content of the file should be as follows:
```
HUGGINGFACE_TOKEN="<your_hugging_face_token>"
```
3. Run the following code to login to Hugging Face with your token.

In [None]:
!pip install python-dotenv

from dotenv import dotenv_values

config = dotenv_values("credentials.env")
HUGGINGFACE_CLI_TOKEN = config["HUGGINGFACE_TOKEN"]

!huggingface-cli login --token $HUGGINGFACE_CLI_TOKEN


The following packages are used in the fine-tuning process:
- [transformers](https://huggingface.co/docs/transformers/index): The Hugging Face library, used to obtain the Llama 2 7B Chat model.
- [peft](https://huggingface.co/docs/peft/index): Short for Parameter-Efficient Fine-Tuning, used to fine-tune the LLM without needing to modify all the model parameters.
- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes): This library runs the model in 4-bit precision.
- [trl](https://huggingface.co/docs/trl/index): Short for Transformer Reinforcement Learning, the library that does the fine-tuning.

The following code defines the model configurations and runs the fine-tuning process.

**WARNING**: The following code takes more than 24 hours to run on a T4 GPU.

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

import torch

from peft import LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments
from trl import SFTTrainer

base_model_name = "meta-llama/Llama-2-7b-chat-hf"
output_dir = "./results_chat_7b_3_epoch/final_checkpoint"

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

device_map = {"": 0}

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False
)

base_model = AutoModelForCausalLM.from_pretrained(  ## If it fails at this line, restart the runtime and try again.
    base_model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    trust_remote_code=True,
    use_auth_token=True
)
base_model.config.use_cache = False

# More info: https://github.com/huggingface/transformers/pull/24906
base_model.config.pretraining_tp = 1

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    save_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    logging_steps=10,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)

trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_args,
)

trainer.train()
trainer.model.save_pretrained(output_dir)

# Results
To obtain the results, the following code creates a Hugging Face pipeline and provides the model and stopping criteria to stop the generation when the triple dollar signs are generated.

If the following code produces a "CUDA out of memory" error, restart runtime and run all the cells **except for the previous cell**, then run the following cells again.

In [None]:
from peft import AutoPeftModelForCausalLM
from transformers import pipeline, StoppingCriteria, StoppingCriteriaList

class StopOnTripleDollarSigns(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor,
                 **kwargs) -> bool:
        if ''.join(tokenizer.convert_ids_to_tokens(input_ids[0][-3:])).endswith("$$$"):
            return True
        return False

model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map=device_map, torch_dtype=torch.bfloat16, load_in_8bit=True)

stopping_criteria = StoppingCriteriaList([StopOnTripleDollarSigns()])
text_gen = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=500, stopping_criteria=stopping_criteria)

text_gen = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=500, stopping_criteria=stopping_criteria)

Finally, the following code generates the text, cleans it by removing the original prompt and the triple dollar signs, and saves it along with the reference texts for evaluation.

In [None]:
from transformers.pipelines.pt_utils import KeyDataset

def clean_result_text(result_text, test_prompt):
    result_text_without_prompt = result_text[len(test_prompt):]
    delimiter_index = result_text_without_prompt.find('$$$')
    cleaned_result_text = result_text_without_prompt[:delimiter_index].strip()
    return cleaned_result_text

results = []
for out in tqdm(text_gen(KeyDataset(test_dataset, "prompt"), top_k=1)):
    results.append(out)

results_texts = [result[0]["generated_text"] for result in results]

cleaned_results_texts = [clean_result_text(result_text, test_prompt)
    for result_text, test_prompt in zip(results_texts, test_dataset["prompt"])]

results_file_text = '\n'.join(cleaned_results_texts)
references_file_text = '\n\n'.join(['\n'.join(responses) for responses in all_responses])

with open('results.txt', 'w') as results_file:
    results_file.write(results_file_text)

with open('references.txt', 'w') as references_file:
    references_file.write(references_file_text)

# Evaluation
To evaluate the results, we use [Data-to-text-Evaluation-Metric](https://github.com/wenhuchen/Data-to-text-Evaluation-Metric/) (the library used by JointGT).
To run the evaluation process:
1. Clone this repo on your local machine and install the required software.
2. Download the `results.txt` and `references.txt` files produced from the previous cell to your local machine.
3. Change your working directory to the location of the repo on your device and run the `measure_scores.py` script.

```bash
cd Data-to-text-Evaluation-Metric/
python measure_scores.py <path_to_references_file> <path_to_results_file>
```

Replace `<path_to_references_file>` and `<path_to_results_file>` with the paths to the references and results file we created earlier, respectively.