# Transformer-based NLP applications

#### HuggingFace example, LLAMA2 7B finetune.
<br/><br/>
Jelen Jupyter notebook a Budapesti Műszaki és Gazdaságtudományi Egyetemen tartott "Mélytanulás" tantárgy segédanyagaként készült.
A tantárgy honlapja: https://portal.vik.bme.hu/kepzes/targyak/VITMMA19

A notebook bármely részének újra felhasználása, publikálása csak a szerzők írásos beleegyezése esetén megengedett.
***

This Jupyter notebook was created as part of the "Deep learning / VITMMA19" class at Budapest University of Technology and Economics, Hungary, https://portal.vik.bme.hu/kepzes/targyak/VITMMA19.

Any re-use or publication of any part of the notebook is only allowed with the written consent of the authors.

2023 (c) Lívia Ónozó
<br/><br/>
## Remarks

0. Make sure you have access to llama-7b-chat-hf modell (https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) and OpenAI (https://openai.com/pricing) they provide $5 in free credit that can be used during your first 3 months
1. Go to Runtime -> change runtime type
2. Insert your OpenAI API key (if missing -> go to 6.)
3. In "prompt" you can describe your own model you want to build
4. Temperature can be between 0 and 1 (high=creative, low=precise)
5. Prepared database can be found in the repo, too
6. Modeling: You can also change the `model_name` if you want to fine-tune a different model. Go through the hyperparams!
7. Saving and merging the model
8. Inference


# 0. Setup: Install libraries

Check out the following packages:
- PEFT: https://huggingface.co/blog/peft
- AutoClasses: https://huggingface.co/docs/transformers/model_doc/auto
- LoRA: https://huggingface.co/docs/peft/conceptual_guides/lora
- SFTTrainer: https://huggingface.co/docs/trl/sft_trainer
- Bitsandbytes: https://huggingface.co/docs/transformers/main_classes/quantization

In [None]:
!pip install openai==0.28.1
import openai

In [None]:
# Just in case of error while "pip install trl" uncomment the first 2 code lines and run all 3 lines. Otherwise run only the pip install
#import locale
#locale.getpreferredencoding = lambda: "UTF-8"
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

In [None]:
import json
import os
import random
import pandas as pd
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
#from peft.auto import AutoModelForSequenceClassification

In [None]:
openai.api_key = "your API key"

#1. Data generation step

You can write your prompt here. Make it as descriptive as possible!

Then, choose the temperature (between 0 and 1) to use when generating data. Lower values are great for precise tasks, like writing code, whereas larger values are better for creative tasks, like writing stories.

Finally, choose how many examples you want to generate. The more you generate, a) the longer it takes and b) the more expensive data generation will be. But generally, more examples will lead to a higher-quality model. 100 is usually the minimum to start.

In [None]:
prompt = "A model that takes in a puzzle-like reasoning-heavy question in Hungarian, and responds with a well-reasoned, step-by-step thought out response in Hungarian."
temperature = .4
number_of_examples = 10

Run this to generate the dataset.

In [None]:
def generate_example(prompt, prev_examples, temperature=.5):
    messages=[
        {
            "role": "system",
            "content": f"You are generating data which will be used to train a machine learning model.\n\nYou will be given a \
            high-level description of the model we want to train, and from that, you will generate data samples, each with a \
            prompt/response pair.\n\nYou will do so in this format:\n```\nprompt\n-----------\n$prompt_goes_here\n-----------\n\nresponse\n-----------\n$response_goes_here\n-----------\n```\n\nOnly one prompt/response pair should be generated per turn.\n\nFor each turn, make the example slightly more complex than the last, while ensuring diversity.\n\nMake sure your samples are unique and diverse, yet high-quality and complex enough to train a well-performing model.\n\nHere is the type of model we want to train:\n`{prompt}`"
        }
    ]

    if len(prev_examples) > 0:
        if len(prev_examples) > 10:
            prev_examples = random.sample(prev_examples, 5)
        for example in prev_examples:
            messages.append({
                "role": "assistant",
                "content": example
            })

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=temperature,
        max_tokens=386,
    )

    return response.choices[0].message['content']


prev_examples = []
for i in range(number_of_examples):
    print(f'Generating example {i}')
    example = generate_example(prompt, prev_examples, temperature)
    prev_examples.append(example)
    print(example)

We also need to generate a system message. Feel free to modify it, to comply with your taks!

In [None]:
system_message = 'Given a puzzle-like reasoning-heavy question in English, \
you will respond with a well-reasoned, step-by-step thought out response in Hungarian.'

Put examples into a dataframe and turn them into a final pair of datasets.

We initialize lists to store prompts and responses. Then parse out them from examples.

Also, removing duplicates, in case GPT was lazy to generate new pairs.

In [None]:
prompts = []
responses = []

for example in prev_examples:
  try:
    split_example = example.split('-----------')
    if split_example[1] and split_example[3]:
      prompts.append(split_example[1].strip())
      responses.append(split_example[3].strip())
  except:
    pass

df = pd.DataFrame({
    'prompt': prompts,
    'response': responses
})

df = df.drop_duplicates()

print('There are ' + str(len(df)) + ' successfully-generated examples. Here are the first few:')
df.head()

I you missed the first part, you can also use prepared dataset from the repository. Clone the repo,  find the llm_trainer_db.csv in the data folder and upload to Colab. Now you can read it as a pandas DataFrame

In [None]:
df = pd.read_csv('/content/llm_trainer_db.csv')
len(df)

Split into train and test sets.

In [None]:
# Split the data into train and test sets, with 90% in the train set
train_df = df.sample(frac=0.8, random_state=42)
test_df = df.drop(train_df.index)

# Save the dataframes to .jsonl files
train_df.to_json('train.jsonl', orient='records', lines=True)
test_df.to_json('test.jsonl', orient='records', lines=True)

# 2. Modeling

## Load datasets

In [None]:
train_dataset = load_dataset('json', data_files='/content/train.jsonl', split="train")
valid_dataset = load_dataset('json', data_files='/content/test.jsonl', split="train")

train_dataset_mapped = train_dataset.map(lambda examples: {'text': [f'[INST] <<SYS>>\n{system_message.strip()}\n<</SYS>>\n\n' + prompt + ' [/INST] ' + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)
valid_dataset_mapped = valid_dataset.map(lambda examples: {'text': [f'[INST] <<SYS>>\n{system_message.strip()}\n<</SYS>>\n\n' + prompt + ' [/INST] ' + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)

## Define Hyperparameters

In [None]:
model_name = "NousResearch/llama-2-7b-chat-hf"
dataset_name = "/content/train.jsonl"
new_model = "llama-2-7b-custom"

lora_r = 16
lora_alpha = 16
lora_dropout = 0.1

use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False
output_dir = "./results"
num_train_epochs = 3
fp16 = False # test True
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "constant"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 25
logging_steps = 5
max_seq_length = None
packing = False
device_map = {"": 0}

## Training

Set training parameters and supervised fine-tuning parameters

In [None]:
%%time
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit, #
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="all",
    evaluation_strategy="steps",
    eval_steps=5
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset_mapped,
    eval_dataset=valid_dataset_mapped,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

trainer.train()
trainer.model.save_pretrained(new_model)

## Test the model

In [None]:
logging.set_verbosity(logging.CRITICAL)
prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\nWrite a function which accepts a string and returns a new string with only capital letters. [/INST]" # replace the command here with something relevant to your task
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=150)
result = pipe(prompt)
print(result[0]['generated_text'])

In [None]:
logging.set_verbosity(logging.CRITICAL)
prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\nWrite the exact recipe for making the Hungarian horn cake, kürtöskalács, but also add cardamom to the ingredients. [/INST]" # replace the command here with something relevant to your task
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(prompt)
print(result[0]['generated_text'])

In [None]:
logging.set_verbosity(logging.CRITICAL)
prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\nAdd meg a kürtöskalács ekészítésének pontos receptjét és add hozzá a kardamomot az összetevőkhöz. [/INST]" # replace the command here with something relevant to your task
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(prompt)
print(result[0]['generated_text'])

In [None]:
logging.set_verbosity(logging.CRITICAL)
prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\nMi az oka annak, hogy Magyarországon még nem vezették be az eurót, mint fizetőeszközt? Csak egy okot adj meg!\n [/INST]" # replace the command here with something relevant to your task
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(prompt)
print(result[0]['generated_text'])

In [None]:
logging.set_verbosity(logging.CRITICAL)
prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\nWhy the euro has not yet been introduced as a currency in Hungary? Please name just one reason. \n [/INST]" # replace the command here with something relevant to your task
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(prompt)
print(result[0]['generated_text'])

# 3. Number of trainable parameters

LoRa (Low-Rank Adaptation) is a technique used to reduce the number of trainable parameters in a neural network model, particularly in the context of fine-tuning or transfer learning. It works by approximating the weight matrices with lower-rank matrices, which have fewer parameters. Here's how LoRa decreases the number of trainable parameters:
1. Weight Matrix Decomposition: LoRa decomposes the original (typically low rank) weight matrices of the neural network, that often contain a large number of parameters, and LoRa aims to reduce this parameter count.
2. Low-Rank Approximation: The decomposition technique used in LoRa approximates the original weight matrix with two (or more) smaller matrices.
3. Parameter Sharing: In some cases, LoRa also employs parameter sharing, where the same low-rank matrices are used in multiple layers within the network. This further reduces the number of unique parameters that need to be trained.
4. Fine-Tuning: After the low-rank decomposition is applied, the network is typically fine-tuned on the specific task for which it is being used. During fine-tuning, the model adjusts the reduced number of parameters to adapt to the new task.

In [None]:
def count_trainable_parameters(model):
  trainable_params = 0
  all_param = 0
  for _, param in model.named_parameters():
    all_param += param.numel()
    if param.requires_grad:
      trainable_params += param.numel()

  return trainable_params, all_param

trainable, all = count_trainable_parameters(model)
print('Traiable parameters as percentage of all parameters: ', 100*trainable / all)


with open("llm_task_04.txt", "w") as file:
    file.write(f"{trainable}\n")
    file.write(f"{all}\n")
    file.write(f"{100*trainable / all}\n")
    file.write(f"{model._modules['model'].layers[0].self_attn.q_proj.lora_A['default'].out_features}")

# 4. Run Inference

Replace the command here with something relevant to your task

In [None]:
prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\nWrite about the effect of the Habsburg occupation on Hungary. [/INST]" # replace the command here with something relevant to your task
num_new_tokens = 100
num_prompt_tokens = len(tokenizer(prompt)['input_ids'])
max_length = num_prompt_tokens + num_new_tokens

gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)
result = gen(prompt)
print(result[0]['generated_text'].replace(prompt, ''))

In [None]:
prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\nWrite your favourite fact about the Hungairan people. [/INST]"
num_prompt_tokens = len(tokenizer(prompt)['input_ids'])
max_length = num_prompt_tokens + num_new_tokens

gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)
result = gen(prompt)
print(result[0]['generated_text'].replace(prompt, ''))

In [None]:
prompt = f"[INST] <<SYS>> You are a helpful, respectful and honest sentiment analysis assistant. \
And you are supposed to classify the sentiment of the user's message into one of the following categories: \
'positive', 'negative' or 'neutral'. The answer should be only one word.<</SYS>>\
Sentence: Magyarország elvesztette területei nagyrészét a trianoni békeszerződés aláírásával. [/INST] Sentiment:"
num_new_tokens = 3
num_prompt_tokens = len(tokenizer(prompt)['input_ids'])
max_length = num_prompt_tokens + num_new_tokens

gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)
result1 = gen(prompt)
print(result1[0]['generated_text'].replace(prompt, ''))

In [None]:
prompt = f"[INST] <<SYS>> You are a helpful, respectful and honest sentiment analysis assistant. \
And you are supposed to classify the sentiment of the user's message into one of the following categories: \
'positive', 'negative' or 'neutral'. The answer should be only one word.<</SYS>>\
Sentence: A magyarok nem humorosak. [/INST] Sentiment:"
num_new_tokens = 3
num_prompt_tokens = len(tokenizer(prompt)['input_ids'])
max_length = num_prompt_tokens + num_new_tokens

gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)
result2 = gen(prompt)
print(result2[0]['generated_text'].replace(prompt, ''))

In [None]:
prompt = f"[INST] <<SYS>> You are a helpful, respectful and honest sentiment analysis assistant. \
And you are supposed to classify the sentiment of the user's message into one of the following categories: \
'positive', 'negative' or 'neutral'. The answer should be only one word.<</SYS>>\
Sentence: Világűröm vagy, s mélytengerem. [/INST] Sentiment:"
num_new_tokens = 3
num_prompt_tokens = len(tokenizer(prompt)['input_ids'])
max_length = num_prompt_tokens + num_new_tokens

gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)
result3 = gen(prompt)
print(result3[0]['generated_text'].replace(prompt, ''))

In [None]:
prompt = f"[INST] <<SYS>> You are a helpful, respectful and honest sentiment analysis assistant. \
And you are supposed to classify the sentiment of the user's message into one of the following categories: \
'positive', 'negative' or 'neutral'. The answer should be only one word.<</SYS>>\
Sentence: Leveleket fúj a viharos szél a komor utcán. [/INST] Sentiment:"
num_new_tokens = 3
num_prompt_tokens = len(tokenizer(prompt)['input_ids'])
max_length = num_prompt_tokens + num_new_tokens

gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)
result4 = gen(prompt)
print(result4[0]['generated_text'].replace(prompt, ''))

Your tasks will be the following:
---

1. Create your own dataset

Create a file with the name llm_task_01.txt. This first line should containt the length of the training dataset, the second line should be the legnth of the test dataset.
Before you save it, make sure that your dataset contains at least 10 question-answer pairs!

2 points.

---
2. Change LoRA rank

We have talked about the low-rank adaptation to make fine-tuning more efficient. LoRA projects the weight updates into two smaller matrices (called update matrices) through low-rank decomposition. These new matrices will be trained to adapt to the new data while keeping the original weights untouched. (Berfore merging them with the updates, of course.)


Train a LLama2 7b chat model on Hungarian data, but this time the LoRA rank should be 4. For this, you will have to change the configuration of LoRA.
If the training is ready, dump the model object.

Hint: use the str(model) to serialize the object
```
with open("llm_task_02.txt", "w") as file:
    file.write(str(model))
```
8 points.

---
3. Reduce validation loss

Make modifications on the training, on even on the training dataset in order to decrease validation loss.

Save the training history into the file llm_task_03.txt, in the following way:
```
training_history = trainer.state.log_history[-3:]

with open("llm_task_03.txt", "w") as file:
  file.write(f"{training_history[0]}\n")
  file.write(f"{training_history[1]}\n")
  file.write(f"{training_history[2]}")
```


15 points.

---
4. Change the configuration

Change the hyperparameters on a way that the trainable parameters (compared to all parameters of the original model) will be less than 0.03 percent!

Save the correspoinding parameters using the following code:

```
with open("llm_task_04.txt", "w") as file:
    file.write(f"{trainable}\n")
    file.write(f"{all}\n")
    file.write(f"{100*trainable / all}\n")
    file.write(f"{model._modules['model'].layers[0].self_attn.q_proj.lora_A['default'].out_features}")
```

25 points.

---

5. Train your model for sequence classification

Train a model with **AutoModelForSequenceClassification** on Hungarian data (you can use the one we have used previously, but you can also choose a different dataset).
<br/><br/>
So you need to align your model with the new AutoModel class as well as the hyperparameters. On the other hand you will also need to change the prompt (but keep the [INST] <\<SYS>>\n{system_message}\n<\</SYS>> [/INST] structure!

This is a multi-label classification problem, where you need to classify the sentences into 5 categories:
After you trained your model in the above described way, it will need to answer 3 questions, and the answer should be one of the following ansvers: "Általános", "Gazdaság", "Érzelem", "Sport", "Tudomány"
<br/><br/>

Please report 3 files:
- The _model_ object and the prompt in a txt file, named llm_task_05.txt,
- The 3 sentences and the corresponding labels in a csv file, named llm_task_05.csv. Include a header, where the columns are: sentences, labels. So this should be a 4 by 2 matrix:

| sentences | labels |
| --- | --- |
| sentence1 | label1 |
| sentence2 | label2 |
| sentence3 | label3 |

- Your ipynb file containing all the calculations, saved as llm_task_05.ipynb


50 points.

---
<br/><br/>
All files must be saved directly into the git repository you submit. You may want to test your results before submitting, so here are the codes you will need for them: get the binary of the test from the main repo, modify the persmission so you can execute the file, and run. Alternatively, you can clone the repository and upload them to Colab. In this case check the file paths!

## Get and run the tests

In [None]:
!wget https://github.com/BME-SmartLab-VITMMA19/llm-llama2-training/raw/main/llm_task_01
!wget https://github.com/BME-SmartLab-VITMMA19/llm-llama2-training/raw/main/llm_task_02
!wget https://github.com/BME-SmartLab-VITMMA19/llm-llama2-training/raw/main/llm_task_03
!wget https://github.com/BME-SmartLab-VITMMA19/llm-llama2-training/raw/main/llm_task_04
!wget https://github.com/BME-SmartLab-VITMMA19/llm-llama2-training/raw/main/llm_task_05

!chmod u+x llm_task_01
!chmod u+x llm_task_02
!chmod u+x llm_task_03
!chmod u+x llm_task_04
!chmod u+x llm_task_05

In [None]:
!./llm_task_01
!./llm_task_02
!./llm_task_03
!./llm_task_04
!./llm_task_05

# 5. Merge the model and store in Google Drive

In [None]:
from google.colab import drive
from transformers import AutoModelForCausalLM, AutoTokenizer

# Merge and save the fine-tuned model
drive.mount('/content/drive')

model_path = "/content/drive/MyDrive/llama-2-7b-custom"

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Save the merged model
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

# 6. Load a fine-tuned model from Drive and run inference

In [None]:
drive.mount('/content/drive')
model_path = "/content/drive/MyDrive/llama-2-7b-custom"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [None]:
prompt = "Mennyi 2 + 2?"  # change to your desired prompt
gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_new_tokens=20)
result = gen(prompt)
print(result[0]['generated_text'])