**Project Description**

The objective of this project is to adapt a pre-trained LLM to a target domain (continued
pre-training), and perform instruction fine-tuning on question-answering.

💻 🧑 Developer: Milad Nourizade

📧 E-mail: milad.nouriezade@gmail.com




**Tool Selection**

We are going to use **Unsloth** since we have a limited computational power and by far it's the most efficient framework for finetuning.
[Unsloth](https://https://github.com/unslothai/unsloth?tab=readme-ov-file)
is a free and open-source framework for fine-tuning and inferencing LLMs which focuses on single-GPU finetuning. It is 2.2x faster, uses 70% less VRAM, has 0% degradation in accuracy for QLoRA (4bit) and LoRA (16bit) finetuning. However, it has some drawbacks such as not supporting all the models available. Here is a [link](https://https://wandb.ai/augmxnt/train-bench/reports/Trainer-performance-comparison-torchtune-vs-axolotl-vs-Unsloth---Vmlldzo4MzU3NTAx) to a detailed comparison between popular fine-tuning tools `torchtune`, `axolotl` , and `Unsloth`.



**Approach Overview**

The final goal of this task is to have a domain adapted model that is addtionally finetuned for instruction following (instruction fine-tuning). This could be achieved by following steps:


1.   Continued pretraining on the base model. The output of this step is the LoRA adapters.
2.   Merge LoRA adapters with the base model and save the new adapted model with `16bit` precision.
3.  Load the adapted model and quantize it with `4bit`.
4.  Fine-tune for the instruction following and like steps `1` and `2` merge and save the final model.

> Developed models are available at [huggingface.co/Dragonfluy](https://huggingface.co/Dragonfluy).


> The major difference between  `fine-tuning` and `continued pretraining` is that for continued pretraining We also integrate LoRA adapters into `embed_tokens` and `lm_head` to allow the model to learn out of distribution data. Additionally, we should select a higher rank `r` for LoRA adapters to train more weights that leads to learning more complex structures.



**Model Selection**

We select `Llama 3.1 8B` to start experimenting as a small base model since it has a high performance regarding to another proprietary models, long context size(`128k`) and appropriate size of parameters that meets our GPU memory limitation (Tesla T4). You can see the LLMs rank on each benchmark on [Lmarena](https://lmarena.ai/?leaderboard) or other leaderboards like [open_llm_leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) on Huggingface.


**Model Quantization**

Model quantization is a greate startegy that hleps us to load a huge model into GPU meomory where we have a limited GPU like T4 and prevent OOM error. Additionally, it decrease the memory usage for limited VRAM and model download time, also inference is much faster. We applied `4bit quantization` which converts higher precision usually `32bit` weights to `4bit`.


In [None]:
!pip install unsloth==2024.11.7
# Also get the latest nightly Unsloth!
# !pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [None]:
from huggingface_hub import login

# Replace with your actual Hugging Face API token string
huggingface_token = "Your huggingface token"

login(huggingface_token)

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

**Continued Pretraining**

Using LoRA ( Low Rank Adaptor ) technique, we can fine tune a language model with much less parameters. As previously discussed, in addition to `Attention` and `Feed forward` layers we need to add both `embed_tokens` and `lm_head`, however since we have a limited memory we can remove `embed_tokens` otherwise we will get OOM error. Selection of the `r` and `lora_alpha` is depend on the model size and dataset complexity.

> ⁉ "LoRA r doesn't significantly impact the final performance as long as LoRA adapters are used on all layers of the model." This is an interesting finding from [QLORA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) paper.



In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj", "lm_head"], # Add for continual pretraining
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

### Data Prepration
Now using the astro abstracts dataset from https://huggingface.co/datasets/UniverseTBD/arxiv-astro-abstracts-all. For speeding up the process we only sample the first `5k` rows and devide it to `train` and `validation` set. We also must add `EOS_TOKEN` or `tokenizer.eos_token` or else the model's generation will go on forever.


In [None]:
from datasets import load_dataset
dataset = load_dataset("UniverseTBD/arxiv-astro-abstracts-all", split = "train[:5000]")
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    return { "text" : [example + EOS_TOKEN for example in examples["text"]] }
dataset = dataset.map(formatting_prompts_func, batched = True,)

# Split the dataset into training and testing sets
dataset_dict = dataset.train_test_split(test_size=0.05)

train_dataset = dataset_dict['train']
eval_dataset = dataset_dict['test']

In [None]:
print(len(train_dataset))

### Continued Pretraining


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # num_train_epochs = 1,
        max_steps = 20,
        evaluation_strategy="steps",
        warmup_ratio=0.1,

        learning_rate = 2e-5,
        embedding_learning_rate = 1e-5, #set embedding_learning_rate to be a learning rate at least 2x or 10x smaller
                                        #than learning_rate to make continual pretraining work!

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "linear", # We can experiment with "cosine" and "constant"
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

The training and validation loss fluctuation showing that we need to put more time here to find appropriate hyperparameters combination. We can experiment with different `r`, `r_alpha`, `lr` or even increasing the dataset size.

In [None]:
trainer_stats = trainer.train()

In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

**Save Model**

Now it's time to merge the LoRA adapters with the base model to build the new model and finally upload it to huggingface profile which is available on this[link](https://huggingface.co/Dragonfluy/astro_adapted_llama_3.1_8b). To have a higher precision and preventing performance dropping we saved the model's weight with `16bit` format.

In [None]:
model.push_to_hub_merged("Dragonfluy/astro_adapted_llama_3.1_8b", tokenizer, save_method = "merged_16bit", token = huggingface_token)
# model.save_pretrained_merged("Astro_Adapted_Model", tokenizer, save_method = "merged_16bit",)

### Instruction Fine-tuning

Now let's load our new model that has been created in the first phase(cotinued pretraining) to continue instruction fine tuning. As discussed we set `load_in_4bit` to True to quantize the model's weight from `16bit` to `4bit`.

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Dragonfluy/astro_adapted_llama_3.1_8b",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

As before for the highest performance we inject LoRA adapters into all the layers but the difference is that for instruction fine-tuning we should not include `lm_head` and `embed_token`.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"], # Add for continual pretraining
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

#### Data Prepration

In this step we are preparing the [dataset](https://huggingface.co/datasets/daven3/geosignal) for instruction fine-tuning. We need to put the `Instruction`, `Input` and `Output` in a proper prompt format to feed the model. We still need `EOS_TOKEN` to indicate the end of sequence. Finally, we split the dataset (first 5000 rows) into two parts `train` and `validation`.

In [None]:
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("daven3/geosignal", split = "train[:5000]")
dataset = dataset.map(formatting_prompts_func, batched = True,)

# Split the dataset into training and testing sets
dataset_dict = dataset.train_test_split(test_size=0.005)

train_dataset = dataset_dict['train']
eval_dataset = dataset_dict['test']


In [None]:
print(len(train_dataset))

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = True, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        evaluation_strategy="steps",
        # eval_steps=5,
        warmup_ratio=0.1,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

As the training process indicates, the `Validation Loss` and `Training Loss` are smoothly decreasing showing that the training process is going well. However, by monitoring loss we can not be sure how good a large language model would perform and we need a set of new metrics to evaluate our model.  

In [None]:
trainer_stats = trainer.train()

In [None]:
model.push_to_hub_merged("Dragonfluy/astro_instruct_llama_3.1_8b", tokenizer, save_method = "merged_16bit", token = huggingface_token)

### Inference
Let's run the model!

We first will try to see if the model follows the style and understands to write a abstract that is within the distribution of `UniverseTBD/arxiv-astro-abstracts-all`. We select "A detailed analysis of Reuven Ramaty High Energy Solar Spectroscopic Imager (RHESSI)," from testset to continue writing.

Next, we select an instruction from `daven3/geosignal` to see how the models performas on instruction tuning task.

In [None]:
from transformers import TextIteratorStreamer
from threading import Thread
text_streamer = TextIteratorStreamer(tokenizer)
import textwrap
max_print_width = 100

FastLanguageModel.for_inference(model)

inputs = tokenizer(
[
    "A detailed analysis of Reuven Ramaty High Energy Solar Spectroscopic Imager (RHESSI),"
]*1, return_tensors = "pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer = text_streamer,
    max_new_tokens = 512,
    use_cache = True,
)
thread = Thread(target = model.generate, kwargs = generation_kwargs)
thread.start()

length = 0
for j, new_text in enumerate(text_streamer):
    if j == 0:
        wrapped_text = textwrap.wrap(new_text, width = max_print_width)
        length = len(wrapped_text[-1])
        wrapped_text = "\n".join(wrapped_text)
        print(wrapped_text, end = "")
    else:
        length += len(new_text)
        if length >= max_print_width:
            length = 0
            print()
        print(new_text, end = "")
    pass
pass

In [None]:
FastLanguageModel.for_inference(model)  # Enable faster inference
inputs = tokenizer(
    [
        prompt.format(
            "Give me a bulleted list of the past 10 Masters Tournament Champions.",  # instruction
            "",  # input
            "",  # output - leave this blank for generation!
        )
    ], return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
outputs = tokenizer.decode(outputs[0])
print(outputs)

### Further Improvement

**Evaluation**

One of the imprtant missing part in the process of finetuning a large language model is evaluation. Considering the time constraint, we had to skip this step but it's vital to do evaluation after each finetuning.

[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main?tab=readme-ov-file) is a open source framework for evaluating LLMs on different task and general benchmarks.

**Perplxity**

Perplexity is a critical metric used to evaluate the performance of language models, especially in domain adaptation scenarios. It quantifies how well a probability distribution predicts a sample. Lower Perplexity shows that model has predicted the sequence with higher accuracy.


**Conclusion**

Considering the low training time (20-30 steps), limited input data model performed well on instruction fine-tuning. Having more compute unit let us to include `embed_token` for LoRA adapters, experiment with a higher rank( `r`) and loading more data to achieve a better result regarding the continued-pretraining.

Future studies:

* [AstroLLaMA: Towards Specialized Foundation Models in Astronomy](https://arxiv.org/abs/2309.06126)
* [AstroMLab 3: Achieving GPT-4o Level Performance in Astronomy with a Specialized 8B-Parameter Large Language Model](https://arxiv.org/html/2411.09012v1)
*   [QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning](https://https://arxiv.org/abs/2402.10462)
* [Can AI Understand Our Universe? Test of Fine-Tuning GPT by Astrophysical Data](//https://arxiv.org/abs/2401.02981)
* [Designing an Evaluation Framework for Large Language Models in Astronomy Research](https://https://arxiv.org/html/2405.20389v1)
