# **Fine-Tuning using LoRA**

This notebook aims to provide you with all the necessary information to be able to effectively use the state of the art fine-tuning frameworks for your specific use case and how to integrate them into your environment.
To this end this guide will include the steps for Huggingfaces `PEFT` implementation, `unsloth` and `torchtune`. 

### **When to use fine-tuning**

Fine-tuning can be an effective way to customize an LLM to your specific needs, however it is not a silver bullet and for some use cases other techniques, such as RAG may be better suited. 
In particular fine-tuning is best used to influence the "behaviour" of an LLM - the speech patterns, phrasing and the like. When it comes to accurately representing facts from your internal knowledge-base, it becomes very challenging and requires a lot of data in order to do this reliably. 

### **General Dependencies**

There are a few general dependancies for this guide that should be met beforehand. Firstly a `Python` environment running at least version `3.9` as well as a `PyTorch` installation of version `2.6.0` or later

---

## **Setting up a remote environment**

This section will go into detail how you can setup your own remote working environment on a machine with a business grade GPU. If you have a powerful enough GPU you can also run everything on your own machine, so feel free to skip this if that is an option for you.

We will be focusing on a single GPU Machine. There are some extra considerations you'll have to make in a multi-gpu or even node-cluster, but for the purposes of this demonstration that is overkill. Most of that is handled by the frameworks themselves anyway.

There are many hosting services out there. From my personal experience and research I can recommend paperspace. They provide very reliable machines and handle the whole networking part fairly seemlessly for you. The pricing is also very reasonable in a pey-per-hour way. 

### **Basics**

For our purposes an NVidia a6000 GPU is sufficient. It's pretty powerful and modern enough to support the newest CUDA versions, which some of the frameworks used later require. 

You can either use the ML-In-A-Box Environment or start from a fresh Linux distro. The first option is convenient because it already has CUDA installed. But it can be a gamble if it's the right version. If it isn't you'll need to install the new one yourself. Like you would have to for the fresh distro. 

Once youve set up the machine connect to it using `ssh paperspace@[HOST-IP]`

From there check the CUDA version (if any) via 
```
nvcc --version
```
You'll be looking for any version higher than 12.0.

The following command can be rather slow and might not be necessary if your CUDA version does not need to be changed.

```
sudo apt-get update
```

To install the right CUDA version manually check the [NVidia documentation](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network) 

### **Installing Conda** 

Conda is very convenient for easy environment management. Even within a conda env you can use pip to download only for that env, so there's really no downside. 

To install it run the following 4 commands sequentially
```
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
```

You will then have to configure the PATH environment variable to ensure the 
conda command is found in the terminal. To do this run the following command
in the Terminal which will open the NANO editor in which you can configure the
environment.

```
nano ~/.bashrc
```

Once opened add the following to the very end of the file, then press `ctrl+x` 
to close NANO, follow with `y` to save the changes and `enter` to apply the
changes to the current file.

```
add EXPORT PATH="/home/paperspace/miniconda3/bin:$PATH"
```
 
For a sanity check you can print the PATH variable using `echo $PATH`. After 
you have done this you will have to reload the path with

```
source ~/.bashrc
```

Now in order to be able to use conda effectively you will have to run the 
following 2 commands after which you should see `(base)` at the start of your
terminal prompt.

```
conda init
source ~/.bashrc
```

### **Installing all dependancies with conda**

First create your conda env using the folling command.

```
conda create --name lora python=3.11
```

After you created it you will have to activate it with 

```
conda activate lora 
```

Now the `(base)` in the command line should show `(lora)` or whatever name you decided to give your environment.

From there you can install all the dependencies with either `pip install` or `conda install`.

### **Cloning the Git Repo**

To clone your git repository use this command and verify using your GitHub
username and access token for the project

```
git clone [REPO LINK]
```

If you dont want to enter the credentials everytime you do a git action you can
cache them for a certain amount time. Simply run this command once before
a git action. You can change the amount of time for however long you want. 

```
git config --global credential.helper 'cache --timeout=3600'
```

Note however that this lasts only for the current session. So for the first action of any session you will have to put in your credentials once.

If you have all dependencies installed you should now be good to go to run your scripts as you normally would.

### **Mounting a Network Drive**

On-Device storage can be quite expensive, so if you need more using a network drive is often a better solution. However in order to be able to use this network drive you will have to mount it. This means that the external drive will be treated like a normal directory in your file-system, making all of the networking completely transparent!

To mount the network drive you may first need to create a directory you want to mount the drive to. I recommend first navigating to your repositories root directory, for more convenient access. Then after create the new folder. 
`mkdir ./data`

Afterwards you edit the permission for mounting outside of the root directory
using 

```
sudo chown paperspace:paperspace /home/paperspace/PATH/TO/MOUNT
```

Then you mount the drive by editing the fstab NANO

```
sudo nano /etc/fstab
```

Simply add this command at the bottom

```
//your-shared-drive-ip-address/your-shared-drive-name /home/paperspace/PATH/TO/MOUNT   cifs  username=your-username,password=your-password,uid=1000,gid=1000,rw,user  0  0
```

Lastly now you can mount the directory you want with the call. Simply add the
same file you wrote into /etc/fstab 

```
mount /home/paperspace/PATH/TO/MOUNT
```

You can verify whether this worked with  `df -h`

### **Accessing Remote Data**

To download data from your machine run 

```
scp paperspace@host-ip:complete-host-path path-to-copy-to
```

Conversely to upload Data

```
scp path-to-upload paperspace@host-ip:complete-host-path 
```

To see the full path of a folder on your machine you can get the full path via 
`pwd`

If you uploaded an archive you extract the data via

```
tar -xzvf filename.tar.gz
```

### **Monitoring the Systems Performance**

You can do `ls -l` to check when the files in the current directory where last 
changed 

You can see a file updating real time while looking at the bottom lines using
`tail -f -n 20 ./`

Check what processes are running with `htop`

---

## **Selecting a Base Model**

Selecting a base model, that is the model that we will train the adapter for, should be the first decision one should make, as it may have implications on which frameworks or data formats are compatible.  While the options for competitive models are ever expanding, making it hard to recommend one specific one for all cases, there are a few key considerations as to which model you should pick to keep in mind. 

### **Foundation vs. Instruction Model**

The first and perhaps easiest choice to make is whether to go for a instructional model or not. For most normal use cases it would be enough to just go with the instructional one and move on, however let's have a look at what this actually means. 
Instructional models are in and of themselves already fine-tuned to expect a specific message format, which influences their behaviour. As the name would suggest this format is structured around certain instructions, each of which are intended to handle different aspects of the models behaviour. In the vast majority of cases there are three different kinds of instructions:

1. `System`: This instruction is directed at the model itself and influences the general behaviour of a model, rather than a specific question. 
2. `User`: This is the input the model should generate a response to. 
3. `Assistant`: This is the models response. 

As you may have noticed these are very similar and in fact essentially the same as the different types of prompts that are relevant when interfacing with popular models such as the GPTs. This also gives you a good pointer to when to use these models: Whenever you're aiming to have some sort of user interaction with the model.

Foundational models do not have this condition for their inputs. They simply take in any text and generate likely continuations. This makes them essentially like a blank canvas, for you to do whatever with. Which may be useful if youre looking to generate text without any user influence.

### **Open vs. Closed Source**

As with most other software applications, this is another important consideration to make. Closed source models are often the more powerful alternatives. However the key advantages of open source models are first that you have complete control over your internal data - theres no obligation to send any data to an external service provider. Second, due to being able to run the models on the hardware of your choice you have much greater control over the environment the model is run on, which in turn affects the financial aspects of your application. 

Noteworthy however is the respective open-source license as many apply additional restrictions and conditions as to the intended use case. For research and educational purposes, most of these are fairly unlimited.

For the purposes of this demonstration we will go forth with open-source models as that allows to show the implementation and integration of the complete fine-tuning process into your personal environment. 


### **Model Sizes**

Model sizes are measured in the amount of parameters they are composed of. This number can range from one billion into the trillions. Generally speaking the more parameters - the more powerful the model is. However with increased parameter count comes an increase in required ressources as these models live on your GPU's VRAM during runtime. So your hardware hard limits the model choice in terms of parameter count. In general if youre running this on your local PC the choice is usually limited from 1B to 8B parameters. Which can give you decent performance already, but its definitely on the low end of the spectrum. For higher tier models you will need a business grade GPU as we've already explored in the previous section (...or even a cluster).

One important note is that training requires a lot more VRAM than just running a model for inference, so fine tuning a model will quickly become a very intensive task even for high end consumer GPUs.

### **Options**

As mentioned before there are many options for viable open source models, and the list is ever expanding. This guide will focus on the Llama 3 model family as they provide a great range of different model sizes, all of which show great performance for their respective parameter count, which makes quick prototyping very easy. For reference there is also an included test using a MistralAI model later on.
The shown implementations are however fairly model agnostic so it should be straight forward to switch one out for another in most cases.

### **Downloading the Models**

The de-facto standard for LLMs when it comes to downloading and using them is `Huggingface` - a platform that provides access to essentially all available open-source LLMs, as well as a lots of pre-made ready-to-use datasets.

They provide libraries for both downloading models using a CLI or in your own script. 

To install the CLI simply run
```
pip install huggingface_hub
```

You will then need to provide a personal token to download models from Huggingface which you can do via `huggingface-cli login`.

You can then download your model of choice with using the CLI with 

```
huggingface-cli download <MODEL-ID>
```
Where the model-id is usually comprised of a group- and a model-name for example: `meta-llama/Llama-3.2-1B`

This will by default download the model to the path specified by the `HF_HOME` environment variable. So if you want to change the folder, you will need to change this variable.

Alternatively you can download the model within your python script using the huggingface python libraries which you download using:

```
pip install transformers accelerate datasets
```

To then download the model use the following code snippet

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Foundation Model
# base_model_id   = "meta-llama/Llama-3.2-1B
# Instruct Model
base_model_id   = "meta-llama/Llama-3.2-1B-Instruct"

base_model      = AutoModelForCausalLM.from_pretrained(base_model_id) 
# Each model comes with its own tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

print (f"Successfully downloaded Model: {base_model.config._name_or_path}")

The `AutoModelForCausalLM` will automatically build the appropriate model for the specified ID and return an abstraction of it so you will never have to directly interface with the model. Each model also comes with it's own tokenizer. This will be installed alongside the model itself so you don't have to do anything extra to get it. Simply create an instance with the abstraction created by `AutoTokenizer` and youre good to go.

### **Inference**

The abstractions of the transformers library make running inference on a model a trivial task. There are however some things that you will have to keep in mind in order to use them properly. \
The most important part is that the prompt gets transformed into the correct format. As discussed before instruction models expect a specific format that varies depending on which model youre using. Luckily the tokenizer for such models provides functionality for automatically applying the format the model requires if the input is provided in a commonly used structure. The most common structure, and the one we will be using here is in the form of a list of individual prompt object. Each of which contains a role and contant (aka. the message itself). This structure is widely accepted and can be transformed into whichever specific prompt fomat youre using automatically by the tokenizer.

In [None]:
# This is only necessary (and posssible) for instruct models
prompt = [
    # This is the role of the model
    {"role": "system",  "content": "You are a helpful AI Assistant."},
    # This is the prompt the model will answer
    {"role": "user",    "content": "What is the meaning to life"}
    # We do not need an assistant entry, as it will be generated by the model
]

# Format prompt from the parquet format to the format that is expected by the
# respective model 
# NOTE: foundation models do not have a chat template so this will fail
formatted_prompt = tokenizer.apply_chat_template(prompt, tokenize=False)

print ("Formatted Prompt:")
print (formatted_prompt)

Once the prompt has the required format the rest is pretty straight forward. Simply tokenize the prompt and move everything to your GPU. Then, call the generate method. There are a few parameters to tweak the behaviour which are explained in the example below, but it's not at all complicated. Finally simply decode the returned tokens and there you go: You've successfully run your own local LLM.

In [None]:
# For foundation models you can simply provide any text
# formatted_prompt = "The meaning of life is"

def run_inference(model, tokenizer, prompt):
    # The tokenizer transforms the raw text into so called input ids, which we
    # specify to be PyTorch tensors
    tokenized_prompt    = tokenizer(prompt, return_tensors="pt")

    # The tokenizer returns two outputs, the tokenized input, the "input ids"
    # and the "attention mask" which tells the model at which token to start
    # generating
    input_ids           = tokenized_prompt.input_ids
    attention_mask      = tokenized_prompt.attention_mask
    # Move the model as well as the inputs to your GPU
    device          = "cuda"
    model           = model.to(device) 
    input_ids       = input_ids.to(device)
    attention_mask  = attention_mask.to(device)

    # Auto regressively generate the output sequence(s). It will always return
    # a list, even if you're just generating one output.
    output_sequences = model.generate(
        input_ids,
        attention_mask          = attention_mask,
        # Defines how many tokens will be generated on top of the already 
        # existing prompt
        max_new_tokens          = 512, 
        # Makes it so we don't always take the most likely next token, but 
        # sample from a distribution of likely next tokens. If this is disabled
        # a given prompt will always return the same result.
        do_sample               = True, 
        # Increases or decreases the probabilities of the less likely tokens
        temperature             = 0.6,
        # Consider only as many tokens as it takes to reach this cumulative 
        # probability
        top_p                   = 0.9,
        # How many tokens will at most be considered for the sampling
        top_k                   = 10,
        # How many sequences will be generated for the result
        num_return_sequences    = 1,
        pad_token_id            = tokenizer.eos_token_id,
        eos_token_id            = tokenizer.eos_token_id
    )

    # Move the output sequences, inputs and model back into RAM
    device              = "cpu"
    output_sequences    = output_sequences.to(device)
    model               = model.to(device) 
    input_ids           = input_ids.to(device)
    attention_mask      = attention_mask.to(device)

    # Remove the original prompt from the response to only get the generated 
    # answer
    output_sequences = output_sequences[:, input_ids.shape[1]:]

    # Decode the first answer from tokens back to plain text, ignore the 
    # special tokens to only get readable text
    response = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
    
    return response

print (run_inference(base_model, tokenizer, formatted_prompt))

---

## **Preparing the Data**

In order to fine-tune your model, you of course need training data. This sectionn deals with exactly that. There's gonna be two separate parts. The first looks at how to prepare a pre-made dataset, the second shows you how to prepare your very own custom one. That part is a lot more complex and complicated so I would recommend for your first experiments you focus on the first one.

Our first snippet here simply defines a helper function that we will use later to format the entire dataset. It's essentially just formatting and tokenizing the data, as we've already seen for the inference. The exact procedure depends on the structure of the used dataset, so you might have to adjust this to your specific dataset.

In [None]:
# Helper function to format and tokenize the entire dataset. The exact steps 
# you will need to do here depend on the structure of your dataset.
def tokenize_function(entry, tokenizer):
    
    formatted_text = tokenizer.apply_chat_template(
        entry["messages"], 
        tokenize = False
    )     
    
    tokenized_text = tokenizer(
        text            = formatted_text,
        # Changing this value can significantly change VRAM usage, so always
        # go with as little as possible
        max_length      = 256,
        padding         = "max_length",
        truncation      = True,
        return_tensors  = "pt"
    )
    return tokenized_text

### **Using a Pre-Existing Dataset**

The Huggingface datasets library allows you to easily download every dataset available on there and seemlessly integrate it with the workflow weve prepared so far. 

Similar as to how `from_pretrained` automatically downloads the model specified with its ID, `load_dataset` does the same but for the dataset. One small nuance is that theres going to be 2 different splits of the dataset, one for training and one for evaluating. \
Once downloaded we simply need to format the data for our model. We can do this via the `map()` function which we simply pass the already introduced `tokenize_funciton()` to. `map()` further allows us to batch this progress, drastically increasing runtime performance. 

And that's already it, we've now done everything necessary to prepare the data for our model!

In [None]:
# Example for a pre-made dataset directly from huggingface
from datasets import load_dataset

# This is the id of the main dataset
dataset_id      = "HuggingFaceTB/smoltalk"
# If you have a subdataset, you can specify that as well
dataset_name    = "numina-cot-100k"

training_data = load_dataset(
    dataset_id, 
    dataset_name, 
    split="train")

validation_data = load_dataset(
    dataset_id, 
    dataset_name, 
    split="test")

# You will need to tokenize (and in the case of instruction models format) 
# the dataset before you can use it to train your adapter
train_data  = training_data.map(
    tokenize_function, 
    # This batches the data, increasing the performance of the mapping
    batched     = True,
    # This will be shown in the progress bar while the mapping is in progress
    desc        = "Formatting Training Data",
    # These arguments will be passed through to your mapping function
    fn_kwargs   = {
        "tokenizer" : tokenizer})
        
val_data    = validation_data.map(
    tokenize_function, 
    batched     = True,
    desc        = "Formatting Validation Data",
    fn_kwargs   = {
        "tokenizer" : tokenizer})


### **Creating your own Dataset**

Creating your own dataset is a lot more complex than taking a pre-made one. Of course as this is a very individual task, this is merely a representation of one possible workflow to create your own dataset. However it demonstrates some common techniques you can use and adapt to your specific use-case. 

The overarching method we will be using is to leverage an LLM to generate training examples for us, going off of some general structure/ prompt/ idea/ etc. we provide it. This saves a lot of time over manually writing a lot of exmples. It may take some consideration and analysis beforehand though to make sure the generated data is actually useful though. 

The snippet below is essentially just providing a logical structure that allows us to group useful information together from one specific PDF file. As mentioned this simply serves to illustrate that you will have to do some preparation of your data for optimal results. There's no silver bullet for this. However there are some libraries like `augmentoolkit`, or `distilabel` that may work for you if youre looking for a very general workflow. In most real cases you will however have to do some sort of your own preparation, especially if you're working with languages that aren't english or intend to use a local LLM for the generating part. 

In [3]:
# A bunch of identifying keywords that represent whenever a new batch of
# information follows that is relevant for the fine-tuning dataset
table_section_keywords = [
    "Studiengang", 
    "Kürzel", 
    "Bezeichnung", 
    "Lehrveranstaltung(en)",
    "Verantwortliche(r)",
    "Zuordnung zum Curriculum",
    "Verwendbarkeit",
    "Semesterwochenstunden",
    "ECTS",
    "Voraussetzungen",
    "Dauer",
    # Submodule
    "Lehrveranstaltung",
    "Dozent(en)",
    "Hörtermin",
    "Häufigkeit",
    "Art",
    "Lehrform",
    "Semesterwochenstunden",
    "ECTS",
    "Prüfungsform",
    "Sprache",
    "Lehr- und Medienform(en)",
]

content_keywords = [
    "Lernziele",
    "Inhalt",
    "Literatur",
]

# Helper function to determine whether a line of starts with a defined keyword
def starts_with_any(string, keywords):
    for k in keywords:
        if string.startswith(k):
            return k

    return ""

# Helper class that contains a pages content. This is highly individual and
# will depend greatly on whatever source data you are working with.
#
# The exact workings of this class is really not relevant, it's simply a 
# representative of a page divided into logical chunks that contain information
# of some sort that we want to feed into the LLM
class Page():
    def __init__(self, raw_text):
        self.raw_text           = raw_text

        self.name               = ""
        self.is_submodule       = False

        self.table_paragraphs   = []
        self.content_paragraphs = []

        latest_access           = None

        for line in raw_text.splitlines():
            parsed_line = ""

            # New logical Page
            if line.startswith ("I."):
                self.name = " ".join(line.split(" ")[1:])
                self.is_submodule = line.split()[0].count(".") >= 3
                continue
            
            # Table section
            keyword = starts_with_any(line, table_section_keywords)
            if keyword:
                parsed_line = keyword + ":" + line[len(keyword):]
                self.table_paragraphs.append(parsed_line)
                latest_access = self.table_paragraphs
                continue
            
            # Content section
            keyword = starts_with_any(line, content_keywords)
            if keyword:
                parsed_line = keyword + ":" + line[len(keyword):]
                self.content_paragraphs.append(parsed_line)
                latest_access = self.content_paragraphs
                continue
            
            # Whether this logical page is concluded or not
            if latest_access:
                latest_access[-1] += "\n" + line


    def __str__(self):
        return (self.name 
                + "\n" 
                + str(self.table_paragraphs) 
                + "\n" 
                + str(self.content_paragraphs))


Just some prep for the intermediate results we will get.

In [4]:
import os

save_path       = "custom_data"
source_path     = os.path.join(save_path, "source.pdf")
raw_data_path   = os.path.join(save_path, "raw_data.txt")
data_path       = os.path.join(save_path, "data.json")


This snippet now does all the heavy lifting for the generating part. First we simply read our source data and convert it into our logical information representation. Then follows the important part. Using the logical chunks of information we generate a prompt for our local LLM that details the result we want, the dataformat it should have and we also provide an example for a question and corresponding answer. This process should ideally use a fairly powerful LLM as the quality of this data matter immensely. It should be noted that it does not need to be the same as the model you're looking to fine-tune. 

Generating the question and answer pairs may take quite a while so in this snippet we've just taken a tiny slice of the actual prepared data for demonstration purposes. The interim results will be saved in the `raw_data.txt` so you can have a look at what such a result may look like. 

In [5]:
import  json
from    pypdf  import PdfReader
from    tqdm   import tqdm

# Load our source Data and prepare it to a list of chapters
reader = PdfReader(source_path)

parsed_pdf          = []
finished_preamble   = False 

# Again the exact workings are irrelevant as its specific to the data you are
# working with
for page in reader.pages:
    page_text = page.extract_text()

    if page_text.startswith("I."):
        finished_preamble = True
        parsed_pdf.append(page_text)
    
    elif finished_preamble:
        parsed_pdf[-1] = parsed_pdf[-1] + "\n" + page_text

    elif len(page_text):
        parsed_pdf.append(page_text)

# After we've created the list of chapter texts, we can convert them to our
# previously established logical representation, in this case the Page() class
prepared_pages = []
for page in parsed_pdf:
    prepared_pages.append(Page(page))

with open("prompts.json", "r", encoding="utf-8") as f:
    prompts = json.load(f)

def build_prompt (page, content, amount):
    return [
        {"role": "system", "content": prompts["sys_qa_prompt"]},
        {"role": "user", "content": 
            prompts["user_course_intro"] 
            + page.name
            + prompts["user_question_amount_pre"] 
            + str(amount)
            + prompts["user_question_amount_post"]
            + prompts["user_course_name_condition"]
            + page.name
            + prompts["user_one_shot_example"]
            + prompts["user_page_content_intro"]
            + content
        }   
    ]


# Once we've prepared the pages we can then generate prompts for our LLM to 
# answer. For this we first define a function that generates our prompts from
# the previously created logical chunks of information
def generate_question_pairs(page, 
                            model,
                            tokenizer):

    question_answer_pairs = []
    for table_content in page.table_paragraphs:
        prompt      = build_prompt(page, table_content, 4)
        prompt      = tokenizer.apply_chat_template(prompt, tokenize = False)
        response    = run_inference(model, tokenizer, prompt)
        
        question_answer_pairs.append(response)

    for table_content in page.content_paragraphs:
        prompt      = build_prompt(page, table_content, 10)
        prompt      = tokenizer.apply_chat_template(prompt, tokenize = False)
        response    = run_inference(model, tokenizer, prompt)

        question_answer_pairs.append(response)
    
    return question_answer_pairs

# Short example for a test run. a normal run would take a lot longer than this!
responses = []
for page in tqdm(prepared_pages[6:7]):
    responses.append(
        generate_question_pairs(
            page, 
            base_model,
            tokenizer)
    )

os.makedirs(save_path, exist_ok=True)
with open(raw_data_path, "w", encoding="utf-8") as f:
    for response in responses:
        for r in response: 
            f.write(r + "\n") 

  0%|          | 0/1 [00:00<?, ?it/s]


NameError: name 'base_model' is not defined

The data that was generated this way will still have some artifacts of your LLMs answers such as "Sure, here are X results", etc. There may also be incorrectly formatted results, incomplete ones, you get the idea. The quality of your source data, prompt and LLM all play a part in this so you may have to adjust one if not all of those components, if the quality appears to be below expectations. 

To fix some of the formatting issues theres a simple script below that filters out most unusable lines. There may however still be the need for some minor manual adjustments afterwards. This final prepared data can be found in the `data.json`.

In [None]:
# After we have prepared a raw data document we still need to properly format
# and clean it in such a way that no remnants or artifacts remain from the LLMs
# answers

with open(raw_data_path, "r", encoding="utf-8") as file:
    content = file.read() 

# Ensure that each line represents a valid JSON object
formatted_data = []
for line in content.splitlines():
    cleaned_line = line.lstrip().rstrip()
    
    # Just some simple checks to see if the formatting is correct.
    if not cleaned_line.startswith("{") or "}" not in cleaned_line:
        continue

    if not cleaned_line.endswith(","):
        if not cleaned_line.endswith("}"):
            continue
        cleaned_line += ","

    formatted_data.append(cleaned_line)

# Save the objects surrounded by a list to make it a valid JSON file
with open(data_path, "w", encoding="utf-8") as f:
    f.write("[\n")
    for line in formatted_data:
        f.write(line + "\n") 
    f.write("]")

# NOTE: Depending on your models strength you may have to remove some artifacts
# manually from the generated data.json as the auto formatting is fairly naive.
# This should at most be a few lines though.

Now that we've prepared our dataset, we still need to make it usable for the LLM we want to fine-tune. This is achieved in the snippet below. Essentially all it's doing is just loading our `data.json` transforming and transforming it into the universally accepted stucture we've seen before and subsequently formatting and tokenizing it as we've already done for the pre-made dataset earlier. 

In [None]:
from datasets import Dataset


# now to actually create the datasets from the prepared QA pairs
def data_mapper(entry):
    result_s = {}
    result_s["role"]    = "system"
    result_s["content"] = prompts["sys_training_prompt"]

    result_q = {}
    result_q["role"]    = "user"
    result_q["content"] = entry["Q"]

    result_a = {}
    result_a["role"]    = "assistant"
    result_a["content"] = entry["A"]

    return [result_s, result_q, result_a]

with open(data_path, "r", encoding="utf-8") as f:
    data = json.load(f)

training_data   = data[:50]
validation_data = data[50:60]

training_data   = {
    "messages": [data_mapper(entry) for entry in training_data]}
validation_data = {
    "messages": [data_mapper(entry) for entry in validation_data]}

training_data   = Dataset.from_dict(training_data)
validation_data = Dataset.from_dict(validation_data)

tokenizer.pad_token = tokenizer.eos_token

training_data = training_data.map(
    tokenize_function, 
    batched     = True,
    desc        = "Formatting Training Data",
    fn_kwargs   = {
        "tokenizer" : tokenizer})


validation_data = validation_data.map(
    tokenize_function, 
    batched     = True,
    desc        = "Formatting Validation Data",
    fn_kwargs   = {
        "tokenizer" : tokenizer})

And there you have it. Your completely original dataset made from scratch and entirely usable from now on for the fine-tuning process. Again its important to mention that all of this is just exemplary of one possible, albeit common, simple workflow of preparing fine-tuning data. You will have to make many adjustments and refinements to make this viable for your project.

## **Fine-Tuning the Base-Model**

Finally after all these steps we can get to actually fine-tuning the base model using the previously prepared dataset. 


### **Selecting a Framework**

As you'll likely notice the process for unsloth and PEFT is remarkably similar. As unsloth provides some excellent runtime and memory optimizations, while being extremely simple to use, I recommend you use that if you have the choice. The one main downside of it however is that it only functions on Linux machines. So if you don't have that option you'll have to go with one of the other options. 

The other two options we will be examining here are huggingfaces PEFT and torchtune. Torchtune is provided directly by Meta and is a native extension of PyTorch. Because of this it has some nice runtime and memory benefits, the one main drawback being that it supports a rather limited amount of models. However they do support the Llama model family which provide excellent performance for open source also supporting a wide variety of model sizes. A little bit of a con is that it can be a bit tricky to include it into your own workflow as it is mainly designed to be used as a command line tool. It does provide a variety of standalone scripts however that you can copy and adapt to fit into your own architecture. The needed config and adapted script are provided in the `torchtune_custom` folder. The adapted excerpts have been marked as such if you want to have a look at it. For more scripts and configs that may cater to different use cases such as full fine-tuning, different models or multi-GPU clusters you can check our their [GitHub repository](https://github.com/pytorch/torchtune).

If torchtune isn't an option either, then you can always go with PEFT which basically supports every model and system you throw at it. It may be less efficient depending on your use-case but it's still a great fallback catch-all solution.

In [None]:
# Currently supported: HF PEFT | torchtune | unsloth (only linux)
fine_tuning_framework = "PEFT"

match fine_tuning_framework:
    # Using Hugging Faces PEFT implementation =================================
    case "PEFT":
        from transformers   import AutoModelForCausalLM, AutoTokenizer
        from peft           import get_peft_model, LoraConfig

        base_model  = AutoModelForCausalLM.from_pretrained(base_model_id)
        tokenizer   = AutoTokenizer.from_pretrained(base_model_id)
        
        tokenizer.pad_token = tokenizer.eos_token

        lora_config = LoraConfig (
            # LoRA rank, higher value = better performance but more ressources
            r               = 16,
            # Weight of the adapter, usually 2x the rank
            lora_alpha      = 32, 
            lora_dropout    = 0,  
            bias            = "none",  
            task_type       = "CAUSAL_LM",
            # Matrices that we want to adapt  
            target_modules=["q_proj", 
                            "v_proj", 
                            "k_proj", 
                            "o_proj",
                            "gate_proj", 
                            "down_proj", 
                            "up_proj"],
        )
        # This is the frozen base model + the adapter which we want to train
        adapted_model = get_peft_model(base_model, lora_config)
        
        print ("Fine-Tuning with PEFT!")
    
    # Using unsloth ===========================================================
    case "unsloth":
        import platform
        # Merely importing the framework wont work if youre not on linux so
        # we have this early out
        if platform.system() != "Linux":
            logging.error("Unsloth is only supported on Linux!")
            exit()
        from unsloth    import FastLanguageModel
        
        base_model, tokenizer   = FastLanguageModel.from_pretrained(
            model_name      = base_model_id,
            max_seq_length  = 1024,
            # Not all GPUs support this
            dtype           = "bf16",
            # Only for quantized models
            load_in_4bit    = False, 
        )
        # As you can see it a very similar process to the PEFT one
        adapted_model = FastLanguageModel.get_peft_model(
            base_model,
            r               = 16,
            lora_alpha      = 32,
            lora_dropout    = 0,
            bias            = "none",
        )

        print ("Fine-Tuning with unsloth!")

    # Using torchtune =========================================================
    case "torchtune":
        import yaml
        # The format torchtune expects
        from omegaconf                      import OmegaConf
        # Customized recipe, be sure to check out the customized sections
        # which make it possible to integrate it into our system
        from torchtune_custom.fine_tuning   import LoRAFinetuneRecipeSingleDevice
        from transformers                   import AutoTokenizer
        
        # Because we tokenize the dataset ourself we also instantiate it 
        # ourself
        tokenizer           = AutoTokenizer.from_pretrained(base_model_id)
        tokenizer.pad_token = tokenizer.eos_token

        with open("torchtune_custome/fine-tune.yaml", "r") as file:
            cfg = yaml.safe_load(file)

        # The config handles the entire details for the training behaviour, so
        # definitely have a look at it if youre choosing torchtune.        
        cfg     = OmegaConf.create(cfg)
        recipe  = LoRAFinetuneRecipeSingleDevice(cfg=cfg)

        print ("Fine-Tuning with Torchtune")
        
    case _:
        print ("This framework is currently not supported")



Once youve created the adapters and set up the configs, running the training procedure is very very easy for each option. Merely a couple of lines each. 
Torchtune does not natively support custom datasets, so there are some adjustments you will have to make in the script to allow you to use them. All of the respective changes have been marked, so it should be easy to tell where they are going.
One thing of note though if you are using torchtune is that it expects a very specific formatting for the data. Its very similar to the one we are using right now it just expects the fields to have different names, so we have to rename them before being able to pass it in.

The fine tuning process may or may not take a while depending on your dataset size. If you havent changed anything so far we are using a very little slice of the data, so it shouldn't take long at all, but the results will also not be of great quality.

In [None]:
fine_tuned_dir = "result"

match fine_tuning_framework:
    # Unsloth and PEFT use the exact same training routine!>
    case "PEFT" | "unsloth":
        from transformers   import (DataCollatorForLanguageModeling, 
                                    TrainingArguments)
        from trl            import SFTTrainer
        
        # This collects the data for the training
        data_collator = DataCollatorForLanguageModeling(
            tokenizer, 
            # We dont want masked language modelling, instead we just want to
            # predict the next token
            mlm = False
        )
        # These are just the usual training arguments you would expect for a 
        # trainer
        training_args = TrainingArguments(
            output_dir                  = fine_tuned_dir,
            num_train_epochs            = 1,
            per_device_train_batch_size = 1,
            per_device_eval_batch_size  = 1,
            eval_strategy               = "epoch",
            logging_dir                 = "./logs",
            logging_steps               = 500,
            save_steps                  = 500,
            save_total_limit            = 2,
            bf16                        = True,  
        )
        # SFTTrainer is specifically for fine-tuning. (Supervised Fine-Tuning
        # Trainer)
        trainer = SFTTrainer(
            model           = adapted_model, 
            train_dataset   = training_data,
            eval_dataset    = validation_data,
            data_collator   = data_collator,
            args            = training_args,
        )

        # Depending on your machine this may take quite a while
        trainer.train()
        
        # This will only save the adapter weights
        adapted_model.save_pretrained(fine_tuned_dir)
        print (f"Saved adapter at {str(fine_tuned_dir)}")

    # Using torchtune =========================================================
    case "torchtune":
        from tqdm import tqdm
        # Torchtune expects a specific format for the data, so we need to
        # format it first
        training_data = [
            {
                "tokens":   data["input_ids"], 
                "mask":     data["attention_mask"], 
                "labels":   data["input_ids"]

            } for data in tqdm(train_data)
        ]

        # All of the actual configuration is done in the fine-tune.yaml
        recipe.setup (
            cfg        = cfg, 
            # The recipe has some custom adjustments to allow us to pass in a 
            # tokenizer and custom dataset. This would not be possible natively
            # so youll have to take a look at that beforehand if youre planning
            # to use custom data. 
            dataset    = train_data,
            tokenizer  = tokenizer
        )

        recipe.train()
        recipe.cleanup()

    case _:
        print ("This framework is currently not supported")

NameError: name 'fine_tuning_framework' is not defined

Now you can find the adapter weights in the `result` directory. As you can see a whole bunch of files have been generated, most of which are just to provide additional information and configuration for the adapter. The actual weights are stored in the `.safetensors` file.

---

## **Loading your Fine-Tuned Model Adapter**

Now that you've created your LoRA adapter running inference with it is extremely easy and very similar to how youve done it before. All you really need to do is load the additional adapter youve trained and combine it with the base-model that you'll load exactly the same as you have before.  After that it can be used exactly as the base model, and you can observe it's new behaviour right there. 

In [None]:
# Loading the fine tuned model is almost the same for each method!
from peft import PeftModel

# Because we only save the adapted weights we still need to load the base model
# first
base_model          = AutoModelForCausalLM.from_pretrained(base_model_id)
tokenizer           = AutoTokenizer.from_pretrained(base_model_id)

tokenizer.pad_token = tokenizer.eos_token

# The base model gets overriden in this case so dont use it after!
fine_tuned_model    = PeftModel.from_pretrained(base_model, fine_tuned_dir)
base_model          = None

if fine_tuning_framework == "unsloth":
    fine_tuned_model = FastLanguageModel.for_inference(fine_tuned_model)

# You can now use the fine tuned_model as if you were using the normal base 
# model. Note that the tokenizer does not change!
prompt = [
    {"role": "system",  "content": "Du bist ein KI Assistent der FH Wedel."},
    {"role": "user",    "content": "Was ist die Modulnummer von Algorithmics."}
]
formatted_prompt    = tokenizer.apply_chat_template(prompt, tokenize=False)

print (run_inference(fine_tuned_model, tokenizer, formatted_prompt))

### **You're Done!** 

Don't forget to restart the notebook to free up the ressources used by the models.