# Fine-tuning Mistral-7b-Instruct On Reddit Comment and Reply Data

Most of the chatbots we have today have been aligned by their creators to behave in a certain way. I want an AI chatbot that is fun and engaging. The chatbots we have including ChatGPT and Gemini are aligned in a way that they only give answers deemed as appropriate. They also have a specific tone (although this could be somewhat altered using prompt engineering). What I want from my chatbot is unfiltered thoughts and for it to give me its opinion when asked about something, instead of giving the usual response of, __"I'm an AI assistant. I don't have opinions, but I can guide you ..."__

To do this, I am going to fine-tune an open-source large language model. The model I'll use is the Mistral-7b-Instruct model developed by the Mistral team. It's freely available on Hugging Face ([model](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GPTQ)). The paper outlining the details of the model can be found [here](https://arxiv.org/abs/2310.06825). And for the data, I'll be using Reddit comments and replies to those comments. These comments are scraped from the wallstreetbets subreddit, and the dataset is available on Hugging Face as well ([dataset](https://huggingface.co/datasets/Sentdex/wsb_reddit_v002)). The dataset is composed of 118k rows, but I'm not using all of them. There are a couple of reasons for this: the first is that using that much data requires a lot of computational power, which is a constraint, and secondly, the method I'm using to fine tune the model works with very few examples, and that's one of the beauty of this method. The final version of model is fine tuned with only 250 examples, but in earlier iterations, I fine-tuned it with even fewer examples, and I was still able to see results. So we can see that the method I'm using is a really practical method of acheiving good results with a small training data and few computational resources.

The dataset was originally in a JSON format, and I changed it to a CSV format that is compatible with how my pipeline is set up. The cleaned data is in the file `reddit-comments.csv`.

To start with, let's first load the base model that we'll be using and ask it some questions to see what kind of response we can get from it. The model is available on Hugging Face and we can use the `transformers` library to load it. Other necessary imports are also included here.

In [None]:
!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes
!pip install sacrebleu

In [None]:
# Import necessary libraries

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import transformers

from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

To get an idea of what kind of responses the base model gives, let's load it and try it out with some prompts.

### Load model

In [None]:
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"

# Here, CausalLM is used becuase the model is for generating text (like GPT). It'd be MaskedLM for models like BERT.
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto", # automatically figures out how to best use CPU + GPU for loading model
                                             trust_remote_code=False,
                                             revision="main")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


After loading in the model, we need to also get a tokenizer that'll convert our prompt to tensors. The exact details of the tokenizer are not mentioned in the paper. But here, we can just load it for our pretrained model diretly from Hugging Face as well using the `AutoTokenizer` class

### Load tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

### Using Base Model

Let's first try to use the base model. I am assuming that when we ask it questions that requires having an opinion, it backs off not to offend anyone. Most of the current LLMs, including ChatGPT, are aligned this way, and I just want something fun and lively to converse with. So let's start off with the base model and see what we get.

In [None]:
def generate_response(prompt, repetition_penalty=1.5, do_sample=True, max_new_tokens=140, specific_commands=None):
    """
    Generate a response based on the given prompt using the pre-trained model.

    Args:
        prompt (str): The input prompt for generating the response.
        repetition_penalty (float, optional): The repetition penalty to apply during generation. Defaults to 1.5.
        do_sample (bool, optional): Whether to use sampling during generation. Defaults to True.
        max_new_tokens (int, optional): The maximum number of new tokens to generate. Defaults to 140.
        specific_commands (str, optional): Specific commands to include in the prompt. Defaults to None.

    Returns:
        str: The generated response based on the given prompt.
    """
    model.eval() # Put the model in evaluation mode (dropout modules are deactivated)

    # Since the base model we're using is the Mistral-7b-Instruct, and it's an instruction-tuned model, it expects the prompt to be in a specific format.
    # It expects the [INST] and [/INST] start and end tokens. They are special tokens used by the model. So that's why we're adjusting the prompts that way.
    if specific_commands:
        prompt = f"[INST] \n{specific_commands} \n{prompt} \n[/INST]"
    else:
        prompt = f"[INST] \n{prompt} \n[/INST]"

    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=max_new_tokens, repetition_penalty=repetition_penalty, do_sample=do_sample)
    return tokenizer.batch_decode(outputs)[0]

In [None]:
comment = "What do you think about the US politics?"
response = generate_response(comment)
print(response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] 
What do you think about the US politics? 
[/INST] I don't have personal experiences or opinions. However, I can provide information and analysis on U.S. politics based on data and facts. The political climate in the United States is complex and multifaceted. There are two major national parties: the Democratic Party and the Republican Party. Political discourse often centers around issues like economy, healthcare, immigration, foreign policy, education, environment, social justice, gun rights, among others. Currently, there is significant polarization between thesetwopartiesandtheirsupportersbasedonideologyaswellasthetraditionalred- vs.-blue divide. Recent yearshaveseenincreasingintensityinsocialissueslikeimm


As we can see from the output of the model, it's saying that it doesn't have any opinions and just overall gives a very basic response, which is fine, but I wanted something fun for this chatbot. To achieve that, like mentioned earlier, we're going to fine-tune it using our Reddit data.

There are a couple of different methods to fine-tune an LLM. One common method is full fine-tuning. The process results in a new version of the model with updated weights. One caveat with this process is that full fine-tuning requires enough memory and computing power to process all the gradients and other components being updated during training.

In order to work against this constraint, we have another method called parameter-efficient fine-tuning. In this method, we only update a small set of parameters, which saves us a lot of computational power and memory. One method of doing this is called **LoRA (Low-Rank Adaptation)**.

### LoRA

LoRA (Low-Rank Adaptation) is a method used for parameter-efficient fine-tuning of large language models. It is designed to update only a small set of parameters, reducing the computational power and memory requirements compared to full fine-tuning.

The idea behind LoRA is to identify a subset of parameters in the model that can be updated to adapt the model to a specific task or domain. This subset of parameters is referred to as the "target modules." By updating only these target modules, LoRA achieves parameter-efficient fine-tuning.

The key concept in LoRA is the low-rank approximation of the weight matrices in the target modules. Instead of updating the full weight matrices, LoRA decomposes them into low-rank factors. This decomposition reduces the number of parameters that need to be updated, resulting in significant memory and computational savings.

During the fine-tuning process, LoRA updates the low-rank factors of the target modules using gradient descent. The gradients are computed using backpropagation through the model, similar to traditional fine-tuning methods. However, since LoRA only updates a small set of parameters, the computational cost is significantly reduced.

Let's discuss this using an example. In class, when discussing transformers, we talked about the three vectors that are generated in each head: $Q, K$, and $V$. In order to generate these vectors, we have matrices associated with each of them: $W^Q, W^K$, and $W^V$. Since in LLMs, we have multiple attention heads, we'll also have more of these matrices as well. For illustrating what LoRA does, lets just look at one of the matrices, $W^Q$.

For our discussion, let's assume this weight matrix is a $7$ x $7$ matrix. We have 49 elements. This is an example of what the matrix could look like:

$$
W^Q = \begin{pmatrix}
4 & 7 & 2 & 9 & 1 & 5 & 3 \\
6 & 3 & 8 & 2 & 7 & 4 & 1 \\
5 & 9 & 1 & 3 & 8 & 6 & 2 \\
7 & 2 & 4 & 6 & 9 & 1 & 5 \\
3 & 8 & 6 & 1 & 4 & 2 & 7 \\
2 & 5 & 9 & 7 & 3 & 8 & 6 \\
1 & 6 & 3 & 4 & 5 & 7 & 9 \\
\end{pmatrix}
$$

If we want to use full fine-tuning, we'd have to update all of these elements. And we can assume as the matrix gets bigger, we need to update more values as well. One idea to reduce the amount of elements to update is, instead of having one big matrix, why don't we have two smaller matrices, when multiplied will give us the same dimension as the original matrix. Let's call these two matrices $A$ and $B$, and in our scenario, their dimensions will be $7$ x $r$ and $r$ x $7$ simultaneously. These matrices are lower in rank when compared to the original matrix. They have a rank of $r$. So they could, for example, be a $7$ x $2$ and a $2$ x $7$ matrix. This is what they could look like:

$$
A = \begin{pmatrix}
a_{11} & a_{12} \\
a_{21} & a_{22} \\
a_{31} & a_{32} \\
a_{41} & a_{42} \\
a_{51} & a_{52} \\
a_{61} & a_{62} \\
a_{71} & a_{72} \\
\end{pmatrix}
$$,

$$B = \begin{pmatrix}
b_{11} & b_{12} & b_{13} & b_{14} & b_{15} & b_{16} & b_{17} \\
b_{21} & b_{22} & b_{23} & b_{24} & b_{25} & b_{26} & b_{27} \\
\end{pmatrix}
$$

When we multiply them, we get a $7$ x $7$ matrix. This matrix is called the adapter. It could be element-wise added to the original weight matrix, $W^Q$ and the result will be the new fine-tuned version of the weight matrix that we can use in our model. Apparently, this would work as a fine-tuning method because the values in these low-rank matrices, $A$ and $B$ are learned during the fine-tuning process, so they have the new information from the new data. When we add this new matrix to original matrix, we have both the pre-trained information and the new information extracted from the fine-tuning data.

So this is great, but does it really save us that much computational power and memory? Let's run some numbers to figure that out. In the previous example, we had a $7$ x $7$ matrix, which had $49$ elements. When we used the two low-rank matrices, we had $7$ x $2 = 14$ elements each, and since we had two matrices, we had $28$ elements we needed to update. But what happens as we scale up?

Let's assume our original weight matrix is $d$ x $d$ elements, and our low-rank matrices are $d$ x $r$ and $r$ x $d$.

1. The total number of elements we have for the orginal matrix is $d^2$ and for the combination of the two low-rank matrices, it's $2$ x $d$ x $r$. WHen $r$ is much smaller than $d$, which is usually the case, $2dr$ is significantly less than $d^2$, meaning fewer parameters need to be learned and updated during training.
2. The matrices $A$ and $B$ also need to be multiplied to get the adapter to be added to the original weight matrix. But even this process is a lot faster because of the matrices' size. When multiplying $A$ and $B$, we take the dot product of the row of $A$ and the column of $B$ to get the value for a single entry in $AB$. Since the number of columns of $A =$ number of rows of $B=r$, the cost of getting a single entry in $AB$ is $r$. Since the size of $AB=d^2$, the cost of mutiplying $A$ and $B$ is $O(d^2r)$. This is much more efficient than operations involving matrices of size $d$ x $d$ when $r<d$.

This method is great and saves us a lot of computational work and allows us to fine-tune our model efficiently. But lucky for us, researchers have also come up with a way to make this even more efficient by adding quantization to the mix to create **QLoRA (Quantized Low-Rank Adaptation)**.

Quantization is a process that reduces the precision of numerical values to save memory and computational resources. This could involve rounding, clustering values, and mapping to a set of representable values, while preserning as much of the original information as possible.

In this assignment, I'll be using QLoRA to fine-tune the base model. To do that, I'll be using the `peft` library from Hugging Face. To do that, let's first load in our data. I uploaded the cleaned dataset to my Hugging Face account, so I'll load it from there.

In [None]:
from datasets import load_dataset

def get_dataset(username, dataset_name):
    # load dataset
    data = load_dataset(f"{username}/{dataset_name}")

    return data

username = 'hussenmi'
dataset_name = 'reddit_comments'
data = get_dataset(username, dataset_name)

data

DatasetDict({
    train: Dataset({
        features: ['example'],
        num_rows: 240
    })
    test: Dataset({
        features: ['example'],
        num_rows: 9
    })
})

In [None]:
# Let's look at some of the data
for i in range(5):
    print(data["train"][i]["example"])
    print()

<s>[INST] 
Yup. Insane payments. Not sustainable. But the banks are all in 
[/INST]
But what if they only give loans to unemployed people? That way, the banks wouldn't have any risk, since 6 x 0 is 0.</s>

<s>[INST] 
Did the run out of water there yet? 
[/INST]
IIRC they have in Cape Town</s>

<s>[INST] 
Thats the same thing?    \n  Edit: learned the difference 
[/INST]
^ This guy doesn't get the joke.</s>

<s>[INST] 
Originally it wasn't. I marked it using my secret mod powers. 
[/INST]
Nice</s>

<s>[INST] 
Pembrolizumab is the single defining factor of Merck right now. The entirety of stock movement is based around this single immunotherapy.    \n  If pembrolizumab can consistently continue to show that it is a better PD-1 checkpoint inhibitor over competitors, which so far, it generally has managed to do, Merck will continue to do well. 
[/INST]
Yea Keytruda is really important for them, but it's kind of disappointing that they don't have multiple irons in the fire currently.</s>



After we get the data, we need to tokenize our data and also get a data collator that adds padding so that the shape of our matrices match (just like what happens in padding in Convolutional Neural Nets).

In [None]:
def get_tokenized_data(examples):
    """Tokenizes the input text and returns the tokenized inputs.

    Args:
        examples (dict): A dictionary containing the input examples.

    Returns:
        dict: A dictionary containing the tokenized inputs.
    """
    # extract text
    text = examples["example"]

    # tokenize and truncate text
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )

    return tokenized_inputs

def prepare_data(data):
    """Prepares the data for training.

    Args:
        data (Dataset): The input dataset to be prepared.
        tokenizer (Tokenizer): The tokenizer used to tokenize the data.

    Returns:
        Tuple[Dataset, DataCollator]: A tuple containing the tokenized data and the data collator.
    """
    # tokenize training and validation datasets
    tokenized_data = data.map(get_tokenized_data, batched=True)

    # setting pad token
    tokenizer.pad_token = tokenizer.eos_token
    # data collator
    data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)

    return tokenized_data, data_collator


tokenized_data, data_collator = prepare_data(data)

Map:   0%|          | 0/9 [00:00<?, ? examples/s]

Once we get the data and prepare it accordingly, we need to prepare our model for training. This means setting it to `train` mode, reducing the number of trainable parameters using the QLoRA method we just discussed, and other things like enabling gradient checkpointing in order to save memory at the cost of more computation.

In [None]:
def prepare_model_for_training(model):
    """Prepare the model for training.

    This function prepares the given model for training by performing the following steps:
    1. Puts the model in training mode (activates dropout modules).
    2. Enables gradient checkpointing.
    3. Enables quantized training.
    4. Configures LoRA with the specified parameters.
    5. Creates a trainable version of the model using the LoRA configuration.
    6. Prints the percentage of trainable parameters as compared to the original parameters.

    Args:
        model (nn.Module): The model to be prepared for training.

    Returns:
        nn.Module: The prepared model.

    """
    model.train() # Put the model in training mode (dropout modules are activated)

    # enable gradient check pointing
    model.gradient_checkpointing_enable()

    # enable quantized training
    model = prepare_model_for_kbit_training(model)

    # LoRA config
    config = LoraConfig(
        r=8,
        lora_alpha=32,
        target_modules=["q_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )

    # LoRA trainable version of model
    model = get_peft_model(model, config)

    # trainable parameter count
    model.print_trainable_parameters()

    return model

model = prepare_model_for_training(model)

trainable params: 2,097,152 || all params: 264,507,392 || trainable%: 0.7928519441906561


Now that we have both our model and data ready, we can fine-tune it.

In [None]:
def finetune_model(model, tokenized_data, data_collator, training_args):
    """Fine-tunes the model on the given tokenized data.

    Args:
        model (PreTrainedModel): The model to be fine-tuned.
        tokenized_data (Dataset): The tokenized training and validation data.
        data_collator (DataCollator): The data collator to be used for training.
        training_args (TrainingArguments): The training arguments to be used for training.
        compute_metrics (Callable): The function used to compute the evaluation metrics.

    Returns:
        transformers.Trainer: The trainer used for fine-tuning the model.
    """
    # configure trainer
    trainer = transformers.Trainer(
        model=model,
        train_dataset=tokenized_data["train"],
        eval_dataset=tokenized_data["test"],
        args=training_args,
        data_collator=data_collator,
    )

    # train model
    model.config.use_cache = False  # silence the warnings
    trainer.train()

    model.config.use_cache = True


We can now fine-tune the data by calling the function with the parameters we have gathered:

In [None]:
# hyperparameters
lr = 2e-4
batch_size = 4
num_epochs = 10

# define training arguments
training_args = transformers.TrainingArguments(
    output_dir= "fungpt-ft",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    gradient_accumulation_steps=4,
    warmup_steps=2,
    fp16=False,
    optim="paged_adamw_8bit",

)


finetune_model(model, tokenized_data, data_collator, training_args)#, compute_metrics=compute_metrics)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss
1,5.2519,3.883362
2,3.6084,3.463136
3,3.1807,3.189507
4,2.921,3.116976
5,2.8287,3.093508
6,2.7328,3.080739
7,2.6611,3.084115
8,2.6403,3.096904
9,2.5925,3.101039
10,2.55,3.104319




Great! We have now trained our model. Next, we'll upload the adapters to Hugging Face so that we can easily load them later and add them to the base model to create our fine tuned model. I'll show how to load the fine-tuned model and use it in another notebook. I'll also calculate some metrics in that notebook as well.

In [None]:
from huggingface_hub import notebook_login
notebook_login()


In [None]:
model_id = "hussenmi/fungpt-ft"

model.push_to_hub(model_id)
trainer.push_to_hub(model_id)