# Merge QLora and Quantize Model

In this tutorial, we will take the QLora model checkpoint you've trained from the last notebook, and perform the post processings necessary for it to actually be used. This usually consists of these steps:

- Merge the QLora adaptors weight with the base model
- Quantize into these formats:
  - GPTQ (for GPU)
  - GGML (mainly for CPU)
- Push to Huggingface

You'd need both a wandb and huggingface API key/token.

## Download artifacts from wandb

First let's login to wandb as usual.

In [None]:
!pip install wandb huggingface_hub

In [None]:
import wandb

wandb.login()

Now find the artifact ID in your wandb dashboard and update the field accordingly:

In [None]:
# @title Download artifact
wandb_artifact_id = "" # @param {type:"string"}

run = wandb.init()
artifact = run.use_artifact(wandb_artifact_id, type='model')
artifact_dir = artifact.download()
print(artifact_dir)
!ls {artifact_dir}

We've downloaded the artifact/checkpoint to a local directory as shown above.

## Merge QLora with base model

Now, let's begin our work. First login to Huggingface (you may skip this if you don't want to publish, but then you'd need to download the results manually, or upload to your own private storage such as S3 yourself)

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Install the libraries:

In [None]:
!pip install torch peft transformers

----

The method to perform the merge is to use the Vanilla `transformers` library from Huggingface combined with their `peft` library (Stands for "Parameter Efficient Fine-tuning" which feature various methods with the general theme of requiring less trainable parameter than the full model). The reason is because when we trained using the `axolotl` tool, it is actually calling `peft` under the hood and the resulting weights are in their format.

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, LlamaForCausalLM

# Load the PEFT config in our checkpoint
config = PeftConfig.from_pretrained(artifact_dir)

# Load the base model
base_model = LlamaForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    torch_dtype='auto',
    device_map='cpu',
    use_safetensors=True
    # Hopefully you saved the checkpoint as safetensor already,
    # as conversion takes additional RAM
    # offload_folder="offload", offload_state_dict = True
    # (Warning: offloading to harddrive results in some issue when I test)
)
#tokenizer = LlamaTokenizer.from_pretrained(config.base_model_name_or_path)

In [None]:
config

In [None]:
# Load a combined PEFT model using the base model + our checkpoint

#model = PeftModel.from_pretrained(base_model, artifact_dir, offload_folder="offload", offload_state_dict = True)
# (similar warning, I can't get it to work, would have been nice as it would lessen main memory requirement)
model = PeftModel.from_pretrained(base_model, artifact_dir)

**And here's where the actual merging occur using magic:**

In [None]:
merged_model = model.merge_and_unload()

In [None]:
merged_model

Let's save the results to disk. The function interface they provide would do the local save + publish to Huggingface Hub in one step. Though it is also possible to do it separately if you wish.

In [None]:
# @title Save model and push
fp16_model_save_dir = "merged_model" # @param {type:"string"}
push_to_hub = True # @param {type:"boolean"}
fp16_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-merged" # @param {type:"string"}

merged_model.save_pretrained(fp16_model_save_dir,
                             safe_serialization=True,
                             push_to_hub=push_to_hub,
                             repo_id=fp16_model_repo_id)

#merged_model.push_to_hub(fp16_model_repo_id)

**Now you need to restart the kernel as there seems to be no clean way to reclaim the memory**

After that run the cell below to remember the model repo id, which we'll need to use later.

You should also login to huggingface hub again if you want to push the models and you have reset the VM instead of just stopping it. (The difference is whether the harddisk is wiped as it stores your token)


In [None]:
fp16_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-merged" # @param {type:"string"}

## Optional sections

The procedures in this section is not strictly necessary, but it's nice to do them.

### Push also the QLora delta weights

We will use the API client provided by Huggingface to upload the whole folder directly.

In [None]:
from huggingface_hub import HfApi

qlora_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-peft" # @param{type:"string"}

api = HfApi()
api.upload_folder(
    folder_path=artifact_dir,
    repo_id=qlora_model_repo_id,
    repo_type="model"
)

### Test the merged model

Let's try to run inference on the merged model with the original `transformer` library (i.e. most likely in fp16 or even fp32 mode with memory requirement being x2/x4 of 3GB for a 3B param model. You can try the `load_in_4bits` with the `bitsandbytes` library, though that require a GPU)



In [None]:
!pip install sentencepiece bitsandbytes

In [None]:
import torch

# Use a pipeline as a high-level helper
from transformers import pipeline

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

# Notice this time we do enable offload_folder to avoid OOM
tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_3b", use_fast=False) #fast tokenizer cause problem
model = AutoModelForCausalLM.from_pretrained(fp16_model_repo_id,
                                             #torch_dtype=torch.float16,
                                             torch_dtype='auto',
                                             device_map="auto",
                                             offload_folder="offload", offload_state_dict = True)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

In [None]:
prompt = """Below is a conversation between user and assistent. The assistent is helpful and skillful.
User: Hi! How are you today?
Assistent: I'm feeling good! Anything I may help you with?
User: Write a short essay that analyze ritual from a cultural and anthropology framework.
Assistent: Sure! """

previous_answer = """ Let's start with the definition of ritual. According to anthropologists, ritual is a set of actions that are performed in a prescribed manner, usually with the intention of achieving a desired outcome. Rituals are often associated with religious or spiritual practices, but they can also be used in non-religious contexts, such as weddings, funerals, or celebrations.
In a cultural context, rituals are often seen as a way for people to connect with their traditions, beliefs, and values. They can help to reinforce social norms, transmit knowledge, and transmit values. Rituals can also serve as a way for people to express their identity and their place in the community.
Anthropologists believe that rituals are a way for people to make sense of the world around them. They can help to make sense of the past, present, and future, and to make sense of the relationships between people and the natural world. Rituals can also help to create a sense of community and belonging, and to reinforce social norms and values.
In conclusion, rituals are a powerful tool for people to connect with their traditions, beliefs, and values, and to make sense of the world around them. They can help to create a sense of community and belonging, and to reinforce social norms and values."""

followup_question = """User: Thanks. I heard that primitive tribes throughout the world have diversity in their rituals, and that some tribes are harsher than others in terms of the rite of passage for their members. Postulate some possible factors explaining this variance.
Assistent: """

response = pipe(prompt, max_new_tokens = 512)
print(response[0]['generated_text'])

## GPTQ Quantization

We will use the `auto-gptq` library for this task, which is also one of the options for inference in GPU only mode (until `exllama` is released which feature even faster inference speed).

This library's author just returned from a break, so expect things to be in more flux.

During testing, I found several traps to avoid/pre-empt:

- The locale may get messed up (use the quickfix below), which may cause error when attempting to run cells in notebook
- Raw install from git repo may be needed and we need to ensure the CUDA extension is actually compiled and installed. Check whether this is correct below.
- Set a memory limit to avoid OOM error.


In [None]:
# Check your locale, should have UTF-8
import locale
print(locale.getpreferredencoding())

In [None]:
# If not, use this to force override

#locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!git clone https://github.com/PanQiWei/AutoGPTQ.git
!cd AutoGPTQ && BUILD_CUDA_EXT=1 pip install .
!pip install sentencepiece

In [None]:
%cd AutoGPTQ

# Check whether CUDA extension is installed correctly, will throw exception if not
import autogptq_cuda

With these out of the way, let's actually do it.

In [None]:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_3b", use_fast=False)

# GPTQ quantization require a sample text, presumably to collect statistics on token distribution?
test1 = """Below is a conversation between user and assistent. The assistent is helpful and skillful.
User: Hi! How are you today?
Assistent: I'm feeling good! Anything I may help you with?
User: Write a short essay that analyze ritual from a cultural and anthropology framework.
Assistent:  Sure! Let's start with the definition of ritual. According to anthropologists, ritual is a set of actions that are performed in a prescribed manner, usually with the intention of achieving a desired outcome. Rituals are often associated with religious or spiritual practices, but they can also be used in non-religious contexts, such as weddings, funerals, or celebrations.
In a cultural context, rituals are often seen as a way for people to connect with their traditions, beliefs, and values. They can help to reinforce social norms, transmit knowledge, and transmit values. Rituals can also serve as a way for people to express their identity and their place in the community.
Anthropologists believe that rituals are a way for people to make sense of the world around them. They can help to make sense of the past, present, and future, and to make sense of the relationships between people and the natural world. Rituals can also help to create a sense of community and belonging, and to reinforce social norms and values.
In conclusion, rituals are a powerful tool for people to connect with their traditions, beliefs, and values, and to make sense of the world around them. They can help to create a sense of community and belonging, and to reinforce social norms and values.
User: Thanks. I heard that primitive tribes throughout the world have diversity in their rituals, and that some tribes are harsher than others in terms of the rite of passage for their members. Postulate some possible factors explaining this variance.
Assistent:  Yes, primitive tribes throughout the world have a wide range of rituals, and some tribes are harsher than others in terms of the rite of passage for their members. Some of the possible factors that may explain this variance include the availability of resources, the level of social organization, and the level of technology.
In societies where resources are scarce, it may be more important for people to have a clear and defined set of rituals that help to reinforce social norms and values. In societies with a high level of social organization, it may be more important for people to have a clear and defined set of rituals that help to reinforce social norms and values. In societies with a high level of technology, it may be more important for people to have a clear and defined set of rituals that help to reinforce social norms and values.
In conclusion, the factors that may explain the variance in the rite of passage for members of primitive tribes throughout the world include the availability of resources, the level of social organization, and the level of technology."""

examples = [
    tokenizer(test1)
]

# This config *how* would we do the quantization.
# Check out r/LocalLlama community etc for tips on the best value to use
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

# Load model with memory limit
model = AutoGPTQForCausalLM.from_pretrained(fp16_model_repo_id,
                                            quantize_config,
                                            max_memory={0:'14GiB', 'cpu': '10GiB'})


And here's the actual quantization:

In [None]:
model.quantize(examples)

After that, let's save the model similar to the `transformer` library.

In [None]:
gptq_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-GPTQ" # @param{type:"string"}
gptq_model_save_dir = "gptq-model" # @param{type:"string"}

model.push_to_hub(gptq_model_repo_id,
                  save_dir=gptq_model_save_dir,
                  use_safetensors=True)

**Restart kernel again**

In [None]:
fp16_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-merged" # @param {type:"string"}

## GGML Quantization

Finally, let's perform GGML quantization. It is a relatively standalone program with some helper python scripts that are needed only for the quantization task, so this should be more straight forward.

First let's download *both* the merged model and the tokenizer. Note that the tokenizer is not specified in the model config and it is up to us to know what it is.

Usually it'd be the tokenizer of the base model.

In [None]:
base_dir = "/content/my_model/" # @param{type:"string"}
ggml_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-GGML" # @param{type:"string"}
ggml_upload_filename = "openllama-3b-qlora-axolotl-ck200.ggml.q4_0.bin" # @param{type:"string"}


In [None]:
from huggingface_hub import snapshot_download

path = snapshot_download(fp16_model_repo_id, local_dir=base_dir) #local_dir=IMAGE_MODEL_DIR

In [None]:
from huggingface_hub import hf_hub_download
tok_path = hf_hub_download(repo_id="openlm-research/open_llama_3b",
                           filename="tokenizer.model",
                           local_dir=base_dir)

!ls {base_dir}

Now let's install `llama.cpp` again (notice the last line is needed as we're using their python scripts):

In [None]:
!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp
!make
!pip install -r requirements.txt

Time for the quantization. Fist convert to the `ggml` format:

In [None]:
!python3 convert.py {base_dir}

Then quantize:

In [None]:
!./quantize {base_dir}/ggml-model-f16.bin {base_dir}/ggml-model-q4_0.bin q4_0

In [None]:
!ls -lha {base_dir}

Last but not least, we can use the Huggingface hub SDK to upload files manually to the repo:

In [None]:
from huggingface_hub import create_repo

create_repo(ggml_model_repo_id, repo_type="model")

In [None]:
from huggingface_hub import HfApi
import os

api = HfApi()
api.upload_file(
    path_or_fileobj = os.path.join(base_dir, "ggml-model-q4_0.bin"),
    path_in_repo = ggml_upload_filename,
    repo_id = ggml_model_repo_id,
    repo_type = "model"
)

## (Optional) Tidying up - Upload Model Card


In [None]:
!pip install Jinja2

In [None]:
model_id = "" # @param{type:"string"}
repo_id = "" # @param{type:"string"}

desc_text = """
This is just the resulting sample artifact of an exercise running a tutorial colab/jupyter notebook.

*Source:* [hkitsmallpotato/llm-collections](https://github.com/hkitsmallpotato/llm-collections)

The following artifacts are possible:

- Original QLora weights
- Merged fp16/32 model
  - Run using the `transformers` library
- GPTQ 4bit quantized model
  - Use `auto-gptq`, `exllama`, etc
- GGML q4_0 quantized model
  - Use `llama.cpp`

"""

from huggingface_hub import ModelCard, ModelCardData

card_data = ModelCardData(language='en', license='mit')
card = ModelCard.from_template(
    card_data,
    model_id = model_id,
    model_description = desc_text,
    finetuned_from = "openlm-research/open_llama_3b"
)

print(card)
#card.push_to_hub(repo_id, create_pr=True)