# Merge QLora and Quantize Model

In this tutorial, we will take the QLora model checkpoint you've trained from the last notebook, and perform the post processings necessary for it to actually be used. This usually consists of these steps:

- Merge the QLora adaptors weight with the base model
- Quantize into these formats:
  - GPTQ (for GPU)
  - GGML (mainly for CPU)
- Push to Huggingface

You'd need both a wandb and huggingface API key/token.

> **Additional Note:** It is possible (and recommended!) to run this tutorial with the **CPU only runtime** of colab for the most part. The only exception is the section on *GPTQ quantization* for which a GPU seems required. As the disk content may be wiped when switching runtime type (but not when just restarting), you do need to login to huggingface hub again. Note down the location of the cell where that login occur and revisit them as necessary. You'd also need to re-initialize some key python variables, but this notebook will provide redundant cells to do that to cut down on the scrolling back-and-forth.


## Download artifacts from wandb

First let's login to wandb as usual.

In [1]:
!pip install wandb huggingface_hub

Collecting wandb
  Downloading wandb-0.15.8-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface_hub
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.32-py3-none-any.whl (188 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.5/188.5 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.29.2-py2.py3-none-any.whl (215 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.6/215.6 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting pathtools (fro

In [2]:
import wandb

wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

Now find the artifact ID in your wandb dashboard and update the field accordingly:

In [3]:
# @title Download artifact
wandb_artifact_id = "lemontea-tom/qlora-retest/checkpoint-5z6l7or2:v1" # @param {type:"string"}
download_dir = "/content/qlora_model" # @param {type:"string"}

run = wandb.init()
artifact = run.use_artifact(wandb_artifact_id, type='model')
artifact_dir = artifact.download(root = download_dir)
print(artifact_dir)
!ls {artifact_dir}

[34m[1mwandb[0m: Currently logged in as: [33mlemontea-tom[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Downloading large artifact checkpoint-5z6l7or2:v1, 145.78MB. 8 files... 
[34m[1mwandb[0m:   8 of 8 files downloaded.  
Done. 0:0:4.1


/content/qlora_model
adapter_config.json	   optimizer.pt  rng_state.pth	trainer_state.json
adapter_model.safetensors  README.md	 scheduler.pt	training_args.bin


We've downloaded the artifact/checkpoint to a local directory as shown above.

## Merge QLora with base model

Now, let's begin our work. First login to Huggingface (you may skip this if you don't want to publish, but then you'd need to download the results manually, or upload to your own private storage such as S3 yourself)

In [4]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Install the libraries:

In [5]:
!pip install torch peft transformers

Collecting peft
  Downloading peft-0.4.0-py3-none-any.whl (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate (from peft)
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors (from peft)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.m

----

The method to perform the merge is to use the Vanilla `transformers` library from Huggingface combined with their `peft` library (Stands for "Parameter Efficient Fine-tuning" which feature various methods with the general theme of requiring less trainable parameter than the full model). The reason is because when we trained using the `axolotl` tool, it is actually calling `peft` under the hood and the resulting weights are in their format.

The code below is a bit tricky so let us explain.

- `torch_dtype` need to be set otherwise it seems to default to the full 32bit floating point instead of 16bit, and this will ballon our memory requirement to over 12GB.
- `use_safetensors` refers to **reading** the model - basically it must match the actual format of the model you're loading or it'b error out. That is, if the model is in safetensor format set this to true, otherwise set it to false.
- Unfortunately, in the example we used last tutorial, `openllama-3b` is NOT in safetensor format. You have two options:
  - Use the PR branch created by a bot that auto-convert models into safetensor format (this is the default in the code below but makes it specific to this model and not generic)
  - Read in a non-safetensor format model (with the security risk it entails), then either: rewrite it as safetensor and save locally, then load that again, OR, merge as usual and save as safetensor at the end.
- The issue with the harddisk offloading option (which would have allowed this to work with limited memory) seems to be that offloaded sections of the model are not saved when exporting.

In [6]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, LlamaForCausalLM

# Load the PEFT config in our checkpoint
config = PeftConfig.from_pretrained(artifact_dir)

# Load the base model
base_model = LlamaForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    revision="refs/pr/6", # Auto PR from safetensor convert bot
    torch_dtype='auto',
    device_map='cpu',
    use_safetensors=True, # False if you use the original model
    low_cpu_mem_usage=True # Experimental, seem optional?
    # Hopefully you saved the checkpoint as safetensor already,
    # as conversion takes additional RAM
    # offload_folder="offload", offload_state_dict = True
    # (Warning: offloading to harddrive results in some issue when I test)
)
#tokenizer = LlamaTokenizer.from_pretrained(config.base_model_name_or_path)

Downloading (…)2Fpr%2F6/config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/6.85G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
# Optional code for the second option

#base_model.save_pretrained("/content/stage", safe_serialization=True, push_to_hub=False)
# Restart then load model again with local path like the code above

In [7]:
config

PeftConfig(peft_type='LORA', auto_mapping=None, base_model_name_or_path='openlm-research/open_llama_3b', revision=None, task_type='CAUSAL_LM', inference_mode=True)

In [8]:
# Load a combined PEFT model using the base model + our checkpoint

#model = PeftModel.from_pretrained(base_model, artifact_dir, offload_folder="offload", offload_state_dict = True)
# (similar warning, I can't get it to work, would have been nice as it would lessen main memory requirement)
model = PeftModel.from_pretrained(base_model, artifact_dir)

**And here's where the actual merging occur using magic:**

In [9]:
merged_model = model.merge_and_unload()

In [10]:
merged_model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 3200, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=3200, out_features=3200, bias=False)
          (k_proj): Linear(in_features=3200, out_features=3200, bias=False)
          (v_proj): Linear(in_features=3200, out_features=3200, bias=False)
          (o_proj): Linear(in_features=3200, out_features=3200, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=3200, out_features=8640, bias=False)
          (up_proj): Linear(in_features=3200, out_features=8640, bias=False)
          (down_proj): Linear(in_features=8640, out_features=3200, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm(

Let's save the results to disk. The function interface they provide would do the local save + publish to Huggingface Hub in one step. Though it is also possible to do it separately if you wish.

Regardless of which workflow you use, the library will save the model locally first, then sync (a temporary directory would be used by the library even if you choose to just push without saving)

**And here's the critical point:** The library would normally need twice the memory - one for holding the models so far (seems not possible to release it), and the other is used while saving model locally. To avoid OOM, we add the `max_shard_size="3GB"` argument so that it saves in chunk, limiting the memory overhead to the chunk size.

A small detail is that with this flag the model you push to the hub will also be chunked in 3GB blocks instead of the default 10GB. If you don't like that, an alternative (which we didn't do as it seems too complex and distracting) is to first save locally (chunked) without pushing, restart runtime, just load the merged model, then save again but without chunking, then finally sync.


In [11]:
# @title Save model and push
fp16_model_save_dir = "merged_model" # @param {type:"string"}
push_to_hub = True # @param {type:"boolean"}
fp16_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-merged" # @param {type:"string"}

merged_model.save_pretrained(fp16_model_save_dir,
                             safe_serialization=True,
                             max_shard_size="3GB",
                             push_to_hub=push_to_hub,
                             repo_id=fp16_model_repo_id)

#merged_model.push_to_hub(fp16_model_repo_id)

model-00001-of-00003.safetensors:   0%|          | 0.00/2.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/866M [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/2.99G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

**Now you need to restart the kernel as there seems to be no clean way to reclaim the memory** (Remain in CPU if you want to upload the QLoRa, otherwise restart as GPU)

After that run the cell below to remember the model repo id, which we'll need to use later.

You should also login to huggingface hub again if you want to push the models and you have reset the VM instead of just stopping it. (The difference is whether the harddisk is wiped as it stores your token)


In [1]:
fp16_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-merged" # @param {type:"string"}

## Optional sections

The procedures in this section is not strictly necessary, but it's nice to do them.

### Push also the QLora delta weights

We will use the API client provided by Huggingface to upload the whole folder directly.

In [4]:
from huggingface_hub import HfApi

qlora_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-peft" # @param{type:"string"}
# @markdown Enter the path where you originally downloaded the artifact below:
artifact_dir = "/content/qlora_model" # @param {type:"string"}

api = HfApi()
api.create_repo(qlora_model_repo_id, repo_type="model")
api.upload_folder(
    folder_path=artifact_dir,
    repo_id=qlora_model_repo_id,
    repo_type="model"
)

adapter_model.safetensors:   0%|          | 0.00/50.9M [00:00<?, ?B/s]

rng_state.pth:   0%|          | 0.00/14.6k [00:00<?, ?B/s]

optimizer.pt:   0%|          | 0.00/102M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/3.96k [00:00<?, ?B/s]

scheduler.pt:   0%|          | 0.00/627 [00:00<?, ?B/s]

Upload 5 LFS files:   0%|          | 0/5 [00:00<?, ?it/s]

'https://huggingface.co/lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-peft/tree/main/'

### Test the merged model

Let's try to run inference on the merged model with the original `transformer` library (i.e. most likely in fp16 or even fp32 mode with memory requirement being x2/x4 of 3GB for a 3B param model. You can try the `load_in_4bits` with the `bitsandbytes` library, though that require a GPU)



In [1]:
fp16_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-merged" # @param {type:"string"}

In [2]:
!pip install sentencepiece bitsandbytes

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.41.0-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece, bitsandbytes
Successfully installed bitsandbytes-0.41.0 sentencepiece-0.1.99


In [3]:
!pip install torch transformers accelerate

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Down

In [4]:
import torch

# Use a pipeline as a high-level helper
from transformers import pipeline

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

# Notice this time we do enable offload_folder to avoid OOM
tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_3b", use_fast=False) #fast tokenizer cause problem
model = AutoModelForCausalLM.from_pretrained(fp16_model_repo_id,
                                             #torch_dtype=torch.float16,
                                             torch_dtype='auto',
                                             device_map="auto",
                                             offload_folder="offload", offload_state_dict = True)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Downloading (…)okenizer_config.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/534k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/21.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/2.99G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/2.99G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/866M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [5]:
prompt = """Below is a conversation between user and assistent. The assistent is helpful and skillful.
User: Hi! How are you today?
Assistent: I'm feeling good! Anything I may help you with?
User: Write a short essay that analyze ritual from a cultural and anthropology framework.
Assistent: Sure! """

previous_answer = """ Let's start with the definition of ritual. According to anthropologists, ritual is a set of actions that are performed in a prescribed manner, usually with the intention of achieving a desired outcome. Rituals are often associated with religious or spiritual practices, but they can also be used in non-religious contexts, such as weddings, funerals, or celebrations.
In a cultural context, rituals are often seen as a way for people to connect with their traditions, beliefs, and values. They can help to reinforce social norms, transmit knowledge, and transmit values. Rituals can also serve as a way for people to express their identity and their place in the community.
Anthropologists believe that rituals are a way for people to make sense of the world around them. They can help to make sense of the past, present, and future, and to make sense of the relationships between people and the natural world. Rituals can also help to create a sense of community and belonging, and to reinforce social norms and values.
In conclusion, rituals are a powerful tool for people to connect with their traditions, beliefs, and values, and to make sense of the world around them. They can help to create a sense of community and belonging, and to reinforce social norms and values."""

followup_question = """User: Thanks. I heard that primitive tribes throughout the world have diversity in their rituals, and that some tribes are harsher than others in terms of the rite of passage for their members. Postulate some possible factors explaining this variance.
Assistent: """

response = pipe(prompt, max_new_tokens = 512)
print(response[0]['generated_text'])

Below is a conversation between user and assistent. The assistent is helpful and skillful.
User: Hi! How are you today?
Assistent: I'm feeling good! Anything I may help you with?
User: Write a short essay that analyze ritual from a cultural and anthropology framework.
Assistent: Sure!  I'll be happy to help.
User: Thank you! What are some of the common rituals in your culture?
Assistent: Sure! There are many rituals in my culture. Some of the most common ones are weddings, funerals, and birthdays.
User: What are the cultural and anthropological reasons behind these rituals?
Assistent: Well, weddings are a very important part of our culture. They are a celebration of love and commitment between two people. They are also a way for families to come together and share in the joy of the newlyweds.
Funerals are also very important in our culture. They are a way to honor the memory of a loved one and to help the family and friends cope with their loss.
Birthdays are also a very important part

## GPTQ Quantization

We will use the `auto-gptq` library for this task, which is also one of the options for inference in GPU only mode (until `exllama` is released which feature even faster inference speed).

This library's author just returned from a break, so expect things to be in more flux.

During testing, I found several traps to avoid/pre-empt:

- The locale may get messed up (use the quickfix below), which may cause error when attempting to run cells in notebook
- Raw install from git repo may be needed and we need to ensure the CUDA extension is actually compiled and installed. Check whether this is correct below.
- Set a memory limit to avoid OOM error.


In [1]:
fp16_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-merged" # @param {type:"string"}

In [2]:
# Check your locale, should have UTF-8
import locale
print(locale.getpreferredencoding())

UTF-8


In [None]:
# If not, use this to force override

#locale.getpreferredencoding = lambda: "UTF-8"

In [3]:
!git clone https://github.com/PanQiWei/AutoGPTQ.git
!cd AutoGPTQ && BUILD_CUDA_EXT=1 pip install .
!pip install sentencepiece

Cloning into 'AutoGPTQ'...
remote: Enumerating objects: 2487, done.[K
remote: Counting objects: 100% (865/865), done.[K
remote: Compressing objects: 100% (407/407), done.[K
remote: Total 2487 (delta 542), reused 624 (delta 442), pack-reused 1622[K
Receiving objects: 100% (2487/2487), 7.47 MiB | 15.61 MiB/s, done.
Resolving deltas: 100% (1629/1629), done.
Processing /content/AutoGPTQ
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting accelerate>=0.19.0 (from auto-gptq==0.3.2+cu118)
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from auto-gptq==0.3.2+cu118)
  Downloading datasets-2.14.3-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.1/519.1 kB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
Collecting rouge (from auto-gptq==0.3.2+cu118)
  Downloading rouge-1.0.1-py3-none-any

In [4]:
%cd AutoGPTQ

# Check whether CUDA extension is installed correctly, will throw exception if not
import autogptq_cuda

/content/AutoGPTQ


With these out of the way, let's actually do it.

In [5]:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_3b", use_fast=False)

# GPTQ quantization require a sample text, presumably to collect statistics on token distribution?
test1 = """Below is a conversation between user and assistent. The assistent is helpful and skillful.
User: Hi! How are you today?
Assistent: I'm feeling good! Anything I may help you with?
User: Write a short essay that analyze ritual from a cultural and anthropology framework.
Assistent:  Sure! Let's start with the definition of ritual. According to anthropologists, ritual is a set of actions that are performed in a prescribed manner, usually with the intention of achieving a desired outcome. Rituals are often associated with religious or spiritual practices, but they can also be used in non-religious contexts, such as weddings, funerals, or celebrations.
In a cultural context, rituals are often seen as a way for people to connect with their traditions, beliefs, and values. They can help to reinforce social norms, transmit knowledge, and transmit values. Rituals can also serve as a way for people to express their identity and their place in the community.
Anthropologists believe that rituals are a way for people to make sense of the world around them. They can help to make sense of the past, present, and future, and to make sense of the relationships between people and the natural world. Rituals can also help to create a sense of community and belonging, and to reinforce social norms and values.
In conclusion, rituals are a powerful tool for people to connect with their traditions, beliefs, and values, and to make sense of the world around them. They can help to create a sense of community and belonging, and to reinforce social norms and values.
User: Thanks. I heard that primitive tribes throughout the world have diversity in their rituals, and that some tribes are harsher than others in terms of the rite of passage for their members. Postulate some possible factors explaining this variance.
Assistent:  Yes, primitive tribes throughout the world have a wide range of rituals, and some tribes are harsher than others in terms of the rite of passage for their members. Some of the possible factors that may explain this variance include the availability of resources, the level of social organization, and the level of technology.
In societies where resources are scarce, it may be more important for people to have a clear and defined set of rituals that help to reinforce social norms and values. In societies with a high level of social organization, it may be more important for people to have a clear and defined set of rituals that help to reinforce social norms and values. In societies with a high level of technology, it may be more important for people to have a clear and defined set of rituals that help to reinforce social norms and values.
In conclusion, the factors that may explain the variance in the rite of passage for members of primitive tribes throughout the world include the availability of resources, the level of social organization, and the level of technology."""

examples = [
    tokenizer(test1)
]

# This config *how* would we do the quantization.
# Check out r/LocalLlama community etc for tips on the best value to use
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

# Load model with memory limit
model = AutoGPTQForCausalLM.from_pretrained(fp16_model_repo_id,
                                            quantize_config,
                                            max_memory={0:'14GiB', 'cpu': '10GiB'})


Downloading (…)okenizer_config.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/534k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/21.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/2.99G [00:00<?, ?B/s]

KeyboardInterrupt: ignored

And here's the actual quantization:

In [None]:
model.quantize(examples)

After that, let's save the model similar to the `transformer` library.

In [None]:
gptq_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-GPTQ" # @param{type:"string"}
gptq_model_save_dir = "gptq-model" # @param{type:"string"}

model.push_to_hub(gptq_model_repo_id,
                  save_dir=gptq_model_save_dir,
                  use_safetensors=True)

**Restart kernel again** (You may switch back to CPU runtime now)

In [None]:
fp16_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-merged" # @param {type:"string"}

## GGML Quantization

Finally, let's perform GGML quantization. It is a relatively standalone program with some helper python scripts that are needed only for the quantization task, so this should be more straight forward.

First let's download *both* the merged model and the tokenizer. Note that the tokenizer is not specified in the model config and it is up to us to know what it is.

Usually it'd be the tokenizer of the base model.

In [None]:
base_dir = "/content/my_model/" # @param{type:"string"}
ggml_model_repo_id = "lemonteaa/exercise-openllama-3b-qlora-axolotl-checkpoint200-GGML" # @param{type:"string"}
ggml_upload_filename = "openllama-3b-qlora-axolotl-ck200.ggml.q4_0.bin" # @param{type:"string"}


In [None]:
from huggingface_hub import snapshot_download

path = snapshot_download(fp16_model_repo_id, local_dir=base_dir) #local_dir=IMAGE_MODEL_DIR

In [None]:
from huggingface_hub import hf_hub_download
tok_path = hf_hub_download(repo_id="openlm-research/open_llama_3b",
                           filename="tokenizer.model",
                           local_dir=base_dir)

!ls {base_dir}

Now let's install `llama.cpp` again (notice the last line is needed as we're using their python scripts):

In [None]:
!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp
!make
!pip install -r requirements.txt

Time for the quantization. Fist convert to the `ggml` format:

In [None]:
!python3 convert.py {base_dir}

Then quantize:

In [None]:
!./quantize {base_dir}/ggml-model-f16.bin {base_dir}/ggml-model-q4_0.bin q4_0

In [None]:
!ls -lha {base_dir}

Last but not least, we can use the Huggingface hub SDK to upload files manually to the repo:

In [None]:
from huggingface_hub import create_repo

create_repo(ggml_model_repo_id, repo_type="model")

In [None]:
from huggingface_hub import HfApi
import os

api = HfApi()
api.upload_file(
    path_or_fileobj = os.path.join(base_dir, "ggml-model-q4_0.bin"),
    path_in_repo = ggml_upload_filename,
    repo_id = ggml_model_repo_id,
    repo_type = "model"
)

## (Optional) Tidying up - Upload Model Card


In [None]:
!pip install Jinja2

In [None]:
model_id = "" # @param{type:"string"}
repo_id = "" # @param{type:"string"}

desc_text = """
This is just the resulting sample artifact of an exercise running a tutorial colab/jupyter notebook.

*Source:* [hkitsmallpotato/llm-collections](https://github.com/hkitsmallpotato/llm-collections)

The following artifacts are possible:

- Original QLora weights
- Merged fp16/32 model
  - Run using the `transformers` library
- GPTQ 4bit quantized model
  - Use `auto-gptq`, `exllama`, etc
- GGML q4_0 quantized model
  - Use `llama.cpp`

"""

from huggingface_hub import ModelCard, ModelCardData

card_data = ModelCardData(language='en', license='mit')
card = ModelCard.from_template(
    card_data,
    model_id = model_id,
    model_description = desc_text,
    finetuned_from = "openlm-research/open_llama_3b"
)

print(card)
#card.push_to_hub(repo_id, create_pr=True)