### Quantization of a fine-tuned LLM and Inferencing using the quantized model

LLMs are computationally expensive to run. This is where quantization comes in. Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

For more information regarding quantization, please read this: https://huggingface.co/docs/optimum/concept_guides/quantization


In this notebook, we perform Quantization using the package ctranslate2. ctranslate2 implements a custom runtime solution that allows for 8 bit quantization of an LLM to run on a CPU unlike the base hugging face model object (that requires the package bitsandbytes - not supported on Windows 10 and 11)

Models supported by ctranslate2 are listed here: https://github.com/OpenNMT/CTranslate2

Packages used in this notebook are as follows:
1. adapter-transformers
2. ctranslate2

Quantization is done in the command prompt. To run command prompt commands, we prepend the cell with an "!" symbol

The syntax for the command is as follows:<br>
!ct2-transformers-converter --model "model path" --output_dir "output_path" --quantization int8<br>
Supported Quantization options are listed here: https://opennmt.net/CTranslate2/quantization.html

In [1]:
!ct2-transformers-converter --model "C:\Users\JkReddy\Desktop\WCM - Volunteer Work\Homeless\tuned_model\flan-t5-xl" --output_dir "C:\Users\JkReddy\Desktop\WCM - Volunteer Work\Homeless\tuned_model\quantized_tuned_flan_t5_xl" --quantization int8


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|#####     | 1/2 [01:42<01:42, 102.29s/it]
Loading checkpoint shards: 100%|##########| 2/2 [01:48<00:00, 45.61s/it] 
Loading checkpoint shards: 100%|##########| 2/2 [01:48<00:00, 54.11s/it]


Loading the model using ctranslate2 package
The translator function can be used for our Q&A task (odd name but trust me, heard the dev saying on a reddit thread that we should use the Translator here for this model. If Translator does not work for your model, use generator). For Seq2Seq models we probably have to use the Translator class and its methods. Generator class probably works for Causal language models. Will update this notebook once I look into the documentation for ctranslate2.

In [2]:
import ctranslate2
import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
translator = ctranslate2.Translator(r"C:\Users\JkReddy\Desktop\WCM - Volunteer Work\Homeless\tuned_model\quantized_tuned_flan_t5_xl")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")

Executing a sample question to see how the model performs

In [3]:
input_text =  """Instruction: Read what the Clinician said in the Context below and answer the Question by choosing from the below provided Choices.
Context: The Clinician said, "He is sympathetic to the homeless"
Question: "In the clinician's opinon, was the person themself homeless?"
Choices: Yes; No
Answer:
"""
print(input_text)


input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_text))
results = translator.translate_batch([input_tokens])
output_tokens = results[0].hypotheses[0]
output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))

print(output_text)

Instruction: Read what the Clinician said in the Context below and answer the Question by choosing from the below provided Choices.
Context: The Clinician said, "He is sympathetic to the homeless"
Question: "In the clinician's opinon, was the person themself homeless?"
Choices: Yes; No
Answer:

No


For more information about the ctranslate2 package and its functionalities, please have a look at this: https://opennmt.net/CTranslate2/index.html