# Run Quantized LLMs on CPUs

In this tutorial, we will demo how to build a RAG pipeline running on CPUs with [optimum-intel](https://github.com/huggingface/optimum-intel), using an LLM.

We will use a pipeline that will:

- Quantize an LLM
- Fetch relevant documents for our question.
- Rerank the documents for better performance.
- Run the LLM on CPU to answer the question.

For more information about optimum-intel, we refer to the [original repository](https://github.com/huggingface/optimum-intel).

First, we export the model to an ONNX file:

In [1]:
!ARCHFLAGS="-arch arm64" pip install numpy  --compile --no-cache-dir



In [1]:
!pip freeze

accelerate==0.29.2
aiohttp==3.9.4
aiosignal==1.3.1
altair==5.3.0
anyio==4.3.0
appdirs==1.4.4
appnope==0.1.4
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
Babel==2.14.0
backoff==2.2.1
beautifulsoup4==4.12.3
bleach==6.1.0
blinker==1.7.0
blis==0.7.11
boilerpy3==1.0.7
cachetools==5.3.3
catalogue==2.0.10
cattrs==23.2.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
cloudpathlib==0.16.0
comm==0.2.2
confection==0.1.4
contourpy==1.2.1
cycler==0.12.1
cymem==2.0.8
datasets==2.18.0
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.8
docopt==0.6.2
et-xmlfile==1.1.0
evaluate==0.4.1
Events==0.5
exceptiongroup==1.2.0
executing==2.0.1
farm-haystack==1.23.0
fastapi==0.110.1
fastjsonschema==2.19.1
fastrag==2.0.0
filelock==3.13.4
fonttools==4.51.0
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.2.0
gitdb==4.0.11
GitPython==3.1.43
h11==0.1

In [2]:
model_name = 'facebook/opt-iml-max-1.3b'

converted_model_path = f"/tmp/{model_name.replace('/','_')}"

In [3]:
from optimum.onnxruntime import ORTModelForCausalLM

model = ORTModelForCausalLM.from_pretrained(model_name, export=True, trust_remote_code=True)
model.save_pretrained(converted_model_path)

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.2.2
Overriding 1 configuration item(s)
	- use_cache -> True
  elif attention_mask.shape[1] != mask_seq_length:
  if (input_shape[-1] > 1 or self.sliding_window is not None) and self.is_causal:
  if past_key_values_length > 0:
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
  if attention_mask.size() != (bsz, 1, tgt_len, src_len):
  attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
Saving external data to one file...
Post-processing the exported models...
Deduplicating shared (tied) weights...
Found different candidate ONNX initializers (likely duplicate) fo

Then, we load the exported model back, and quantize it:

In [4]:
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer
import os

model = ORTModelForCausalLM.from_pretrained(converted_model_path)

qconfig = AutoQuantizationConfig.avx2(is_static=False)
quantizer = ORTQuantizer.from_pretrained(model)

quantizer.quantize(save_dir=os.path.join(converted_model_path, 'quantized'), quantization_config=qconfig)

Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/u8, channel-wise: True)
Quantizing model...
Saving quantized model at: /tmp/facebook_opt-iml-max-1.3b/quantized (external data format: False)
Configuration saved in /tmp/facebook_opt-iml-max-1.3b/quantized/ort_config.json


PosixPath('/tmp/facebook_opt-iml-max-1.3b/quantized')

Now that our model is quantized, we can create a RAG pipeline with it:

In [5]:
from haystack import Pipeline
from haystack.nodes.prompt import PromptNode
import torch
from haystack.nodes import PromptModel
from haystack.nodes.prompt.prompt_template import PromptTemplate
from haystack.nodes import AnswerParser
from haystack.nodes.ranker import SentenceTransformersRanker
from haystack.nodes.retriever import BM25Retriever
from haystack.document_stores import InMemoryDocumentStore
from haystack import Document

  _torch_pytree._register_pytree_node(


We start from a collection of paragraphs from Wikipedia, for the retrieval phase:

In [6]:
document_collection = [{'id': '11457596',
  'text': 'Quest", the "Ultima" series, "EverQuest", the "Warcraft" series, and the "Elder Scrolls" series of games as well as video games set in Middle-earth itself. Research also suggests that some consumers of fantasy games derive their motivation from trying to create an epic fantasy narrative which is influenced by "The Lord of the Rings". In 1965, songwriter Donald Swann, who was best known for his collaboration with Michael Flanders as Flanders & Swann, set six poems from "The Lord of the Rings" and one from "The Adventures of Tom Bombadil" ("Errantry") to music. When Swann met with Tolkien to play the',
  'title': 'The Lord of the Rings'},
 {'id': '11457582',
  'text': 'helped "The Lord of the Rings" become immensely popular in the United States in the 1960s. The book has remained so ever since, ranking as one of the most popular works of fiction of the twentieth century, judged by both sales and reader surveys. In the 2003 "Big Read" survey conducted in Britain by the BBC, "The Lord of the Rings" was found to be the "Nation\'s best-loved book". In similar 2004 polls both Germany and Australia also found "The Lord of the Rings" to be their favourite book. In a 1999 poll of Amazon.com customers, "The Lord of the',
  'title': 'The Lord of the Rings'},
 {'id': '11457540',
  'text': 'of Tolkien\'s works is such that the use of the words "Tolkienian" and "Tolkienesque" has been recorded in the "Oxford English Dictionary". The enduring popularity of "The Lord of the Rings" has led to numerous references in popular culture, the founding of many societies by fans of Tolkien\'s works, and the publication of many books about Tolkien and his works. "The Lord of the Rings" has inspired, and continues to inspire, artwork, music, films and television, video games, board games, and subsequent literature. Award-winning adaptations of "The Lord of the Rings" have been made for radio, theatre, and film. In',
  'title': 'The Lord of the Rings'},
 {'id': '11457587',
  'text': 'has been read as fitting the model of Joseph Campbell\'s "monomyth". "The Lord of the Rings" has been adapted for film, radio and stage. The book has been adapted for radio four times. In 1955 and 1956, the BBC broadcast "The Lord of the Rings", a 13-part radio adaptation of the story. In the 1960s radio station WBAI produced a short radio adaptation. A 1979 dramatization of "The Lord of the Rings" was broadcast in the United States and subsequently issued on tape and CD. In 1981, the BBC broadcast "The Lord of the Rings", a new dramatization in 26',
  'title': 'The Lord of the Rings'},
 {'id': '11457592',
  'text': '"The Lord of the Rings", was released on the internet in May 2009 and has been covered in major media. "Born of Hope", written by Paula DiSante, directed by Kate Madison, and released in December 2009, is a fan film based upon the appendices of "The Lord of the Rings". In November 2017, Amazon acquired the global television rights to "The Lord of the Rings", committing to a multi-season television series. The series will not be a direct adaptation of the books, but will instead introduce new stories that are set before "The Fellowship of the Ring". Amazon said the',
  'title': 'The Lord of the Rings'},
 {'id': '7733817',
  'text': 'The Lord of the Rings Online The Lord of the Rings Online: Shadows of Angmar is a massive multiplayer online role-playing game (MMORPG) for Microsoft Windows and OS X set in a fantasy universe based upon J. R. R. Tolkien\'s Middle-earth writings, taking place during the time period of "The Lord of the Rings". It launched in North America, Australia, Japan, and Europe in 2007. Originally subscription-based, it is free-to-play, with a paid VIP subscription available that provides players various perks.  The game\'s environment is based on "The Lord of the Rings" and "The Hobbit". However, Turbine does not',
  'title': 'The Lord of the Rings Online'},
 {'id': '22198847',
  'text': 'of "The Lord of the Rings", including Ian McKellen, Andy Serkis, Hugo Weaving, Elijah Wood, Ian Holm, Christopher Lee, Cate Blanchett and Orlando Bloom who reprised their roles. Although the "Hobbit" films were even more commercially successful than "The Lord of the Rings", they received mixed reviews from critics. Numerous video games were released to supplement the film series. They include: "," Pinball, "", "", , "", "", "", "", "The Lord of the Rings Online", "", "", "", "Lego The Lord of the Rings", "Guardians of Middle-earth", "", and "".',
  'title': 'The Lord of the Rings (film series)'},
 {'id': '24071573',
  'text': 'Lord of the Rings (musical) The Lord of the Rings is the most prominent of several theatre adaptations of J. R. R. Tolkien\'s epic high fantasy novel of the same name, with music by A. R. Rahman, Christopher Nightingale and the band Värttinä, and book and lyrics by Matthew Warchus and Shaun McKenna. Set in the world of Middle-earth, "The Lord of the Rings" tells the tale of a humble hobbit who is asked to play the hero and undertake a treacherous mission to destroy an evil, magic ring without being seduced by its power. The show was first performed',
  'title': 'Lord of the Rings (musical)'},
 {'id': '11457536',
  'text': 'The Lord of the Rings The Lord of the Rings is an epic high fantasy novel written by English author and scholar J. R. R. Tolkien. The story began as a sequel to Tolkien\'s 1937 fantasy novel "The Hobbit", but eventually developed into a much larger work. Written in stages between 1937 and 1949, "The Lord of the Rings" is one of the best-selling novels ever written, with over 150 million copies sold. The title of the novel refers to the story\'s main antagonist, the Dark Lord Sauron, who had in an earlier age created the One Ring to rule',
  'title': 'The Lord of the Rings'},
 {'id': '13304003',
  'text': "The Lord of the Rings (disambiguation) The Lord of the Rings is a fantasy novel by J. R. R. Tolkien. The title refers to Sauron, the story's main antagonist. The Lord of the Rings may also refer to:",
  'title': 'The Lord of the Rings (disambiguation)'}]

We then create an InMemoryDocumentStore document store, to store all the documents:

In [7]:
store = InMemoryDocumentStore(use_bm25=True)

In [8]:
documents = [Document(id=item["id"], content=item["text"], meta={"title": item["title"]}) for item in document_collection]

In [9]:
store.write_documents(documents)

Updating BM25 representation...: 100%|██████████| 10/10 [00:00<00:00, 8977.53 docs/s]


Next, we create a simple BM25 retriever on top of our store, and an additional reranker component to improve the ranking of the documents used for answering the question:

In [10]:

retriever = BM25Retriever(
    document_store= store,
    top_k= 10
)    
 
reranker = SentenceTransformersRanker(
    batch_size= 32,
    model_name_or_path= "cross-encoder/ms-marco-MiniLM-L-6-v2",
    top_k= 1,
    use_gpu= False
)

Now that we have created the retrieval components, we move to the LLM usage.

We create a document template that contains a placeholder for the retrieved documents to be inserted:

In [11]:
AParser = AnswerParser()
LFQA = PromptTemplate(
    prompt="""Answer the Question below using only the Document provided.
Do not use any prior knowledge to answer the question.
Your answer can only be an entity name or a short phrase.

Document:
{join(documents)}

Question: {query}
Answer: """,
    output_parser= AParser
)

We now create a PromptModel with an LLM, using the ```ORTInvocationLayer``` class, to load the LLM as a quantized ONNX model on our CPU.

In [12]:
from fastrag.prompters.invocation_layers.ort import ORTInvocationLayer

  _torch_pytree._register_pytree_node(


In [13]:
import os

In [14]:
prompter_model = PromptModel(
    model_name_or_path= os.path.join(converted_model_path, 'quantized'),
    invocation_layer_class=ORTInvocationLayer,
    model_kwargs= dict(
        task_name="text-generation",
        max_new_tokens=20,
    )
)

With the model and the prompt template now ready, we create a PromptNode to unify both modules:

In [15]:
Prompter = PromptNode(
    model_name_or_path= prompter_model,
    default_prompt_template= LFQA
)

Our components are now ready. We can now create a pipeline, to connect all of them together:

In [16]:
from haystack import Pipeline

In [17]:
pipe = Pipeline()

pipe.add_node(component=retriever, name= 'Retriever',inputs= ["Query"])
pipe.add_node(component=reranker, name= 'Reranker',inputs= ["Retriever"])
pipe.add_node(component=Prompter, name= 'Prompter',inputs= ["Reranker"])

Finally, lets ask it a question.

In [19]:
answer_result = pipe.run("Who is the main villan in Lord of the Rings?",params={
    "Retriever": {
        "top_k": 10
    },
    "Reranker": {
        "top_k": 1
    },
    "generation_kwargs":{
        "max_length": 10,
        "do_sample": False,
    }
})

print(f"Answer: {answer_result['answers'][0].answer}")

Answer:  Sauron


In [20]:
answer_result = pipe.run("What crimes did Sauron commit",params={
    "Retriever": {
        "top_k": 10
    },
    "Reranker": {
        "top_k": 1
    },
    "generation_kwargs":{
        "max_length": 10,
        "do_sample": False,
    }
})

print(f"Answer: {answer_result['answers'][0].answer}")

Answer: 
