In [1]:
# The MIT License (MIT) Copyright (c) 2025 Emilio Morales
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of 
# this software and associated documentation files (the "Software"), to deal in the Software without 
# restriction, including without limitation the rights to use, copy, modify, merge, publish, 
# distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the 
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all copies or 
# substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 
# INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES 
# OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/milmor/NLP/blob/main/Notebooks/21_RAG.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
</table>

# RAG

- Harry Potter book: https://www.kaggle.com/datasets/shubhammaindola/harry-potter-books

In [2]:
#!pip install -q sentence-transformers faiss-cpu
#!pip install vllm

In [3]:
import torch

torch.__version__

'2.5.1+cu124'

In [4]:
from vllm import LLM, SamplingParams
from datasets import Dataset

In [5]:
def chunk_text(text, chunk_size):
    """Splits text into chunks of a specified size."""
    words = text.split()
    chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
    return chunks

with open('./01 Harry Potter and the Sorcerers Stone.txt', 'r', encoding='utf-8') as f:
    book_text = f.read()

chunk_size = 500  # You can adjust the chunk size as needed
book_chunks = chunk_text(book_text, chunk_size)

print(f"Number of chunks created: {len(book_chunks)}")
print(f"First chunk: {book_chunks[0][:200]}...") # Print the first 200 characters of the first chunk

Number of chunks created: 157
First chunk: M r. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange o...


## Vector Database

In [6]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

In [7]:
model = SentenceTransformer(
    "all-MiniLM-L6-v2" # Using a smaller model
)

# The actual max sequence length for this model is 512
model.max_seq_length = 512

embeddings = model.encode([
    'How is the weather today?',
    'What is the current weather like today?'
])
print(cos_sim(embeddings[0], embeddings[1]))

tensor([[0.8852]])


In [8]:
embeddings[0].shape

(384,)

In [9]:
embeddings = model.encode([
    'How is the weather today?',
    'What is the current weather like today?'
])
print(cos_sim(embeddings[0], embeddings[1]))

tensor([[0.8852]])


In [10]:
embeddings = model.encode([
    'How is the weather today?',
    "What's the name of your dog"
])

print(cos_sim(embeddings[0], embeddings[1]))

tensor([[0.0808]])


In [11]:
my_dict = {
    "id": list(range(len(book_chunks))),
    "text": book_chunks}

book = Dataset.from_dict(my_dict)
book

Dataset({
    features: ['id', 'text'],
    num_rows: 157
})

In [12]:
def embed(batch):
  # Explicitly truncate text to the model's max sequence length
  truncated_texts = [text[:model.max_seq_length] for text in batch["text"]]
  info = model.encode(truncated_texts)
  return {"embeddings": info}

dataset = book.map(embed, batched=True)

Map:   0%|          | 0/157 [00:00<?, ? examples/s]

In [13]:
dataset

Dataset({
    features: ['id', 'text', 'embeddings'],
    num_rows: 157
})

In [14]:
data = dataset.add_faiss_index(column="embeddings")

  0%|          | 0/1 [00:00<?, ?it/s]

In [15]:
def search(query, k):
  emb_query = model.encode([query])
  return data.get_nearest_examples("embeddings", emb_query, k=k)

In [16]:
scores, result = search("Why does Voldemort want the stone?", 3)
result['text']

['with a mixture of shock and suspicion. “Professor Dumbledore will be back tomorrow,” she said finally. I don’t know how you found out about the Stone, but rest assured, no one can possibly steal it, it’s too well protected.” “But Professor —” “Potter, I know what I’m talking about,” she said shortly. She bent down and gathered up the fallen books. I suggest you all go back outside and enjoy the sunshine.” But they didn’t. “It’s tonight,” said Harry, once he was sure Professor McGonagall was out of earshot. “Snape’s going through the trapdoor tonight. He’s found out everything he needs, and now he’s got Dumbledore out of the way. He sent that note, I bet the Ministry of Magic will get a real shock when Dumbledore turns up.” “But what can we —” Hermione gasped. Harry and Ron wheeled round. Snape was standing there. “Good afternoon,” he said smoothly. They stared at him. “You shouldn’t be inside on a day like this,” he said, with an odd, twisted smile. “We were —” Harry began, without a

## LLM

In [17]:
from transformers import AutoTokenizer

#model_id = "neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic"
model_id = "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"

number_gpus = 1
tokenizer = AutoTokenizer.from_pretrained(model_id)
max_model_len = 8192

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)

INFO 11-14 14:17:14 config.py:510] This model supports multiple tasks: {'embed', 'score', 'reward', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 11-14 14:17:14 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 11-14 14:17:14 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='x

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 11-14 14:17:17 model_runner.py:1099] Loading model weights took 5.3555 GB
INFO 11-14 14:17:19 worker.py:241] Memory profiling takes 2.05 seconds
INFO 11-14 14:17:19 worker.py:241] the current vLLM instance can use total_gpu_memory (23.68GiB) x gpu_memory_utilization (0.90) = 21.31GiB
INFO 11-14 14:17:19 worker.py:241] model weights take 5.36GiB; non_torch_memory takes 0.01GiB; PyTorch activation peak memory takes 1.22GiB; the rest of the memory reserved for KV Cache is 14.72GiB.
INFO 11-14 14:17:19 gpu_executor.py:76] # GPU blocks: 7538, # CPU blocks: 2048
INFO 11-14 14:17:19 gpu_executor.py:80] Maximum concurrency for 8192 tokens per request: 14.72x
INFO 11-14 14:17:21 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilizat

Capturing CUDA graph shapes: 100%|██████████████| 35/35 [00:12<00:00,  2.89it/s]

INFO 11-14 14:17:33 model_runner.py:1535] Graph capturing finished in 12 secs, took 0.26 GiB
INFO 11-14 14:17:33 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 15.47 seconds





In [18]:
def format_prompt_chatbot(prompt, docs, k):
  context = "\n".join(docs['text'][:k])
  full_prompt = f"Question: {prompt} \nContext: {context}"
  return {"role": "user", "content": full_prompt}

def generate(message):
  prompts = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
  outputs = llm.generate(prompts, sampling_params)
  return outputs[0].outputs[0].text

def rag_llama(prompt, k=3):
  scores, docs = search(prompt, k)

  user_message = format_prompt_chatbot(prompt, docs, k)
  system_message = {"role": "system", "content": "You are a helpful assistant for answering questions. "
  " You are given the extracted parts of a long document. Provide a conversational answer. \n"}
  message = [system_message, user_message]

  return generate(message)

In [19]:
output = rag_llama("Why does Voldemort want the stone?")
print(output)

Processed prompts: 100%|█| 1/1 [00:01<00:00,  1.04s/it, est. speed input: 2024.0

Oh my, Voldemort's plan is to get the Sorcerer's Stone. He's using Quirrell as a puppet to get to it. The Stone is a powerful object that can grant eternal life, and Voldemort is desperate to get it to become immortal and gain the power to return to power.





In [20]:
output = rag_llama("Why does Voldemort want the stone?")
print(output)

Processed prompts: 100%|█| 1/1 [00:02<00:00,  2.75s/it, est. speed input: 767.29

It seems like you want me to explain Voldemort's motivations for wanting the Sorcerer's Stone. Well, the main reason Voldemort wants the Stone is to gain the power to create a physical body for himself. 

As we see in the conversation, the face that is revealed to be Voldemort's is a spirit or a form that can only exist by possessing the bodies of others. He is using Quirrell's body to carry out his plans, but he needs the Stone to create a permanent, physical body for himself. 

The Stone is also associated with the Elixir of Life, which is a powerful potion that can grant eternal life. Voldemort wants to use this Elixir to become immortal and invulnerable. He believes that with the Stone and the Elixir, he will be able to achieve his goal of immortality and become the most powerful wizard of all time. 

Additionally, Voldemort's ultimate goal is to return to power and dominate the wizarding world. He wants to create a Dark Empire, and the Sorcerer's Stone is a crucial step in achievi




In [21]:
output = rag_llama("How did Harry and Ron become friends on the train?")
print(output)

Processed prompts: 100%|█| 1/1 [00:01<00:00,  1.35s/it, est. speed input: 1550.1

Harry and Ron became friends on the train because they were introduced by Ron's twin brothers, Fred and George Weasley. The twins were sitting in a compartment, and they introduced themselves to Harry, and then they left, leaving Harry and Ron alone together. Ron was the first to speak, asking Harry if he was really Harry Potter, and Harry confirmed it. Ron was surprised to learn that Harry was the boy who had survived the attempt on his life by Lord Voldemort.



