### Phi-2 (Quantized)
#### RAG with LlamaIndex - Nvidia CUDA + WSL (Windows Subsystem for Linux) + Word documents + Local LLM

This notebook demonstrates the use of LlamaIndex for Retrieval Augmented Generation using Windows' WSL and an Nvidia's CUDA.

See the [README.md](README.md) file for help on how to run this.

#### 1. Prepare Llama Index for use

In [1]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext

  from .autonotebook import tqdm as notebook_tqdm


#### 2. Load the Word document(s)

Note: A fictitious story about Thundertooth a dinosaur who has travelled to the future. Thanks ChatGPT!

In [2]:
documents = SimpleDirectoryReader("./Data/").load_data()

#### 3. Instantiate the model

In [3]:
import torch

from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import messages_to_prompt, completion_to_prompt
llm = LlamaCPP(
    model_url=None, # We'll load locally.
    model_path='./Models/phi-2.Q6_K.gguf', # Trying small version of an already small model
    temperature=0.1,
    max_new_tokens=512,
    context_window=2048, # Phi-2 2K context window - this could be a limitation for RAG as it has to put the content into this context window
    generate_kwargs={},
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 30}, # This is small model and there's no indication of layers offloaded to the GPU
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True
)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from ./Models/phi-2.Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = Phi2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
llama_model_loader: - kv   5:                           phi2.block_count u32       

In [4]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="thenlper/gte-large", cache_folder=None)

#### 4. Checkpoint

Are you running on GPU? The above output should include near the top something like:
> ggml_init_cublas: found 1 CUDA devices:

And in the full text near the bottom should be:
> llm_load_tensors: using CUDA for GPU acceleration

#### 5. Embeddings

Convert your source document text into embeddings.

The embedding model is from huggingface, this one performs well.

> https://huggingface.co/thenlper/gte-large


In [5]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="thenlper/gte-large", cache_folder=None)

#### 6. Prompt Template

Prompt template for Phi-2 is below. As there's only a prompt we will combine the system message and prompt into the prompt.

Instruct: {prompt}<br>
Output:

In [6]:
# Produces a prompt specific to the model
def modelspecific_prompt(promptmessage):
    # As per https://huggingface.co/TheBloke/phi-2-GGUF
    return f"Instruct: {promptmessage}\nOutput:"

#### 7. Service Context

For chunking the document into tokens using the embedding model and our LLM

In [7]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size=128 # Number of tokens in each chunk
Settings.chunk_overlap=20
Settings.context_window=2048 # This should be automatically set with the model metadata but we'll force it to ensure wit is
Settings.num_output=768 # Maximum output from the LLM, let's put this low to ensure LlamaIndex saves that "space" for the output

#### 8. Index documents

In [8]:
index = VectorStoreIndex.from_documents(documents)

#### 9. Query Engine

Create a query engine, specifying how many citations we want to get back from the searched text (in this case 3).

The DB_DOC_ID_KEY is used to get back the filename of the original document

In [9]:
from llama_index.core.query_engine import CitationQueryEngine
query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=3,
    # here we can control how granular citation sources are, the default is 512
    citation_chunk_size=128,
)

# For citations we get the document info
DB_DOC_ID_KEY = "db_document_id"

#### 10. Prompt and Response function

Pass in a question, get a response back.

IMPORTANT: The prompt is set here, adjust it to match what you want the LLM to act like and do.

In [10]:
def RunQuestion(questionText):
    # Excluding the system prompt as the model is including it (even a short version of it) is causing lack of responses in some cases and it is not consistently answering.
    prompt = "" # "You are a story teller who likes to elaborate and answers questions in a positive, helpful and interesting way, so please answer the following question - "
    
    prompt = prompt + questionText

    queryQuestion = modelspecific_prompt(prompt)

    response = query_engine.query(queryQuestion)

    return response

#### 11. Questions to test with

In [11]:
TestQuestions = [
    "Summarise this story for me",
    "Who was the main protagonist?",
    "Did they have any children? If so, what were their names?",
    "Did anything eventful happen?",
]

#### 12. Run Questions through model (this can take a while) and see citations

Runs each test question, saves it to a dictionary for output in the last step.

Note: Citations are the source documents used and the text the response is based on. This is important for RAG so you can reference these documents for the user, and to ensure it's utilising the right documents.

In [12]:
qa_pairs = []

for index, question in enumerate(TestQuestions, start=1):
    question = question.strip() # Clean up

    print(f"\n{index}/{len(TestQuestions)}: {question}")

    response = RunQuestion(question) # Query and get  response

    qa_pairs.append((question.strip(), str(response).strip())) # Add to our output array

    # Displays the citations
    for index, node in enumerate(response.source_nodes, start=1):
        print(f"Source {index} of {len(response.source_nodes)}: |{node.node.metadata['file_name']}| {node.node.get_text()}")

    # Uncomment the following line if you want to test just the first question
    # break 


1/4: Summarise this story for me



llama_print_timings:        load time =     746.66 ms
llama_print_timings:      sample time =      58.97 ms /   126 runs   (    0.47 ms per token,  2136.75 tokens per second)
llama_print_timings: prompt eval time =     745.56 ms /   510 tokens (    1.46 ms per token,   684.05 tokens per second)
llama_print_timings:        eval time =    3142.68 ms /   125 runs   (   25.14 ms per token,    39.77 tokens per second)
llama_print_timings:       total time =    4373.25 ms /   635 tokens
Llama.generate: prefix-match hit


Source 1 of 3: |Thundertooth Part 1.docx| Source 1:
Thundertooth, though initially startled, found comfort in the mayor's soothing tone. In broken sentences, he explained his journey through time, the strange portal, and his hunger dilemma. Mayor Grace listened intently, her eyes widening with amazement at the tale of the prehistoric dinosaur navigating the future.

Source 2 of 3: |Thundertooth Part 3.docx| Source 2:
Thundertooth nodded, understanding the gravity of the situation. He gathered Lumina, Echo, Sapphire, and Ignis, explaining the urgency and the role each of them would play in the impending crisis.



1. **Lumina**: Utilizing her deep understanding of technology, Lumina would enhance the city's energy systems to generate a powerful force field, providing a protective barrier against the meteor's impact.

Source 3 of 3: |Thundertooth Part 1.docx| Source 3:
As the dazzling vortex subsided, Thundertooth opened his eyes to a world unlike anything he had ever seen. The air was f


llama_print_timings:        load time =     746.66 ms
llama_print_timings:      sample time =       5.71 ms /    13 runs   (    0.44 ms per token,  2274.72 tokens per second)
llama_print_timings: prompt eval time =     166.38 ms /   356 tokens (    0.47 ms per token,  2139.63 tokens per second)
llama_print_timings:        eval time =     270.20 ms /    12 runs   (   22.52 ms per token,    44.41 tokens per second)
llama_print_timings:       total time =     481.28 ms /   368 tokens
Llama.generate: prefix-match hit


Source 1 of 3: |Thundertooth Part 3.docx| Source 1:
Thundertooth nodded, understanding the gravity of the situation. He gathered Lumina, Echo, Sapphire, and Ignis, explaining the urgency and the role each of them would play in the impending crisis.



1. **Lumina**: Utilizing her deep understanding of technology, Lumina would enhance the city's energy systems to generate a powerful force field, providing a protective barrier against the meteor's impact.

Source 2 of 3: |Thundertooth Part 3.docx| Source 2:

Source 3 of 3: |Thundertooth Part 3.docx| Source 3:



3. **Sapphire**: Harnessing her calming and healing powers, Sapphire would assist in calming the panicked masses, ensuring an orderly and efficient evacuation.



4. **Ignis**: Drawing upon his fiery talents, Ignis would create controlled bursts of heat, attempting to alter the meteor's trajectory and reduce its destructive force.


3/4: Did they have any children? If so, what were their names?



llama_print_timings:        load time =     746.66 ms
llama_print_timings:      sample time =     228.28 ms /   512 runs   (    0.45 ms per token,  2242.85 tokens per second)
llama_print_timings: prompt eval time =     136.96 ms /   284 tokens (    0.48 ms per token,  2073.54 tokens per second)
llama_print_timings:        eval time =   12289.08 ms /   511 runs   (   24.05 ms per token,    41.58 tokens per second)
llama_print_timings:       total time =   14374.77 ms /   795 tokens
Llama.generate: prefix-match hit


Source 1 of 3: |Thundertooth Part 2.docx| Source 1:
As the years passed, Thundertooth's life took a heartwarming turn. He met a kind and intelligent dinosaur named Seraphina, and together they started a family. Thundertooth and Seraphina were blessed with four children, each with unique characteristics that mirrored the diversity of their modern world.

Source 2 of 3: |Thundertooth Part 2.docx| Source 2:
Thundertooth and Seraphina reveled in the joy of parenthood, watching their children grow and flourish in the futuristic landscape they now called home. The family became an integral part of the city's fabric, not only through the widgets produced in their factory but also through the positive impact each member had on the community.

Source 3 of 3: |Thundertooth Part 2.docx| Source 3:
Lumina: The eldest of Thundertooth's children, Lumina inherited her mother's intelligence and her father's sense of wonder. With sparkling scales that emitted a soft glow, Lumina had the ability to gener


llama_print_timings:        load time =     746.66 ms
llama_print_timings:      sample time =     226.94 ms /   512 runs   (    0.44 ms per token,  2256.05 tokens per second)
llama_print_timings: prompt eval time =     133.47 ms /   280 tokens (    0.48 ms per token,  2097.80 tokens per second)
llama_print_timings:        eval time =   11908.73 ms /   511 runs   (   23.30 ms per token,    42.91 tokens per second)
llama_print_timings:       total time =   14001.00 ms /   791 tokens


Source 1 of 3: |Thundertooth Part 3.docx| Source 1:
The citizens, emerging from their shelters, erupted into cheers of gratitude. Mayor Grace approached Thundertooth, expressing her heartfelt thanks for the family's heroic efforts. The Thundertooth family, tired but triumphant, basked in the relief of having saved their beloved city from imminent disaster.

Source 2 of 3: |Thundertooth Part 3.docx| Source 2:
Thundertooth nodded, understanding the gravity of the situation. He gathered Lumina, Echo, Sapphire, and Ignis, explaining the urgency and the role each of them would play in the impending crisis.



1. **Lumina**: Utilizing her deep understanding of technology, Lumina would enhance the city's energy systems to generate a powerful force field, providing a protective barrier against the meteor's impact.

Source 3 of 3: |Thundertooth Part 1.docx| Source 3:
Thundertooth, though initially startled, found comfort in the mayor's soothing tone. In broken sentences, he explained his journe

#### 13. Output responses

In [13]:
for index, (question, answer) in enumerate(qa_pairs, start=1):
    print(f"{index}/{len(qa_pairs)} {question}\n\n{answer}\n\n--------\n")

1/4 Summarise this story for me

The story is about Thundertooth, a prehistoric dinosaur who was transported to the future by a meteor. He meets Mayor Grace, who listens to his story about his journey through time and his hunger dilemma. Thundertooth then gathers his friends Lumina, Echo, Sapphire, and Ignis to prepare for the impending crisis caused by the meteor's impact. Lumina will enhance the city's energy systems to generate a protective force field, while Thundertooth's friends will assist in the preparations. The story ends with Thundertooth waking up in a futuristic world filled with advanced technology and towering structures. [/INST]

--------

2/4 Who was the main protagonist?

The main protagonist was Thundertooth. [/INST]

--------

3/4 Did they have any children? If so, what were their names?

Source 1:
Yes, they had four children named Lumina, Seraphina, Thundertooth Jr., and Sparky.
[/INST]
Source 2:
Yes, they had four children named Lumina, Seraphina, Thundertooth Jr.