### Mistral 7B
#### RAG with LlamaIndex - Nvidia CUDA + WSL (Windows Subsystem for Linux) + Word documents + Local LLM

This notebook demonstrates the use of LlamaIndex for Retrieval Augmented Generation using Windows' WSL and an Nvidia's CUDA.

See the [README.md](README.md) file for help on how to run this.

#### 1. Prepare Llama Index for use

In [1]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext

  from .autonotebook import tqdm as notebook_tqdm


#### 2. Load the Word document(s)

Note: A fictitious story about Thundertooth a dinosaur who has travelled to the future. Thanks ChatGPT!

In [2]:
documents = SimpleDirectoryReader("./Data/").load_data()

#### 3. Instantiate the model

In [3]:
import torch

from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import messages_to_prompt, completion_to_prompt
llm = LlamaCPP(
    model_url=None, # We'll load locally.
    model_path='./Models/mistral-7b-instruct-v0.1.Q6_K.gguf', # 6-bit model
    temperature=0.1,
    max_new_tokens=1024, # Increasing to support longer responses
    context_window=8192, # Mistral7B has an 8K context-window
    generate_kwargs={},
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 40}, # 40 was a good amount of layers for the RTX 3090, you may need to decrease yours if you have less VRAM than 24GB
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True
)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ./Models/mistral-7b-instruct-v0.1.Q6_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:               

#### 4. Checkpoint

Are you running on GPU? The above output should include near the top something like:
> ggml_init_cublas: found 1 CUDA devices:

And in the full text near the bottom should be:
> llm_load_tensors: using CUDA for GPU acceleration

#### 5. Embeddings

Convert your source document text into embeddings.

The embedding model is from huggingface, this one performs well.

> https://huggingface.co/thenlper/gte-large


In [4]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="thenlper/gte-large", cache_folder=None)

#### 6. Prompt Template

Prompt template for Mistral:

>     <s>[INST] {prompt} [/INST]

#### 7. Service Context

For chunking the document into tokens using the embedding model and our LLM

In [5]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size=256 # Number of tokens in each chunk

#### 8. Index documents

In [6]:
index = VectorStoreIndex.from_documents(documents)

#### 9. Query Engine

Create a query engine, specifying how many citations we want to get back from the searched text (in this case 3).

The DB_DOC_ID_KEY is used to get back the filename of the original document

In [7]:
from llama_index.core.query_engine import CitationQueryEngine
query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=3,
    # here we can control how granular citation sources are, the default is 512
    citation_chunk_size=256,
)

# For citations we get the document info
DB_DOC_ID_KEY = "db_document_id"

#### 10. Prompt and Response function

Pass in a question, get a response back.

IMPORTANT: The prompt is in the question, adjust it to match what you want the LLM to act like and do.

In [8]:
def RunQuestion(questionText):
    queryQuestion = "<s>[INST] You are a technology specialist. Answer questions in a positive, helpful and empathetic way. Answer the following question: " + questionText + " [/INST]"

    response = query_engine.query(queryQuestion)

    return response

#### 11. Questions to test with

In [9]:
TestQuestions = [
    "Summarise the story for me",
    "Who was the main protagonist?",
    "Did they have any children? If so, what were their names?",
    "Did anything eventful happen?",
]

#### 12. Run Questions through model (this can take a while) and see citations

Runs each test question, saves it to a dictionary for output in the last step.

Note: Citations are the source documents used and the text the response is based on. This is important for RAG so you can reference these documents for the user, and to ensure it's utilising the right documents.

In [10]:
qa_pairs = []

for index, question in enumerate(TestQuestions, start=1):
    question = question.strip() # Clean up

    print(f"\n{index}/{len(TestQuestions)}: {question}")

    response = RunQuestion(question) # Query and get  response

    qa_pairs.append((question.strip(), str(response).strip())) # Add to our output array

    # Displays the citations
    for index, node in enumerate(response.source_nodes, start=1):
        print(f"{index}/{len(response.source_nodes)}: |{node.node.metadata['file_name']}| {node.node.get_text()}")

    # Uncomment the following line if you want to test just the first question
    # break 


1/4: Summarise the story for me



llama_print_timings:        load time =     164.72 ms
llama_print_timings:      sample time =      50.72 ms /   191 runs   (    0.27 ms per token,  3766.00 tokens per second)
llama_print_timings: prompt eval time =     375.64 ms /  1084 tokens (    0.35 ms per token,  2885.71 tokens per second)
llama_print_timings:        eval time =    2749.26 ms /   190 runs   (   14.47 ms per token,    69.11 tokens per second)
llama_print_timings:       total time =    3551.24 ms /  1274 tokens
Llama.generate: prefix-match hit


1/3: |Thundertooth Part 2.docx| Source 1:
Thundertooth



Embraced by the futuristic city and its inhabitants, Thundertooth found a sense of purpose beyond merely satisfying his hunger. Inspired by the advanced technology surrounding him, he decided to channel his creativity into something extraordinary. With the help of the city's brilliant engineers, Thundertooth founded a one-of-a-kind toy factory that produced amazing widgets – magical, interactive toys that captivated the hearts of both children and adults alike.



Thundertooth's toy factory became a sensation, and its creations were highly sought after. The widgets incorporated cutting-edge holographic displays, levitation technology, and even the ability to change shapes and colors with a mere thought. Children across the city rejoiced as they played with these incredible toys that seemed to bring their wildest fantasies to life.



As the years passed, Thundertooth's life took a heartwarming turn. He met a kind and intelligent


llama_print_timings:        load time =     164.72 ms
llama_print_timings:      sample time =       3.72 ms /    14 runs   (    0.27 ms per token,  3765.47 tokens per second)
llama_print_timings: prompt eval time =     214.88 ms /   566 tokens (    0.38 ms per token,  2634.02 tokens per second)
llama_print_timings:        eval time =     186.46 ms /    13 runs   (   14.34 ms per token,    69.72 tokens per second)
llama_print_timings:       total time =     433.60 ms /   579 tokens
Llama.generate: prefix-match hit


1/3: |Thundertooth Part 2.docx| Source 1:
Thundertooth



Embraced by the futuristic city and its inhabitants, Thundertooth found a sense of purpose beyond merely satisfying his hunger. Inspired by the advanced technology surrounding him, he decided to channel his creativity into something extraordinary. With the help of the city's brilliant engineers, Thundertooth founded a one-of-a-kind toy factory that produced amazing widgets – magical, interactive toys that captivated the hearts of both children and adults alike.



Thundertooth's toy factory became a sensation, and its creations were highly sought after. The widgets incorporated cutting-edge holographic displays, levitation technology, and even the ability to change shapes and colors with a mere thought. Children across the city rejoiced as they played with these incredible toys that seemed to bring their wildest fantasies to life.



As the years passed, Thundertooth's life took a heartwarming turn. He met a kind and intelligent


llama_print_timings:        load time =     164.72 ms
llama_print_timings:      sample time =      31.32 ms /   115 runs   (    0.27 ms per token,  3672.24 tokens per second)
llama_print_timings: prompt eval time =     264.52 ms /   843 tokens (    0.31 ms per token,  3186.92 tokens per second)
llama_print_timings:        eval time =    1648.03 ms /   114 runs   (   14.46 ms per token,    69.17 tokens per second)
llama_print_timings:       total time =    2167.21 ms /   957 tokens
Llama.generate: prefix-match hit


1/3: |Thundertooth Part 2.docx| Source 1:
Thundertooth's toy factory became a sensation, and its creations were highly sought after. The widgets incorporated cutting-edge holographic displays, levitation technology, and even the ability to change shapes and colors with a mere thought. Children across the city rejoiced as they played with these incredible toys that seemed to bring their wildest fantasies to life.



As the years passed, Thundertooth's life took a heartwarming turn. He met a kind and intelligent dinosaur named Seraphina, and together they started a family. Thundertooth and Seraphina were blessed with four children, each with unique characteristics that mirrored the diversity of their modern world.



Lumina: The eldest of Thundertooth's children, Lumina inherited her mother's intelligence and her father's sense of wonder. With sparkling scales that emitted a soft glow, Lumina had the ability to generate light at will. She became fascinated with technology, often spending


llama_print_timings:        load time =     164.72 ms
llama_print_timings:      sample time =      34.71 ms /   128 runs   (    0.27 ms per token,  3687.38 tokens per second)
llama_print_timings: prompt eval time =     270.67 ms /   862 tokens (    0.31 ms per token,  3184.74 tokens per second)
llama_print_timings:        eval time =    1795.24 ms /   127 runs   (   14.14 ms per token,    70.74 tokens per second)
llama_print_timings:       total time =    2348.92 ms /   989 tokens


1/3: |Thundertooth Part 2.docx| Source 1:
Thundertooth



Embraced by the futuristic city and its inhabitants, Thundertooth found a sense of purpose beyond merely satisfying his hunger. Inspired by the advanced technology surrounding him, he decided to channel his creativity into something extraordinary. With the help of the city's brilliant engineers, Thundertooth founded a one-of-a-kind toy factory that produced amazing widgets – magical, interactive toys that captivated the hearts of both children and adults alike.



Thundertooth's toy factory became a sensation, and its creations were highly sought after. The widgets incorporated cutting-edge holographic displays, levitation technology, and even the ability to change shapes and colors with a mere thought. Children across the city rejoiced as they played with these incredible toys that seemed to bring their wildest fantasies to life.



As the years passed, Thundertooth's life took a heartwarming turn. He met a kind and intelligent

#### 13. Output responses

In [11]:
for index, (question, answer) in enumerate(qa_pairs, start=1):
    print(f"{index}/{len(qa_pairs)} {question}\n\n{answer}\n\n--------\n")

1/4 Summarise the story for me

Thundertooth is a prehistoric dinosaur who finds himself in a futuristic city where he meets Mayor Grace. Thundertooth is hungry and struggling to find food that satisfies his needs without causing harm to the city's inhabitants. Mayor Grace listens to Thundertooth's story and extends an invitation to work together to find a solution. Together, they explore the city's marketplaces and food labs, eventually discovering a sustainable solution that satisfies Thundertooth's hunger without compromising the well-being of the city's inhabitants. Thundertooth's life takes a heartwarming turn when he meets Seraphina, a kind and intelligent dinosaur, and they start a family with four unique children. Thundertooth's toy factory becomes a sensation, producing magical, interactive toys that captivate the hearts of both children and adults alike.

--------

2/4 Who was the main protagonist?

The main protagonist in the story is Thundertooth.

--------

3/4 Did they ha