In [None]:
!pip install chromadb
!pip install -U bitsandbytes
!pip install llama-cpp-python
!pip install rank_bm25 nltk

# Objectives

Build a simple RAG for question answering based on a light quantized LLama3.2 1B model. The goal is to answer accuratly questions concerning Warhammer 40K rules.

In order to simplify the workstream I already prepared the textual data in an additional notebook. As always, the parsing is not perfect and there can be some artifacts.

## Evaluation

The evaluation will be decomposed along the different componnents:
- the LLM and prompting engine
- the vector-DB/Approximate kNN
- The overall quality of the full RAG function
- the quality and readibility of the code.

## Overview

First we will import Llama 3.2 and try some templatting and chat with the model.

Secondly, we will experiment with [ChromaDB](https://docs.trychroma.com/getting-started) and build a first RAG.

Finally, we will be using the headers and BM25 to try and improve the retriever.

# Imports

In [None]:
import chromadb # gérer des données de type embedding (vecteurs numériques)
import json
import uuid # pour générer des identifiants universels uniques (UUID). Pr garantir l’unicité d’un objet/enregistrement dans une base de données

from llama_cpp import Llama
from transformers import AutoModelForCausalLM, AutoTokenizer
from jinja2 import Template # permet de générer du texte dynamique, comme des fichiers HTML, à partir de modèles avec des variables et des expressions Python.

from rank_bm25 import BM25Okapi
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
nltk.download('stopwords')


# Large Language Model

## CPU implementation

Use the llama-cpp 'from_pretrained' function to import a LLM from the ["bartowski/Llama-3.2-1B-Instruct-GGUF"](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF) collection in the CPU.

It can be used as a fallback if you're GPU credit is finished.

**Clean the output and remove special tokens and the input prompt from the answer**

In [None]:

llm = Llama.from_pretrained(
    repo_id="bartowski/Llama-3.2-1B-Instruct-GGUF",
    filename="*Q8_0.gguf",
    verbose=False,
    n_ctx=4096,
)

## Initial LLM experiments

### Prompt Template
Create a Jinja2 ``Template`` to add the llama 3.2 (same as llama 3.1) sepcial tokens in order to optionnaly set roles and wrap a user prompt.

In [None]:
prompt_template = Template(
    """
    <|begin_of_text|>
    {% if role %}
    <|start_header_id|>system<|end_header_id|> {{ role }} <|eot_id|>
    <|start_header_id>user<|end_header_id>
    {% endif %}
    {% block task %}{{ input }}{% endblock %}
    <|eot_id|>
    <|start_header_id>assistant<|end_header_id|>
    """
)

### Play with the LLM

Try to trigger different model behaviours by changing the role for a same question. For example, use the prompt  'tell me a joke !' with different roles.

In [None]:
input = {'role':'you are a depressed clown' ,'input': 'tell me a joke !'}
input_2 = {'role':'you are a clown for children' ,'input': 'tell me a joke !'}


In [None]:
prompt = prompt_template.render(input_2)
print(f"Generated prompt:\n{prompt}\n")

response = llm.create_completion(prompt,max_tokens=256) # (help(llm.create_completion) to know which arguments can be expected)
response

Generated prompt:

    <|begin_of_text|>
    
    <|start_header_id|>system<|end_header_id|> you are a clown for children <|eot_id|>
    <|start_header_id>user<|end_header_id>
    
    tell me a joke !
    <|eot_id|>
    <|start_header_id>assistant<|end_header_id|>
    



{'id': 'cmpl-f148288f-d4d9-43dc-a4af-621083819819',
 'object': 'text_completion',
 'created': 1733665743,
 'model': '/root/.cache/huggingface/hub/models--bartowski--Llama-3.2-1B-Instruct-GGUF/snapshots/067b946cf014b7c697f3654f621d577a3e3afd1c/./Llama-3.2-1B-Instruct-Q8_0.gguf',
 'choices': [{'text': " hi there little buddy, want to hear a joke?\n\nokay! here's one: what do you call a fake noodle?\n(wait for it...)\n\nan impasta!\n\nhehehe, did I make you laugh?",
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 51, 'completion_tokens': 44, 'total_tokens': 95}}

What do you think ?




**Answer**:  We indeed obtain a response that follows the principles and characteristics of jokes. We can adjust the length of the joke with max_tokens to get a short or longer one.

# Retriever

In [None]:
import chromadb.utils.embedding_functions as embedding_functions

In [None]:
client = chromadb.Client()
stf_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

In [None]:
collection = client.create_collection(name="warhammer_40k",
                                      metadata={"hnsw:space": "cosine"},
                                      embedding_function=stf_function)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/MyDrive/processed')
print(os.getcwd())

In [None]:
with open('chunks.json', 'r') as f:
    chunks = json.load(f)

# Check the different field of the JSON what do you think?

chunks[0]

{'id': None,
 'metadata': {'Header 1': 'Core Rules'},
 'page_content': "'We are beset on all sides by vile predatory aliens and sedition gnaws at us from within; in this dark hour the best we can do is look to our wargear and pray to our gods.'\n- Skolak a'Trellar IV, Imperial Commander  \n200 STORE",
 'type': 'Document'}

**Answer**: We have the fields "id", "metadata", "page_content" and "type". So the documents are a compatible format to ChromaDB, so we can add them to the collection.

In [None]:
# Add the documents to the collection

# Since we have only 'None' ids, we need to generate universally unique identifiers (UUID) (to ensure the unicity of an object in the db)
ids = [str(uuid.uuid4()) for _ in chunks]
print(ids)

page_content = [chunk["page_content"] for chunk in chunks]
metadatas = [chunk["metadata"] for chunk in chunks]

collection.add(ids=ids, documents=page_content, metadatas=metadatas)

In [None]:
collection.get()["metadatas"][:5]

[{'Header 1': 'Core Rules'},
 {'Header 1': 'Introduction'},
 {'Header 1': 'Introduction'},
 {'Header 1': 'Introduction', 'Header 2': 'Core Concepts (Pg 5-9)'},
 {'Header 1': 'Introduction', 'Header 2': 'The Battle Round (Pg 10-36)'}]

In [None]:
question = "What is a visible unit ?"

# Perform a query

top_k = 5  # Nombre de documents les plus pertinents à récupérer

# Requête au retriever
results = collection.query(
    query_texts=[question],
    n_results=top_k  # Renvoie les top_k documents les plus pertinents
)

# Résultats
# Parcourir chaque sous-liste dans results['documents']
for i, document_group in enumerate(results['documents']):
    print(f"Document Group {i+1}:\n")
    # Parcourir chaque document dans la sous-liste
    for j, document in enumerate(document_group):
        print(f"  Document {j+1}:\n{document}\n")

Document Group 1:

  Document 1:
- **Model Visible:** If any part of a model can be seen, it is visible.  
- **Unit Visible:** If any model in a unit is visible, that model's unit is visible.  
- **Model Fully Visible:** If every Warhammer 40,000 battles are fought across all manner of grim and perilous landscapes, often strewn with ruins, wreckage and other obstacles your forces must navigate while they fight.  
#### - Unit Fully Visible: If Every Model In A Unit Is Fully Visible, That Unit Is

  Document 2:
- **Unit Fully Visible:** If every model in a unit is fully visible, that unit is fully visible.  
#### Hints And Tips  
Dice Rolling

  Document 3:
MODEL FULLY VISIBLE
If every part of another model that is facing the observing model can be seen from any part of the observing model, then that other model is said Every Warhammer 40,000 unit has a datasheet, reflecting the characteristics and abilities they can draw upon in battle.  
Unit Visible  
#### Has Line Of Sight To All Par

Create a retrieval function wrapping the ChromaDB query and returning an adapted format.

In [None]:
def retrieve(question, n_results=5):
  results = collection.query(
    query_texts=[question],
    n_results=n_results # Renvoie les n_results documents les plus pertinents
)
  # Extract documents and ids
  documents = results['documents'][0]  # First item, as query_texts has only one element
  #ids = results['ids'][0]             # Same for ids
  headers = results['metadatas'][0]

  # Format the output
  formatted_results = []
  for doc_header, doc_text in zip(headers, documents):
    formatted_results.append({
            "header": doc_header,
            "text": doc_text
        })

  return formatted_results # return texts and associated headers

In [None]:
retrieve('What is the goal of the game?') # another example

## Rag template

Create a RAG template in Jinja

In [None]:
# Fill it, use if and for blocks

rag_template = Template(
    """
    <|begin_of_text|>

    <|start_header_id|>system<|end_header_id|>
    {% if role %}
    <|start_header_id|>role<|end_header_id|>
    {{ role }}
    <|eot_id|>
    {% endif %}

    Context:
    {% if chunks %}
    {% for chunk in chunks %}
    {{' '.join(chunk['header'].values())}}
    Text: {{chunk['text']}}
    {% endfor %}
    <|eot_id|>
    {% endif %}

    <|start_header_id|>user<|end_header_id|>
    Question: {{question}}
    <|eot_id|>

    <|start_header_id|>assistant<|end_header_id|>
    """
)

In [None]:
print(rag_template.render(**{
    'role': 'you are an experienced wargame player',
    'question': "What is a visible unit ?",
    'chunks':[{'header': {'header1':'toto'},'text':'ctx1'},{'header': {'header1':'tato', 'header2':'tato'},'text':'ctx2'},{'header': {'header1':'tato'}, 'text':'ctx3'}]
}))

Createz a question that will create the prompt using the question and chunks.

In [None]:
def prompt_generation(question, chunks):
    return rag_template.render(question=question, chunks=chunks) # return the prompt

In [None]:
retrieve(question)

In [None]:
print(prompt_generation(question, retrieve(question)))

# Full Rag

Create functions to perform the full RAG pipeline, you may create a function for the CPU and another one for the GPU.

In [None]:
question_0 = "What is a visible unit ?"
question_1 = 'What are the limitations associated to the advance movement rule?'
question_2 = 'Is there a stratagem that can be used to reroll a failed dice role?'
question_3 = 'Explain the Comand Re-roll stratagem'

In [None]:
# fonction qui retourne une réponse générée suite à la question : la réponse rédigée est le premier élément du return de cette fonction
def full_rag_cpu(question, n_results=4):
  chunks = retrieve(question, n_results)
  prompt = prompt_generation(question, chunks)
  result = llm.create_completion(prompt,max_tokens=500,stop=None)
  return result['choices'][0]['text'], result['choices'][0]
# stop = 'None' pour désactiver les arrêts automatiques (éviter que le modèle s'arrête quand saut de ligne par exemple)

This is what we obtain:

In [None]:
# question_0
print(full_rag_cpu(question_0, n_results=4)[0])

 A visible unit is a model that is fully visible from any part of another model that is facing it, without any other models or terrain features blocking visibility to any of those parts.


In [None]:
# question_1
print(full_rag_cpu(question_1, n_results=4)[0])

 The limitations associated with the Advance movement rule in the Movement Phase of the Warhammer 40,000 rules are:

1.  Models can make an Advance move, but they cannot move within Engagement Range of any enemy models.
2.  A unit cannot shoot or declare a charge in the same turn that it Advances.
3.  Models that Advance cannot shoot or charge this turn.
4.  Models that Advance can move up to M+D6, where M is the model's Movement characteristic and D6 is the result of the roll.
5.  Models that Advanced cannot move within Engagement Range of any enemy models.
6.  A unit cannot Consolidate, which means it cannot move towards the closest enemy model, in base-to-base contact with an enemy model if possible.


In [None]:
# question_2
print(full_rag_cpu(question_2, n_results=4)[0])




    There isn't a specific stratagem that is universally accepted for rerolling a failed dice roll. The core concept in most games is that a failed dice roll is a loss and is usually not re-rolled. However, some games have house rules or modifications to this rule, allowing for the possibility of re-rolling a failed dice roll, as mentioned in the Core Concepts Determining Visibility section of the Warhammer 40k rules.


In [None]:
# question_3
print(full_rag_cpu(question_3, n_results=4)[0])

 The Command Re-roll stratagem is a rule in wargames and skirmish games that allows a player to re-roll dice rolls that were made by their unit, especially those that were made by adding several dice together. Here's a more detailed explanation:

**General Rules:**

* A re-roll is only applicable if the original roll was made by adding multiple dice together (e.g. 2D6).
* Re-rolls happen before any modifiers are applied.
* A re-roll can be used as many times as the number of dice were added together.

**When to use a Command Re-roll:**

* After a successful roll (Hit, Wound, Damage, saving throw, etc.).
* After a critical hit or critical failure (e.g. 1D6 for a critical hit, or 1D6 + 1 for a critical failure).
* After a successful initiative roll (for some games).

**How to use a Command Re-roll:**

* Pay the required number of CP (Command Points) to use the stratagem.
* The CP cost is usually listed on the stratagem.
* The same stratagem cannot be used more than once in the same phase

# Adding a reranker
The results Might not be satisfactory for some questions.

In order to better use the header extraction, we will rerank the chunks using BM25 over the headers.

In [None]:
stopwords_en = stopwords.words('english')
stemmer = SnowballStemmer("english")
def metadata_preprocessing(metadata):
    preprocessed_metadata = []
    for key, value in metadata.items():
        text = f"{key} {value}"
        tokens = text.lower().split()
        tokens = [stemmer.stem(token) for token in tokens if token not in stopwords_en]
        preprocessed_metadata.append(" ".join(tokens))
    return preprocessed_metadata # preprocessed headers

def query_preprocessing(query):
    return " ".join([stemmer.stem(token) for token in query.lower().split() if token not in stopwords_en]) # preprocessed queries

In [None]:
from rank_bm25 import BM25Okapi

def rerank_chunks(question, chunks, n_results=5):
    # Extract headers from chunks
    headers = [chunk['header'] for chunk in chunks]
    # Preprocess the headers and the query
    preprocessed_headers = [metadata_preprocessing(header) for header in headers]
    preprocessed_query = query_preprocessing(question)

    # Flatten preprocessed headers for BM25
    tokenized_headers = [' '.join(header).split() for header in preprocessed_headers]
    tokenized_query = preprocessed_query.split()

    # Initialize BM25
    bm25 = BM25Okapi(tokenized_headers)

    # Rank scores
    scores = bm25.get_scores(tokenized_query)

    # Sort chunks based on BM25 scores, include scores with chunks
    sorted_chunks = [(score, chunk) for score, chunk in sorted(zip(scores, chunks), key=lambda x: x[0], reverse=True)]

    return sorted_chunks[:n_results]  # Return top-n results with scores


In [None]:
retrieved_chunks = retrieve(question_2, n_results=100)
reranked_chunks = rerank_chunks(question_2, retrieved_chunks, n_results=10)
for rr_chk in reranked_chunks:
    print('score: {}'.format(rr_chk[0]))
    print('header: {}'.format('#'.join(rr_chk[1]['header'].values())))
    print('text: {}'.format(rr_chk[1]['text']))
    print('-'*100)

score: 3.107509213049174
header: Core Concepts#Dice
text: #### Re-Rolls  
Some rules allow you to re-roll a dice roll, which means you get to roll some or all of the dice again. If a rule allows you to re-roll a dice roll that was made by adding several dice together (e.g. 2D6, 3D6, etc.) then, unless otherwise stated, you must re-roll all of those dice again. You can never re-roll a dice more than once, and re-rolls happen before modifiers (if any) are applied. Rules that refer to the value of an 'unmodified' dice roll are referring to the dice result after any re-rolls, but before any modifiers are applied.  
- **Unmodified Dice**: the result after re-rolls, but before any modifiers.  
- A dice can never be re-rolled more than once. - You must re-roll all dice if several need adding together
(e.g. 2D6).  
- Re-rolls are applied before any modifiers.  
#### Roll-Offs
----------------------------------------------------------------------------------------------------
score: 3.107509213

In [None]:
question_3 = 'Explain the Comand Re-roll stratagem'
retrieved_chunks = retrieve(question_3, n_results=100)
reranked_chunks = rerank_chunks(question_3, retrieved_chunks, n_results=10)
for rr_chk in reranked_chunks:
    print('score: {}'.format(rr_chk[0]))
    print('header: {}'.format('#'.join(rr_chk[1]['header'].values())))
    print('text: {}'.format(rr_chk[1]['text']))
    print('-'*100)

## RAG with reranker

In [None]:
def full_rag_reranker(question, n_results=5):
    # Retrieve chunks
    chunks = retrieve(question, n_results)
    # Rerank chunks using BM25
    ranked_chunks = rerank_chunks(question, chunks)
    # Extract only the chunk dictionaries from ranked_chunks
    ranked_chunks = [chunk[1] for chunk in ranked_chunks]  # Selecting the dictionary part of the tuple
    # Generate prompt using top-ranked chunks
    prompt = prompt_generation(question, ranked_chunks[:n_results])
    # Query the LLM
    results = llm.create_completion(prompt, max_tokens=700, stop = None)
    return results['choices'][0]['text'] # full pipeline with renraker

Compare the generation from both pipeline (with and without reranker)

What do you think ?

In [None]:
# with reranker
print(full_rag_reranker(question_2, n_results=4))




    


    The answer is:

    Yes, there are several stratagems that can be used to reroll a failed dice roll. Here are a few examples:


    1. **Re-roll with a re-roll of a dice**: Many dice rules include a clause that allows you to re-roll a dice roll using the same dice, which means you get to roll another dice.

    2. **Re-roll all dice with a re-roll of a dice**: Some rules include a clause that allows you to re-roll all dice made with a re-roll of a dice, such as 2D6 or 3D6.

    3. **Re-roll a specific dice with a re-roll of a dice**: Some rules include a clause that allows you to re-roll a specific dice with a re-roll of a dice, such as 1D6.

    4. **Re-roll a dice with a re-roll of a dice**: Some rules include a clause that allows you to re-roll a dice with a re-roll of a dice, such as 2D6 or 3D6.

    5. **Use a different dice for re-rolling**: Some rules allow you to use a different dice for re-rolling, such as a D4 instead of a D6.

    **Example**:

    Let's say yo

In [None]:
# without reranker
print(full_rag_cpu(question_2, n_results=4)[0])




Yes, there are several stratagems that can be used to reroll a failed dice roll in Warhammer 40,000. Here are a few examples:


*   **Re-roll a failed die**: In some cases, the stratagem allows you to re-roll a failed die without using a special rule. This can be done by using the "Roll for Initiate" stratagem or the "Re-roll a failed dice" stratagem. These stratagems can be used to reroll a failed die in various situations, such as when a character is hit by an enemy's attack and fails to roll a critical hit or when a character attempts a skill check and fails.

    *   **Re-roll a failed dice in combat**: If a character fails to roll a critical hit in combat, they can re-roll the dice to try again. If they succeed, they can use the re-roll to re-roll the dice again. If they fail to roll the dice again, they must attempt to re-roll it again.
*   **Re-roll a failed skill check or ability check**: Some stratagems, such as the "Roll for Initiate" stratagem, can re-roll a failed skill 

Here we observe that the response with the reranker is more detailed and even includes examples to illustrate the concept.

In [None]:
# question_3 with reranker
print(full_rag_reranker(question_3, n_results=4))




The Command Re-roll Stratagem is a powerful ability that allows a player to re-roll a dice roll, but with a twist. Here's how it works:


**When:** The Command Re-roll Stratagem is used during the Battle phase, just after a player has made a Hit roll, Wound roll, Damage roll, saving throw, Advance roll, Charge roll, Desperate Escape test, or Hazardous test.
**Categories:** It belongs to the "Stratagems" category.


**Effect:** The player re-rolls the affected roll, test, or saving throw.

**Key Points:**

* The re-roll is applied immediately after the test or roll, not before any modifiers are applied.
* The re-roll is done before any modifiers are applied to the result.
* The re-roll is not done before re-rolls of dice are made (e.g. re-rolling 2D6 is allowed, but re-rolling a dice made by adding 2D6 together is not allowed).
* The re-roll cannot be used after a saving throw that has already been made.
* The same Stratagem cannot be used more than once in the same phase.

In essenc

In [None]:
# question_3 without reranker
print(full_rag_cpu(question_3, n_results=4)[0])

 The Command Re-roll Stratagem is a combat ability found in the core rules of Warhammer 40,000.

**When:** It can be used during the fight phase.

**Effect:** When you use a Command Re-roll Stratagem, you must re-roll the dice, test, or saving throw as required. The result of the re-roll is added to your total.

**Restrictions:**

*   You must use the Command Re-roll Stratagem at the same phase (just after an enemy unit has fought).
*   The Command Re-roll Stratagem cannot be used if the re-roll is to test or a saving throw, unless the test or saving throw is also a Command Re-roll.
*   You cannot re-roll a dice roll more than once.

This allows you to re-roll a dice roll that was made by adding several dice together, or any re-roll that is part of a larger dice pool, as long as the re-roll is made in the same phase.


Comparison: we see that we have a more precise first sentence with the reranker (we understand the core of the stratagem immediately with this first sentence, unlike the first sentence without the reranker), and there are more points detailed.

Overall, the description is more detailed and clearer.

So the reranking with BM25 helped to obtain results that are more relevant.

# Further application

At this point, we created a single question/answer turn RAG. It can be usefull for some applications to allow conversations with documents. Modify the generation using [this example.](https://huggingface.co/docs/transformers/conversations).

In [None]:
!pip install transformers

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from rank_bm25 import BM25Okapi

# Charger le tokenizer et le modèle DialoGPT
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-medium")

In [None]:
def prompt_generation(question, chunks, conversation_history):
    """Generate the prompt for DialoGPT using conversation and chunks"""
    context = "\n".join([msg["text"] for msg in conversation_history]) + "\n"
    # Limit the number of retrieved texts to prevent exceeding model's input size
    retrieved_texts = "\n".join([chunk[1]["text"] for chunk in chunks[:2]])  # Reduced to 2 chunks
    prompt = f"Context:\n{context}\nDocument Chunks:\n{retrieved_texts}\nUser's Query: {question}\nAnswer:"
    return prompt

def full_rag_conversation(question, conversation_history, n_results=5):
    """Complete pipeline for a RAG conversation with DialoGPT"""
    retrieved_chunks = retrieve(question, n_results)  # Retrieve relevant chunks
    reranked_chunks = rerank_chunks(question, retrieved_chunks, n_results)  # Re-rank chunks

    prompt = prompt_generation(question, reranked_chunks, conversation_history)  # Generate the prompt

    # Encode the prompt and generate the response with DialoGPT
    inputs = tokenizer.encode(prompt + tokenizer.eos_token, return_tensors='pt', truncation=True) # Added truncation
    response_ids = model.generate(inputs, max_new_tokens=256, pad_token_id=tokenizer.eos_token_id)

    # Decode and return the response
    response = tokenizer.decode(response_ids[0], skip_special_tokens=True)
    return response.strip()

def start_conversation():
    """Function to start a conversation with DialoGPT and the chunks"""
    conversation_history = []

    # Conversation loop example
    while True:
        question = input("User: ")

        # Get the response and update conversation history
        answer = full_rag_conversation(question, conversation_history)

        # Add the question and answer to the conversation history
        conversation_history.append({"role": "user", "text": question})
        conversation_history.append({"role": "assistant", "text": answer})

        print(f"Assistant: {answer}")

# Start the conversation
start_conversation()

User: What is the goal of the game?
Assistant: Context:


Document Chunks:
#### End The Battle  
Each mission will tell you when the battle ends. This will typically be after a set number of battle rounds have been completed, or when one player has achieved a certain victory condition.  
DETERMINE VICTOR
Each mission will tell you what you need to do in order to win the game. If neither player manages to achieve a victory then the game is considered to be a draw.  
57
Your units manoeuvre across the battlefield and reinforcements enter the fray.  
SHOOTING PHASE
Your units fire their ranged weapons at the foe.  
CHARGE PHASE
Your units charge forward to battle at close quarters.  
Both players' units pile in and attack with melee weapons.  
Once a player's turn has ended, their opponent then starts their turn. Once both players have completed a turn, the battle round has been completed and the next one begins, and so on, until the battle ends.  
3 4 5
User's Query: What is the goal of 

KeyboardInterrupt: Interrupted by user