# TD: RAG

Dans ce notebook, un RAG basique est implémenté:
- On chunk les documents par paragraphes
- On a un embedding pour les chunks
- Pour une question, on peut embedde la question et récupérer les N chunks les plus pertinents
- On utilise un modèle de génération de texte (SMoLL) pour faire la partie question + chunks les plus pertinents -> réponse.

Téléchargez (cette archive)[https://drive.google.com/file/d/1TnfKs7bTwmpbXklbgiIBpdw7I_wJ5y9Y/view?usp=sharing] avec différentes 

Dans ce TD, vous allez expérimenter différentes façons de chunk et d'embeded les documents et les questions pour que le RAG retrieve les documents les plus pertinents. <br/>
Vous expérimenterez aussi la prompt donnée au générateur de texte pour avoir les meilleures réponses.

Voici la [liste de questions](https://drive.google.com/file/d/14hZ0hTx5dM1WgJYewZsn9BkHzEReq-pj/view?usp=sharing) que je poserai au RAG. </br>
A rendre: 
- Le notebook de votre RAG
- un CSV avec question,embedding,rag_reply
- un CSV avec chunk,embedding</br>
L'embedding doit être le JSON d'une liste de float.</br>
Quand je ferai "json.loads(embedding)", je dois récupérer une liste de floats

In [None]:
import numpy as np

import pandas as pd
from pathlib import Path

# Data loading

In [None]:
path = Path("../data/raw/rag/")

In [None]:
texts = []
for filename in path.glob("*.md"):
    with open(filename) as f:
        texts.append(f.read())

texts[0]

# Chunk
## Basic

In [None]:
def parse_class(text):
    chunks = text.split("\n\n")
    title = chunks[0].replace("# Title: ", "")
    return {"title": title, "chunks": chunks}

In [None]:
def parse_class_add_title(text):
    chunks = text.split("\n\n")
    title = chunks[0].replace("# Title: ", "")
    return {"title": title, "chunks": [f"{title}: {chunk}" for chunk in chunks]}

In [None]:
chunks = sum((parse_class_add_title(txt)["chunks"] for txt in texts), [])

# Embedding

## BAAI's embedding

In [None]:
from FlagEmbedding import FlagModel

In [None]:
model = FlagModel(
    'BAAI/bge-base-en-v1.5',
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
    use_fp16=True,
)

In [None]:
corpus_embedding = model.encode(chunks)

In [None]:
queries = [
    "Who is the reinforcement learning teacher?",
    "In what class will I learn game AI?",
]

In [None]:
query_embedding = model.encode(queries)

In [None]:
sim_scores = query_embedding @ corpus_embedding.T

In [None]:
for query, score in zip(queries, sim_scores):
    print(" ---- ")
    print("Query: ", query)
    indexes = np.argsort(score)[-5:]
    print("Sources:")
    for i, idx in enumerate(reversed(indexes)):
        if score[idx] > .5:
            print(f"{i+1} -- similarity {score[idx]:.2f} -- \"", chunks[idx], '"')
            
    

# Eval retrieval: Mean Reciprocal Rank
Le fichier [question_answer_short.csv](https://drive.google.com/file/d/1EB8IwGlqvpNy3oq7xyR2IzdqJDX8C_fr/view?usp=drive_link) contient une liste de question et le texte à retrouver dans les documents.<br/>
Je considère que tout chunk contenant le "texte à retrouver" était un bon chunk

In [None]:
df = pd.read_csv(path / "question_answer_short.csv")

In [None]:
query_embedding = model.encode(list(df["question"]))

In [None]:
acceptable_chunks = []
for answer in df["answer"]:
    chunks_ok = set(i for i, chunk in enumerate(chunks) if answer in chunk)
    acceptable_chunks.append(chunks_ok)

In [None]:
def compute_mrr(sim_score, acceptable_chunks):
    ranks = []
    for this_score, this_acceptable_chunks in zip(sim_score, acceptable_chunks):
        indexes = reversed(np.argsort(this_score))
        rank = 1 + next(i for i, idx in enumerate(indexes) if idx in this_acceptable_chunks)
        ranks.append(rank)
        
    return {
        "score": sum(1 / r if r < 6 else 0 for r in ranks) / len(ranks),
        "ranks": ranks,
    }

In [None]:
sim_scores = query_embedding @ corpus_embedding.T

In [None]:
res = compute_mrr(sim_scores, acceptable_chunks)
res["score"]

# Text generation

In [None]:
def get_context(query, corpus, corpus_embeddings):
    query_embedding = model.encode([query])
    sim_scores = query_embedding @ corpus_embedding.T
    indexes = list(np.argsort(sim_scores[0]))[-5:]
    return [corpus[i] for i in indexes]

In [None]:
get_context("Which class will teach me to build a chatbot?", chunks, corpus_embedding)

## SMOLL

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceTB/SmolLM2-360M-Instruct"
# checkpoint = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
# checkpoint = "amd/Instella-3B"

device = "cpu" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model_generator = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

In [None]:
def build_smoll_prompt(query, corpus, corpus_embedding):
    context_str = "\n\n".join(get_context(query, chunks, corpus_embedding))

    prompt = f"""<|im_start|>system
You reply to the user's request using only context information.
Context information to answer "{query}" is below
------
Context:
{context_str}
------
You are a helpful assistant for a Computer Science university. You reply to students'questions about the courses that they can attend.
<|im_end|>
<|im_start|>user
{query}
<|im_reend|>
"""
    return prompt


In [None]:
def build_smoll_messages(query, chunks, corpus_embedding):
    context_str = "\n\n".join(get_context(query, chunks, corpus_embedding))

    messages = [
        {"role": "system", "content": f"""You reply to the user's request using only context information.
Context information to answer "{query}" is below
------
Context:
{context_str}
------
You are a helpful assistant for a Computer Science university. You reply to students'questions about the courses that they can attend.
"""},
        {"role": "user", "content": query},
    ]

    return messages


In [None]:
messages = build_smoll_messages("Who is the NLP teacher?", chunks, corpus_embedding)

input_text=tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model_generator.generate(inputs, max_new_tokens=100, temperature=0.01, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0]))

# Groq generator

In [None]:
groq_api_key = "YOUR-API-KEY"

In [None]:
import openai

In [None]:
client = openai.OpenAI(
    api_key=groq_api_key,
    base_url="https://api.groq.com/openai/v1"
)

In [None]:
query = "What must I do to pass the NLP class?"

context_str = "\n\n".join(get_context(query, chunks, corpus_embedding))

prompt = f"""Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
If the answer is not in the context information, reply "I cannot answer that question".
Query: {query}
Answer:"""

In [None]:
res = client.chat.completions.create(                                            
    messages=[{"role": "user", "content": prompt}],                              
    model="openai/gpt-oss-20b",                                                                 
)                                                                                

In [None]:
res.choices[0].message.content