# Introduction 
 This notebook demonstrates how to build a rag pipeline using gemma, langchain and chromadb.
 
1.[Introduction](#introduction)

2.[Installation](#installation)

3.[Importations](#importation)

4.[Data](#data)<br>
1. [Spliting the pdf into pages](#spliting_pages)<br>
2. [Spliting the pages into chunks-Chunking](#chunking)

5.[Building the Vector DataBase](#db)

6.[Model](#model)

7.[Retrieval](#retrieval)

8.[Augmentation](#augmentation)

9.[Query](#query)

10.[Generation](#generation)
 

# Installation
<a id="installation"></a>

In [1]:

#!pip install --upgrade jupyter #in case you face the issue in loading the model 
#!pip install --upgrade ipywidgets # // // // // // //
#!pip install --upgrade tqdm # // // // // // //
#!pip install bitsandbytes accelerate
#!pip install tiktoken
#!pip install pypdf



# Importations
<a id="importation"></a>

In [3]:
import pandas as pd
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
from pypdf import PdfReader
import ipywidgets as widgets
import tqdm


# Data
<a id ="data"></a>
For this pipiline we will use the institutional regulations of poyltech Angers.

The data comes in a pdf format.


### Spliting the pdf into pages 
<a id="spliting_pages"></a>

In [4]:
from pypdf import PdfReader

pdf_reader = PdfReader("./data/Reglement_etudes_Polytech_Angers_2020-2021.pdf")
pdf_texts = [p.extract_text().strip() for p in pdf_reader.pages]
pdf_texts[0]

"Page 4 sur 17  \n 1. PRÉAMBULE \nLa formation d’ingénieur comporte 5 années d’études  post baccalauréat. Il débute par un parcours intégré de \ndeux années appelé « Parcours des écoles d’ingénieu rs Polytech (PeiP) » et un cycle d’ingénieurs sur l es trois \ndernières années. Le présent règlement concerne les  deux années de PeiP de Polytech Angers.   \n \nCe règlement des études est révisable chaque année et validé par le conseil d'école. \n2. ORGANISATION DES ÉTUDES \n2.1. RÉPARTITION TEMPORELLE ET UNITÉS D’ENSEIGNEMEN T \nLe volume horaire total d’enseignement encadré est compris entre 1400 h et 1600 h, Les enseignements s ont \norganisés en 4 semestres. \nLes enseignements (matières, modules, éléments cons titutifs pédagogiques) sont groupés en Unités \nd’Enseignement (UE) au sein de chaque semestre. Cha que UE assure une cohérence pédagogique entre \ndiverses matières et contribue à l’acquisition de c ompétences identifiées. A chaque UE est associé un nombre \nfixé de crédits ECT

### Splitting the texts into chunks 
<a id="chunking"></a>

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

#print(word_wrap(character_split_texts[10]))
print(f"\nTotal chunks: {len(character_split_texts)}")


Total chunks: 35


# Building the vector database -Chroma
<a id="db"></a>
The vector database is responsible for storing and organizing the embeddings of the documents that our model will retrieve and use to generate responses.

For embedding we will use the default embedding model provided by chroma.

In [6]:
# Create a new chroma collection
client = chromadb.Client()
chroma_collection = client.get_or_create_collection(name="Reglement_etudes_Polytech_Angers")
ids = [str(i) for i in range(len(character_split_texts))]

chroma_collection.add(ids=ids, documents=character_split_texts)
chroma_collection.count()

35

# Loading the Model
<a id="model"></a>
For the model we will use the open source LLM provided by googel Gemma 7bit instruction (gemma-7b-it)

In [7]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available 
# 1. Specify the model to be used
model_name = "./google/gemma-7b-it"
print(f"[INFO] Using model: {model_name}")

use_quantization = False

# 2. (Optional) Create quantization configuration for smaller model loading
# Requires additional library installation and NVIDIA GPU with compute capability of 8.0 or above
from transformers import BitsAndBytesConfig
if use_quantization:
    from transformers import BitsAndBytesConfig
    quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

# 3. (Optional) Specify attention implementation for faster inference
# Requires additional library installation and NVIDIA GPU with compute capability of 8.0 or above
attn_type = "flash_attention_2" if is_flash_attn_2_available() and torch.cuda.get_device_capability(0)[0] >= 8 else "sdpa"
print(f"[INFO] Using attention type: {attn_type}")

# 4. Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 5. Load the model
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             torch_dtype=torch.float16,
                                             quantization_config=quantization_config if use_quantization else None,
                                             low_cpu_mem_usage=False, 
                                             attn_implementation=attn_type)

from contextlib import redirect_stdout

# 6. (Optional) Move the model to GPU
if not use_quantization:
    model.to("cuda")

[INFO] Using model: ./google/gemma-7b-it
[INFO] Using attention type: sdpa


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [8]:
print(model)

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 3072, padding_idx=0)
    (layers): ModuleList(
      (0-27): 28 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (k_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (v_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=3072, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=3072, out_features=24576, bias=False)
          (up_proj): Linear(in_features=3072, out_features=24576, bias=False)
          (down_proj): Linear(in_features=24576, out_features=3072, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
      )
    )
    (norm): Gemma

# Retrieval
<a id="retrieval"></a>
The `generate_context` represents our retrival , in our example we return the top 5 documents relevant to our query.

The default distance used in chroma is the `cosine` distance.

In [9]:
def generate_context(query):
    results = chroma_collection.query(
    query_texts=[query],
    n_results=5,
    include=['documents']
    )
    
    res = [{"sentence_chunk": str(item)} for item in results['documents'][0]]
    return res

# Augmentation
<a id="augmentation"></id>
In this step, we craft our prompt, where we define the model's role as a student assistant and incorporate the contextual elements provided by `generate_context` function so it provides  the language model with the necessary information to generate accurate and relevant responses given a query input.

In [10]:
def prompt_formatter(query: str, 
                     context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])


    base_prompt = """  Vous etes un assitant aux etudiants qui veulent se renseigner sur le reglement interieur de polytech Angers. 
    Essayez de repondre aux requetes d'utilisateur , si tu ne connais pas la reponse ne l'invente pas 
\nMaintenant utiliser les elements de contexte suivant pour repondre a la requetes de l'utilisateur:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Update base prompt with context items and query   
    base_prompt = base_prompt.format(context=context, query=query)


    # Create prompt template for instruction-tuned model
    dialogue_template = [
        {"role": "user",
        "content": base_prompt}
    ]
     # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                          tokenize=False,
                                          add_generation_prompt=True)
    return prompt

### Query
<a id="query"></a>

In [11]:
query="Quelle est la limite maximale d'absences justifiées pour un élève ?"

context = generate_context(query)

prompt = prompt_formatter(query=query,
                          context_items=context)

# Generation: answer of the model 
<a id="generation"></a>
Finally we come to the last part "Answer Generation".

We perform the following steps 
1. Tokenize the formatted prompt,
2. Generate a response using the model, and
3. Decode the output to present the final answer to the user's query.

In [12]:


input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate an output of tokens
outputs = model.generate(**input_ids,
                             #temperature=0.5, # lower temperature = more deterministic outputs, higher temperature = more creative outputs
                             do_sample=False, # whether or not to use sampling, see https://huyenchip.com/2024/01/16/sampling.html for more
                             max_new_tokens=1024) # how many new tokens to generate from prompt 

# Turn the output tokens into text
output_text = tokenizer.decode(outputs[0])

print(f"Query: {query}")
print(f"RAG answer:\n{output_text.replace(prompt, '')}")

Query: Quelle est la limite maximale d'absences justifiées pour un élève ?
RAG answer:
<bos>La limite maximale d'absences justifiées pour un élève est de 30% du volume horaire du semestre. Si l'absence d'un élève dépasse cette limite, le semestre ne peut pas être validé et l'élève peut être proposé de répéter son année à titre exceptionnel.<eos>
