## Procedure
We create two classes:

#### **AIAgent** : 
An AI Agent that query Gemma LLM using a custom prompt that instruct Gemma to generate and answer (from the query) by refering to the context (as well provided); the answer to the AI Agent query function is then returned.

#### **RAGSystem** : 
initialized with the dataset with Data Science information, with an AIAgent object. In the init function of this class, we ingest the data from the dataset in the vector database. This class have as well a query member function. In this function we first perform similarity search with the query to the vector database. Then, we call the generate function of the ai agent object. Before returning the answer, we use a predefined template to compose the overal response from the question, answer and the context retrieved.

## Packages instalation and configurations : 

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM

from langchain.document_loaders import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

from IPython.display import display, Markdown
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


## AI Agent class :

In [2]:
import os
HUGGINGFACE_HUB_TOKEN = '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'
# Set your Hugging Face API token
os.environ['HUGGINGFACE_HUB_TOKEN'] = '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'

In [9]:
# model_name="google/gemma-2b-it"
# model_name="aymanboufarhi/gemma-fstt"
class AIAgent:
    """
    Gemma 2b-it assistant.
    It uses Gemma transformers 2b-it/2.
    """
    def __init__(self, model_name="aymanboufarhi/gemma2B-chat-bot-fstt", max_length=256):
        self.max_length = max_length
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(model_name,token = HUGGINGFACE_HUB_TOKEN)
            self.gemma_llm = AutoModelForCausalLM.from_pretrained(model_name,token = HUGGINGFACE_HUB_TOKEN)
        except Exception as e:
            raise ValueError(f"Error loading model: {e}")

    # Do not include other information.
    def create_prompt(self, query, context):
        # Prompt template
        prompt = f"""
        You are an AI Agent specialized to answer questions about FSTT (faculty of science and technology in Tanger).
        Explain the concept or answer the question about FSTT.
        In order to create the answer,you can use the
        context provided (Context). 
        Answer with simple words.
        If needed, include also explanations.
        it's importent to answer with french languge.
        Question: {query}
        Context: {context}
        Answer:
        """
        return prompt
    
    def generate(self, query, retrieved_info):
        prompt = self.create_prompt(query, retrieved_info)
        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids
        
        # Answer generation
        answer_ids = self.gemma_llm.generate(input_ids, max_new_tokens=self.max_length)
        
        # Decode and return the answer
        answer = self.tokenizer.decode(answer_ids[0], skip_special_tokens=True)
        return prompt, answer


### Test the AIAgent :

In [10]:
# Initialize the AI Agent
ai_agent = AIAgent()

Loading checkpoint shards: 100%|██████████| 2/2 [03:13<00:00, 96.76s/it] 


In [11]:
# Example usage with a sample query and context
query = "What is the contact information for the head of the department of GÉNIE INFORMATIQUE?"
context = "GÉNIE INFORMATIQUE: Chef : Pr. EL BRAK Mohamed\nEmail : melbrak@uae.ac.ma"

# Generate the response
prompt, answer = ai_agent.generate(query, context)

# Print the prompt and answer
print(f"Prompt: {prompt}\n")
print(f"Answer: {answer}")

Prompt: 
        You are an AI Agent specialized to answer questions about FSTT (faculty of science and technology in Tanger).
        Explain the concept or answer the question about FSTT.
        In order to create the answer,you can use the
        context provided (Context). 
        Answer with simple words.
        If needed, include also explanations.
        it's importent to answer with french languge.
        Question: What is the contact information for the head of the department of GÉNIE INFORMATIQUE?
        Context: GÉNIE INFORMATIQUE: Chef : Pr. EL BRAK Mohamed
Email : melbrak@uae.ac.ma
        Answer:
        

Answer: 
        You are an AI Agent specialized to answer questions about FSTT (faculty of science and technology in Tanger).
        Explain the concept or answer the question about FSTT.
        In order to create the answer,you can use the
        context provided (Context). 
        Answer with simple words.
        If needed, include also explanations.
    

In [12]:
# In[30]:
class RAGSystem:
    """Sentence embedding based Retrieval Based Augmented generation.
       Given a ChromaDB collection, retriever finds num_retrieved_docs relevant documents."""
    
    def __init__(self, ai_agent, collection, num_retrieved_docs=2):
        self.num_docs = num_retrieved_docs
        self.collection = collection
        self.ai_agent = ai_agent
        self.template = "\n\nQuestion:\n{question}\n\nPrompt:\n{prompt}\n\nAnswer:\n{answer}\n\nContext:\n{context}"
    
    def retrieve(self, query):
        # Retrieve top k similar documents to query
        results = self.collection.query(query_texts=[query], n_results=self.num_docs)
        docs = [result for result in results['documents']]
        return docs
    
    def query(self, query):
        # Generate the answer
        context_docs = self.retrieve(query)
        context_docs = context_docs[0]
        print(context_docs)
        # unique_docs = self.deduplicate_docs(context)
        # # unique_docs = set(context_docs)
        # print(unique_docs)
        data = ""
        for item in list(context_docs):
            data += item
        context = " | ".join(context_docs[:self.num_docs])
        print(context)
        
        # prompt = f"""
        # You are an AI Agent specialized to answer questions about FSTT (faculty of science and technology in Tanger).
        # Explain the concept or answer the question about FSTT.
        # In order to create the answer, please only use the information from the
        # context provided (Context). Do not include other information.
        # Answer with simple words.
        # It's important to answer with french languge.
        # If needed, include also explanations.
        # Question: {query}
        # Context: {context}
        # Answer:
        # """
        
        # input_ids = self.collection.embedding_function.tokenizer(prompt, return_tensors="pt").input_ids
        # answer_ids = self.collection.embedding_function.model.generate(input_ids, max_new_tokens=256)
        # answer = self.collection.embedding_function.tokenizer.decode(answer_ids[0], skip_special_tokens=True)

        prompt, answer = self.ai_agent.generate(query, context)
        
        return self.template.format(question=query, prompt=prompt, answer=answer, context=context)

In [13]:
def colorize_text(text):
    for word, color in zip(["Question", "Prompt", "Answer", "Context"], ["blue", "magenta", "red", "green"]):
        text = text.replace(f"\n\n{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

In [14]:
import chromadb
def connect(): 
    # Configure the ChromaDB client with persistence
    client = chromadb.HttpClient(host='localhost', port=8000)
    collection = client.get_collection(name="text_embeddings")
    return collection
# Initialize the RAGSystem with the existing collection
rag_system = RAGSystem(ai_agent=ai_agent, collection=connect(), num_retrieved_docs=4)

In [15]:
# Define the query
query = "quel est le doyen ?"

print(data.iloc[0].departement_name)
# Get the answer from the RAG system
response = rag_system.query(data.iloc[0].departement_name)

# Print the response
display(Markdown(colorize_text(response)))

NameError: name 'data' is not defined

In [34]:
# query = '''<|system|>FSTT c'est la Faculté des Sciences et Techniques de Tanger 
# <|user|> Donne le nombre de départements avec les noms et informations de chaque departement
# <|assistant|>'''

query = '''Donne le nombre de départements avec les informations de chaque departement'''

# Get the answer from the RAG system
response = rag_system.query(query)
# response = rag_system.query('''Donne le nombre de départements avec les noms et informations de chaque departement''')

# Print the response
display(Markdown(colorize_text(response)))

['suit lappel candidatur concour dinnov objet diffus récent centr yabd collabor centr régional dinvest organis vendred 29 mar session dinform prof professeur étudi', 'centr régional dinvest critt lanc deuxiem édit concour international dopen innov territory development challeng tdc nouveau concour ouvert tout porteur projet innov apport solut problémat propr région tang tetouan al hoceim critt invit tous intéress répond criter exig fair part cet nouvel aventur soumet projet plateform tdc1710496281', 'mettr létudi cœur problem technolog réel quil devr résoudr futur lauréat doit convaincr secteur industriel laboratoir capac scientif multidisciplinair décisionnel organisationnel', 'cet format permettr étudi dacquer spécialis environ répond exigent recherch fondamental appliqu apport ouvertur appliqu recherch secteur socioéconom']
suit lappel candidatur concour dinnov objet diffus récent centr yabd collabor centr régional dinvest organis vendred 29 mar session dinform prof professeur étudi



**<font color='blue'>Question:</font>**
Donne le nombre de départements avec les informations de chaque departement

**<font color='magenta'>Prompt:</font>**

        You are an AI Agent specialized to answer questions about FSTT (faculty of science and technology in Tanger).
        Explain the concept or answer the question about FSTT.
        In order to create the answer, please only use the information from the
        context provided (Context). 
        Answer with simple words.
        If needed, include also explanations.
        it's importent to answer with french languge.
        Question: Donne le nombre de départements avec les informations de chaque departement
        Context: suit lappel candidatur concour dinnov objet diffus récent centr yabd collabor centr régional dinvest organis vendred 29 mar session dinform prof professeur étudi | centr régional dinvest critt lanc deuxiem édit concour international dopen innov territory development challeng tdc nouveau concour ouvert tout porteur projet innov apport solut problémat propr région tang tetouan al hoceim critt invit tous intéress répond criter exig fair part cet nouvel aventur soumet projet plateform tdc1710496281 | mettr létudi cœur problem technolog réel quil devr résoudr futur lauréat doit convaincr secteur industriel laboratoir capac scientif multidisciplinair décisionnel organisationnel | cet format permettr étudi dacquer spécialis environ répond exigent recherch fondamental appliqu apport ouvertur appliqu recherch secteur socioéconom
        Answer:
        

**<font color='red'>Answer:</font>**

        You are an AI Agent specialized to answer questions about FSTT (faculty of science and technology in Tanger).
        Explain the concept or answer the question about FSTT.
        In order to create the answer, please only use the information from the
        context provided (Context). 
        Answer with simple words.
        If needed, include also explanations.
        it's importent to answer with french languge.
        Question: Donne le nombre de départements avec les informations de chaque departement
        Context: suit lappel candidatur concour dinnov objet diffus récent centr yabd collabor centr régional dinvest organis vendred 29 mar session dinform prof professeur étudi | centr régional dinvest critt lanc deuxiem édit concour international dopen innov territory development challeng tdc nouveau concour ouvert tout porteur projet innov apport solut problémat propr région tang tetouan al hoceim critt invit tous intéress répond criter exig fair part cet nouvel aventur soumet projet plateform tdc1710496281 | mettr létudi cœur problem technolog réel quil devr résoudr futur lauréat doit convaincr secteur industriel laboratoir capac scientif multidisciplinair décisionnel organisationnel | cet format permettr étudi dacquer spécialis environ répond exigent recherch fondamental appliqu apport ouvertur appliqu recherch secteur socioéconom
        Answer:
        FSTT compte 12 départements.
        Pour chaque département, on peut accéder à la page de celui-ci.
        Lien de la page de chaque département est fourni dans le contexte.

**<font color='green'>Context:</font>**
suit lappel candidatur concour dinnov objet diffus récent centr yabd collabor centr régional dinvest organis vendred 29 mar session dinform prof professeur étudi | centr régional dinvest critt lanc deuxiem édit concour international dopen innov territory development challeng tdc nouveau concour ouvert tout porteur projet innov apport solut problémat propr région tang tetouan al hoceim critt invit tous intéress répond criter exig fair part cet nouvel aventur soumet projet plateform tdc1710496281 | mettr létudi cœur problem technolog réel quil devr résoudr futur lauréat doit convaincr secteur industriel laboratoir capac scientif multidisciplinair décisionnel organisationnel | cet format permettr étudi dacquer spécialis environ répond exigent recherch fondamental appliqu apport ouvertur appliqu recherch secteur socioéconom