# THIS NOTEBOOK IS A VARIATION

Original is here:
https://colab.research.google.com/drive/1Z7DSkB7uZJNomPTjbHSwfr8Rb0BLg-Tj#scrollTo=d7Kw_t3mpBDx  

We duplicated the notebok to compare the performance of the large embedding model.  

Spoiler: large performs way better in French.

## Define Tokens and API keys needed for this notebook

We set up env vars using Colab Secrets.  

We need Huggingface to download and upload the dataset, OpenAI for the embeddings, and Groq for LLM inference on RAG context.  

In [None]:
# load secrets for all api keys using Colab Secrets
from google.colab import userdata

In [None]:
# login to HF HUB
from huggingface_hub import login

login(userdata.get('HF_TOKEN'))

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
# Set OPENAI and GROQ env vars
import os

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')

## install libs  

We use tiktoken to count the token before embedding as there is a limit on model input length.  
FAISS-CPU is needed to retrieve close embeddings.  
We use groq as it is free (for now!)

In [None]:
%%capture
!pip install datasets
!pip install langchain
!pip install langchain-groq
!pip install openai
!pip install tiktoken
!pip install faiss-cpu

## dataset creation (adding embeddings)

Skip this part if the dataset is already made.

Set this variable to `True` to run the dataset creation process

In [None]:
dataset_creation = False

### Download base dataset

In [None]:
from datasets import load_dataset

if dataset_creation:
  # declaration_ds = load_dataset("the-french-artist/hatvp_declarations_xml_plus_json_plus_index", split='train')
  declaration_ds = load_dataset("the-french-artist/hatvp_declarations_text_embeds", split='train')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/484 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/137M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10944 [00:00<?, ? examples/s]

### Check dataset content

Check features and content of a text record before embedding.

In [None]:
# declaration_ds

Dataset({
    features: ['xml_sha1', 'declaration_xml', 'declaration_json', 'extracted_text', 'text_embedding'],
    num_rows: 10944
})

In [None]:
# print(declaration_ds.select(range(1)).to_pandas().extracted_text.to_list()[0][:200])

Fiche de damien abad - député/ain(01) 
 ------------ 
11/07/2022 15:40:13
4344aaa1-874d-4e6d-9b1a-45f7725b710c
adel
true
vue_pdf_du_recepisse_du_depot_xml

20171221
true
true
creation
[données non pub


In [None]:
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

def truncate_text_to_stay_under_openai_embedding_limit(input_text):

  backup_input_text = input_text

  openai_embed_limit = 8192
  delta = num_tokens_from_string(input_text, "cl100k_base") - openai_embed_limit
  while delta > 0:
    input_text = input_text[:-int(delta*2)] #we add factor 2 to speed up the process
    delta = num_tokens_from_string(input_text, "cl100k_base") - openai_embed_limit

  if len(input_text) > 0:
    return input_text
  else:
    return backup_input_text[:8000]

In [None]:
from openai import OpenAI
client = OpenAI()
from tqdm.auto import tqdm
tqdm.pandas()


def get_embedding(text, model="text-embedding-3-large"): #this is where we switch to the large embeding model - WARNING: 10x more expensive
   text = text.replace("\n", " ")
   text = truncate_text_to_stay_under_openai_embedding_limit(text)
   return client.embeddings.create(input = [text], model=model).data[0].embedding

def get_multiple_embeddings(text_list, model="text-embedding-3-large"): #this is where we switch to the large embeding model - WARNING: 10x more expensive
   clean_text_list = []
   for text in text_list:
     clean_text_list.append(truncate_text_to_stay_under_openai_embedding_limit(text))

   response = client.embeddings.create(input = clean_text_list, model=model)
   embedding_list = []
   for curr_data in response.data:
     embedding_list.append(curr_data.embedding)

   return embedding_list

len(get_multiple_embeddings(["first sentence", "second sentence", "third sentence"]))

3

In [None]:
def get_embedding_for_map(row):
  row['text_embedding'] = get_embedding(row['extracted_text'])
  return row


def get_embedding_for_map_batch(row):
  row['text_embedding'] = get_multiple_embeddings(row['extracted_text'])
  return row

In [None]:
if dataset_creation:
  # declaration_ds = declaration_ds.map(get_embedding_for_map, num_proc=2)
  declaration_ds = declaration_ds.map(get_embedding_for_map_batch, num_proc=2, batched=True, batch_size=100)

Map (num_proc=2):   0%|          | 0/10944 [00:00<?, ? examples/s]

### Dataset checks before upload

Uncomment to check features and content

In [None]:
# declaration_ds

Dataset({
    features: ['xml_sha1', 'declaration_xml', 'declaration_json', 'extracted_text', 'text_embedding'],
    num_rows: 10944
})

In [None]:
# declaration_ds.select(range(100)).to_pandas()

Unnamed: 0,xml_sha1,declaration_xml,declaration_json,extracted_text,text_embedding
0,0a0a9f2a6772942557ab5355d76af442f8f65e01,<declaration><dateDepot>11/07/2022 15:40:13</d...,"{""declaration"": {""dateDepot"": ""11/07/2022 15:4...",Fiche de damien abad - député/ain(01) \n -----...,"[-0.009466919116675854, -0.008166174404323101,..."
1,0a0a9f2a6772942557ab5355d76af442f8f65e01,<declaration><dateDepot>27/11/2022 18:18:23</d...,"{""declaration"": {""dateDepot"": ""27/11/2022 18:1...",Fiche de damien abad - député/ain(01) \n -----...,"[-0.010201191529631615, 0.0008864270057529211,..."
2,0a0a9f2a6772942557ab5355d76af442f8f65e01,<declaration><dateDepot>19/08/2022 10:08:23</d...,"{""declaration"": {""dateDepot"": ""19/08/2022 10:0...",Fiche de caroline abadie - député/isère(38) \n...,"[0.0018713556928560138, 0.00048493893700651824..."
3,0a0a9f2a6772942557ab5355d76af442f8f65e01,<declaration><dateDepot>04/10/2022 17:22:07</d...,"{""declaration"": {""dateDepot"": ""04/10/2022 17:2...",Fiche de caroline abadie - député/isère(38) \n...,"[0.0008672993280924857, -0.003786470042541623,..."
4,0a0a9f2a6772942557ab5355d76af442f8f65e01,<declaration><dateDepot>20/09/2021 13:41:36</d...,"{""declaration"": {""dateDepot"": ""20/09/2021 13:4...",Fiche de joelle abadie - elu départemental/hau...,"[-0.0008490124018862844, -0.01616096682846546,..."
...,...,...,...,...,...
95,0a0a9f2a6772942557ab5355d76af442f8f65e01,<declaration><dateDepot>10/09/2020 12:31:53</d...,"{""declaration"": {""dateDepot"": ""10/09/2020 12:3...",Fiche de claude alemagna - membre d’epci/dracé...,"[-0.02202477678656578, -0.018816333264112473, ..."
96,0a0a9f2a6772942557ab5355d76af442f8f65e01,<declaration><dateDepot>20/09/2021 22:53:49</d...,"{""declaration"": {""dateDepot"": ""20/09/2021 22:5...",Fiche de claude alemagna - membre d’epci/dracé...,"[-0.026265963912010193, -0.014384419657289982,..."
97,0a0a9f2a6772942557ab5355d76af442f8f65e01,<declaration><dateDepot>26/09/2021 20:40:19</d...,"{""declaration"": {""dateDepot"": ""26/09/2021 20:4...",Fiche de claude alemagna - membre d’epci/dracé...,"[-0.025565914809703827, -0.012604921124875546,..."
98,0a0a9f2a6772942557ab5355d76af442f8f65e01,<declaration><dateDepot>26/09/2021 21:01:41</d...,"{""declaration"": {""dateDepot"": ""26/09/2021 21:0...",Fiche de claude alemagna - membre d’epci/dracé...,"[-0.02375740557909012, -0.01477577444165945, -..."


### Backup dataset to HF HUB

In [None]:
if dataset_creation:
  declaration_ds.push_to_hub("the-french-artist/hatvp_declarations_text_embeds")
  # declaration_ds.push_to_hub("the-french-artist/hatvp_declarations_text_index_embeds")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/11 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/484 [00:00<?, ?B/s]

## RAG inference (no GPU needed)

Once the dataset is created and seved to HF HUB, we can load it and perform inference from there.  

### Download dataset w/ embeddings

In [None]:
from datasets import load_dataset
embed_ds = load_dataset("the-french-artist/hatvp_declarations_text_embeds", split='train')

Downloading readme:   0%|          | 0.00/484 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/231M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10944 [00:00<?, ? examples/s]

### Create vector index feature using FAISS

In [None]:
embed_ds.add_faiss_index(column='text_embedding')

  0%|          | 0/11 [00:00<?, ?it/s]

Dataset({
    features: ['xml_sha1', 'declaration_xml', 'declaration_json', 'extracted_text', 'text_embedding'],
    num_rows: 10944
})

### Define Query functions

We embed the user query and retrieve the closes record(s) that match it.  

In [None]:
import numpy as np

def perform_query(query, n_samples=1):
  # query_embed = model.encode([query])
  query_embed = np.array(get_embedding(query.lower()))
  scores, retrieved_examples = embed_ds.get_nearest_examples('text_embedding', query_embed, k=n_samples)
  return retrieved_examples['declaration_json']

We also make a function to get the name and surname of the person owning the declaration (documents retrieved).  

In [None]:
import json

def get_name_surname_from_str_declaration(input_str_json):
  parsed_json = json.loads(input_str_json)
  return parsed_json['declaration']['general']['declarant']['nom'], parsed_json['declaration']['general']['declarant']['prenom']

### Perform some tests on embeddings

We simply ask questions and check whose record shows up.  
We show the 10 best matches.  

In [None]:
results = perform_query("Qui est Damien Abad ?", 10)
for result in results:
  print(get_name_surname_from_str_declaration(result))

('ABAD', 'DAMIEN')
('ABAD', 'DAMIEN')
('ADAM', 'Damien')
('ADAM', 'Damien')
('Maudet', 'Damien')
('Maudet', 'Damien')
('Delavoie', 'Damien')
('Charlet', 'Damien')
('Charlet', 'Damien')
('Allouch', 'Damien')


In [None]:
results = perform_query("Quel est le salaire de Damien Abad en 2019 ?", 10)
for result in results:
  print(get_name_surname_from_str_declaration(result))

('ABAD', 'DAMIEN')
('ABAD', 'DAMIEN')
('ADAM', 'Damien')
('ADAM', 'Damien')
('Delavoie', 'Damien')
('Maudet', 'Damien')
('Maudet', 'Damien')
('Allouch', 'Damien')
('HUGUET', 'Damien')
('ABADIE', 'Caroline')


In [None]:
results = perform_query("Qui est Damien ABAD? Et quel est son salaire en 2019?", 10)
for result in results:
  print(get_name_surname_from_str_declaration(result))

('ABAD', 'DAMIEN')
('ABAD', 'DAMIEN')
('ADAM', 'Damien')
('ADAM', 'Damien')
('Delavoie', 'Damien')
('Maudet', 'Damien')
('Maudet', 'Damien')
('Allouch', 'Damien')
('HUGUET', 'Damien')
('Charlet', 'Damien')


In [None]:
results = perform_query("Qui est Antoine ARMAND?", 10)
for result in results:
  print(get_name_surname_from_str_declaration(result))

('Armand', 'Antoine')
('Armand', 'Antoine')
('Armand', 'Antoine')
('HOAREAU', 'Antoine')
('HOAREAU', 'Antoine')
('MADELIN', 'Antoine')
('AUDEGOND', 'Armand')
('MADELIN', 'Antoine')
('Chereau', 'Antoine')
('HOAREAU', 'Antoine')


In [None]:
results = perform_query("Qui est un Député de Haute-Savoie?", 10)
for result in results:
  print(get_name_surname_from_str_declaration(result))

('Duby-Muller', 'Virginie')
('COULOMME', 'Jean-François')
('Violland', 'Anne-Cécile')
('Violland', 'Anne-Cécile')
('Duby-Muller', 'Virginie')
('Petex', 'Christelle')
('Duby-Muller', 'Virginie')
('NOEL', 'Sylviane')
('ROSEREN', 'xavier')
('rolland', 'vincent')


In [None]:
results = perform_query("Qui est une Députée de Haute-Savoie?", 10)
for result in results:
  print(get_name_surname_from_str_declaration(result))

('Violland', 'Anne-Cécile')
('Violland', 'Anne-Cécile')
('Duby-Muller', 'Virginie')
('NOEL', 'Sylviane')
('Petex', 'Christelle')
('NOEL', 'Sylviane')
('Petex', 'Christelle')
('Duby-Muller', 'Virginie')
('Bonnivard', 'Emilie')
('Duby-Muller', 'Virginie')


In [None]:
results = perform_query("Le conjoint de Antoine ARMAND est il un homme ou une femme?", 10)
for result in results:
  print(get_name_surname_from_str_declaration(result))

('Armand', 'Antoine')
('Armand', 'Antoine')
('Armand', 'Antoine')
('HOAREAU', 'Antoine')
('HOAREAU', 'Antoine')
('JEAN', 'Antoine')
('MADELIN', 'Antoine')
('JEAN', 'Antoine')
('MADELIN', 'Antoine')
('AUDEGOND', 'Armand')


In [None]:
results = perform_query("Quel est l'ensemble des revenus perçus par Antoine ARMAND?", 10)
for result in results:
  print(get_name_surname_from_str_declaration(result))

('Armand', 'Antoine')
('Armand', 'Antoine')
('Armand', 'Antoine')
('ardouin', 'jean philippe')
('ardouin', 'jean philippe')
('de Bourrousse', 'Arnaud')
('Arrighi de Casanova', 'Jacques')
('VERAN', 'ANTOINE')
('MADELIN', 'Antoine')
('Arciero', 'Anthony')


In [None]:
results = perform_query("Le nom du déclarant est Antoine ARMAND?", 10)
for result in results:
  print(get_name_surname_from_str_declaration(result))

('Armand', 'Antoine')
('Armand', 'Antoine')
('Armand', 'Antoine')
('AUDEGOND', 'Armand')
('ARMAND', 'Jean-Luc')
('MADELIN', 'Antoine')
('Quenette', 'Marc-Antoine')
('MADELIN', 'Antoine')
('HOAREAU', 'Antoine')
('ardouin', 'jean philippe')


In [None]:
results = perform_query("Qui est le maire de Bordeaux?", 10)
for result in results:
  print(get_name_surname_from_str_declaration(result))

('HURMIC', 'Pierre')
('HURMIC', 'Pierre')
('Jacotot', 'Sandrine')
('JEANJEAN', 'Didier')
('HURMIC', 'Pierre')
('BLANC', 'BERNARD')
('Bouisson', 'Dominique')
('maurin', 'vincent')
('CAZAUX', 'OLIVIER')
('JEANJEAN', 'Didier')


In [None]:
results = perform_query("Qui est maire de Bordeaux?", 10)
for result in results:
  print(get_name_surname_from_str_declaration(result))

('Jacotot', 'Sandrine')
('HURMIC', 'Pierre')
('JEANJEAN', 'Didier')
('HURMIC', 'Pierre')
('BLANC', 'BERNARD')
('Bouisson', 'Dominique')
('HURMIC', 'Pierre')
('maurin', 'vincent')
('CAZAUX', 'OLIVIER')
('JEANJEAN', 'Didier')


In [None]:
results = perform_query("Qui est pierre hurmic?", 10)
for result in results:
  print(get_name_surname_from_str_declaration(result))

('HURMIC', 'Pierre')
('HURMIC', 'Pierre')
('HURMIC', 'Pierre')
('HURMIC', 'Pierre')
('Huguet', 'Pierre')
('MICHEL', 'Pierre')
('Giran', 'Jean-Pierre')
('meriaux', 'pierre')
('Giran', 'Jean-Pierre')
('MICHEL', 'Pierre')


### Define RAG functions

We use the query functions defined previously to populate a context and feed this context with a prompt to a LLM.  

In [None]:
from langchain.schema.output_parser import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq

In [None]:
def get_answer_to_question_no_context(question, llm_to_use):

  system = """You are an assistant for question-answering tasks.
  Use three sentences maximum and keep the answer concise.
  """
  human = "{text}"
  prompt = ChatPromptTemplate.from_messages([("system", system), ("human", human)])
  actual_prompt = f"""
  Question: {question}
  Answer:
  """

  chain = prompt | llm_to_use | StrOutputParser()
  return chain.invoke({"text": actual_prompt})

In [None]:
def get_RAG_request(question, llm_to_use):

  system = """You are an assistant for document retrieval tasks.
  You have access to a database of entries related to french politicians.
  Each entry is identified by a name and a surname.
  When provided with a question, return the name and surname of the most probable data entry.
  Only respond with the name and surname.

  """
  human = "{text}"
  prompt = ChatPromptTemplate.from_messages([("system", system), ("human", human)])
  actual_prompt = f"""
  Question: {question}
  Answer:
  """

  chain = prompt | llm_to_use | StrOutputParser()
  return chain.invoke({"text": actual_prompt})

In [None]:
def get_answer_to_question(question, llm_to_use):

  system = """You are an assistant for question-answering tasks.
  Use the following pieces of retrieved context to answer the question.
  If you don't know the answer, just say that you don't know.
  Use three sentences maximum and keep the answer concise.
  """
  human = "{text}"
  prompt = ChatPromptTemplate.from_messages([("system", system), ("human", human)])

  results = perform_query(question, 1)
  context = ''.join(results) #concatenate top 5 results into a context
  actual_prompt = f"""
  Question: {question}
  Context: {context}
  Answer:
  """

  # print(context)
  for result in results:
    print(get_name_surname_from_str_declaration(result))
  chain = prompt | llm_to_use | StrOutputParser()
  return chain.invoke({"text": actual_prompt})

In [None]:
def get_answer_to_question_improved(question, llm_to_use):

  system = """You are an assistant for question-answering tasks.
  Use the following pieces of retrieved context to answer the question.
  If you don't know the answer, just say that you don't know.
  Use three sentences maximum and keep the answer concise.
  """
  human = "{text}"
  prompt = ChatPromptTemplate.from_messages([("system", system), ("human", human)])

  retrieval_question = get_RAG_request(question, llm_to_use)
  print(f"Retrieval question : {retrieval_question}")
  print()
  results = perform_query(retrieval_question, 1)
  context = ''.join(results) #concatenate top 5 results into a context
  actual_prompt = f"""
  Question: {question}
  Context: {context}
  Answer:
  """

  print("Most relevant context(s) : ")
  for result in results:
    print(get_name_surname_from_str_declaration(result))
  print()
  chain = prompt | llm_to_use | StrOutputParser()
  return chain.invoke({"text": actual_prompt})

### Perform some RAG tests

In [None]:
question = "Qui est Damien ABAD? Et quel est son salaire en 2019?"

In [None]:
mixtral_llm = ChatGroq(temperature=0, model_name="mixtral-8x7b-32768")
llm_llama3_70B = ChatGroq(temperature=0, model_name="llama3-70b-8192")
llm_llama3_8B = ChatGroq(temperature=0, model_name="llama3-8b-8192")

get_answer_to_question(question, llm_llama3_70B)

('ABAD', 'DAMIEN')


'Damien Abad is a French politician, specifically a Député (Member of Parliament). His salary in 2019 was €71,105.'

In [None]:
get_answer_to_question(question, llm_llama3_8B)

('ABAD', 'DAMIEN')


"According to the provided context, Damien Abad's salary in 2019 was 71,105 euros."

In [None]:
get_answer_to_question("Combien est payé Damien Abad en 2019?", llm_llama3_8B)

('ABAD', 'DAMIEN')


'According to the provided context, Damien Abad was paid 71,105 euros in 2019.'

In [None]:
print(get_answer_to_question("Quel est l'ensemble des revenus perçus par Damien Abad?", llm_llama3_8B))

('ABAD', 'DAMIEN')
The ensemble des revenus perçus par Damien Abad includes:

* Rémunération de député : 67 047 € en 2017, 71 042 € en 2018, 71 105 € en 2019, 70 773 € en 2020, 70 676 € en 2021 et 27 289 € en 2022.
* Rémunération de conseiller départemental : 28 007 € en 2017, 24 201 € en 2018, 16 386 € en 2019, 16 386 € en 2020, 16 384 € en 2021 et 6 827 € en 2022.
* Rémunération de président du groupe LR à l'Assemblée nationale : 0 € en 2019, 0 € en 2020, 0 € en 2021 et 0 € en 2022.
* Rémunération de président du SDIS 01 : 0 € en 2015, 0 € en 2016 et 0 € en 2017.
* Rémunération de président d'Aintourisme : 0 € en 2017, 0 € en 2018, 0 € en 2019, 0 € en 2020, 0 € en 2021 et 0 € en 2022.
* Rémunération de président du groupe Les Républicains à l'Assemblée nationale : 0 € en 2019, 0 € en 2020, 0 € en 2021 et 0 € en 2022.
* Rémunération de président de l'association Saveurs de l'Ain : 0 € en 2019, 0 € en 2020, 0 € en 2021 et 0 € en 2022.
* Rémunération de président de l'association Les Am

In [None]:
print(get_answer_to_question("Où travaille le conjoint de Damien Abad?", llm_llama3_8B))

('ABAD', 'DAMIEN')
Le conjoint de Damien Abad travaille au Centre Hospitalier du Haut-Bugey.


In [None]:
print(get_answer_to_question("Où travaille le conjoint de Damien Abad?", llm_llama3_70B))

('ABAD', 'DAMIEN')
Le conjoint de Damien Abad travaille au Centre Hospitalier du Haut-Bugey en tant qu'infirmière.


In [None]:
print(get_answer_to_question("Où travaille le conjoint de Damien Abad?", mixtral_llm))

('ABAD', 'DAMIEN')
The spouse of Damien Abad works as an "Infirmière" (nurse) at the "CENTRE HOSPITALIER DU HAUT-BUGEY".


In [None]:
print(get_answer_to_question("Qui est une infirmière?", llm_llama3_8B))

# we indeed do find a nurse
# but the only LLM that has a high enough token limit
# it too stupid to respond correctly

('Hartmann', 'Delphine')
An infirmière (nurse) is a healthcare professional who provides medical care to patients.


In [None]:
print(get_answer_to_question("Le conjoint de Damien Abad est il un homme ou une femme?", llm_llama3_8B))

('ABAD', 'DAMIEN')
Le conjoint de Damien Abad est une femme.


In [None]:
print(get_answer_to_question("Le conjoint de Damien Abad est il un homme ou une femme?", llm_llama3_70B))

('ABAD', 'DAMIEN')
Le conjoint de Damien Abad est une femme, car il est mentionné que son conjoint est infirmière.


In [None]:
print(get_answer_to_question("Le conjoint de Damien Abad est il un homme ou une femme?", mixtral_llm))

('ABAD', 'DAMIEN')
The context does not provide information on Damien Abad's spouse's gender. The provided document focuses on Damien Abad's professional activities and declarations of interests.


In [None]:
print(get_answer_to_question("Où travaille le conjoint de Loïc HERVÉ le sénateur de Haute Savoie?", mixtral_llm))

('HERVE', 'loic')
According to the provided context, Loïc HERVÉ's spouse is an Attachée principale (FPT, catégorie A) and works for Collectivités locales.


In [None]:
print(get_answer_to_question("Où travaille le conjoint de Antoine Armand?", mixtral_llm))

('Armand', 'Antoine')
The spouse of Antoine Armand works for the "Ministère de l'Economie et des Finances" as an "Inspecteur des finances".


In [None]:
print(get_answer_to_question("Où travaille le conjoint de Eric PARRA", llm_llama3_70B))

('PARRA', 'ERIC')
Le conjoint d'Eric PARRA travaille comme assistante de vie.


In [None]:
print(get_answer_to_question("Quel est le plus récent emploi d'Eric PARRA? Et sa rémunération?", mixtral_llm))

('PARRA', 'ERIC')
Eric PARRA's most recent job is as an assistant of life. His remuneration for this position in 2020 was 7,000. However, if you're asking about his remuneration for his mandates, he received 6,650 as Vice President of the Communauté d'Agglomération du Grand Narbonne and 1,545 as Conseller Municipal of the Ville de Narbonne in 2020.


In [None]:
print(get_answer_to_question("Combien gagne le maire de Bordeaux?", llm_llama3_70B))

('BLANC', 'BERNARD')
The mayor of Bordeaux's salary is not explicitly stated in the provided context. However, the remuneration for the "RESPONSABLE DU SERVICE DES IMPOTS DES PARTICULIERS" position is mentioned, with annual amounts ranging from €34,134 in 2020 to €68,317 in 2017.


In [None]:
get_answer_to_question_no_context("Qui est le maire de Bordeaux", mixtral_llm)

'Pierre Hurmic is the current mayor of Bordeaux, France. He took office on July 4, 2020. He is a member of the Ecologist party.'

In [None]:
get_answer_to_question_no_context("Combien gagne le maire de Bordeaux", mixtral_llm)

"The mayor of Bordeaux, Pierre Hurmic, earns a salary of around 5,257 euros gross per month as of 2022. This amount can vary depending on the mayor's additional responsibilities and the decisions of the municipal council. It's important to note that this figure is subject to change."

In [None]:
get_RAG_request("Qui est le maire de Bordeaux?", mixtral_llm)

'Pierre Hurmic'

In [None]:
get_RAG_request("Combien gagne le maire de Bordeaux", mixtral_llm)

'Pierre Hurmic'

In [None]:
get_answer_to_question_improved("Combien d'argent est payé le maire de Bordeaux?", llm_llama3_8B)

Retrieval question : Gilles Savary

Most relevant context(s) : 
('Savry', 'Gilles')



'According to the provided context, the mayor of Bordeaux is not mentioned. The context appears to be a declaration of interests for a person named Gilles Savry, who is the mayor of Argenteuil, not Bordeaux.'

# Conclusions!

All test conclusions go here.

## CONCLUSION OF THE MARKDOWN TEST


- the openai embedding system is very slow (1H for 10k embeddings) but at least it works, unlike the open source lighter ones (in french that is).  

- we cannot get interesting information because the retrieval query is very different from the actual query. Eg. What is XXX salary in 2019? The retrieval query will be "get me XXX data" while the generative query is the input one.  

- we are missing information from the markdown files: after 3 days of intensive attempts to make an SQL, then simple Markdown file, from the input XML files, we still cannot ensure that some entire sections are not missing due to the fickle nature of the context


### What to do next:  

- find a way to get embeddings faster and cheaper: we need to run the system in a dedicated notebook with enough VRAM, using a quantized version I suppose (given that we do not have enough VRAM, either locally or online)  
- Make a new dataset from the XML base declarations that simply converts XML to JSON => this will make the context lighter while not need to have massive OO classes that are not able to parse the thing, even after writing more than 250 if statements on "None".  
- find a way to execute online : this system will still need a local FAISS index and/or a local embedding for the dataset creation, as well as an acess to an LLM for retrieval => can we get free (or near free) online stuff to perform these operations? (Groq is already a good thing).  

## CONCLUSION OF THE JSON TEST

We have complete categories as we cannot possibly fail (we are using XMLTODICT lib to convert XML to JSON).  
However, this doesn't seem to prevent problems:  
- we cannot match a given person declaration! Even when setting up the query perfectly, we are in 5th position, with 4 irrelevant matches before. Even after analyzing all semantically related text, we cannot explain this weird occurance...
- this use of JSON instead of XML decreases char count by an average of 25%; which is nice but not enough to use several declarations in a single context request (only Mixtral was able to process it).  

````
RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama3-70b-8192` in organization `org_01hw5634epebfvhswdt88pdsm3` on tokens per minute (TPM): Limit 7000, Used 0, Requested ~7485. Please try again in 4.157142857s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}
````

### What to do next

- extract text from JSON so the embeddings are les confused by keys maybe?  
- we noticed a change in performance between ALLCAPS and alllower so we will preventively set all text to lowerchars and process all queries to lower chars so this problem is dealt with.
- recompute embeddings and test results.  

## CONCLUSION OF THE TEXT TEST

We have better retrieval rates: Damien Abad is perfectly recognized in our tests. However, Antoine Armand is not, at all. It seems that we can also retrieve people by their title, like "Député de la Haute-Savoie", but not perfectly, as many will in fact be "Senateurs" or "Député de Savoie" instead.  

We still rely on full JSON inclusion in the context as the text only version is too cryptic for an LLM to analyze its contents.  

### What to do next

- CHUNKING! We will chunk each declaration into large subsections (the main keys under "declaration") and prefix them with the name and work title of the person making the declaration. This way, we will be able to ask precise questions about something related to someone.  
- Embedding: it seems that we can pass up to 2048 embedding requests at once (see https://cookbook.openai.com/examples/embedding_wikipedia_articles_for_search ) so we should do that to handle the increased needs due to chunking.  

This "chunking with person name+title prefix" technique will improve questions about someone, but still not solve the problem of aggregate questions like:  
- who is making more than X in year Y ?  
- list people working at place X ?
- list people affiliated with party X ?
etc...

These will only get solved by using SQL and a structured database.  

## CONCLUSION OF THE TEXT WITH METADATA TEST

It works even better! Now we get 90% succes when naming someone in a query, and we get consistent results.  
The only shortcoming is when several people have the same name, either same name+surname, or a shared name as a surname. We can disimbiguate using the current job title, although that is only possible by doing some research.  
The special chars are also a problem, their usage is inconsistent in the declarations, eg. Loic VS Loïc.  

- adding a name+surname+job title prefix to the context is a great success
- we should test the job titles more deeply to see how to exploit them  
- the advanced questions are also a great success due to the complete declaration being passed in the context  

### What to do next  
- find a way to embed a link to the current declaration so our answers come with a URL to fact check the answers  
- convert the simple RAG into a discussion by passing the chat history

## CONCLUSION OF THE TEXT-INDEX TEST

Best results so far, we have a perfect match when asking for a given person.  
Now, the problem seems to be the quality of the job titles: XML files contain hazy information, this is not great for creating the index, we will have to rely on a seconday data source to improve this.  

- the persons name and surname is a perfect match
- we can ask complex questions and still match the correct context

### What to do next  

- load the extented information (.csv file with actual job titles) and re-create the index from there  
- do a second test  

## CONCLUSION OF THE SECOND TEXT-INDEX TEST  

Afterr retrieval of the correct job titles, we still cannot get a positive match for the mayor of Bordeaux.  
Sadly, the second nin chief is always recognized first.  
We then tried making a second guess by first having an LLM ask a retrieval question, then have a final question based on the "higher" retrieval quality.  

It turns out that the dumb LLM (mixtral) gets the correct answer to the mayor question and returns the correct context, but then, is incapable of answering correctly.  
The "smart" llm (LLama3) cannot get the correct retrieval query (asks about the previous maire...the change dates back to 2019 when the cutoff date is supposed to be 2023, and the LLM gives the correct answer on huggingface, so Groq might be quantizing a lot here...) so the test fails.  

- we have correct identification results, although the exact synthax (eg. a `d'` VS `de` before the name) seems to have a big impact.  
- we cannot ask general questions about job titles as a job title containing part of another job title will weight more (no job title hierarchy knowledge in the embedding). Eg. `mayor helper` will be ranked higher than `mayor` when looking for the maire.  

### What to do next  

- we should use better quality LLMs to ask retrieval questions, as we do not pass documents in the context => the requests will be quite cheap.  
- we can add more information in the index? No ideas for now.  

## CONCLUSION OF THE THRID TEXT-INDEX TEST

This time we used the "large" embedding model instead of the "small".  
The results are spectacular : 10/10 !  

````
This was a triumph.
I'm making a note here:
huge success.
````