## RAG Estatuto de los Trabajadores

A Retrieval-Augmented Generation (RAG) system is built for querying (QA) the latest amendment of the spanish Workers' Statute published on April 30, 2025. This publication is available in PDF format at https://www.boe.es/biblioteca_juridica/abrir_pdf.php?id=PUB-DT-2025-139.

### Environment

In [4]:
!pip install langchain pypdf langchain_experimental faiss-cpu



In [5]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from huggingface_hub import whoami

### Read pdf

In [57]:
# read pdf
loader = PyPDFLoader("data/estatuto_trabajadores.pdf")
document = loader.load()
document = document[10:189]
# TODO Join page into only one

In [58]:
# check data
print(f'Number of pages {len(document)}')
print(f'Author: {document[21].metadata["author"]}')
print(f'Title: {document[21].metadata["title"]}')
print(f'Page: {document[21].metadata["page"]}')
document[178].page_content[:200]

Number of pages 179
Author: Agencia Estatal Boletín Oficial del Estado (AEBOE)
Title: Estatuto de los Trabajadores. Última modificación: 30 de abril de 2025
Page: 31


'192\nd) En los supuestos de adopción, de guarda con fines de \nadopción y de acogimiento, de acuerdo con el artículo \n45.1.d), en caso de que ambos progenitores trabajen, \nel periodo de suspensión se di'

### Chunk text

In [97]:
# chunk pdf
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)
docs = text_splitter.split_documents(document)

In [99]:
# check chunking
print(f'Number of chunks: {len(docs)}')
n = 104
print(f'Number characters for chunk {n}: {len(docs[n].page_content)}')
docs[n].page_content[:200]

Number of chunks: 518
Number characters for chunk 104: 392


'cidad.\n4. El empresario podrá verificar el estado de salud del trabaja-\ndor que sea alegado por este para justificar sus faltas de \nasistencia al trabajo, mediante reconocimiento a cargo de \npersonal '

In [100]:
# clean text by removing '\n'
for d in docs:
  d.page_content = d.page_content.replace("-\n","")
  d.page_content = d.page_content.replace(" -\n","")
  d.page_content = d.page_content.replace("\n","")

Number characters for chunk 104: 382


'cidad.4. El empresario podrá verificar el estado de salud del trabajador que sea alegado por este para justificar sus faltas de asistencia al trabajo, mediante reconocimiento a cargo de personal médic'

In [117]:
# check cleaning
n = 104
print(f'Number characters for chunk {n}: {len(docs[n].page_content)}')
docs[n].page_content[:200]

Number characters for chunk 104: 382


'cidad.4. El empresario podrá verificar el estado de salud del trabajador que sea alegado por este para justificar sus faltas de asistencia al trabajo, mediante reconocimiento a cargo de personal médic'

### Embeddings and vector store

An open-source embedding model is used to convert texts into vectors. This is a small multilingual model also trained for Spanish. More details about the model can be found at https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. The vectors are stored in a FAISS vector database.

In [None]:
# embeddings model
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)

In [101]:
# vector store
vectorstore = FAISS.from_documents(docs, embedding=embedding_model)

In [102]:
# check search by similarity
query = "Qué se tendrá en cuenta en los ascensos dentro de una empresa"

docs_and_scores = vectorstore.similarity_search_with_score(query, k = 5)

for ds in docs_and_scores:
    print(f'Length: {len(ds[0].page_content)}\nScore: {ds[1]}\n')
    text_without_n = ds[0].page_content.replace('\n','')
    print(f"{text_without_n}\n")

### Augmented prompt

The augmented prompt is prepared using a retriever that captures the five texts with the highest similarity scores to the query. A moderately-sized LLM model Google/gemma-2-2b-it is used to construct the response.
This model is a 2.2-billion parameter, instruction-tuned variant of Google's Gemma family of lightweight, open-weight large language models (LLMs). Derived from the same research and technology used to create Gemini models, Gemma-2B-IT is optimized for instruction-following tasks.

In [104]:
# retriever for augmentation
retriever = vectorstore.as_retriever(search_kwargs = {"k": 5})

In [15]:
# login huggingface
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [17]:
# LLM model
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-2-2b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")


tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

In [18]:
# wrap as langchain LLM
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512, temperature=0.3)

llm = HuggingFacePipeline(pipeline=pipe)


Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=pipe)


The prompt instructs the model to answer strictly based on the provided context (retriever) and, if the answer is unknown, to avoid inventing any information and simply respond with 'I don't know' ('No lo se').

In [111]:
# build RAG chain
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=(
        "Contesta a la siguiente pregunta usando solo el contexto proporcionado. "
        "Si no puedes responder, di 'No lo sé'.\n\n"
        "Devuelveme sólo la Respuesta"
        "Contexto:\n{context}\n\nPregunta:\n{question}"
    )
)

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",
    chain_type_kwargs={"prompt": prompt}
)


In [114]:
# question 1
query = "Qué se tendrá en cuenta en los ascensos dentro de una empresa"
result = rag_chain.run(query)
#print(result)
clean_answer = result.split("Respuesta:")[-1].strip()
print(clean_answer)

En todo caso los ascensos se producirán teniendo en cuenta la formación, méritos, antigüedad del trabajador, así como las facultades organizativas del empresario.


In [115]:
# question 2
result = rag_chain.run("Qué sabes de Leo Messi?")
#print(result)
clean_answer = result.split("Respuesta:")[-1].strip()
print(clean_answer)

No lo sé.


In [116]:
# question 3
result = rag_chain.run("Qué es el salario?")
#print(result)
clean_answer = result.split("Respuesta:")[-1].strip()
print(clean_answer)

El salario es la retribución que se le paga a un trabajador por su trabajo efectivo.
