'QhdyPvS28vnpVY94g5YKmm58PgEhUCYb'

In [1]:
import getpass
import os

os.environ["MISTRAL_API_KEY"] = getpass.getpass()

 ········


# Installation

In [2]:
!pip install --quiet --upgrade langchain langchain-community langchain-chroma
!pip install --quiet langchain-openai
!pip install --quiet pypdf
!pip install --quiet sentence-transformers

# Necessary Packages

In [3]:
import bs4
from langchain import hub
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser  # For the chains
from langchain_core.runnables import RunnablePassthrough   # For the chains
#from langchain_openai import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader #For PDF Loader
from langchain_mistralai import ChatMistralAI #For MISTRAL MODEL

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [4]:
#setting uup langsmith:

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "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"
os.environ["LANGCHAIN_PROJECT"] = "taac-rag"
#os.environ["TOKENIZERS_PARALLELISM"] = "false"   #

In [5]:
from langsmith import Client
client = Client()

In [6]:
from dotenv import load_dotenv
load_dotenv()
os.environ["USER_AGENT"] = "RAG-Chatbot"
mistral_api_key = os.getenv("MISTRAL_API_KEY")

# Selected Model: Mistral 7B

We have selected the **Mistral model** with **7 billion parameters**, and we are accessing it remotely using a Mistral API Key.

## Reasons for Selection
1. **Open-Source**: Being open-source allows for customization and adaptability.
2. **Relatively Small Size**: With 7B parameters, it offers a good balance between performance and computational cost, making it feasible to use locally with proper quantization.
3. **Proven Performance**: The model has shown excellent results across various tasks, proving effective in multiple contexts.

*Mistral context window has 32.8k size which is proximatly 20,000–25,000 words*

In [7]:
llm = ChatMistralAI(model="open-mistral-7b",api_key=mistral_api_key)

# 1-RAG PIPELINE 

## Indexing: Load (PDF)
We use PyPDFLoader for loading local pdf, but we mihgt change that for web Documents loadin, later with  **WebBaseLoader**

In [8]:
from langchain.document_loaders import PyPDFLoader
# Replace this with the path to your local PDF file
pdf_file_path = "ArtificialIntelligenceAct-1-50.pdf"
# Load the local PDF file
loader = PyPDFLoader(pdf_file_path)
# Load and process the document
docs = loader.load()

len(docs[0].page_content) #here docs is already an LangChain Object

1882

## Indexing: Split

we use Chroma as our Vector Store
we use all-MiniLM-L6-v2 from Microsoftmodel to create the Embeddings. (OpenAI are pay to use)

The chunk size is an balanced Value considering on the Mistral and embedding model context window size which seems to work good on practice


In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Indexing: Store Embeddings
- **Chroma** is an open-source **vector database** that’s designed for scalable, high-performance **similarity search**.
- The model **all-MiniLM-L6-v2** is part of the MiniLM (Mini Language Models) family developed by **Microsoft**. Especially suited for **semantic similarity** tasks, **sentence embedding**, and **question-answer retrieval**.

In [10]:
vectorstore = Chroma.from_documents(documents=splits, embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"))

  vectorstore = Chroma.from_documents(documents=splits, embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"))
  from tqdm.autonotebook import tqdm, trange


##  Retrieval and Generation: Retrieve
## Retreiver:
We're using the most common type of Retriever wich is the VectorStoreRetriever.
## Prompt:
The prompt that is being pulled from *https://smith.langchain.com/hub/rlm/rag-prompt*: 

"
`HUMAN`

`You are an assistant for question-answering tasks. Use the following retrieved context to answer the question. If you don't know the answer, state that clearly. Limit your response to three sentences, keeping the answer concise.`


`Question: {question}`

`Context: {context}`

`Answer:`
"

In [11]:
# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

## Retrieval and Generation: Generate

In [12]:
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()  #becaouse some meta_data might come along with the text. Only extract the text
)
rag_chain.invoke("What are the main values enshrined in the AI Act, according to Union law?")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'The AI Act, according to Union law, aims to improve the internal market by laying down a uniform framework for AI systems. It promotes the uptake of human-centric and trustworthy AI while ensuring a high level of protection of fundamental rights, including democracy, the rule of law, and environmental protection. The Act also fosters the development and use of AI, ensuring free movement of AI-based goods and services, but only under harmonized rules to protect public interests and fundamental rights.'

# 2-RAG EVALUTAION

## RAG Evaluation Types

In Retrieval-Augmented Generation (RAG) systems, there are four primary evaluation types that users are commonly interested in:

### 1. Response vs. Reference Answer
- **Goal**: Assess the accuracy of the RAG chain response in relation to a known, ground-truth answer.
- **Evaluation Mode**: Requires a reference answer, typically provided via a dataset, to compare against.
- **Judge**: Uses a language model as the evaluator to determine answer correctness by comparing the generated response to the ground-truth answer.

### 2. Response vs. Input Question
- **Goal**: Evaluate how effectively the generated response addresses the user's initial question or input.
- **Evaluation Mode**: Does not require a reference answer; instead, it compares the answer directly with the input question.
- **Judge**: Uses a language model as the evaluator to assess relevance, helpfulness, and whether the response meets the user's intent.

### 3. Response vs. Retrieved Documents
- **Goal**: Determine the consistency between the generated response and the retrieved documents, focusing on factual accuracy and faithfulness.
- **Evaluation Mode**: Does not require a reference answer; compares the answer to the context retrieved during the RAG process.
- **Judge**: Uses a language model as the evaluator to check for faithfulness, detect hallucinations, and ensure alignment with the provided context.

> **Note**: We will not be using this evaluation type in this project, as it is more useful for evaluating the model itself and assessing the prompt used for RAG.

### 4. Retrieved Documents vs. Input Question
- **Goal**: Measure the relevance and quality of the retrieved documents with respect to the user’s original query.
- **Evaluation Mode**: Reference-free; evaluates the retrieved documents based on their relevance to the input question.
- **Judge**: Uses a language model as the evaluator to judge relevance, ensuring that the retrieved information is pertinent to the query.

---

##### In this project, we will concentrate on evaluation types **1** and **2**, as they are the most pertinent to our objectives. We’ve chosen to exclude evaluation type **3** since it is primarily useful for assessing the model's performance and the effectiveness of the prompt within the RAG process. Furthermore, we are not also implementing evaluation type **4**, as the behavior of the retriever can be easily monitored through LangSmith tracing, making a separate evaluation unnecessary.


In [13]:
from langsmith.evaluation import evaluate
import httpx
#os.environ["TOKENIZERS_PARALLELISM"] = "false"

## The Data-set used for Evaluation

#### Dataset with some LCEL(LangChain Expression Language) questions as input and and exepected output/answear.

In [14]:
dataset = client.clone_public_dataset("https://smith.langchain.com/public/730d833b-74da-43e2-a614-4e2ca2502606/d")
dataset_name = "LCEL-QA"

# Function for RAG invocation
We are implementing time delays and staggered execution to prevent API calls to Mistral from being throttled due to rate limits. By incorporating time.sleep and exponential backoff, we aim to manage the timing of requests, reducing the likelihood of hitting the rate limit and ensuring smoother interaction with the Mistral API.

In [18]:
import time
def generate_rag_answer(example: dict):
    # Applying exponential backoff if necessary here as well
    max_retries = 7
    delay = 7  # Start with a 5-second delay
    
    for attempt in range(max_retries):
        try:
            return rag_chain.invoke(example["input_question"])
        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429 and attempt < max_retries - 1:
                print(f"Rate limit exceeded. Retrying in {delay} seconds...")
                time.sleep(delay)
                delay *= 2  # Exponential backoff
            else:
                return "Error: Rate limit exceeded"

def generate_rag_answer_with_context(example: dict):
    #Use this for evaluation of retrieved documents and hallucinations with exponential backoff for rate limiting.
    max_retries = 7
    delay = 7  # Start with a 10-second delay

    for attempt in range(max_retries):
        try:
            # Make the request to get the answer and contexts
            response = rag_chain.invoke(example["input_question"])
            return {"answer": response["answer"], "contexts": response["contexts"]}
        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429 and attempt < max_retries - 1:
                print(f"Rate limit exceeded. Retrying in {delay} seconds...")
                time.sleep(delay)
                delay *= 2  # Exponential backoff
            else:
                return {"answer": "Error: Rate limit exceeded", "contexts": []}


# 1 eval : Response vs reference answer
Here, we are comparing the response generated by the RAG invocation with the ground truth or desired response provided in the dataset. The approach involves specifying a prompt that instructs the model to answer a question. After generating its response, the model then compares its answer with the labeled (ground truth) response, assigning a score of 1 if the answer is close to the desired response, and 0 otherwise. Finally, we calculate the average score across 20 examples to assess overall performance.

In [17]:
grade_prompt_answer_accuracy = hub.pull("langchain-ai/rag-answer-vs-reference")

def answer_evaluator(run, example) -> dict:
    """
    A simple evaluator for RAG answer accuracy with exponential backoff for rate limiting.
    """
    input_question = example.inputs["input_question"]
    reference = example.outputs["output_answer"]
    prediction = run.outputs

    llm = ChatMistralAI(model="open-mistral-7b",api_key=mistral_api_key)

    # Structured grading prompt
    answer_grader = grade_prompt_answer_accuracy | llm

    # Set up exponential backoff parameters
    max_retries = 7
    delay = 7  # Start with a 5-second delay

    for attempt in range(max_retries):
        try:
            score = answer_grader.invoke({
                "question": input_question,
                "correct_answer": reference,
                "student_answer": prediction
            })
            return {"key": "answer_v_reference_score", "score": score["Score"]}
        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429 and attempt < max_retries - 1:
                print(f"Rate limit exceeded. Retrying in {delay} seconds...")
                time.sleep(delay)
                delay *= 3  # Exponential backoff
            else:
                return {"key": "answer_v_reference_score", "score": "Error: Rate limit exceeded"}

In [18]:
# Run the evaluation 20 exemples
experiment_results = evaluate(
    generate_rag_answer,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix="rag-answer-v-reference",
    metadata={"version": "LCEL context, mistral7B"}
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


View the evaluation results for experiment: 'rag-answer-v-reference-a89baf6b' at:
https://smith.langchain.com/o/2556513b-85f5-4f8a-a6cc-ef8f17fb2ac6/datasets/c421821d-9bf4-4706-904b-359aad37fb34/compare?selectedSessions=cbba98a0-d6ae-41e5-bfe2-0f4bc9e5f03e




0it [00:00, ?it/s]

Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 21 seconds...
Rate limit exceeded. Retrying in 21 seconds...
Rate limit exceeded. Retrying in 21 seconds...
Rate limit exceeded. Retrying in 21 seconds...
Rate limit exceeded. Retrying in 20 seconds...
Rate limit exceed

# 2 eval: Response vs input

The same approach is used for the Response vs. Input evaluation. Here, the model is responsible for generating answers and then evaluating its own responses based on the criteria provided in the prompt.

In this case, the prompt instructs the model to assign a score of 1 (good) or 0 (bad) according to these guidelines:
1. The STUDENT ANSWER should be concise and directly relevant to the QUESTION.
2. The STUDENT ANSWER should contribute toward answering the QUESTION effectively. 

This process allows the model to assess its responses objectively, ensuring they are both relevant and helpful.

In [21]:

# Grade prompt
grade_prompt_answer_helpfulness = hub.pull("langchain-ai/rag-answer-helpfulness")

def answer_helpfulness_evaluator(run, example) -> dict:
    """
    A simple evaluator for RAG answer helpfulness with exponential backoff for rate limiting.
    """

    # Get question and RAG chain answer
    input_question = example.inputs["input_question"]
    prediction = run.outputs

    # LLM grader
    llm = ChatMistralAI(model="open-mistral-7b", api_key=mistral_api_key, temperature=0)
    # Structured prompt
    answer_grader = grade_prompt_answer_helpfulness | llm

    # Set up exponential backoff parameters
    max_retries = 10
    delay = 10  # Start with a 10-second delay

    for attempt in range(max_retries):
        try:
            score = answer_grader.invoke({
                "question": input_question,
                "student_answer": prediction
            })
            return {"key": "answer_helpfulness_score", "score": score["Score"]}
        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429 and attempt < max_retries - 1:
                print(f"Rate limit exceeded. Retrying in {delay} seconds...")
                time.sleep(delay)
                delay *= 3  # Exponential backoff
            else:
                return {"key": "answer_helpfulness_score", "score": "Error: Rate limit exceeded"}


In [22]:
experiment_results = evaluate(
    generate_rag_answer,
    data=dataset_name,
    evaluators=[answer_helpfulness_evaluator],
    experiment_prefix="rag-answer-helpfulness",
    metadata={"version": "LCEL context, gpt-4-0125-preview"},
)

View the evaluation results for experiment: 'rag-answer-helpfulness-967a4e62' at:
https://smith.langchain.com/o/2556513b-85f5-4f8a-a6cc-ef8f17fb2ac6/datasets/c421821d-9bf4-4706-904b-359aad37fb34/compare?selectedSessions=a0563944-e838-436c-9ae3-e9a988fb81e9




0it [00:00, ?it/s]

Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 10 seconds...
Rate limit exceeded. Retrying in 7 seconds...
Rate limit exceeded. Retrying in 14 seconds...
Rate limit exceeded. Retrying in 14 seconds...
Rate limit exceeded. Retrying in 14 seconds...
Rate limit exceeded. Retrying in 14 seconds...
Rate limit exceeded. Retrying in 14 seconds...
Rate limit exceeded. Retr

# Conclusion

As shown in the results, we achieved:
- A score of 0.85 for answer helpfulness
- A score of 0.60 for response vs. reference answer accuracy
  
Considering that the evaluation dataset was limited in size and some instances were affected by rate limits on Mistral API calls, we can conclude that the benchmark positively reflects the effectiveness of the RAG pipeline we developed. These scores suggest that the pipeline performs well in terms of relevance and helpfulness, despite the constraints encountered.





