# Correctness Evaluation

This page shows the process of testing the correctness of the RAG system using Langsmith, which is a framwework focused on testing, monitoring and deployment of LLM apps, and provides tools for QA Correctness via QA Evaluators . These [evaluators](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations#correctness-qa-evaluation) can be used to measure the accuracy of the system's answers against a set of reference answers. 

To test the correctness of the RAG system, a train-validation-test approach were defined, and the *test* part of the split will be used to perform these tests.

First of all , we extract the question and answer that will be used for reference in a list of tuples:

In [2]:
import dotenv
import pandas as pd

dotenv.load_dotenv()

df_test = pd.read_csv("../data/split_files/test_counsel_chat.csv")
df_test = df_test[["questionTitle", "answerText"]]

list_examples = list(df_test.itertuples(index=False, name=None))

After that, the Langsmith client is instantiated to interact with the Langsmith platform

In [44]:
from langsmith import Client

client = Client()

The dataset has defined a name to be identified across tests, and in this part the question/answer pairs used for reference are stored in the dataset on Langsmith platform

In [46]:
dataset_name = f"Mental Health Retrieval QA Questions"
dataset = client.create_dataset(dataset_name=dataset_name)

for q, a in list_examples:
    client.create_example(
        inputs={"question": q}, outputs={"answer": a}, dataset_id=dataset.id
    )

In this case, we will be using the chain-of-thought Q&A correctness evaluator `cot-qa`. This evaluator measures the accuracy of the system's answers based on how well they follow the logical flow of the conversation. And we also use the `qa` evaluator, which only grade a response as "correct" or "incorrect" based on the reference answer.

In [None]:
from langchain.smith import RunEvalConfig

eval_config = RunEvalConfig(
    # We will use the chain-of-thought Q&A correctness evaluator
    evaluators=["cot_qa","qa"],
)

Next we configure the prompts, retrievers and LLM models used, in this case `gpt-3.5-turbo-16k`, for the chain to be executed in Langsmith platform

In [3]:
from langchain_community.vectorstores.chroma import Chroma
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate

FAQ_CHROMA_PATH = "../data/vector"

faqs_retriever = Chroma(
    embedding_function=OpenAIEmbeddings(), persist_directory=FAQ_CHROMA_PATH
).as_retriever(k=10)

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful documentation Q&A assistant, trained to answer"
            " questions from Mental health."
            "Answer only based on the following context:",
        ),
        ("system", "{context}"),
        ("human", "{question}"),
    ]
)

model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)

response_generator = prompt | model | StrOutputParser()

faq_rag_chain = (
        {
            "context": faqs_retriever,
            "question": itemgetter("question"),
        }
        | response_generator
)

Once we have created the dataset and structured the chain, we can run the evaluations on the dataset using the run_on_dataset method of the Langsmith client. We need to provide the name of the dataset, the LLM or chain factory function, and the evaluation configuration.

In [54]:
client.run_on_dataset(
    dataset_name=dataset_name, llm_or_chain_factory=faq_rag_chain, evaluation=eval_config
)

View the evaluation results for project 'best-discovery-95' at:
https://smith.langchain.com/o/adc22f44-a515-576d-9880-914687081682/datasets/97aea489-e162-4ed9-a5a0-06c51f5ebff0/compare?selectedSessions=fded5dcf-0eaf-45a9-8e4c-b5f34e2e0173

View all tests for Dataset Mental Health Retrieval QA Questions at:
https://smith.langchain.com/o/adc22f44-a515-576d-9880-914687081682/datasets/97aea489-e162-4ed9-a5a0-06c51f5ebff0
[------------------------------------------------->] 117/117

{'project_name': 'best-discovery-95',
 'results': {'c7f84acb-dddd-47a2-a59b-a783a565182d': {'input': {'question': 'Is it normal to cry during therapy?'},
   'feedback': [EvaluationResult(key='COT Contextual Accuracy', score=1, value='CORRECT', comment="The student's answer aligns with the context provided. The context states that it is normal to express various emotions, including crying, during therapy sessions. The student's answer also states that it is normal to cry during therapy and provides additional information on why this might occur and how it can be beneficial. Therefore, the student's answer is factually accurate and does not conflict with the context.\nGRADE: CORRECT", correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('ef35e11a-4a84-438a-be95-a7ab334a133e'))}, source_run_id=None, target_run_id=None),
    EvaluationResult(key='correctness', score=1, value='CORRECT', comment='CORRECT', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('1ac28a92-6a06-

Finally, the results can be seen on the Langsmith website, with explanation and details about the results.

![title](correctness_tests_results.png "Correctness tests in LangSmith")

Additionally, you can see the contextual feedback about the results on any example that was tested

![title](correctness_tests_details.png "Correctness tests explanation in LangSmith")

## Next steps

The next steps would be to do benchmarking between different [LLMs](https://python.langchain.com/docs/integrations/llms/) (LangChain supports plenty of them), different chain dispositions, and additional fine-tuning, As it is shown here, where we can edit some parameters, prompts, or messages in the interaction with the LLM model to check how it performs after the changes 

![title](correctness_test_playground_1.png "Correctness tests Playground")

![title](correctness_test_playground_2.png "Correctness tests Playground ChatOpenAI")