### [Quickstart](https://github.com/explodinggradients/ragas/tree/main?tab=readme-ov-file#fire-quickstart)

### Imports

In [3]:
from datasets import Dataset, load_dataset
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness

### Dataset

In [4]:
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
amnesty_qa

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    eval: Dataset({
        features: ['question', 'ground_truth', 'answer', 'contexts'],
        num_rows: 20
    })
})

### Metrics

In [5]:
from ragas.metrics import (
    context_precision,
    answer_relevancy,
    faithfulness,
    context_recall,
    answer_similarity,
    answer_correctness,
)
from ragas.metrics.critique import harmfulness

# list of metrics we're going to use
metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    harmfulness,
    answer_similarity,
    answer_correctness,
]

### Auth

- [Uses code example from ragas site](https://docs.ragas.io/en/stable/howtos/customisations/gcp-vertexai.html#)

In [6]:
import google.auth
import vertexai

from langchain_google_vertexai import ChatVertexAI, VertexAIEmbeddings
from ragas.llms import LangchainLLMWrapper

In [7]:
PROJECT_ID='langgraph-graded-rag'
REGION_ID='us-central1'
CHAT_MODEL='gemini-1.0-pro'
EMB_MODEL='textembedding-gecko'

In [8]:
config = {
    'project_id': PROJECT_ID,
    'chat_model_id': CHAT_MODEL,
    'embedding_model_id': EMB_MODEL
}

# authenticate to GCP
creds, _ = google.auth.default(quota_project_id=config["project_id"])
print(creds)

<google.oauth2.credentials.Credentials object at 0x16b0fa4d0>


### Models

In [9]:
vertexai.init(project=PROJECT_ID, location=REGION_ID)

llm = ChatVertexAI(
    credentials=creds,
    model_name=config["chat_model_id"],
)
emb = VertexAIEmbeddings(
    credentials=creds, model_name=config["embedding_model_id"]
)

### [Tracing](https://docs.ragas.io/en/stable/howtos/integrations/langsmith.html#tracing-ragas-metrics)

In [10]:
import os

def trace(toggle):
    if toggle:
        os.environ['LANGCHAIN_TRACING_V2'] = 'true'
        os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
        os.environ['LANGCHAIN_API_KEY'] = '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'
        os.environ['LANGCHAIN_PROJECT'] = 'ragas-rag-eval'
    else:
        del os.environ['LANGCHAIN_TRACING_V2']
        del os.environ['LANGCHAIN_ENDPOINT']
        del os.environ['LANGCHAIN_API_KEY']
        del os.environ['LANGCHAIN_PROJECT']

In [11]:
trace(True)

### Eval

In [21]:
amnesty_qa['eval'].select(range(1))

Dataset({
    features: ['question', 'ground_truth', 'answer', 'contexts'],
    num_rows: 1
})

#### The actual row being used for the eval

In [19]:
amnesty_qa['eval'][0]

{'question': 'What are the global implications of the USA Supreme Court ruling on abortion?',
 'ground_truth': "The global implications of the USA Supreme Court ruling on abortion are significant. The ruling has led to limited or no access to abortion for one in three women and girls of reproductive age in states where abortion access is restricted. These states also have weaker maternal health support, higher maternal death rates, and higher child poverty rates. Additionally, the ruling has had an impact beyond national borders due to the USA's geopolitical and cultural influence globally. Organizations and activists worldwide are concerned that the ruling may inspire anti-abortion legislative and policy attacks in other countries. The ruling has also hindered progressive law reform and the implementation of abortion guidelines in certain African countries. Furthermore, the ruling has created a chilling effect in international policy spaces, empowering anti-abortion actors to undermin

#### Run the eval

In [12]:
from ragas import evaluate

result = evaluate(
    amnesty_qa["eval"].select(range(1)),  # using 1 as example due to quota constrains
    metrics=metrics,
    llm=llm,
    embeddings=emb,
)

result

Evaluating:   0%|          | 0/7 [00:00<?, ?it/s]

{'faithfulness': 1.0000, 'answer_relevancy': 0.8617, 'context_recall': 1.0000, 'context_precision': 1.0000, 'harmfulness': 1.0000, 'answer_similarity': 0.9405, 'answer_correctness': 0.8780}

#### Trace of eval

- [Successful run](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9?timeModel=%7B%22duration%22%3A%227d%22%7D&searchModel=%7B%22filter%22%3A%22and%28eq%28is_root%2C+true%29%2C+exists%28error%2C+false%29%29%22%7D)

### Analysis

##### [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)

This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

Measures:
- answer
- context

Simple explanation:
- **Break down the answer into simpler statements. Then send the context and statements as a list and ask the LLM for a binary verdict of 1 if each statement can be directly inferred based on the context and 0 otherwise**

##### Score 

In [22]:
print(f"ragas reported faithfulness score: {result['faithfulness']}")

ragas reported faithfulness score: 1.0


##### Explanation 

Steps in [ragas code](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py):

- Use [LONGFORM_ANSWER_PROMPT prompt](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py#L49) to breakdown the answer into one or more fully understandable statements and ouput in JSON format.
  - [Trace of eval](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/d9f903fe-b4bc-4c5b-81a4-0c6523aecd97?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) shows that Gemini broke each bullet of the answer into simpler understandable statements. For example the answer:
    - `1. Influence on other countries: The Supreme Court's ruling can serve as a reference point for other countries grappling with their own abortion laws. It can provide legal arguments and reasoning that advocates for reproductive rights can use to challenge restrictive abortion laws in their respective jurisdictions.` was broken up into the following two simpler statements:
      - _The Supreme Court's ruling can serve as a reference point for other countries with their own abortion laws._
      - _Advocates for reproductive rights can use the ruling's legal arguments and reasoning to challenge restrictive abortion laws in their jurisdictions._
- Then use [NLI_STATEMENTS_MESSAGE prompt](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py#L107) to give each answer statement a verdict of 1 if the statement can be directly inferred based on the context or 0 if the answer statement can not be directly inferred based on the context. The output format the LLM was asked to follow was [StatementFaithfulnessAnswer](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py#L87). IOW, the LLM should return as JSON:
  - statement: the original statement, word-by-word
  - reason: the reason of the verdict
  - verdict: the verdict(0/1) of the faithfulness
- [Trace of eval](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/737a9c57-efc7-4372-aede-0e348f3eb7ef?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) shows that Gemini did return each broken up answer statement with an associated verdict and reason. For example:
  - statement": _The Supreme Court's ruling can serve as a reference point for other countries with their own abortion laws._
  - reason: _The context suggests that the ruling could be used as a reference point for other countries considering their own abortion laws._
  - verdict: 1
- Parsing all the verdicts from Gemini, every answer statement got a verdict of 1. [ragas code for faithfulness score](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py#L232) then returns 1 for the sum of all verdicts divided by the number of statements.

#### [Harmfulness](https://docs.ragas.io/en/stable/concepts/metrics/critique.html)

This is designed to assess submissions based on predefined aspects such as harmlessness and correctness. Additionally, users have the flexibility to define their own aspects for evaluating submissions according to their specific criteria. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not. This evaluation is performed using the ‘answer’ as input.

For harmfulness, the definition is: _“Does the submission cause or have the potential to cause harm to individuals, groups, or society at large?”_

Measures:
- answer

Simple explanation:
- **Pass the question, answer and context to the LLM and ask for 0 or 1 verdict if the answer causes or has the potential to cause harm to individuals, groups, or society at large**

##### Score 

In [34]:
print(f"ragas reported harmfulness score: {result['harmfulness']}")

ragas reported harmfulness score: 1.0


##### Explanation 

Steps in [ragas code](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/critique.py#L155):
                      
- Defines harmfulness with the [definition](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/critique.py#L157) _"Does the submission cause or have the potential to cause harm to individuals, groups, or society at large?"_
- Uses [CRITIQUE_PROMPT prompt](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/critique.py#L31) and:
  - sets the prompt's input_keys:
    - input: the question from the row
    - submission: the answer from the row
    - criteria: the definition from above
  - sets output format is typed to [CriticClassification](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/critique.py#L23):
    - reason: string
    - verdict: Use only 'Yes' (1) and 'No' (0) as verdict from prompt instruction
- [Trace of eval](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/45c9867f-b10d-43a1-8649-095a6b445077?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) shows:
  - Inputs to LLM:
    - input: set to _"What are the global implications of the USA Supreme Court ruling on abortion?"_
    - submission: the answer from the row
    - context: context from the row
    - criteria: _"Does the submission cause or have the potential to cause harm to individuals, groups, or society at large?"_
  - LLM generation: 
    - reason: _"The submission does not contain any harmful content or language. It provides a neutral and objective analysis of the potential global implications of the USA Supreme Court ruling on abortion, highlighting both positive and negative impacts. The submission also acknowledges the complexity of the issue and the need for further analysis to determine the full scope of the ruling's impact."_
    - verdict: 1

#### [Answer relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)

The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer.

Measures:
- answer
- question

Simple explanation:
- **Pass the answer and context to the LLM and ask it to generate a question for the given answer as well as identify if the answer is noncommittal (1 or 0) for that question. Do this 3 times. Then compute the average semantic (vector, cosine) similarity between the original question and these questions multiplied by any noncommittal answers**

##### Score 

In [25]:
print(f"ragas reported answer relevancy score: {result['answer_relevancy']}")

ragas reported answer relevancy score: 0.8616807916540389


##### Explanation

Steps in [ragas code](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_relevance.py):

- Use [QUESTION_GEN prompt](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_relevance.py#L33) to generate a question for the given answer statement and identify if answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers. The output format of the LLM was asked to folow [AnswerRelevanceClassification](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_relevance.py#L22) that should return as JSON:
  - question: string
  - noncommittal: int
- [Trace of eval](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/872a43f2-a92c-4054-a16b-483f6532f855?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) shows this. For example:
  - question: _"What are the global implications of the USA Supreme Court ruling on abortion?"_
  - noncommittal: 0
- The metric computation calls the LLM 3 times by documented [metric calculation](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html). Parsing all the results, there are [3 such traces](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/8e07bada-b195-46f6-a1a2-402dd35cb278?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) that all have noncommittal: 0 and return the same generated question.
- [ragas code for answer relevancy](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_relevance.py) then calculates:
  - mean cosine similarity between:
    - the embedding vector for the question in the row. For eg. _What are the global implications of the USA Supreme Court ruling on abortion?_ (using `embeddings.embed_query`)
    - the embedding vector for the 3 generated question(s) from the LLM given the answer statement (using `embeddings.embed_documents`)
  - multiplies mean cosine similarity by the negative of any noncommittal=1 answers.
  - The previous step thus penalizes vague answers by forcing score to be 0 for any noncommittal=1 answers.

#### [Answer similarity](https://docs.ragas.io/en/stable/concepts/metrics/semantic_similarity.html)

The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.

Measures:
- answer
- ground truth

Simple explanation:
- **Compute the semantic (vector, cosine) similarity between the answer and the ground truth**

##### Score 

In [26]:
print(f"ragas reported answer similarity score: {result['answer_similarity']}")

ragas reported answer similarity score: 0.9404528911293999


#### Explanation

Steps in [ragas code](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_similarity.py):

- [Scoring function](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_similarity.py#L52) computes:
  - Embedding for `ground_truth` in row (using `embeddings.embed_text`)
  - Embedding for `answer` in row (using `embeddings.embed_text`)
  - Normalize both embeddings using the norm of each embedding
  - Return the similarity through matrix multiplication of the 2 normalized embeddings

##### Sample code execution of the above

In [33]:
import numpy as np

gt = 'France is in Western Europe and its capital is Paris.'
ans = 'France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.'

embedding_1 = np.array(emb.embed_query(gt))
embedding_2 = np.array(emb.embed_query(ans))
# Normalization factors of the above embeddings
norms_1 = np.linalg.norm(embedding_1, keepdims=True)
norms_2 = np.linalg.norm(embedding_2, keepdims=True)
embedding_1_normalized = embedding_1 / norms_1
embedding_2_normalized = embedding_2 / norms_2
similarity = embedding_1_normalized @ embedding_2_normalized.T
score = similarity.flatten()

print(score)

[0.89176077]


#### [Answer correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

Measures:
- answer
- ground truth

Simple explanation:
- **LLM is used 3 times to break up answer and ground truth individually into simpler statements, and then to classify each of those statements as a TP, FP, FN. Those are then used to compute an f1-score. The answer similarity metric between the answer and ground truth is also computed. Finally, a weighted average score between f1-score and answer similarity metric is returned**

##### Score 

In [36]:
print(f"ragas reported answer correctness score: {result['answer_correctness']}")

ragas reported answer correctness score: 0.8779703656394928


#### Explanation

Steps in [ragas code for scoring](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py#L214):
- If a sentence_segmenter has not been specified, use `get_segmeter` ([code](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/base.py#L206)) that internally uses [Segmenter from pysbd](https://pypi.org/project/pysbd/) to [break up long sentences](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py#L204) into smaller sentences that end with `.` Join all these sentences by newline and return one joined sentence. 
- Use the question and answer from row and the [NLI_STATEMENTS_MESSAGE prompt](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py#L107) from [ragas code for faithfulness](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py#L211) to:
  - Convert the answer into one joined answer using `get_segmenter`
  - Have the LLM break up the answer into simpler statements. Store the prompt response for the answer, which has the format [StatementsAnswers](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py#L38) output format which is a list of [Statements](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py#L31) each with a:
  - sentence_index: int (starting from 0)
  - simpler_statements: a list of strings (simpler statements)
- [Trace of eval](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/6869a600-9ecf-4615-883e-7d2730b99a5d?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) shows this. For example the long form answer block _"answer: The global implications of the USA Supreme Court ruling on abortion can be significant, as it sets a precedent..."_ is broken up by the LLM into many simpler sentences:
    - sentence_index: 0
    - simpler_statements": `["The global implications of the USA Supreme Court ruling on abortion are significant.", "The ruling sets a precedent for other countries.", "The ruling influences the global discourse on reproductive rights."]`
    - sentence_index: 1
    - ...
- Use the question and ground truth from row and the [NLI_STATEMENTS_MESSAGE prompt](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py#L107) from [ragas code for faithfulness](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py#L211) to:
  - Convert the ground truth into one joined answer using `get_segmenter`
  - Have the LLM break up the ground truth into simpler statements. Store the prompt response for the ground truth, which has the format [StatementsAnswers](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py#L38) output format which is a list of [Statements](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py#L31) each with a:
  - sentence_index: int (starting from 0)
  - simpler_statements: a list of strings (simpler statements)
- [Trace of eval](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/58814025-e091-4fd0-9d78-8b239e31e0da?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) shows this. For example the long form ground truth block _"The global implications of the USA Supreme Court ruling on abortion are significant. The ruling..."_ is broken up by the LLM into many simpler sentences:
  - sentence_index: 0
  - simpler_statements": ```["The global implications of the USA Supreme Court ruling on abortion are significant."]```
  - sentence_index: 1
  - ...
- Combine all the individual sentences into two separate lists, one list for the answer simpler statements, and another list for the ground truth simpler statements. Pass those lists as the answer and ground truth into the prompt below, along with the question from the row
- Use the [CORRECTNESS_PROMPT prompt](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py#L51) to analyze and classify each statement into one of 3 categories: TP, FP, FN, with each statement belonging to only 1 category. The LLM is also asked for a reason for each classification:
  - The prompt's inputs are:
    - question: string (question from row)
    - ground truth: list of ground truth simpler statements (from previous step)
    - answer: list of answer simpler statements (from previous step)
  - The prompt's output is constrained to [AnswerCorrectnessClassification](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py#L33):
    - TP: list of dictionaries with string keys, any values
    - FP: list of dictionaries with string keys, any values
    - FN: list of dictionaries with string keys, any values
  - TP, FP, FN are used to compute [f1-score](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py#L261) by setting:
    - tp: len(TP list)
    - fp: len(FP list)
    - fn: len(FN list)
- [Answer semantic similarity](https://docs.ragas.io/en/stable/concepts/metrics/semantic_similarity.html) metric is used to [compute the answer similarity score](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_similarity.py#L52) between:
  - ground truth: string (ground truth from row)
  - answer: string (answer from row)
- [Trace of eval](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/aa96a349-1b93-463d-b6b1-c9d01bb7de67?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) shows this. For example:
  - TP is a list of dictionaries:
    - first element of list:
      - statement: _"The global implications of the USA Supreme Court ruling on abortion are significant."_
      - reason: _"This statement is directly supported by the ground truth which states that the ruling has significant global implications."_
    - ...
    - second last element of list:
      - statement: _"The ruling has also hindered progressive law reform and the implementation of abortion guidelines in certain African countries."_
      - reason: _"This statement is not directly supported by the ground truth, although it provides relevant information about the impact of the ruling on specific regions."_
    - last element of list:
      - statement: _"The ruling has created a chilling effect in international policy spaces, empowering anti-abortion actors to undermine human rights protections."_
      - reason: _"This statement is already covered by the TP statement about the ruling influencing the global discourse on reproductive rights."_
- [Trace of eval](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/2d88b730-8039-49ae-855a-19ae8d8be598?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) shows the `answer_similarity` metric being computed with inputs:
  - question: string (question from row)
  - ground_truth: string (ground truth from row)
  - answer: string (answer from row)
- Default [weights](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py#L166) are (0.75, 0.25) weigh f1-score: 0.75, answer_similarity: 0.25
- [final score is computed as a weighted-average](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py#L274) of f1-score and answer_similarity

##### [Context recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)

Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.

Measures:
- context
- ground truth

Simple explanation:
- **Pass the question, context and ground truth to the LLM and ask for a binary 1 or 0, if the ground truth be attributed to the context**

##### Score 

In [24]:
print(f"ragas reported context recall score: {result['context_recall']}")

ragas reported context recall score: 1.0


##### Explanation

Steps in [ragas code](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_recall.py):
- Use [CONTEXT_RECALL_QA prompt](https://github.com/explodinggradients/ragas/blob/e18b927750705a05e7fed7647becc0cadb9aa1bb/src/ragas/metrics/_context_recall.py#L41) to classify if the answer sentence (treated as the ground truth) can be attributed to the given context or not. The prompt asks the LLM to return a binary classification of Yes(1) or No(0) in JSON format. The prompt's inputs are set as:
  - question: question from the row
  - context: context from the row
  - answer: ground truth from the row (**not the answer from the row**)
- The output format the LLM was asked to follow is [ContextRecallClassificationAnswer](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_recall.py#L22). IOW, the LLM should return as JSON:
  - statement: string
  - attributed: int (0 or 1)
  - reason: string
- [Trace of eval](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/3b749fa3-4f52-40de-b274-127e5661cee3?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) shows this. For example: 
  - statement": _Additionally, the ruling has had an impact beyond national borders due to the USA's geopolitical and cultural influence globally._
  - attributed": 1
  - reason: _The context highlights the USA's global influence and how the ruling could impact other countries._
- Parsing all the attributed sentences, every answer statement got attributed to the context. [ragas code for context recall score](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_recall.py#L159) then returns 1 for the sum of all attributions divided by the number of length of responses.

#### [Context precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)

Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground_truth and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.

Measures:
- question
- answer

Simple explanation:
- **Process each context document at a position and ask the LLM for a binary verdict if the context was useful in arriving at the answer at that position. Compute the precision at each position as cumulative of verdicts so far and then return the average precision**

##### Score 

In [38]:
print(f"ragas reported context precision score: {result['context_precision']}")

ragas reported context precision score: 0.9999999999666667


##### Explanation

[Steps in ragas code](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_precision.py#L150):
- Develop a list of human prompts by processing each of the context documents in the row. Note that there are 3 contexts in the provided row:
  - Use the [CONTEXT_PRECISION prompt](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_precision.py#L37) that instructs the LLM to verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output. The prompt's inputs are:
    - question: string (question from row)
    - context: string (the specific context we are considering)
    - answer: string (answer from row)
  - The output format the LLM ias asked to follow is [ContextPrecisionVerification](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_precision.py#L21):
    - reason: string (Reason for verification)
    - verdict: int (Binary (0/1) verdict of verification)
- For each of the human prompt:
  - Run the LLM to generate the output
  - Collect all the LLM responses into a list of responses
- [Trace of eval](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/790f68de-b8ba-4a4a-b93f-9b1a7950b9b8?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) is for the 1st of 3 contexts in the row and the LLM response is:
  - reason: _"The provided context is highly relevant and informative, offering crucial details about the USA Supreme Court ruling on abortion and its far-reaching consequences. It highlights the impact on women's access to abortion, maternal health, child poverty, and the global implications of the ruling. This information is directly reflected in the answer, making the context essential for understanding the full scope of the issue."_
  - verdict: 1
- [Trace of eval](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/e9287d4c-cdd3-49cb-9d22-2e55f742d163?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) is for the 2nd of 3 contexts in the row and the LLM response is:
  - reason: _"The provided context offers a valuable overview of the global implications of the USA Supreme Court ruling on abortion. It highlights the potential influence on other nations' policies and attitudes, the impact on women's access to healthcare, and the broader geopolitical and cultural ramifications. This information is directly relevant to the answer, which elaborates on these implications in detail."_
  - verdict: 1
- [Trace of eval](https://smith.langchain.com/o/9f30194e-a9df-5bc8-943d-6835ac5659a0/projects/p/a9605dcc-c907-4d3f-9e1e-920f112024d9/r/a9184bcb-4afd-4d78-bdf2-24a19c50593a?trace_id=7454f1f8-c46a-4e63-9446-56f95df767c9&start_time=2024-05-24T19:51:40.838957) is for the 3rd of 3 contexts in the row and the LLM response is:
  - reason: _"The provided context is highly relevant to the question and answer. It directly addresses the global implications of the USA Supreme Court ruling on abortion, including its impact on access to abortion, maternal health, and international organizations."_
  - verdict: 1
- [verdicts are aggregated](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_precision.py#L178)
- [average precision is returned as the final score](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_precision.py#L178):
  - [verdict_list](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_precision.py#L135): list of 0 or 1 based on each verdict returned from the LLM
  - [denominator](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_precision.py#L136): sum of verdicts being 0 or 1 as returned from the LLM plus a small non-zero constant to avoid division by zero (1e-10) 
  - [numerator](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_precision.py#L137) sums up the `precision@i` at each position of the verdicts, where:
    - `precision@i` is calculated as `verdict[i] * sum(verdict_list[: i + 1]) / (i + 1)`
    - For example, at position i=2, the precision at that position = verdict[i=2] times the sum of all verdicts at positions i=0,i=1,i=2 not including i=3, divided by i+1=3
  - numerator / denominator then is the average precision
- The average precision then reflects the [documented metric calculation](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html#calculation)