# Bring your own LLMs

Ragas uses langchain under the hood for connecting to LLMs for metrices that require them. This means you can swap out the default LLM we use (`gpt-3.5-turbo-16k`) to use any 100s of API supported out of the box with langchain.

- [Completion LLMs Supported](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.llms)
- [Chat based LLMs Supported](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.chat_models)

This guide will show you how to use another or LLM API for evaluation.

## Evaluating with GPT4

Ragas uses gpt3.5 by default but using gpt4 for evaluation can improve the results so lets use that for the `Faithfulness` metric

To start-off, we initialise the gpt4 `chat_model` from langchain

In [52]:
%pip show ragas

Name: ragas
Version: 0.0.15.dev2+gd590b10.d20230924
Summary: 
Home-page: 
Author: 
Author-email: 
License: 
Location: /Users/inflaton/miniconda3/lib/python3.10/site-packages
Requires: datasets, langchain, numpy, openai, pydantic, pysbd, sentence-transformers, transformers
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [30]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ["RAGAS_OPENAI_MODEL_NAME"] = 'gpt-3.5-turbo' 

True

Now initialise `Faithfulness` with `gpt-3.5-turbo-instruct` and `gpt4`

In [53]:
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from ragas.metrics import Faithfulness, AnswerRelevancy

gpt_instruct = OpenAI(model_name="gpt-3.5-turbo-instruct")
faithfulness_instruct = Faithfulness(name="faithfulness", llm=gpt_instruct)
answer_relevancy_instruct = AnswerRelevancy(name="answer_relevancy", llm=gpt_instruct)

gpt4 = ChatOpenAI(model_name="gpt-4")
faithfulness_gpt4 = Faithfulness(name="faithfulness", llm=gpt4)
answer_relevancy_gpt4 = AnswerRelevancy(name="answer_relevancy", llm=gpt4)


That's it!

Now lets run the evaluations using the example from [quickstart](../quickstart.ipnb).

In [54]:
# data
from datasets import load_dataset

fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
fiqa_eval

DatasetDict({
    baseline: Dataset({
        features: ['question', 'ground_truths', 'answer', 'contexts'],
        num_rows: 30
    })
})

In [67]:
"""
Official evaluation script for QAConv, modified from SQuAD 2.0.

 * Copyright (c) 2021, salesforce.com, inc.
 * All rights reserved.
 * SPDX-License-Identifier: BSD-3-Clause
 * For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause

"""

import collections
import re
import string


def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def get_tokens(s):
    if not s:
        return []
    return normalize_answer(s).split()


def compute_exact(a_gold, a_pred):
    return int(normalize_answer(a_gold) == normalize_answer(a_pred))


def compute_f1(a_gold, a_pred):
    gold_toks = get_tokens(a_gold)
    pred_toks = get_tokens(a_pred)
    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
    num_same = sum(common.values())
    if len(gold_toks) == 0 or len(pred_toks) == 0:
        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
        return int(gold_toks == pred_toks)
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(pred_toks)
    recall = 1.0 * num_same / len(gold_toks)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


In [68]:
dataset = fiqa_eval["baseline"]
new_ds = dataset.map(
    lambda record, idx: {
        "answer": record["ground_truths"][0] if idx < 5 else  record["answer"],
        "EM": compute_exact(record['ground_truths'][0], record["ground_truths"][0] if idx < 5 else  record["answer"]), 
        "F1": compute_f1(record['ground_truths'][0], record["ground_truths"][0] if idx < 5 else  record["answer"])
    },
    batched=False,
    with_indices=True,
)
new_ds

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'ground_truths', 'answer', 'contexts', 'EM', 'F1'],
    num_rows: 30
})

In [69]:
new_ds.to_pandas()

Unnamed: 0,question,ground_truths,answer,contexts,EM,F1
0,How to deposit a cheque issued to an associate...,[Have the check reissued to the proper payee.J...,Have the check reissued to the proper payee.Ju...,[Just have the associate sign the back and the...,1,1.0
1,Can I send a money order from USPS as a business?,[Sure you can. You can fill in whatever you w...,Sure you can. You can fill in whatever you wa...,[Sure you can. You can fill in whatever you w...,1,1.0
2,1 EIN doing business under multiple business n...,[You're confusing a lot of things here. Compan...,You're confusing a lot of things here. Company...,[You're confusing a lot of things here. Compan...,1,1.0
3,Applying for and receiving business credit,"[""I'm afraid the great myth of limited liabili...","""I'm afraid the great myth of limited liabilit...",[Set up a meeting with the bank that handles y...,1,1.0
4,401k Transfer After Business Closure,[You should probably consult an attorney. Howe...,You should probably consult an attorney. Howev...,[The time horizon for your 401K/IRA is essenti...,1,1.0
5,What are the ins/outs of writing equipment pur...,[Most items used in business have to be deprec...,\nWriting equipment purchases off as business ...,[You would report it as business income on Sch...,0,0.424742
6,Can a entrepreneur hire a self-employed busine...,[Yes. I can by all means start my own company ...,"\nYes, an entrepreneur can hire a self-employe...",[Yes. I can by all means start my own company ...,0,0.226087
7,Intentions of Deductible Amount for Small Busi...,"[""If your sole proprietorship losses exceed al...",\nThe intention of deductible amounts for smal...,"[""Short answer, yes. But this is not done thro...",0,0.16185
8,How can I deposit a check made out to my busin...,[You should have a separate business account. ...,\nYou can deposit a check made out to your bus...,"[""I have checked with Bank of America, and the...",0,0.186528
9,Filing personal with 1099s versus business s-c...,[Depends whom the 1099 was issued to. If it wa...,\nFiling personal taxes with 1099s versus fili...,[Depends whom the 1099 was issued to. If it wa...,0,0.214286


In [70]:
%%time
# evaluate
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
)

result = evaluate(
    new_ds,
    metrics=[
        faithfulness,
        answer_relevancy,
    ],
)

result

evaluating with [faithfulness]


100%|██████████| 2/2 [05:44<00:00, 172.45s/it]


evaluating with [answer_relevancy]


100%|██████████| 2/2 [01:15<00:00, 37.98s/it]


CPU times: user 391 ms, sys: 343 ms, total: 734 ms
Wall time: 7min 1s


{'ragas_score': 0.8333, 'faithfulness': 0.7698, 'answer_relevancy': 0.9083}

In [74]:
df = result.to_pandas()
df

Unnamed: 0,question,contexts,answer,ground_truths,faithfulness,answer_relevancy
0,How to deposit a cheque issued to an associate...,[Just have the associate sign the back and the...,Have the check reissued to the proper payee.Ju...,[Have the check reissued to the proper payee.J...,0.666667,0.821788
1,Can I send a money order from USPS as a business?,[Sure you can. You can fill in whatever you w...,Sure you can. You can fill in whatever you wa...,[Sure you can. You can fill in whatever you w...,1.0,0.844772
2,1 EIN doing business under multiple business n...,[You're confusing a lot of things here. Compan...,You're confusing a lot of things here. Company...,[You're confusing a lot of things here. Compan...,0.857143,0.777502
3,Applying for and receiving business credit,[Set up a meeting with the bank that handles y...,"""I'm afraid the great myth of limited liabilit...","[""I'm afraid the great myth of limited liabili...",1.0,0.813215
4,401k Transfer After Business Closure,[The time horizon for your 401K/IRA is essenti...,You should probably consult an attorney. Howev...,[You should probably consult an attorney. Howe...,0.666667,0.769605
5,What are the ins/outs of writing equipment pur...,[You would report it as business income on Sch...,\nWriting equipment purchases off as business ...,[Most items used in business have to be deprec...,1.0,0.949131
6,Can a entrepreneur hire a self-employed busine...,[Yes. I can by all means start my own company ...,"\nYes, an entrepreneur can hire a self-employe...",[Yes. I can by all means start my own company ...,1.0,0.916488
7,Intentions of Deductible Amount for Small Busi...,"[""Short answer, yes. But this is not done thro...",\nThe intention of deductible amounts for smal...,"[""If your sole proprietorship losses exceed al...",0.8,0.904527
8,How can I deposit a check made out to my busin...,"[""I have checked with Bank of America, and the...",\nYou can deposit a check made out to your bus...,[You should have a separate business account. ...,0.6,0.976848
9,Filing personal with 1099s versus business s-c...,[Depends whom the 1099 was issued to. If it wa...,\nFiling personal taxes with 1099s versus fili...,[Depends whom the 1099 was issued to. If it wa...,0.5,0.962691


In [83]:
result2 = new_ds.map(
    lambda record, idx: {
        "faithfulness": df["faithfulness"][idx], 
        "answer_relevancy": df["answer_relevancy"][idx], 
        "ragas_score": df["faithfulness"][idx] * df["answer_relevancy"][idx] / (df["faithfulness"][idx] + df["answer_relevancy"][idx])
    },
    batched=False,
    with_indices=True,
    remove_columns=dataset.column_names
)
result2

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

Dataset({
    features: ['EM', 'F1', 'faithfulness', 'answer_relevancy', 'ragas_score'],
    num_rows: 30
})

In [84]:
df2 = result2.to_pandas()
df2

Unnamed: 0,EM,F1,faithfulness,answer_relevancy,ragas_score
0,1,1.0,0.666667,0.821788,0.368072
1,1,1.0,1.0,0.844772,0.457928
2,1,1.0,0.857143,0.777502,0.407691
3,1,1.0,1.0,0.813215,0.448493
4,1,1.0,0.666667,0.769605,0.357223
5,0,0.424742,1.0,0.949131,0.486951
6,0,0.226087,1.0,0.916488,0.478212
7,0,0.16185,0.8,0.904527,0.424529
8,0,0.186528,0.6,0.976848,0.371697
9,0,0.214286,0.5,0.962691,0.329082


In [85]:
result2.to_csv("gpt-3.5-turbo.csv.log")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

2023

In [86]:
%%time
# evaluate

result_gpt4 = evaluate(
    new_ds,
    metrics=[
        faithfulness_gpt4,
        answer_relevancy_gpt4,
    ],
)

result_gpt4

evaluating with [faithfulness]


100%|██████████| 2/2 [17:08<00:00, 514.13s/it]


evaluating with [answer_relevancy]


100%|██████████| 2/2 [01:47<00:00, 53.54s/it]


CPU times: user 806 ms, sys: 416 ms, total: 1.22 s
Wall time: 18min 56s


{'ragas_score': 0.8068, 'faithfulness': 0.7170, 'answer_relevancy': 0.9224}

In [87]:
result, result_gpt4

({'ragas_score': 0.8333, 'faithfulness': 0.7698, 'answer_relevancy': 0.9083},
 {'ragas_score': 0.8068, 'faithfulness': 0.7170, 'answer_relevancy': 0.9224})

In [88]:
df = result_gpt4.to_pandas()
result3 = result2.map(
    lambda record, idx: {
        "faithfulness_gpt4": df["faithfulness"][idx], 
        "answer_relevancy_gpt4": df["answer_relevancy"][idx], 
        "ragas_score_gpt4": df["faithfulness"][idx] * df["answer_relevancy"][idx] / (df["faithfulness"][idx] + df["answer_relevancy"][idx])
    },
    batched=False,
    with_indices=True,
)
result3

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

Dataset({
    features: ['EM', 'F1', 'faithfulness', 'answer_relevancy', 'ragas_score', 'faithfulness_gpt4', 'answer_relevancy_gpt4', 'ragas_score_gpt4'],
    num_rows: 30
})

In [89]:
result3.to_csv("gpt-3-4.csv.log")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

3504

In [90]:
%%time

result_instruct = evaluate(
    new_ds,
    metrics=[
        faithfulness_instruct,
        answer_relevancy_instruct,
    ],
)

result, result_gpt4, result_instruct

evaluating with [faithfulness]


  0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 2/2 [00:11<00:00,  5.81s/it]


evaluating with [answer_relevancy]


100%|██████████| 2/2 [00:44<00:00, 22.08s/it]


CPU times: user 238 ms, sys: 82.4 ms, total: 321 ms
Wall time: 56.3 s


({'ragas_score': 0.8333, 'faithfulness': 0.7698, 'answer_relevancy': 0.9083},
 {'ragas_score': 0.8068, 'faithfulness': 0.7170, 'answer_relevancy': 0.9224},
 {'ragas_score': 0.8158, 'faithfulness': 0.7474, 'answer_relevancy': 0.8980})

In [91]:
df = result_instruct.to_pandas()
result_all = result3.map(
    lambda record, idx: {
        "faithfulness_gpt3_instruct": df["faithfulness"][idx], 
        "answer_relevancy_gpt3_instruct": df["answer_relevancy"][idx], 
        "ragas_score_gpt3_instruct": df["faithfulness"][idx] * df["answer_relevancy"][idx] / (df["faithfulness"][idx] + df["answer_relevancy"][idx])
    },
    batched=False,
    with_indices=True,
)
result_all

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

Dataset({
    features: ['EM', 'F1', 'faithfulness', 'answer_relevancy', 'ragas_score', 'faithfulness_gpt4', 'answer_relevancy_gpt4', 'ragas_score_gpt4', 'faithfulness_gpt3_instruct', 'answer_relevancy_gpt3_instruct', 'ragas_score_gpt3_instruct'],
    num_rows: 30
})

In [92]:
result_all.to_csv("result_all.csv.log")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

4885

In [93]:
df = result_all.to_pandas()
df

Unnamed: 0,EM,F1,faithfulness,answer_relevancy,ragas_score,faithfulness_gpt4,answer_relevancy_gpt4,ragas_score_gpt4,faithfulness_gpt3_instruct,answer_relevancy_gpt3_instruct,ragas_score_gpt3_instruct
0,1,1.0,0.666667,0.821788,0.368072,0.9,0.82723,0.431041,0.8,0.866893,0.416052
1,1,1.0,1.0,0.844772,0.457928,0.888889,0.876646,0.441362,1.0,0.833523,0.454602
2,1,1.0,0.857143,0.777502,0.407691,1.0,0.776124,0.436976,0.888889,0.776325,0.414401
3,1,1.0,1.0,0.813215,0.448493,1.0,0.852873,0.460297,1.0,0.832488,0.454294
4,1,1.0,0.666667,0.769605,0.357223,0.0,0.8466,0.0,0.0,0.841671,0.0
5,0,0.424742,1.0,0.949131,0.486951,1.0,0.95701,0.489016,0.333333,0.930778,0.245437
6,0,0.226087,1.0,0.916488,0.478212,0.666667,0.965547,0.394371,0.5,0.900254,0.321461
7,0,0.16185,0.8,0.904527,0.424529,0.75,0.907683,0.410671,0.75,0.852995,0.399094
8,0,0.186528,0.6,0.976848,0.371697,1.0,0.986572,0.49662,0.5,0.923212,0.324341
9,0,0.214286,0.5,0.962691,0.329082,0.666667,0.931343,0.388543,0.5,0.902026,0.321687


In [102]:
best = df[df.faithfulness + df.faithfulness_gpt4 + df.faithfulness_gpt3_instruct  > 2.9]
best

Unnamed: 0,EM,F1,faithfulness,answer_relevancy,ragas_score,faithfulness_gpt4,answer_relevancy_gpt4,ragas_score_gpt4,faithfulness_gpt3_instruct,answer_relevancy_gpt3_instruct,ragas_score_gpt3_instruct
3,1,1.0,1.0,0.813215,0.448493,1.0,0.852873,0.460297,1.0,0.832488,0.454294
14,0,0.671642,1.0,0.919864,0.47913,1.0,0.945617,0.486024,1.0,0.930464,0.48199
18,0,0.596273,1.0,0.856767,0.461429,1.0,0.861429,0.462778,1.0,0.852309,0.460133
24,0,0.443114,1.0,0.927224,0.481119,1.0,0.940389,0.484639,1.0,0.918513,0.478763
26,0,0.214,1.0,0.938227,0.484065,1.0,0.944147,0.485636,1.0,0.917522,0.478494


In [103]:
len(best)

5