# LLMComparison

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDoxJudge/blob/main/examples/llm_comparison.ipynb)

**Explanation:**

- **Defining Model Data:** The `models` list contains dictionaries representing three different language models, each with a name, overall score, and detailed evaluation metrics. These metrics include Faithfulness, Answer Relevancy, Bias, Hallucination, Knowledge Retention, Toxicity, Precision, Recall, F1 Score, and BLEU. The `score` key represents the overall evaluation score for each model.

- **Importing LLMComparison:** The `LLMComparison` class from the `indoxJudge.piplines` module is imported. This class is used to compare the performance of the different language models based on their metrics.

- **Initializing the Comparison:** An instance of `LLMComparison` is created by passing the `models` list to it. This instance, `llm_comparison`, will handle the comparison of the models.

- **Plotting the Comparison:** The `plot` method is called with `mode="inline"` to generate and display a comparative visualization of the models' performance within the notebook. This is especially useful for users working in environments like Google Colab, where inline plotting is preferred for ease of use.

This cell is designed to compare multiple language models visually, allowing for a detailed analysis of their respective strengths and weaknesses across various metrics.


In [None]:
!pip install indoxJudge -U
!pip install transformers

In [1]:
# check indoxJudge version
import indoxJudge
indoxJudge.__version__

'0.0.2'

In [1]:
models = [{'name': 'llama3',
  'score': 0.50,
  'metrics': {'Faithfulness': 0.55,
   'AnswerRelevancy': 1.0,
   'Bias': 0.45,
   'Hallucination': 0.8,
   'KnowledgeRetention': 0.0,
   'Toxicity': 0.0,
   'precision': 0.64,
   'recall': 0.77,
   'f1_score': 0.70,
   'BLEU': 0.11}},
 {'name': 'OpenAi',
  'score': 0.61,
  'metrics': {'Faithfulness': 1.0,
   'AnswerRelevancy': 1.0,
   'Bias': 0.0,
   'Hallucination': 0.8,
   'KnowledgeRetention': 1.0,
   'Toxicity': 0.0,
   'precision': 0.667,
   'recall': 0.77,
   'f1_score': 0.71,
   'BLEU': 0.14}},
 {'name': 'Gemini',
  'score': 0.050,
  'metrics': {'Faithfulness': 1.0,
   'AnswerRelevancy': 1.0,
   'Bias': 0.0,
   'Hallucination': 0.83,
   'KnowledgeRetention': 0.0,
   'Toxicity': 0.0,
   'precision': 0.64,
   'recall': 0.76,
   'f1_score': 0.70,
   'BLEU': 0.10}},
]

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

INDOX_API_KEY = os.environ['INDOX_API_KEY']

In [3]:
from indoxJudge.models import IndoxApi
judge = IndoxApi(api_key=INDOX_API_KEY)

In [5]:
from indoxJudge.pipelines import EvaluationAnalyzer
llm_comparison = EvaluationAnalyzer(models=models)
llm_comparison.plot(interpreter=judge)

{
    "radar_chart": "OpenAi demonstrates the strongest overall performance, achieving perfect scores in Faithfulness, AnswerRelevancy, and KnowledgeRetention. In contrast, llama3 shows significant weaknesses in KnowledgeRetention and has a high Hallucination score. Gemini matches OpenAi in Faithfulness and AnswerRelevancy but has a slightly higher Hallucination score, indicating it also has areas needing improvement.",

    "bar_chart": "OpenAi leads in precision (0.667) and f1_score (0.71), while both llama3 and Gemini lag slightly behind with f1_scores of 0.70. In recall, all models are closely matched, with llama3 and OpenAi both at 0.77. OpenAi clearly excels in precision and f1_score, marking it as the top performer in these metrics.",

    "scatter_plot": "The scatter plot reveals that precision and recall are closely clustered for all models, indicating a balanced performance. OpenAi has a slight advantage in precision, while llama3 and Gemini maintain similar positions, sugges