# LLMComparison

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDoxJudge/blob/main/examples/llm_comparison.ipynb)

**Explanation:**

- **Defining Model Data:** The `models` list contains dictionaries representing three different language models, each with a name, overall score, and detailed evaluation metrics. These metrics include Faithfulness, Answer Relevancy, Bias, Hallucination, Knowledge Retention, Toxicity, Precision, Recall, F1 Score, and BLEU. The `score` key represents the overall evaluation score for each model.

- **Importing LLMComparison:** The `LLMComparison` class from the `indoxJudge.piplines` module is imported. This class is used to compare the performance of the different language models based on their metrics.

- **Initializing the Comparison:** An instance of `LLMComparison` is created by passing the `models` list to it. This instance, `llm_comparison`, will handle the comparison of the models.

- **Plotting the Comparison:** The `plot` method is called with `mode="inline"` to generate and display a comparative visualization of the models' performance within the notebook. This is especially useful for users working in environments like Google Colab, where inline plotting is preferred for ease of use.

This cell is designed to compare multiple language models visually, allowing for a detailed analysis of their respective strengths and weaknesses across various metrics.


In [6]:
!pip install indoxJudge -U
!pip install transformers


Collecting transformers
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
Collecting filelock (from transformers)
  Using cached filelock-3.15.4-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Downloading huggingface_hub-0.24.6-py3-none-any.whl.metadata (13 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.4-cp39-none-win_amd64.whl.metadata (3.9 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Downloading tokenizers-0.19.1-cp39-none-win_amd64.whl.metadata (6.9 kB)
Collecting fsspec>=2023.5.0 (from huggingface-hub<1.0,>=0.23.2->transformers)
  Using cached fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
   ---------------------------------------- 0.0/9.5 MB ? eta -:--:--
   ------------- -------------------------- 3.1/9.5 MB 20.5 MB/s eta 0:00:01
   ---------------------------------------  9.4/9.5 MB 25.5 MB/s eta 0:0

In [1]:
# check indoxJudge version
import indoxJudge
indoxJudge.__version__

'0.0.2'

In [2]:
models = [{'name': 'Model_1',
  'score': 0.50,
  'metrics': {'Faithfulness': 0.55,
   'AnswerRelevancy': 1.0,
   'Bias': 0.45,
   'Hallucination': 0.8,
   'KnowledgeRetention': 0.0,
   'Toxicity': 0.0,
   'precision': 0.64,
   'recall': 0.77,
   'f1_score': 0.70,
   'BLEU': 0.11}},
 {'name': 'Model_2',
  'score': 0.61,
  'metrics': {'Faithfulness': 1.0,
   'AnswerRelevancy': 1.0,
   'Bias': 0.0,
   'Hallucination': 0.8,
   'KnowledgeRetention': 1.0,
   'Toxicity': 0.0,
   'precision': 0.667,
   'recall': 0.77,
   'f1_score': 0.71,
   'BLEU': 0.14}},
 {'name': 'Model_3',
  'score': 0.050,
  'metrics': {'Faithfulness': 1.0,
   'AnswerRelevancy': 1.0,
   'Bias': 0.0,
   'Hallucination': 0.83,
   'KnowledgeRetention': 0.0,
   'Toxicity': 0.0,
   'precision': 0.64,
   'recall': 0.76,
   'f1_score': 0.70,
   'BLEU': 0.10}},
]

In [1]:
models = [
    {'name': 'Model_1',
     'score': 0.88,
     'metrics': {
         'Harmfulness': 1.0,
         'Misinformation': 1.0,
         'Fairness': 1.0,
         'Privacy': 1.0,
         'Out Of Distribution Robustness': 0.9,
         'Stereotype Bias': 0.9,
         'Adversarial Robustness': 0.85,
         'Robustness To Adversarial Demonstrations': 0.85,
         'Toxicity': 0.85,
         'Machine Ethics': 0.85
     }},
    {'name': 'Model_2',
     'score': 0.91,
     'metrics': {
         'Harmfulness': 1.0,
         'Misinformation': 1.0,
         'Fairness': 1.0,
         'Privacy': 1.0,
         'Out Of Distribution Robustness': 0.9,
         'Stereotype Bias': 0.9,
         'Adversarial Robustness': 0.85,
         'Robustness To Adversarial Demonstrations': 0.85,
         'Toxicity': 0.85,
         'Machine Ethics': 0.85
     }},
    {'name': 'Model_3',
     'score': 0.89,
     'metrics': {
         'Harmfulness': 1.0,
         'Misinformation': 1.0,
         'Fairness': 1.0,
         'Privacy': 1.0,
         'Out Of Distribution Robustness': 0.9,
         'Stereotype Bias': 0.9,
         'Adversarial Robustness': 0.85,
         'Robustness To Adversarial Demonstrations': 0.85,
         'Toxicity': 0.85,
         'Machine Ethics': 0.85
     }}
]


In [2]:
from indoxJudge.pipelines import EvaluationAnalyzer
llm_comparison = EvaluationAnalyzer(models=models)
llm_comparison.plot()

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Dash app running on http://127.0.0.1:8050/


In [1]:
models = [{'name': 'Model_1',
  'score': 0.78,
  'metrics': {'Faithfulness': 0.80,
   'AnswerRelevancy': 0.85,
   'Bias': 0.76,
   'Hallucination': 0.78,
   'KnowledgeRetention': 0.77,
   'Toxicity': 0.75,
   'precision': 0.79,
   'recall': 0.80,
   'f1_score': 0.78,
   'BLEU': 0.77}},
 {'name': 'Model_2',
  'score': 0.82,
  'metrics': {'Faithfulness': 0.85,
   'AnswerRelevancy': 0.88,
   'Bias': 0.80,
   'Hallucination': 0.79,
   'KnowledgeRetention': 0.83,
   'Toxicity': 0.76,
   'precision': 0.81,
   'recall': 0.82,
   'f1_score': 0.80,
   'BLEU': 0.78}},
 {'name': 'Model_3',
  'score': 0.79,
  'metrics': {'Faithfulness': 0.82,
   'AnswerRelevancy': 0.87,
   'Bias': 0.78,
   'Hallucination': 0.81,
   'KnowledgeRetention': 0.80,
   'Toxicity': 0.77,
   'precision': 0.80,
   'recall': 0.79,
   'f1_score': 0.78,
   'BLEU': 0.76}},
]
