# LLM Evaluator

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDoxJudge/blob/main/examples/llm_evaluation.ipynb)

In [None]:
!pip install indoxJudge -U
!pip install transformers
!pip install torch
!pip install openai

## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indoxJudge`:

### Windows

1. **Create the virtual environment:**
```bash
python -m venv indoxJudge
```
2. **Activate the virtual environment:**
```bash
indoxJudge\Scripts\activate
```

### macOS/Linux

1. **Create the virtual environment:**
   ```bash
   python3 -m venv indoxJudge
```

2. **Activate the virtual environment:**
    ```bash
   source indoxJudge/bin/activate
```
### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
pip install -r requirements.txt
```


In [2]:
import os
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


In [3]:
query = "How does photosynthesis work in plants?"

retrieval_context = [
    "Photosynthesis is a biochemical process by which green plants, algae, and some bacteria convert light energy into chemical energy. "
    "It primarily occurs in the chloroplasts of plant cells, which contain the green pigment chlorophyll that captures sunlight.",
    
    "The process begins with light-dependent reactions in the thylakoid membranes of the chloroplasts. These reactions use solar energy to split water molecules "
    "(photolysis), producing oxygen gas (O2) as a byproduct and generating energy-rich molecules, ATP and NADPH.",
    
    "The second stage of photosynthesis, known as the Calvin cycle or light-independent reactions, takes place in the stroma of the chloroplast. "
    "During this stage, the plant uses the ATP and NADPH from the light-dependent reactions to fix carbon dioxide (CO2) into glucose molecules through a series of enzymatic steps.",
    
    "Photosynthesis is influenced by several factors, including light intensity, carbon dioxide concentration, temperature, and availability of water. "
    "Plants optimize these factors through adaptations such as leaf structure, stomatal control, and pigment types for efficient energy capture.",
    
    "This process is crucial for life on Earth because it not only provides the primary energy source for most ecosystems but also contributes to the oxygenation of the atmosphere, "
    "maintaining a balance in the global carbon cycle and supporting the existence of aerobic life forms."
]

response = (
    "Photosynthesis is a fundamental process that allows plants to convert light energy into chemical energy, which is stored as glucose. "
    "The process occurs in two stages: light-dependent reactions in the thylakoid membranes and the Calvin cycle in the stroma of the chloroplasts. "
    "In the light-dependent reactions, sunlight splits water molecules to release oxygen and generate energy carriers like ATP and NADPH. "
    "The Calvin cycle then uses these energy carriers to fix carbon dioxide into glucose. This process is critical for sustaining life on Earth by providing food and oxygen."
)

## Importing Required Modules
imports the necessary classes from the indoxJudge library. LLMEvaluator is the class used for evaluating language models based on various metrics, and OpenAi is the class used to interact with the OpenAI models, such as GPT-3.5.

In [4]:
from indoxJudge.pipelines import LLMEvaluator
from indoxJudge.models import OpenAi

## Initializing the OpenAI Model
Here, the OpenAi class is instantiated to create a model object that interacts with OpenAI's gpt-3.5-turbo-0125 model. The api_key is passed to authenticate the API request. Replace OPENAI_API_KEY with your actual API key.

In [5]:

# Initialize the language model
# it can be any OpenAI model, please refer to the [OpenAI Models documentation](https://platform.openai.com/docs/models) such as GPT-4o. or other models

model = OpenAi(api_key=OPENAI_API_KEY,model="gpt-4o-mini",max_tokens=1024)

[32mINFO[0m: [1mInitializing OpenAi with model: gpt-4o-mini and max_tokens: 1024[0m


the LLMEvaluator class is instantiated to create an evaluator object. This object is configured to evaluate the model using a specific response, retrieval context, and query. The llm_as_judge parameter is set to the model created in the previous cell, while llm_response, retrieval_context, and query are other components used during the evaluation process.

In [6]:
evaluator = LLMEvaluator(llm_as_judge=model,llm_response=response,retrieval_context=retrieval_context,query=query)

[32mINFO[0m: [1mEvaluator initialized with model and metrics.[0m
[32mINFO[0m: [1mModel set for all metrics.[0m


**Explanation:**

- **Running the Evaluation:** This line calls the `judge` method on the `evaluator` object. The `judge` method runs through all the specified metrics (e.g., Faithfulness, Answer Relevancy, Bias, Hallucination, Knowledge Retention, Toxicity, BertScore, BLEU, Gruen) to evaluate the language model's performance.
  
- **Logging the Process:** As the evaluation runs, the process logs the start and completion of each metric evaluation, providing feedback on the progress. Each log entry is tagged with an INFO level, indicating routine operational messages.
  
- **Handling Warnings:** You may notice a warning regarding model initialization and a future deprecation notice from the Hugging Face Transformers library. These warnings inform you about potential issues related to model compatibility and upcoming changes in the library.
  
- **Completion of Metrics:** After all metrics have been evaluated, the `judge` method completes, and the evaluation results are stored in the `eval_result` dictionary for further analysis or reporting.



In [7]:
eval_result = evaluator.judge()


[32mINFO[0m: [1mEvaluating metric: Faithfulness[0m
[32mINFO[0m: [1mToken Counts - Input: 292 | Output: 153[0m
[32mINFO[0m: [1mToken Counts - Input: 257 | Output: 154[0m
[32mINFO[0m: [1mToken Counts - Input: 975 | Output: 120[0m
[32mINFO[0m: [1mToken Counts - Input: 235 | Output: 33[0m
[32mINFO[0m: [1mToken Usage Summary:
 Total Input: 1759 | Total Output: 460 | Total: 2219[0m
[32mINFO[0m: [1mCompleted evaluation for metric: Faithfulness, score: 1.0[0m

[32mINFO[0m: [1mEvaluating metric: AnswerRelevancy[0m
[32mINFO[0m: [1mToken Counts - Input: 233 | Output: 155[0m
[32mINFO[0m: [1mToken Counts - Input: 529 | Output: 155[0m
[32mINFO[0m: [1mToken Counts - Input: 197 | Output: 33[0m
[32mINFO[0m: [1mToken Usage Summary:
 Total Input: 959 | Total Output: 343 | Total: 1302[0m
[32mINFO[0m: [1mCompleted evaluation for metric: AnswerRelevancy, score: 1.0[0m

[32mINFO[0m: [1mEvaluating metric: Bias[0m
[32mINFO[0m: [1mToken Counts - Input:

100%|██████████| 1/1 [00:00<00:00,  1.02it/s]
Evaluating: 100%|██████████| 5/5 [00:00<00:00,  7.87it/s]


[32mINFO[0m: [1mCompleted evaluation for metric: Gruen, score: 0.73[0m


**Explanation:**

- **Retrieving Metric Scores:** The first line assigns the detailed metric scores from the evaluation to the variable `evaluator_metrics_score`. These scores reflect the performance of the language model across individual metrics such as Answer Relevancy, Bias, Knowledge Retention, Toxicity, Precision, Recall, F1 Score, BLEU, and Gruen.

- **Retrieving Overall Evaluation Score:** The second line assigns the overall evaluation score to the variable `evaluator_evaluation_score`. This score is typically a cumulative measure that reflects the model's performance across all evaluated metrics.

- **Example Output:** The dictionary shown represents a typical output of `evaluator_metrics_score`. Each key corresponds to a specific metric, and the associated value indicates the model's score for that metric. For instance, a `1.0` score in `AnswerRelevancy` and `KnowledgeRetention` indicates perfect performance in those areas, while other metrics like `BLEU` and `gruen` show a more moderate performance.


In [8]:
evaluator_metrics_score = evaluator.metrics_score
evaluator_metrics_score

{'Faithfulness': 1.0,
 'AnswerRelevancy': 1.0,
 'Bias': 0.0,
 'Hallucination': 0.0,
 'KnowledgeRetention': 1.0,
 'Toxicity': 0.0,
 'precision': 0.67,
 'recall': 0.74,
 'f1_score': 0.7,
 'BLEU': 0.17,
 'Gruen': 0.73,
 'evaluation_score': 0.93}

In [9]:
evaluator.results

{'Faithfulness': {'claims': 'Photosynthesis is a fundamental process that allows plants to convert light energy into chemical energy., The chemical energy produced in photosynthesis is stored as glucose., Photosynthesis occurs in two stages: light-dependent reactions and the Calvin cycle., Light-dependent reactions occur in the thylakoid membranes., The Calvin cycle occurs in the stroma of the chloroplasts., In the light-dependent reactions, sunlight splits water molecules to release oxygen., The light-dependent reactions generate energy carriers like ATP and NADPH., The Calvin cycle uses energy carriers to fix carbon dioxide into glucose., Photosynthesis is critical for sustaining life on Earth by providing food and oxygen.',
  'truths': 'Photosynthesis is a process that allows plants to convert light energy into chemical energy., The chemical energy produced in photosynthesis is stored as glucose., Photosynthesis occurs in two stages: light-dependent reactions and the Calvin cycle., 

**Explanation:**

- **Plotting Evaluation Metrics:** This line generates visual plots of the evaluation metrics using the `plot` method of the `evaluator` object. The plots provide a graphical representation of the model's performance across different metrics, making it easier to analyze and compare the results.

- **Dash UI Interface:** When `mode="external"` is used, this will open a Dash UI in a new browser window or tab to display the evaluation metrics plots interactively.

- **Colab Users:** If you are using Google Colab, it is recommended to set `mode="inline"` instead. This will render the plots directly within the notebook, making it more convenient for users working in an online environment like Colab.


In [9]:
evaluator.plot(mode="external")

Dash app running on http://127.0.0.1:8050/
