# LLM Evaluator

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDoxJudge/blob/main/examples/llm_evaluation.ipynb)

In [None]:
!pip install indoxJudge -U
!pip install transformers
!pip install torch
!pip install openai

## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indoxJudge`:

### Windows

1. **Create the virtual environment:**
```bash
python -m venv indoxJudge
```
2. **Activate the virtual environment:**
```bash
indoxJudge\Scripts\activate
```

### macOS/Linux

1. **Create the virtual environment:**
   ```bash
   python3 -m venv indoxJudge
```

2. **Activate the virtual environment:**
    ```bash
   source indoxJudge/bin/activate
```
### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
pip install -r requirements.txt
```


In [4]:
import os
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


In [3]:
query = "How does photosynthesis work in plants?"

retrieval_context = [
    "Photosynthesis is a biochemical process by which green plants, algae, and some bacteria convert light energy into chemical energy. "
    "It primarily occurs in the chloroplasts of plant cells, which contain the green pigment chlorophyll that captures sunlight.",
    
    "The process begins with light-dependent reactions in the thylakoid membranes of the chloroplasts. These reactions use solar energy to split water molecules "
    "(photolysis), producing oxygen gas (O2) as a byproduct and generating energy-rich molecules, ATP and NADPH.",
    
    "The second stage of photosynthesis, known as the Calvin cycle or light-independent reactions, takes place in the stroma of the chloroplast. "
    "During this stage, the plant uses the ATP and NADPH from the light-dependent reactions to fix carbon dioxide (CO2) into glucose molecules through a series of enzymatic steps.",
    
    "Photosynthesis is influenced by several factors, including light intensity, carbon dioxide concentration, temperature, and availability of water. "
    "Plants optimize these factors through adaptations such as leaf structure, stomatal control, and pigment types for efficient energy capture.",
    
    "This process is crucial for life on Earth because it not only provides the primary energy source for most ecosystems but also contributes to the oxygenation of the atmosphere, "
    "maintaining a balance in the global carbon cycle and supporting the existence of aerobic life forms."
]

response = (
    "Photosynthesis is a fundamental process that allows plants to convert light energy into chemical energy, which is stored as glucose. "
    "The process occurs in two stages: light-dependent reactions in the thylakoid membranes and the Calvin cycle in the stroma of the chloroplasts. "
    "In the light-dependent reactions, sunlight splits water molecules to release oxygen and generate energy carriers like ATP and NADPH. "
    "The Calvin cycle then uses these energy carriers to fix carbon dioxide into glucose. This process is critical for sustaining life on Earth by providing food and oxygen."
)

In [5]:
import sys
import os

module_path = os.path.abspath('E:/Codes/IndoxJudge/indoxJudge')
if module_path not in sys.path:
    sys.path.append(module_path)

In [6]:
query = """
As a professional summarizer, create a concise and comprehensive summary of the provided text, be it a provided text, while adhering to these guidelines:
1 Craft a summary that is detailed, thorough, in-depth, and complex, while maintaining clarity and conciseness.
2 Incorporate main ideas and essential information, eliminating extraneous language and focusing on critical aspects.
3 Rely strictly on the provided text, without including external information.
4 Format the summary in paragraph form for easy understanding.
5 Conclude your notes with [End of Notes, Message #X] to indicate completion, where "X" represents the total number of messages that I have sent. In other words, include a message counter where you start with #1 and add 1 to the message counter every time I send a message.
6 Summarize this text into a summary suitable for non-experts.
"""
information_protection_text = """
Information Protection
Northwestern Mutual takes the privacy and security of your personal information seriously. It is our mission to protect our client's information and maintain their trust through delivery of innovative cybersecurity and risk management services.

Northwestern Mutual's cybersecurity and risk program includes safeguards.
Program safeguards: The security and privacy of clients' confidential information are important to Northwestern Mutual. The company takes its responsibility to protect this information seriously and uses technical, administrative, and physical controls to safeguard its systems and data. The following are just some of the ways the company keeps client information safe.

Technical: Northwestern Mutual uses layers of technical controls to protect its clients' information:

Endpoint protection: The company employs a comprehensive endpoint protection program, beyond traditional antivirus. This includes threat protection and response mechanisms to safeguard against evolving threats.
Email security: The company utilizes a suite of solutions capable of detecting and mitigating not only spam, phishing, and fraud attempts but also advanced malicious email-based threats.
Encryption by default: The company adopts strong encryption algorithms for both data-in-transit and data-at-rest.
Next-generation firewalls: The company leverages robust next-generation firewalls (NGFW) technologies providing deep packet inspection, intrusion prevention, and threat protection.
24/7 Continuous monitoring and incident response: The company operates a Cyber Defense & Threat Operations organization dedicated to monitoring and responding to threats facing the organization.
Regular penetration testing: The company utilizes internal and external resources to continually assess, and improve, our preventative controls and our detective capabilities.
Product security: The company prioritizes security of its products and services through a series of solutions aimed at identifying and correcting vulnerabilities throughout a products lifecycle; including code and operating system (OS) scanning, cloud security enforcements, and runtime protections.
Administrative: Northwestern Mutual supplements its technical controls with processes, procedures, and policies to further protect its clients' information:

Authentication: The company requires multiple authentication factors to verify the identity of persons requesting policy, contract, or account information. A customer's right to policy, contract or account information follows state and federal regulations based on their role with the policy, contract or account.
Authorization: Access to company systems is granted on a business-need-to-know basis. Only those people who need access to a given system and its information to accomplish their job responsibilities receive that access.
Change control: The company uses a change control process to help ensure all changes to company systems maintain the confidentiality, integrity, and availability of those systems.
Corporate governance: The company has a strong governance process with multiple committees supporting information protection initiatives.
Cybersecurity exercises and business continuity planning: The company maintains a comprehensive business continuity plan and conducts periodic cybersecurity tabletop and functional exercises and threat simulations to identify areas of program strength and opportunities for improvement.
Privacy notice: Northwestern Mutual uses Privacy Notices to communicate how we collect, use, share, retain, and protect their Personal information. For more information, visit: northwesternmutual.com/privacy-notices/.
Internal and external IT auditors: The company's internal and external auditors regularly review and assess the company's information technology systems and operations.
Records and information management and sanitization: The company maintains a records and information management program that manages the lifecycle of the company's information, including adherence to regulatory requirements and secure disposal of confidential information.
Risk assessments: The Company has multiple departments that specialize in identifying and mitigating risks through risk assessment processes. These departments include Information Risk Management, Privacy, Enterprise Risk Assurance, Financial Management, and Enterprise Risk Management.
Security assessments: Security assessments are a key component of our information protection program. We execute a rigorous process to evaluate technology solutions against common security standards and controls.
Security awareness: The company recognizes that end users are a critical component of an effective information security and risk management program. The company provides employees and financial representatives with security education and training, such as ongoing security education articles and events, training in company policies and standards, and simulated phishing exercises. Information to help clients protect themselves is also available on this page.
Separation of duties: The company separates specific job duties to prevent a conflict of interest when appropriate.
Storage locations: The company maintains certain personal information on its own internal systems. It also uses service providers for the purpose of maintaining data. Service providers are only entrusted with data after successfully undergoing a due diligence review and having signed a contract requiring information to be protected and use of data restricted to Northwestern Mutual's business purposes.
Threat monitoring & hunting: The company works with internal teams and third-party industry security organizations to proactively monitor its environment for existing and potential threats.
Training and user behavior: All employees and financial representatives receive training on protecting confidential information, which is tailored to Company business needs, identified risks, and emerging threats.
User access reviews: The company regularly reviews user access to company systems to help ensure users maintain an appropriate level of access to those systems.
Physical: Northwestern Mutual also protects its clients' information from physical harm and theft:

Building and data center physical security: The company controls physical access to its buildings, data centers, and other facilities. Restricted access helps to ensure the confidentiality, integrity, and availability of company systems and physical assets within the company.
Business continuity and disaster recovery planning: The company maintains and periodically tests defined business continuity and disaster recovery plans. These plans are designed to maximize the availability of company systems and information and recover from natural or human-made disasters as efficiently and effectively as possible.
Redundancy: As part of its business continuity and disaster recovery plans, the company uses redundant infrastructure to ensure availability of company systems and client information.
Our commitment to privacy: The Northwestern Mutual Family of Companies is committed to protecting the confidentiality of your personal information. We respect your privacy rights and choices, which may vary based on your interactions with us or your relationship to us. Please review our policies here.

Responsible disclosure policy: We want to hear from security researchers ("You" or "Your") who have information related to suspected security vulnerabilities ("Vulnerability" or "Vulnerabilities") of any Northwestern Mutual services exposed to the internet. We value Your work and are committed to working with You. Please report Vulnerabilities to us in accordance with this Responsible Disclosure Policy.

General security hygiene & suggestions for staying safe
Protecting your NM account

Registering for online account access

When you register for Northwestern Mutual online account access, we will require several pieces of personal information from you. This helps ensure that only you may register to access your own accounts. Avoid accessing your online accounts through publicly shared computers and Wi-Fi networks.
Multi-Factor Authentication: The company requires users to add an extra layer of protection to their northwesternmutual.com account login process by requiring entry of a security code in addition to their username and password. The security code is a unique, single-use number they receive via phone call, text, or your preferred authenticator app on your mobile device.
Passwords

A strong password is important to protect your online accounts. When you are selecting a password, keep the following tips in mind:
Strong (good) passwords have the following characteristics:
Contain both upper and lower-case characters (e.g., a–z, A–Z)
Have digits and punctuation characters as well as letters (e.g., 0–9, !@#$%^&*)
Fourteen (14) or more alphanumeric characters
Not based on personal information, names of family, or important calendar dates
Change passwords frequently.
Consider using a password manager.
Keep your new password private and change it immediately if someone else learns it or you believe it has been compromised in any way.
Constructing a passphrase based on a sentence or phrase makes a password easy to remember. Passphrase is a type of string password that uses a short sentence or group of random words.
Weak (bad) passwords have the following characteristics:
Default password
Contain less than eight (8) characters
Common, easily guessed phrases (e.g. "Winter is coming")
A common usage word such as names of family, pets, friends, co-workers, fantasy characters, etc.
The words "Northwestern Mutual" or any derivation
Do not use Social Security numbers, words or numbers associated with easily attainable personal information, like birthdays, anniversaries, license plates, telephone numbers, or addresses.
Do not use the same pattern for your passwords, such as smart1, smart2, etc.
Any of the above spelled backward
Any of the above preceded or followed by a digit (e.g., secret1 or 1secret).
Do not use real names or your login name or any variation of it.
Do not write down your password or share your password with anyone else.
Do not reuse passwords. Make sure you use different and unique passwords for all of your online accounts. Reusing a single password for multiple websites is never a good idea. If a cybercriminal obtains your password, they may try to use it on other websites.
Protecting your account: Be prepared to provide your account/policy number when reaching out to us to set up the following safeguards:
Life, Disability Income and Annuity: You can request a password to be used to verify your identity when you call the home office.
For Life and Disability Income Policies, call 1-800-388-8123.
For Annuities, call 1-888-455-2232.
Long-Term Care: You can request additional authenticators be added to verify your identity when calling about your policy by calling 1-800-748-9493.
Investment Accounts: You can request a note or block be added to your account.
For Northwestern Mutual Investment Services and Northwestern Mutual Wealth Management Company, including Trust and Private Client Services, call 1-866-950-4644 and ask for "investments".
Staying safe online
Email Hacking: Email hacking occurs when a cybercriminal illegally gains access to an individual's email account. This allows the hacker to read email messages and view the address book on the email account. Using this information, the cybercriminal (appearing to be the individual), contacts the individual's financial institutions via an email message and tries to obtain funds. Learn about how to protect yourself at Email Hacking Fraud. When email hacking occurs, emails can/will be intercepted and deleted by the hacker, and the rightful sender and receiver will no longer see legitimate email traffic. We recommend that you take the following actions to protect yourself:
Never give your email address to anyone or any site that you do not trust.
Never send personal or sensitive information in an unsecured email.
Phishing: One of the most common ways cybercriminals trick their victims is through phishing. This occurs when a cybercriminal tries to either get the victim to reveal confidential information or installs malicious software (malware) on the victim's computer. A phishing attack can take many forms, although the most common is an email message.
Identifying phishing messages: common phishing message characteristics and red flags include:
Unsolicited email or attachments: Always regard unsolicited email or attachments as suspicious, even if the message appears to come from a known sender. The "From" address in emails can be easily faked. In addition, the sender's email account may have been hacked.
Spoofing: Email spoofing refers to email messages that are sent with a forged sender address. The email is sent to look like it is from a known company or people such as your friend, relative, or co-worker, but the email is fake.
If you receive an email asking for confidential information and demanding unusual or urgent action, be suspicious. Take action against this and call the sender directly, using contact information on record, to ask if they sent the email.
Generic greetings: Phishing messages often contain generic, non-personalized greetings both in the subject line and the message. Some legitimate messages may contain generic greetings, but a generic greeting should raise suspicions. It's important to note, though, that a personalized greeting in and of itself does not guarantee that a message is legitimate.
Urgent or threatening language: Many phishing messages contain urgent or threatening language. Phishers often try to manipulate people's emotions. Don't fall for that trick. Take the time to examine unsolicited messages carefully.
Awkward grammar or spelling errors: Awkward grammar or spelling errors may be signs of phishing messages. However, don't assume that polished, professional-looking messages are necessarily legitimate.
Tricky links: Always be suspicious of links in unsolicited messages. The web address displayed in the message can be easily faked. Hover the mouse pointer over the link on a computer or long-press the link on a mobile device to preview the actual link address.
Never send personal or sensitive information in an unsecured (unencrypted) email.
Never make a purchase from an unsolicited email. Not only can an email user fall prey to a potentially fraudulent sales scheme, but their email address may be added to email lists sold within the spamming community, further compounding the number of unsolicited emails the user receives.
Phishing resources: To learn more about phishing and handling phishing messages, see the Federal Trade Commission (FTC) website.
"""
llm_response = """
Northwestern Mutual is committed to safeguarding clients' personal information through a comprehensive cybersecurity and risk management program. The company employs technical measures such as endpoint protection, email security, encryption, next-generation firewalls, and continuous monitoring to defend against threats. Administrative safeguards include multi-factor authentication, access controls, corporate governance, and rigorous training for employees and representatives. Physical security measures protect data centers and offices, while redundancy and disaster recovery plans ensure resilience. Clients are educated on strong password practices, phishing avoidance, social media safety, and secure device usage. The company also promotes proper records management and secure disposal to prevent identity theft. Additional safeguards like fraud alerts and credit freezes further protect client accounts. Northwestern Mutual emphasizes proactive client engagement, security awareness, and privacy rights, demonstrating its commitment to robust information protection.
"""



## Importing Required Modules
imports the necessary classes from the indoxJudge library. LLMEvaluator is the class used for evaluating language models based on various metrics, and OpenAi is the class used to interact with the OpenAI models, such as GPT-4o-mini.

In [7]:
from indoxJudge.pipelines import LLMEvaluator
from indoxJudge.models import OpenAi

## Initializing the OpenAI Model
Here, the OpenAi class is instantiated to create a model object that interacts with OpenAI's gpt-4o-mini model. The api_key is passed to authenticate the API request. Replace OPENAI_API_KEY with your actual API key.

In [8]:

# Initialize the language model
# it can be any OpenAI model, please refer to the [OpenAI Models documentation](https://platform.openai.com/docs/models) such as GPT-4o. or other models

model = OpenAi(api_key=OPENAI_API_KEY,model="gpt-4o-mini",max_tokens=1024)

[32mINFO[0m: [1mInitializing OpenAi with model: gpt-4o-mini and max_tokens: 1024[0m


the LLMEvaluator class is instantiated to create an evaluator object. This object is configured to evaluate the model using a specific response, retrieval context, and query. The llm_as_judge parameter is set to the model created in the previous cell, while llm_response, retrieval_context, and query are other components used during the evaluation process.

In [9]:
evaluator = LLMEvaluator(llm_as_judge=model,llm_response=llm_response,retrieval_context=information_protection_text,query=query)

[32mINFO[0m: [1mEvaluator initialized with model and metrics.[0m
[32mINFO[0m: [1mModel set for all metrics.[0m


**Explanation:**

- **Running the Evaluation:** This line calls the `judge` method on the `evaluator` object. The `judge` method runs through all the specified metrics (e.g., Faithfulness, Answer Relevancy, Bias, Hallucination, Knowledge Retention, Toxicity, BertScore, BLEU, Gruen) to evaluate the language model's performance.
  
- **Logging the Process:** As the evaluation runs, the process logs the start and completion of each metric evaluation, providing feedback on the progress. Each log entry is tagged with an INFO level, indicating routine operational messages.
  
- **Handling Warnings:** You may notice a warning regarding model initialization and a future deprecation notice from the Hugging Face Transformers library. These warnings inform you about potential issues related to model compatibility and upcoming changes in the library.
  
- **Completion of Metrics:** After all metrics have been evaluated, the `judge` method completes, and the evaluation results are stored in the `eval_result` dictionary for further analysis or reporting.



In [10]:
eval_result = evaluator.judge()


[32mINFO[0m: [1mEvaluating metric: Faithfulness[0m
[32mINFO[0m: [1mToken Counts - Input: 342 | Output: 196[0m
[32mINFO[0m: [1mToken Counts - Input: 307 | Output: 181[0m
[32mINFO[0m: [1mToken Counts - Input: 3398 | Output: 147[0m
[32mINFO[0m: [1mToken Counts - Input: 264 | Output: 48[0m
[32mINFO[0m: [1mToken Usage Summary:
 Total Input: 4311 | Total Output: 572 | Total: 4883[0m
[32mINFO[0m: [1mCompleted evaluation for metric: Faithfulness, score: 0.89[0m

[32mINFO[0m: [1mEvaluating metric: AnswerRelevancy[0m
[32mINFO[0m: [1mToken Counts - Input: 283 | Output: 179[0m
[32mINFO[0m: [1mToken Counts - Input: 723 | Output: 380[0m
[32mINFO[0m: [1mToken Counts - Input: 573 | Output: 65[0m
[32mINFO[0m: [1mToken Usage Summary:
 Total Input: 1579 | Total Output: 624 | Total: 2203[0m
[32mINFO[0m: [1mCompleted evaluation for metric: AnswerRelevancy, score: 0.0[0m

[32mINFO[0m: [1mEvaluating metric: Bias[0m
[32mINFO[0m: [1mToken Counts - Inp

100%|██████████| 1/1 [00:01<00:00,  1.32s/it]
Evaluating: 100%|██████████| 8/8 [00:01<00:00,  6.61it/s]


[32mINFO[0m: [1mCompleted evaluation for metric: Gruen, score: 0.71[0m


**Explanation:**

- **Retrieving Metric Scores:** The first line assigns the detailed metric scores from the evaluation to the variable `evaluator_metrics_score`. These scores reflect the performance of the language model across individual metrics such as Answer Relevancy, Bias, Knowledge Retention, Toxicity, Precision, Recall, F1 Score, BLEU, and Gruen.

- **Retrieving Overall Evaluation Score:** The second line assigns the overall evaluation score to the variable `evaluator_evaluation_score`. This score is typically a cumulative measure that reflects the model's performance across all evaluated metrics.

- **Example Output:** The dictionary shown represents a typical output of `evaluator_metrics_score`. Each key corresponds to a specific metric, and the associated value indicates the model's score for that metric. For instance, a `1.0` score in `AnswerRelevancy` and `KnowledgeRetention` indicates perfect performance in those areas, while other metrics like `BLEU` and `gruen` show a more moderate performance.


In [11]:
evaluator_metrics_score = evaluator.metrics_score
evaluator_metrics_score

{'Faithfulness': 0.89,
 'AnswerRelevancy': 0.0,
 'Bias': 0.0,
 'Hallucination': 0.0,
 'KnowledgeRetention': 1.0,
 'Toxicity': 0.0,
 'precision': 0.75,
 'recall': 0.68,
 'f1_score': 0.71,
 'BLEU': 0.03,
 'Gruen': 0.71,
 'evaluation_score': 0.75}

In [12]:
evaluator.results

{'Faithfulness': {'claims': "Northwestern Mutual is committed to safeguarding clients' personal information through a comprehensive cybersecurity and risk management program., Northwestern Mutual employs technical measures such as endpoint protection, email security, encryption, next-generation firewalls, and continuous monitoring to defend against threats., Northwestern Mutual's administrative safeguards include multi-factor authentication, access controls, corporate governance, and rigorous training for employees and representatives., Physical security measures protect data centers and offices at Northwestern Mutual., Redundancy and disaster recovery plans ensure resilience at Northwestern Mutual., Clients of Northwestern Mutual are educated on strong password practices, phishing avoidance, social media safety, and secure device usage., Northwestern Mutual promotes proper records management and secure disposal to prevent identity theft., Additional safeguards like fraud alerts and cr

**Explanation:**

- **Plotting Evaluation Metrics:** This line generates visual plots of the evaluation metrics using the `plot` method of the `evaluator` object. The plots provide a graphical representation of the model's performance across different metrics, making it easier to analyze and compare the results.

- **Dash UI Interface:** When `mode="external"` is used, this will open a Dash UI in a new browser window or tab to display the evaluation metrics plots interactively.

- **Colab Users:** If you are using Google Colab, it is recommended to set `mode="inline"` instead. This will render the plots directly within the notebook, making it more convenient for users working in an online environment like Colab.


In [9]:
evaluator.plot(mode="external")

Dash app running on http://127.0.0.1:8050/
