![Alt Text](./APP.png)

# Eval
There are broadly 2 types of evaluations:
1. Offline
2. Online


## Offline Evaluation
This is where you build the models, eval models, LLMs setc and you mostly compute everything in a controlled environment. This is done against a set benchmark, GT answers, Reference contexts etc. Hit Rate, NDCG, MRR, Precision@K, Recall@K etc etc are used here to check the effectiveness of Embedding Model + Vector DB + Rerankers + Chunking Strategy 


On the other hand: Fluency, Complexity, Perplexity, BERTScore, BLEU, ROUGE, METROR, LLMasJudge, Groundedness, Hallucination Rate, Toxicity, Context Adherence, Faithfulness etc etc are used to judge the quality of LLM and it's response. You can use your rubric fine tuned models to use them as LLMasJudge.

## Online Evaluation
This is where we do live evaluation. No ground truth and there is `pipelines` in place having all the components to figure out shortcomings of system. These are multiple `Choke / Failure Points`. For example:

1. LLM itself is slow, bad, toxic, bias etc and prone to injections etc. It may be prone to divulging sensitive info
2. Context retrieval models are slow or not good enough so you need to use different chunk size, different chunking strategy etc
3. VectorDB uses ANN isntead of pure Cosine similarity so it has it's issues. So you end up using Re-Ranksers. On the other hand, not all tasks need sementic search so you need to use Syntactic search
4. API failure rate, Latency, Time to response, throughput, load resistance etc etc are checked here



## Types of Eaaluation metrics:
In this repo, you'll find 6 different types of Classes where 4 of them are actively working and 2 are abstract classes. The 5 working classes make up for more than 50+ metrics that are used. These are as follows:

### `IOGuards`
Guads to protext the model from taking in prompt injections, divulging sensitive data bias, toxicity, polarity, harmful output, sentiment etc etc for Query, Context and Response

### `TextStat`
These are mostly for Response answering the questions like:  How complex is the output, how understandable is it, how fluent, which calls of student can understand it etc etc

### `ComparisonMetrics`
Mostly used for Summarisation

These are Query-Context, Query-Response and Context-Response based metrics. They tell you the answer quality according to query and response. Metrics are Hallucination, Contradiction, BERTScore, ROUGE, METEOR etc etc

Within it, there is string matching based ones using BM25, Levenshtein Distance, Fuzzy Score etc (Need to add LSH and other Hashing too)

### `AppMetrics`
Checks the overall APP usage. Failure rate, latenct, time to response, time to fetch the context, time for LLM to answer etc etc. Newrelic, Prometheus etc are there which be directly integrated with  Flask or FastAPI. Streamlit is not supported by them but I have written by own decorstors for time.

### `LLMasJudge`
These are nothing but a wrapper. You use a prompt and send the question, answer and context the LLM to get response. If it is offline evaluation, you can get reasoning steps and compare them against GT. You can as for reasonong for answering and compare against GT, Humans. This can be used for any task. Many models like Prometheus-2, PHUDGE, JudgeLM etc are there finetuned for specific tasks

### `TraditionalPipelines`
You use traditional pipelines for storing the topics from Query, Context, Response to evaluate and compare whether they all talk about same thing or not. Then you can use POS tagging and other classification tasks for your usecase

# Requirements
Tested with: `Python 3.9`

Step: 
1. `pip install -r requirements.txt`
2. `pip install -U evaluate` (without it, some old metrics won't work)
3. `streamlit run eval_rag_app.py`

Running the below code will run the `st.spinner` while loading models

In [5]:
from eval_metrics import *

guard =  IOGuards()
stat =  TextStat()
comp = ComparisonMetrics()


def evaluate_all(query, context_lis, response):
    context = "\n".join(context_lis)

    RESULT = {}

    RESULT["guards"] = {
        "query_injection": guard.prompt_injection_classif(query),
        "context_injection": guard.prompt_injection_classif(context),
        "query_bias": guard.bias(query),
        "context_bias": guard.bias(context),
        "response_bias": guard.bias(response),
        "query_regex": guard.detect_pattern(query),
        "context_regex": guard.detect_pattern(context),
        "response_regex": guard.detect_pattern(response),
        "query_toxicity": guard.toxicity(query),
        "context_toxicity": guard.toxicity(context),
        "response_toxicity":  guard.toxicity(response),
        "query_sentiment": guard.sentiment(query),
        "query_polarity": guard.polarity(query),
        "context_polarity":guard.polarity(context), 
        "response_polarity":guard.polarity(response), 
        "query_response_hallucination" : comp.hallucinations(query, response),
        "context_response_hallucination" : comp.hallucinations(context, response),
        "query_response_hallucination" : comp.contradiction(query, response),
        "context_response_hallucination" : comp.contradiction(context, response),
    }

    RESULT["guards"].update(guard.harmful_refusal_guards(query, context, response))

    tmp = {}
    for key, val in comp.ref_focussed_metrics(query, response).items():
        tmp[f"query_response_{key}"] = val

    for key, val in comp.ref_focussed_metrics(context, response).items():
        tmp[f"context_response_{key}"] = val
    
    RESULT["reference_based_metrics"] = tmp
    
    
    tmp = {}
    for key, val in comp.string_similarity(query, response).items():
        tmp[f"query_response_{key}"] = val

    for key, val in comp.string_similarity(context, response).items():
        tmp[f"context_response_{key}"] = val
    
    RESULT["string_similarities"] = tmp

    tmp = {}
    for key, val in stat.calculate_text_stat(response).items():
        tmp[f"result_{key}"] = val
    RESULT["response_text_stats"] = tmp
    
    return RESULT



[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/mahkumar/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
Some layers from the model checkpoint at d4data/bias-detection-model were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at d4data/bias-detection-model and are newly initialized: ['dropout_39']
You should

INFO:tensorflow:Reading checkpoint /Users/mahkumar/.cache/huggingface/metrics/bleurt/default/downloads/extracted/3ab93262e863625b5602d5c988317eca1e3022de221c7e6e9b88b58fca9ee841/bleurt-base-128.


INFO:tensorflow:Reading checkpoint /Users/mahkumar/.cache/huggingface/metrics/bleurt/default/downloads/extracted/3ab93262e863625b5602d5c988317eca1e3022de221c7e6e9b88b58fca9ee841/bleurt-base-128.


INFO:tensorflow:Config file found, reading.


INFO:tensorflow:Config file found, reading.


INFO:tensorflow:Will load checkpoint bert_custom


INFO:tensorflow:Will load checkpoint bert_custom


INFO:tensorflow:Loads full paths and checks that files exists.


INFO:tensorflow:Loads full paths and checks that files exists.


INFO:tensorflow:... name:bert_custom


INFO:tensorflow:... name:bert_custom


INFO:tensorflow:... vocab_file:vocab.txt


INFO:tensorflow:... vocab_file:vocab.txt


INFO:tensorflow:... bert_config_file:bert_config.json


INFO:tensorflow:... bert_config_file:bert_config.json


INFO:tensorflow:... do_lower_case:True


INFO:tensorflow:... do_lower_case:True


INFO:tensorflow:... max_seq_length:128


INFO:tensorflow:... max_seq_length:128


INFO:tensorflow:Creating BLEURT scorer.


INFO:tensorflow:Creating BLEURT scorer.


INFO:tensorflow:Creating WordPiece tokenizer.


INFO:tensorflow:Creating WordPiece tokenizer.


INFO:tensorflow:WordPiece tokenizer instantiated.


INFO:tensorflow:WordPiece tokenizer instantiated.


INFO:tensorflow:Creating Eager Mode predictor.


INFO:tensorflow:Creating Eager Mode predictor.


INFO:tensorflow:Loading model.


INFO:tensorflow:Loading model.


INFO:tensorflow:BLEURT initialized.


INFO:tensorflow:BLEURT initialized.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mahkumar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mahkumar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/mahkumar/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [6]:
evaluate_all("Everyone is a terrorist", 
             ["Eminem is the white legend", "Trump's a bitch"], 
            "There is no answer to that. These questions and context are bad")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'guards': {'query_injection': [{'label': 'SAFE',
    'score': 0.9999986886978149}],
  'context_injection': [{'label': 'SAFE', 'score': 0.9999991655349731}],
  'query_bias': [{'label': 'Biased', 'score': 0.6330747604370117}],
  'context_bias': [{'label': 'Non-biased', 'score': 0.5858706831932068}],
  'response_bias': [{'label': 'Biased', 'score': 0.5588837265968323}],
  'query_regex': {},
  'context_regex': {},
  'response_regex': {},
  'query_toxicity': [{'label': 'toxic', 'score': 0.9225953817367554}],
  'context_toxicity': [{'label': 'toxic', 'score': 0.9640267491340637}],
  'response_toxicity': [{'label': 'non-toxic', 'score': 0.9988303780555725}],
  'query_sentiment': {'neg': 0.701,
   'neu': 0.299,
   'pos': 0.0,
   'compound': -0.6908},
  'query_polarity': [{'negative': 0.98,
    'other': 0.01,
    'neutral': 0.01,
    'positive': 0.0}],
  'context_polarity': [{'negative': 0.96,
    'other': 0.03,
    'neutral': 0.01,
    'positive': 0.0}],
  'response_polarity': [{'negative': 0