# Comprehensive Guide to LLM/GenAI Evaluation Metrics



## Table of Contents

- Text Generation Metrics
- Semantic Similarity Metrics
- Language Quality Metrics
- Task-Specific Metrics
- Human Evaluation Metrics


## 1. Text Generation Metrics

### BLEU (Bilingual Evaluation Understudy) Score

**Definition**  
BLEU measures how similar AI-generated text is to human-written reference text by comparing overlapping n-grams (sequences of words).

**Formula**  
$$\text{BLEU} = \text{BP} \times \exp\left(\sum (\text{w}_n \times \log \text{p}_n)\right)$$

Where:
- **BP** = Brevity Penalty
- **w_n** = Weight for n-gram
- **p_n** = Modified precision for n-gram

**Example**  
Reference: "The cat is on the mat"  
AI Output: "The cat sits on the mat"  
BLEU Score: 0.83 (high similarity)

**Use Cases**
- Machine translation evaluation
- Text summarization quality
- Content generation assessment

### Perplexity

**Definition**  
Perplexity is a metric used to evaluate how well a language model predicts a sample of text. It essentially measures the uncertainty of the model when predicting the next word in a sequence. Lower perplexity indicates better performance, meaning the model is more confident in its predictions.

**Formula**  
The formula for perplexity is:  
$$\text{Perplexity} = 2^{\left(-\frac{1}{N} \sum \log_2 P(x_i)\right)}$$

Where:
- **N** = Number of words in the sequence.
- **P(x_i)** = Probability assigned by the model to the i-th word in the sequence.

**Example**  
Let’s consider the text: “I love to”  
The model’s predictions for the next word are:
- “eat” with a probability of 60% (0.60)
- “sleep” with a probability of 30% (0.30)
- “dance” with a probability of 10% (0.10)

To calculate perplexity, we follow these steps:

1. **Calculate the log probabilities**:
   - For “eat”: \( \log_2(0.60) \)
   - For “sleep”: \( \log_2(0.30) \)
   - For “dance”: \( \log_2(0.10) \)

2. **Sum the log probabilities**:
   $$ \sum \log_2 P(x_i) = \log_2(0.60) + \log_2(0.30) + \log_2(0.10) $$

3. **Average the log probabilities**:
   $$ \frac{1}{N} \sum \log_2 P(x_i) $$

4. **Calculate the perplexity**:
   $$ \text{Perplexity} = 2^{\left(-\frac{1}{N} \sum \log_2 P(x_i)\right)} $$

Given the probabilities, the perplexity for this example is calculated to be **1.89**, which indicates a good prediction since lower perplexity values signify better model performance.



**Use Cases**
- Language model evaluation
- Text prediction quality
- Model comparison

## 2. Semantic Similarity Metrics

### Cosine Similarity

**Definition**  
Measures the cosine of the angle between two text vectors in a multi-dimensional space.

**Formula**  
$$\text{Cosine Similarity} = \frac{A \cdot B}{||A|| \, ||B||}$$

Where:
- **A \cdot B** = Dot product of vectors
- **||A||** = Length of vector A
- **||B||** = Length of vector B

**Example**  
Text 1: "I love dogs"  
Text 2: "I like canines"  
Vector representation:
- Text 1: [0.8, 0.6, 0.9]
- Text 2: [0.7, 0.5, 0.8]  
Cosine Similarity: 0.97 (very similar)

**Use Cases**
- Semantic search
- Document similarity
- Plagiarism detection

### ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

**Definition**  
ROUGE measures the quality of text summaries by comparing them with reference summaries.

**Types**
- **ROUGE-N**: N-gram overlap
- **ROUGE-L**: Longest common subsequence
- **ROUGE-W**: Weighted longest common subsequence

**Formula (ROUGE-N)**  
$$\text{ROUGE-N} = \frac{\sum \text{common n-grams}}{\sum \text{reference n-grams}}$$

**Example**  
Reference: "The quick brown fox jumps"  
Summary: "The brown fox jumps quickly"  
ROUGE-1: 0.75 (word overlap)  
ROUGE-2: 0.50 (2-gram overlap)

**Use Cases**
- Text summarization
- Content generation
- Translation evaluation

## 3. Language Quality Metrics

### Grammar and Fluency Score

**Definition**  
Measures grammatical correctness and natural flow of text.

**Components**

**Grammar Score**  
$$\text{Grammar Score} = \frac{\text{Correct sentences}}{\text{Total sentences}}$$

**Fluency Score**  
$$\text{Fluency Score} = \frac{\text{Natural transitions}}{\text{Total transitions}}$$

**Example**  
Text: "I am going to store. The weather nice."  
Grammar Score: 0.5 (1 correct / 2 sentences)  
Fluency Score: 0.0 (unnatural transition)

**Use Cases**
- Content quality assessment
- Educational applications
- Professional writing tools

## 4. Task-Specific Metrics

### Question-Answering Metrics

**Exact Match (EM)**  
$$\text{EM} = \frac{\text{Exact matches}}{\text{Total questions}}$$

**F1 Score**  
$$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Where:
- **Precision** = Correct words / Predicted words
- **Recall** = Correct words / Reference words

**Example**  
Question: "What is the capital of France?"  
Correct: "Paris"  
AI Answer: "The capital is Paris"  
EM: 0 (not exact match)  
F1: 0.67 (partial match)

## 5. Human Evaluation Metrics

### Expert Rating System

**Components**
- **Clarity** (1-5)
- **Accuracy** (1-5)
- **Usefulness** (1-5)

**Formula**  
$$\text{Overall Score} = \frac{\text{Clarity} + \text{Accuracy} + \text{Usefulness}}{15} \times 100$$

**Example**  
AI Text Evaluation:
- Clarity: 4/5
- Accuracy: 5/5
- Usefulness: 4/5  
Overall Score: 86.7%
