# Evaluation Metrics in Language Models

Perplexity 
---
- Provides a measure of how well a probability distribution or probability model predicts a sample
- The exponentiation of cross entropy
$$ Perplexity = e^{CE} $$
$$ = e^{-\frac{1}{N}\sum_{i=1}^N logP(x_i)} $$
- Generally, the lower the perplexity, the better the model. But, in the domain of open-ended text generation, the lower the perplexity, the more likely the model is to repeat itself.
- **Pros**
    - Easy to compute (one pass through the data)
    - Easy to interpret (lower is better)
    - Robust to outliers
- **Cons**
  -  Not Accurate (not a true metric). It is possible to have a model with a low perplexity that is not a good model. 
  -  Not a true metric (not a true loss function)
  -  Cannot generalize to unseen data.

ROUGE
---
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating the quality of summaries.

- **ROUGE-N**
  - The number of n-grams in the human reference summary that are also in the machine generated summary.
  - 

- **ROUGE-L**
  - The longest common subsequence of words in the candidate summary and the reference summary.
  - ROUGE-L Recall: The longest common subsequence of words in the candidate summary and the reference summary divided by the total number of words in the reference summary.

$$ ROUGE_L = \frac{2 * LCS}{len(C) + len(R)} $$

***Overcomes the problem of ROUGE-N by taking into account the order of the words in the candidate summary and the reference summary.***


BLEU
---
BiLingual Evaluation Understudy (BLEU) is a metric for evaluating the quality of text which has been machine-translated from one natural language to another.

- **BLEU-N**
  - The number of n-grams in the candidate summary that are also in the reference summary.
  - BLEU-N Precision: The number of n-grams in the candidate summary that are also in the reference summary divided by the total number of n-grams in the candidate summary.
  - BLEU-N Score: The geometric mean of BLEU-N Precision for n = 1, 2, 3, 4.


----

# Open-Ended Text Generation Metrics

Self-BLEU
---

- Self-BLEU is a metric based on BLEU(BiLingual Evaluation Understudy) that measures the similarity between the generated text and the reference text.
- BLEU(N) is the number of n-grams in the candidate summary that are also in the reference summary.
  $$ BLEU_N = \frac{N_{gram}^{candidate} \cap N_{gram}^{reference}}{N_{gram}^{candidate}} $$
- Self-BLEU(N) is the number of n-grams in the generated text that are also in the reference text.

Intuitively Self-BLEU(N) is the number of n-grams in the generated text that are also in the reference text. Note that, if we achieve Self-BLEU(N) = 1, it means that all n-grams in the generated text are also in the reference text. However, it is not a good metric because it is not a true metric. It is possible to have a model with a high Self-BLEU(N) that is not a good model.


Zipf’s law  is an empirical law formulated using mathematical statistics that describes the frequency distribution of words in natural languages. It states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table []. 

A rank-frequency plot of the distributional differences between n-gram frequencies of human and machine text can be used to compare the quality of the text generated by the model. But, the problem with this approach is that it is not a good measure because it only considers the frequency of the n-grams and not the order of the n-grams. Also, some decoding algorithms work by truncating the tail of the word distribution, which can lead to a lower n-gram frequency but a higher quality text.


Future Work
---
Now that we have explored the different metrics used for evaluating the quality of text generated by a language model, we can now explore the different ways to improve the quality of the text generated by the language model.


- 
    
    A unified framework which evaluates both diversity and quality [3]. Authors show that the error rate can be estimated by combining human and statistical evaluation, using an evaluation metric of the name HUSE.
    
    
