# Uncertainty Quantification Pipeline - Version 1
This notebook contains the complete use of the version 1 of the Uncertainity Quantification Pipeilne for EXPERT 2.0. 

The Uncertainity Quantification(UQ) Pipeline aims to measure the uncertainity of decoder/autoregressive styled large language model through various entropy and confidence scores. 

In this version of our UQ Pipeline, we experiment with various Unsupervised Methods for Uncertainity Estimation. Unsupervised Methods for Uncertainity Estimation are the ones which do not involve any type of caliberation or training. The models used in this version are frozen, which means that their weights are not updated in any way.

In this version of our UQ Pipeline, we present the following 4 different types of uncertainty estimation algorithms:

1. Entropy
2. Normalized Entropy
3. Lexical Similarity
4. Semantic Entropy

These 4 entropy scores represent the extent of uncertainity shown by a given model for a given input prompt (question).
These are among the 4 widely used uncertainity measures in the community, with Semantic Entropy being the state-of-the-art for Unsupervised Uncertainity Estimation.

## Structure

The UQ Pipeline consists of 3 seperate classes:

1. [Generation](#1-generation): This consists of all the different functions required to generate the output(s) for a given prompt(question)
2. [Entropy](#2-entropy): This consists of all the different functions required to extract the entropy scores for a given set of generations
3. [Pipeline](#3-pipeline): This combines both, Generation as well as Entropy classes for easy save and run for a given set of model and prompts

### 1. Generation

In this section we explore the Generation class of the UQ Pipeline.

In [1]:
# Import the Generation Class
from uq_pipeline import UQ_Generation

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Using a GPT2-XL Model
gen_pipeline = UQ_Generation("gpt2-xl")

In [3]:
# Setting up the input
question = "Which knowledge of the aqueous solubility of TBP during Purex process conditions is important?"

Generation Methods<sup>[[1]](#acknowledgments)</sup>, also known as Decoding Strategies, are how language models choose what words they should output once they know the probabilities.

In this version we experiment with 3 different types of generation methods:
1. Sampling with temperature: This method randomly picks the next token from a set of high-probablity tokens
2. Nucleus (Top-p) Sampling: This method chooses from the smallest possible set of tokens whose cumulative probability exceeds the probability ```p```
3. Beam Search: This method keeps the most likely ```num_beams``` of hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability.

In [4]:
# Generating output for the given prompt(question). 

# Using the Beam Search Method we generate 10 seperate sequences(answers), 
# for the given input prompt(question), with 15 tokens generated for each sequence(answer)

# Please note that the number of tokens more than 15 is not tested.
outputs = gen_pipeline.gen_beam(question, num_tokens=15)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [5]:
# Getting probablities for each generated  
gen_probs = gen_pipeline.get_probab(outputs)

In [6]:
# Decoding generated sequences into text
gen_text = gen_pipeline.get_gen_text()
gen_text

['\n\nTBP is insoluble in water, but it is soluble in',
 '\n\nTBP is insoluble in water at room temperature. However,',
 '\n\nTBP is insoluble in water at room temperature. It is',
 '\n\nTBP is insoluble in water at room temperature. The sol',
 '\n\nTBP is insoluble in water at pH 7.4 and',
 '\n\nTBP is insoluble in water at pH 7.4.',
 '\n\nTBP is insoluble in water at room temperature, but soluble',
 '\n\nTBP is insoluble in water at pH 7.4,',
 '\n\nTBP is insoluble in water, but soluble in ethanol,',
 '\n\nTBP is insoluble in water at pH 7.0 and']

### 2. Measuring Uncertainty

In this section we explore various uncertainty estimation methods proposed for autoregressive styled generative LLMs.

In [7]:
# Importing Entropy Class
from uq_pipeline import UQ_Entropy


In [8]:
# Using generation probablities from Generation Class 
entropy_pipeline = UQ_Entropy(gen_probs)

#### 2.1 Predictive Entropy<sup>[[2]](#acknowledgments)</sup>

The entropy is a statistical parameter which measures, in a certain sense, how much information is produced on the average for each letter of a text in the language. 

For a generated sequence, predictive entropy is the sum of product of conditional probablities of all tokens in S and their corresponding log values

![image](img/pred_entropy.jpeg)

To calculate the final predictive entropy for a given model, we average the predictive entropy for a set S of generated sequences for a given prompt x

![image](img/pred_entropy_final.jpeg)

In [9]:
# 1. Predictive Entropy

entropy = entropy_pipeline.get_entropy()

#### 2.2 Normalized Predictive Entropy<sup>[[3]](#acknowledgments)</sup>

It is similar to the predictive entropy, however, we normalize the sequence entropy by dividing it by the total number of tokens generated (N).

![image](img/norm_entropy.jpeg)

In [10]:
# 2. Normalized Entropy

norm_entropy = entropy_pipeline.normalized_entropy()

#### 2.3 Lexical Similarity<sup>[[4]](#acknowledgments)</sup>

Lexical similarity uses the average similarity of the answers in the answer
set S

![image](img/lex_sim.jpeg)

where the sim if the Rouge-L score, and

![image](img/lex_sim_C.jpeg)

We invert the final Lexical Similarity Score to estimate the entropy in the generated sequences

In [11]:
# 3. Lexical Similarity

gen_sequences = gen_pipeline.gen_sequences
gen_tokenizer = gen_pipeline.tokenizer

lex_sim = entropy_pipeline.lexical_similarity(gen_sequences, gen_tokenizer)

#### 2.4 Semantic Entropy<sup>[[5]](#acknowledgments)</sup>

Semantic Entropy is a measure to estimate entropy for open-ended generations. In this method, we create multiple ```meaning sets, C```, which consists of various generated sequences(answers) for the same prompt(question) which are semantically similar. 

We then use the sum of the various ```meaning sets, C```, to calculate the final Semantic Entropy similar to the way we calcualte predictive entropy for a sequence.

![image](img/sem_entropy.jpeg)

In [12]:
# 4. Semantic Entropy

sem_uncertainty = entropy_pipeline.semantic_uncertainty(question, gen_sequences, gen_tokenizer)


Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
print(f"Entropy: {entropy}")
print(f"Normalized Entropy: {norm_entropy}")
print(f"Lexical Similarity: {lex_sim}")
print(f"Semantic Entropy: {sem_uncertainty}")

Entropy: 0.12755484166352649
Normalized Entropy: 0.1622507373491923
Lexical Similarity: 0.04075235109717865
Semantic Entropy: 0.13229776088533252


### 3. Pipeline

In this section we explore the Pipeline class of the UQ Pipeline. The Pipeline class combines the Generation and Entropy classes to provide a one-line way to generate text for a given prompt, as well as calcualte the various entropy values for the provided model.

In [14]:
#Import the Pipeline Class
from uq_pipeline import UQ_Pipeline

In [15]:
# Initial Pipeline
uq_pipeline = UQ_Pipeline(prompt=question, model_name="gpt2-xl",
                          gen_method="sampling", outpath='./output')

In [16]:
# Save output as JSON
out_json = uq_pipeline.save_json()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generation Done.....
	Basic Entropy Done.....
	Normalized Entropy Done.....
	Lexical Similarity Done.....


Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


	Semantic Uncertainity Done.....
Entropy Done.....
Output JSON saved at:
./output/23_03_2023_18_29_20.json


## Demo 

Once the widget is loaded and the inputs (Generator Algorithm, Uncertainty Estimator and the Input Question) are selected, the widget displays series of answers to the input question with colored background corresponding to the rank in which the token is displayed. 

Hover on each token to view more related tokens with their corresponding probabilities. The widget also displays the uncertainty estimation measured in terms of Entropy, Normalized Entropy, Lexical Similarity and Semantic Uncertainty. 

Note: You can also use the input files provided as supplementary materials to test the Jupyter widget functionality. Please direct the ``data_path`` to the respective folder

In [18]:
import logging, sys
import warnings
warnings.filterwarnings('ignore')
logging.disable(sys.maxsize)

from uq_widget import uqWidget
uqWidget.LoadWidget(data_path = './output')

(User Message: If running this widget on a virtual machine, port forward 38327 to your local machine and then run the widget)


<uq_widget.uqWidget.LoadWidget at 0x7f3eee906460>

## Example demos
![image](img/uq_widget_overall.jpeg)
![image](img/uq_widget_hover.jpeg)

# References

[1] How to generate text: using different decoding methods for language generation with Transformers [[Link](https://huggingface.co/blog/how-to-generate)]

[2] Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE mobile computing and communications review, 5(1), 3-55. [[Link](https://dl.acm.org/doi/abs/10.1145/584091.584093)]

[3] Malinin, A., & Gales, M. (2020). Uncertainty estimation in autoregressive structured prediction. arXiv preprint arXiv:2002.07650. [[Link](https://arxiv.org/abs/2002.07650)]

[4] Fomicheva, M., Sun, S., Yankovskaya, L., Blain, F., Guzmán, F., Fishel, M., ... & Specia, L. (2020). Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8, 539-555. [[Link](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00330/96475/Unsupervised-Quality-Estimation-for-Neural-Machine)]

[5] Kuhn, L., Gal, Y., & Farquhar, S. (2023). Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. arXiv preprint arXiv:2302.09664. [[Link](https://arxiv.org/abs/2302.09664)] 