# Evaulate uncertainty measure
- To evaluate different uncertainty measure, use AUROC. 
- Uncertainty measure expected to be higher for incorrect answers and lower for correct answers. As we want to show the relationship incorrect answer -> High uncertainty, incorrect answers are encoded with label 1, correct answers with label 0 
- Compute AUROC score for each uncertainty measure with sklearn.metrics.roc_auc_score([0, 0, 0, 1, 1, 0, ...], [semantic entropy scores of the answers])


Page 17: They compare Rouge-L > 0.3, Rouge-L > 0.5, exact matching, semantic matching 


In [None]:
import numpy as np
import yaml
import glob
import pickle
import pandas as pd
import os

with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

In [2]:
def load_pickle_files(folder):
    data_groups = []
    pickle_files = glob.glob(f"{folder}/group*.pkl")
    for pickle_file in pickle_files:
        with open(pickle_file, "rb") as f:
            data_groups.append(pickle.load(f))

    return data_groups

In [3]:
save_path = config["path_to_saved_generations"]
data_groups = load_pickle_files(save_path)
keys_data = [f"temperature_{temp}" for temp in config["temperatures"]] + [f"beam_{beam}" for beam in config["n_beams"]]

## Save results

In [4]:
general_columns = ["question_id", "group"]
correct_columns = ["rougel_0.3", "rougel_0.5", "rouge1_0.5", "entailment", "label_by_hand"]
predictive_entropy_columns = [f"pred_entropy_temperature_{s}" for s in config["temperatures"]] + [
    f"pred_entropy_beam_{s}" for s in config["n_beams"]]
length_normalized_entropy_columns = [f"length_normalized_pred_entropy_temperature_{s}" for s in
                                     config["temperatures"]] + [
                                        f"length_normalized_pred_entropy_beam_{s}" for s in config["n_beams"]]
n_semantically_distinct_columns = [f"n_semantically_distinct_temperature_{s}" for s in config["temperatures"]] + [
    f"n_semantically_distinct_beam_{s}" for s in config["n_beams"]]
semantic_entropy_columns = [f"sem_entropy_temperature_{s}" for s in config["temperatures"]] + [
    f"sem_entropy_beam_{s}" for s in config["n_beams"]]

results = pd.DataFrame(
    columns=general_columns + correct_columns + predictive_entropy_columns + length_normalized_entropy_columns + n_semantically_distinct_columns + semantic_entropy_columns)

In [None]:
data_groups = load_pickle_files(save_path)

for group_nr, group_info in enumerate(data_groups):
    for question_idx in group_info.keys():
        new_row = {"question_id": question_idx, "group": group_nr}
        results = pd.concat([results, pd.DataFrame([new_row])], ignore_index=True)

In [None]:
# Store it for now
results.to_csv(os.path.join(save_path, "results.csv"), index=False)

## Predictive Entropy

The predictive entropy is defined as:
$$PE(x) = H[p(y \mid x)] = - \sum_{y} p(y \mid x) \ln p(y \mid x)$$

Since this is the expected value of $\ln p(y \mid x)$, Monte Carlo sampling approximates it by averaging log-probabilities from sampled outcomes. Thus, the Monte Carlo estimator is:
$$PE(x) \approx - \frac{1}{n}  \sum_{i = 1}^{n} \ln p(y_i \mid x)$$ where $n$ is the number of sampled answers for the question $x$.

In [4]:
def predictive_entropy(probabilities, lexical_eq_classes):
    equivalence_classes = set(lexical_eq_classes)

    probability_eq_class = np.zeros(len(equivalence_classes))

    # Sum probabilities for each equivalence class
    for i, eq_class in enumerate(equivalence_classes):
        prob_sum = np.sum([probabilities[j] for j in range(len(probabilities)) if lexical_eq_classes[j] == eq_class])
        probability_eq_class[i] = np.log(prob_sum)

    return -1 / len(equivalence_classes) * np.sum(probability_eq_class)

In [10]:
predictive_entropy_scores = {k: [] for k in keys_data}

## Length-normalised predictive entropy

## Number of semantically distinct answers

## Semantic Entropy

In [11]:
def semantic_entropy(probabilities, semantic_eq_classes):
    equivalence_classes = set(semantic_eq_classes)

    probability_eq_class = np.zeros(len(equivalence_classes))

    # Sum probabilities for each equivalence class
    for i, eq_class in enumerate(equivalence_classes):
        prob_sum = np.sum([probabilities[j] for j in range(len(probabilities)) if semantic_eq_classes[j] == eq_class])
        probability_eq_class[i] = np.log(prob_sum)

    return -1 / len(equivalence_classes) * np.sum(probability_eq_class)

## Plots
- Bar plot: x-axis Semantic entropy, normalised entropy, entropy; y-axis: AUROC (page 2)
- Table 2
- x-axis Temperature, y-axis AUROC (2 lines: semantic entropy, length-normalised entropy)
- x-axis number samples used to estimate entropy, y-axis AUROC; lines: Semantic entropy, length-normalised entropy, entropy
- 