# Homework 2 Part 3

## Course Name: Large Language Models
#### Lecturers: Dr. Soleimani, Dr. Rohban, Dr. Asgari

---

#### Notebooks Supervised By: Omid Ghahroodi, MohammadAli SadraeiJavaheri
#### Notebook Prepared By: Omid Ghahroodi, MohammadAli SadraeiJavaheri

**Contact**: Ask your questions in Quera

---

### Instructions:
- Complete all exercises presented in this notebook.
- Ensure you run each cell after you've entered your solution.
- After completing the exercises, save the notebook and <font color='red'>follow the submission guidelines provided in the PDF.</font>


---

**Note**: Replace the placeholders (between <font color="green">`## Your code begins ##`</font> and <font color="green">`## Your code ends ##`</font>) with the appropriate details.


# 1. Introduction

This notebook serves as a practical exercise in understanding prompt engineering and calibration within large language models. We will apply these concepts using `phi1.5`, a variant of advanced language models. Our task involves utilizing the `IMDB sentiment dataset`, a popular choice for training and testing language processing capabilities. This dataset, known for its collection of movie reviews, offers a diverse range of emotions and sentiments, making it an ideal tool for this exercise. The goal is to explore how different prompts influence the model's performance in accurately identifying and analyzing sentiments in text, thereby enhancing our comprehension of the nuances in language model calibration and prompt design.

In this exercise, you will explore different prompt choices and examine their effects on the model's performance. Your task is to calculate the calibration of the model for each of the given prompts and then compare these results. To achieve this, you should first implement the Expected Calibration Error (ECE) metric. This metric is crucial for understanding how closely the confidence of the model's predictions aligns with its accuracy. After implementing the ECE metric, calculate and report it for the results obtained from each of the prompts. This will provide valuable insights into the effectiveness of prompt engineering and its impact on model calibration, helping you understand the intricacies of large language model behavior in sentiment analysis tasks

In [None]:
%%capture

!pip install datasets
!pip install transformers
!pip install einops

In [None]:
# Note: Do NOT make changes to this block.

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics import classification_report
from tqdm import tqdm
import itertools
import torch
import random
import numpy as np
import pandas as pd


SEED=21

np.random.seed(SEED)
random.seed(SEED)

## 1.1 Load Dataset

Because `IMDB sentiment dataset` is large we only evalute using only 1000 samples of it. Important varibles from the cell below are:
- `test_set` the 1000 samples from `IMDB sentiment dataset`
- `pos_samples`, `neg_samples` 3 samples from each class that we will use in section `2.2 Few-shot`
- `calibration_context` samples used for calibration in section `3. Calibration`

In [None]:
# Note: Do NOT make changes to this block.

dataset = load_dataset("imdb")

num_of_test_data = 100

test_set = list(dataset['test'])

data = np.array(test_set[:num_of_test_data]+test_set[-num_of_test_data:])
data = [i for i in data if len(i['text'])<2000]
data = np.array(test_set[:num_of_test_data//2]+test_set[-num_of_test_data//2:])

np.random.shuffle(data)


pos_samples = []
neg_samples = []

for i in range(12400, 12600, 1):
    if len(test_set[i]['text'])<1000:
        if test_set[i]['label'] == 0:
            neg_samples.append(test_set[i]['text'])
        elif test_set[i]['label'] == 1:
            pos_samples.append(test_set[i]['text'])
pos_samples = pos_samples[:3]
neg_samples = neg_samples[:3]

calibration_context = []

for i in range(13000, 16000, 1):
    if len(test_set[i]['text'])<=4000:
        calibration_context.append(test_set[i]['text'])

data[0]

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'text': 'This movie had a very unique effect on me: it stalled my realization that this movie REALLY sucks! It is disguised as a "thinker\'s film" in the likes of Memento and other jewels like that, but at the end, and even after a few minutes, you come to realize that this is nothing but utter pretentious cr4p. Probably written by some collage student with friends to compassionate to tell him that his writing sucks. The whole idea is \x85 I don\'t even know if it tried to scratch on the supernatural, or they want us to believe that because someone fills your mind (a very weak one, btw) with stupid "riddles", the kind you learn on elementary school recess, you suddenly come to the "one truth" about everything, then you have to kill someone and confess\x85. !!! What? How, what, why, WHY? Is just like saying that to make a cake, just throw a bunch of ingredients, and add water\x85 forgot about cooking it? I guess these guys forgot to, not explain, but present the mechanism of WHY was th

## 1.2 Load Model and Tokenizer

In [None]:
# Note: Do NOT make changes to this block.

torch.set_default_device("cuda")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

configuration_phi.py:   0%|          | 0.00/2.03k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-1_5:
- configuration_phi.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi.py:   0%|          | 0.00/33.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-1_5:
- modeling_phi.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/2.84G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/69.0 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

## 2. Classification (30 Points)



In the next cell you must complete `classify` implementation. This method can be used to classify a text using language model generation!

In [None]:
from typing import List

def classify(texts: List[str], pos_token: str, neg_token: str) -> List[int]:
    predicted_labels = []
    pos_token_id = tokenizer.encode(pos_token, add_special_tokens=False)[0]
    neg_token_id = tokenizer.encode(neg_token, add_special_tokens=False)[0]
    decoding_tokens = [pos_token_id, neg_token_id]
    for text in texts:
        ## Your code begins ##
        input_ids = tokenizer.encode(text, return_tensors='pt').to(model.device)

        outputs = model.generate(
            input_ids=input_ids,
            max_new_tokens=1,
            prefix_allowed_tokens_fn=lambda batch_id, context: decoding_tokens  # we force the model to generate between these two tokens
        )
        last_output_id = outputs[0, -1].item()
        ## Your code ends ##
        if last_output_id == pos_token_id:
            predicted_labels.append(1)
        elif last_output_id == neg_token_id:
            predicted_labels.append(0)
        else:
            if not isinstance(last_output_id, int):
                raise ValueError("Convert last_output_id to normal python type (use item method in torch)!")
            raise ValueError(f"A not supported label ({last_output_id}) occured!!!")

    return predicted_labels

## 2.1 Zero-shot settings (effect of label names)

In this section you will classify `data` by just using prompts without any examples. In the next two cel the performance is tested using two different prompts!

In [None]:
from sklearn.metrics import classification_report

pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{text}
The sentiment of the above text is: '''

texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
true_labels = [
    row['label']
    for row in data
]
## Your code begins ##
predicted_labels = classify(texts, pos_token, neg_token)
## Your code ends ##
print(classification_report(y_true=true_labels, y_pred=predicted_labels))

              precision    recall  f1-score   support

           0       1.00      0.04      0.08        50
           1       0.51      1.00      0.68        50

    accuracy                           0.52       100
   macro avg       0.76      0.52      0.38       100
weighted avg       0.76      0.52      0.38       100



In [None]:
pos_token = '1'
neg_token = '0'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} for positive or {neg_token} for negative.
{text}
The sentiment of the above text is: '''

## Your code begins ##
texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
true_labels = [
    row['label']
    for row in data
]
predicted_labels = classify(texts, pos_token, neg_token)
## Your code ends ##
print(classification_report(y_true=true_labels, y_pred=predicted_labels))

              precision    recall  f1-score   support

           0       0.75      0.06      0.11        50
           1       0.51      0.98      0.67        50

    accuracy                           0.52       100
   macro avg       0.63      0.52      0.39       100
weighted avg       0.63      0.52      0.39       100



## 2.2 Few-shot settings
### 2.2.1 Effect of different few-shot examples

In this section you will add an example for positive and negative label into your prompt. You must compare all 9 results in your report!

In [None]:
pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{pos_sample}
The sentiment of the above text is: {pos_token}
{neg_sample}
The sentiment of the above text is: {neg_token}
{text}
The sentiment of the above text is: '''

for pos_sample in pos_samples:
    for neg_sample in neg_samples:
        print(f'Results with:\n{pos_sample=}\n{neg_sample=}')
        ## Your code begins ##
        texts = [
            prompt_template.format(
                pos_sample=pos_sample,
                neg_sample=neg_sample,
                text=row['text'],
                pos_token=pos_token,
                neg_token=neg_token
            )
            for row in data
        ]
        true_labels = [
            row['label']
            for row in data
        ]
        predicted_labels = classify(texts, pos_token, neg_token)
        ## Your code ends ##
        print(classification_report(y_true=true_labels, y_pred=predicted_labels))
        print("=====================================")

Results with:
pos_sample="Previous reviewer Claudio Carvalho gave a much better recap of the film's plot details than I could. What I recall mostly is that it was just so beautiful, in every sense - emotionally, visually, editorially - just gorgeous.<br /><br />If you like movies that are wonderful to look at, and also have emotional content to which that beauty is relevant, I think you will be glad to have seen this extraordinary and unusual work of art.<br /><br />On a scale of 1 to 10, I'd give it about an 8.75. The only reason I shy away from 9 is that it is a mood piece. If you are in the mood for a really artistic, very romantic film, then it's a 10. I definitely think it's a must-see, but none of us can be in that mood all the time, so, overall, 8.75."
neg_sample='Shame Shame Shame on UA/DW for what you do! <br /><br />I was appalled. <br /><br />Do NOT take kids to see this movie. The humor is totally inappropriate for children - plus they\'ll be bored and disappointed. Certain

### 2.2.2 Effect of the order of few-shot examples

The sequence order is critical in in-context few-shot learning for Large Language Models (LLMs). In the upcoming section, we will delve into this by conducting tests with three distinct samples. Using these samples, we have the potential to examine six different permutations to understand this learning approach better.

In [None]:

import itertools
from sklearn.metrics import classification_report

pos_token = 'positive'
neg_token = 'negative'

sample_template = '''
{text}
The sentiment of the above text is: {label}'''

prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{samples}
{text}
The sentiment of the above text is: '''

samples_list = [
    sample_template.format(text=pos_samples[0], label=pos_token),
    sample_template.format(text=pos_samples[1], label=pos_token),
    sample_template.format(text=neg_samples[0], label=neg_token)
]

for permutation_indexes in itertools.permutations(range(len(samples_list))):
    print(f'Results with Permutation {permutation_indexes}')
    samples_permuted = [samples_list[idx] for idx in permutation_indexes]
    samples = ''.join(samples_permuted)
    ## Your code begins ##
    texts = [
        prompt_template.format(
            samples=samples,
            text=row["text"],
            pos_token=pos_token,
            neg_token=neg_token
        )
        for row in data
    ]
    true_labels = [row['label'] for row in data]
    predicted_labels = classify(texts, pos_token, neg_token)
    ## Your code ends ##
    print(classification_report(y_true=true_labels, y_pred=predicted_labels))
    print("=====================================")


Results with Permutation (0, 1, 2)
              precision    recall  f1-score   support

           0       0.75      0.92      0.83        50
           1       0.90      0.70      0.79        50

    accuracy                           0.81       100
   macro avg       0.83      0.81      0.81       100
weighted avg       0.83      0.81      0.81       100

Results with Permutation (0, 2, 1)
              precision    recall  f1-score   support

           0       0.91      0.58      0.71        50
           1       0.69      0.94      0.80        50

    accuracy                           0.76       100
   macro avg       0.80      0.76      0.75       100
weighted avg       0.80      0.76      0.75       100

Results with Permutation (1, 0, 2)
              precision    recall  f1-score   support

           0       0.90      0.86      0.88        50
           1       0.87      0.90      0.88        50

    accuracy                           0.88       100
   macro avg       0.88

# 3. Calibration (50 Points)

In this section, you will calibrate the large language model using the methods that reviewed in class.

For prompt use the zero-shot setting with positive and negative labels.

### Calibrate before Use

In this part, you should use the method of "the Calibrate before Use" paper which was discussed in class, and get the calibration coefficients of the positive and negative labels, then combine it with your model and report metrics. You can read this paper in [this link](https://arxiv.org/abs/2102.09690).

In [None]:
pos_prob_calibration = 0
neg_prob_calibration = 0

from scipy.optimize import minimize
from torch.nn.functional import softmax
from typing import Tuple
import numpy as np
from sklearn.calibration import CalibratedClassifierCV

prompt_cfi = '''
What is the sentiment of the following text? Choose between positive or negative.
N/A
The sentiment of the above text is: '''


def calibrate_model(pos_token, neg_token):
    pos_token_id = tokenizer.encode(pos_token, add_special_tokens=False)[0]
    neg_token_id = tokenizer.encode(neg_token, add_special_tokens=False)[0]
    decoding_tokens = [pos_token_id, neg_token_id]

    # Initialize the calibration coefficients
    pos_prob_calibration = 0
    neg_prob_calibration = 0

    input_ids = tokenizer.encode(prompt_cfi, return_tensors='pt')

    # Feed the content-free input into the model
    with torch.no_grad():
        logits = model(input_ids).logits

    output_distribution = softmax(logits, dim=-1)

    pos_prob = output_distribution[0, -1, decoding_tokens.index(pos_token_id)].item()
    neg_prob = output_distribution[0, -1, decoding_tokens.index(neg_token_id)].item()

    # Calculate the calibration coefficients
    prob_sum = pos_prob + neg_prob
    pos_prob /= prob_sum
    neg_prob /= prob_sum

    pos_prob_calibration = 0.5 / pos_prob
    neg_prob_calibration = 0.5 / neg_prob

    return pos_prob_calibration, neg_prob_calibration

# Get the calibration coefficients
pos_prob_calibration, neg_prob_calibration = calibrate_model(pos_token, neg_token)

print(f'Positive prob: {pos_prob_calibration}')
print(f'Negative prob: {neg_prob_calibration}')


Positive prob: 0.5053957446393953
Negative prob: 46.832807926955624


In [None]:
pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{text}
The sentiment of the above text is: '''

texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
true_labels = [
    row['label']
    for row in data
]

## Your code begins ##

from typing import List, Tuple

def classify_with_calibration(texts: List[str], pos_token: str, neg_token: str, pos_prob_calibration: float, neg_prob_calibration: float) -> Tuple[List[int], List[float]]:
    predicted_labels = []
    certainties = []
    pos_token_id = tokenizer.encode(pos_token, add_special_tokens=False)[0]
    neg_token_id = tokenizer.encode(neg_token, add_special_tokens=False)[0]
    decoding_tokens = [pos_token_id, neg_token_id]
    for text in texts:
        input_ids = tokenizer.encode(text, return_tensors='pt')

        outputs = model.generate(
            input_ids=input_ids,
            max_new_tokens=1,
            prefix_allowed_tokens_fn=lambda batch_id, context: decoding_tokens  # we force the model to generate between these two tokens
        )
        last_output_id = outputs[0, -1].item()

        with torch.no_grad():
            logits = model(input_ids).logits

        output_distribution = softmax(logits, dim=-1)

        pos_prob = output_distribution[0, -1, decoding_tokens.index(pos_token_id)].item()
        neg_prob = output_distribution[0, -1, decoding_tokens.index(neg_token_id)].item()

        # Apply calibration
        if last_output_id == pos_token_id:
            predicted_label = 1 if pos_prob_calibration * pos_prob >= neg_prob_calibration * neg_prob else 0
            certainty = pos_prob_calibration * pos_prob
        elif last_output_id == neg_token_id:
            predicted_label = 0 if neg_prob_calibration * neg_prob >= pos_prob_calibration * pos_prob else 1
            certainty = neg_prob_calibration * neg_prob
        else:
            raise ValueError(f"A not supported label ({last_output_id}) occured!!!")

        predicted_labels.append(predicted_label)
        certainties.append(certainty)

    return predicted_labels, certainties


predicted_labels_calibrated, certainties_calibrated = classify_with_calibration(texts, pos_token, neg_token, pos_prob_calibration, neg_prob_calibration)

predicted_labels = predicted_labels_calibrated.copy()
## Your code ends ##

print(classification_report(y_true=true_labels, y_pred=predicted_labels))

              precision    recall  f1-score   support

           0       1.00      0.04      0.08        50
           1       0.51      1.00      0.68        50

    accuracy                           0.52       100
   macro avg       0.76      0.52      0.38       100
weighted avg       0.76      0.52      0.38       100



### Mitigating label biases for in-context learning

In this part, you should use the method of "Mitigating label biases for in-context learning" paper which was discussed in class, and get the calibration coefficients of the positive and negative labels, then combine it with your model and report metrics.

Use `calibration_context` list for context and consider `T = 1000`

In [None]:
T = 1000

## Your code begins ##

from scipy.special import log_softmax
from sklearn.preprocessing import normalize

def estimate_label_bias(pos_token, neg_token):
    pos_prob_calibration = 0
    neg_prob_calibration = 0

    pos_token_id = tokenizer.encode(pos_token, add_special_tokens=False)[0]
    neg_token_id = tokenizer.encode(neg_token, add_special_tokens=False)[0]
    decoding_tokens = [pos_token_id, neg_token_id]

    for i in range(T):
        # Select a random context from the calibration_context list
        context = random.choice(calibration_context)

        input_ids = tokenizer.encode(context, return_tensors='pt')

        with torch.no_grad():
            logits = model(input_ids).logits

        output_distribution = softmax(logits, dim=-1)

        pos_prob = output_distribution[0, -1, decoding_tokens.index(pos_token_id)].item()
        neg_prob = output_distribution[0, -1, decoding_tokens.index(neg_token_id)].item()

        # Update the calibration coefficients
        pos_prob_calibration += pos_prob
        neg_prob_calibration += neg_prob

        return pos_prob_calibration, neg_prob_calibration

pos_prob_calibration, neg_prob_calibration = estimate_label_bias(pos_token, neg_token)

## Your code ends ##

pos_prob_calibration /= T
neg_prob_calibration /= T

print(f'Positive prob: {pos_prob_calibration}')
print(f'Negative prob: {neg_prob_calibration}')


Positive prob: 1.6298395348712803e-07
Negative prob: 5.2711278840433803e-08


In [None]:
pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{text}
The sentiment of the above text is: '''

texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
true_labels = [
    row['label']
    for row in data
]

## Your code begins ##
from typing import List, Tuple

def classify_with_domain_context_calibration(texts: List[str], pos_token: str, neg_token: str, pos_prob_calibration: float, neg_prob_calibration: float) -> Tuple[List[int], List[float]]:
    predicted_labels = []
    certainties = []
    pos_token_id = tokenizer.encode(pos_token, add_special_tokens=False)[0]
    neg_token_id = tokenizer.encode(neg_token, add_special_tokens=False)[0]
    decoding_tokens = [pos_token_id, neg_token_id]
    for text in texts:
        input_ids = tokenizer.encode(text, return_tensors='pt')

        outputs = model.generate(
            input_ids=input_ids,
            max_new_tokens=1,
            prefix_allowed_tokens_fn=lambda batch_id, context: decoding_tokens  # we force the model to generate between these two tokens
        )
        last_output_id = outputs[0, -1].item()

        with torch.no_grad():
            logits = model(input_ids).logits

        output_distribution = softmax(logits, dim=-1)

        pos_prob = output_distribution[0, -1, decoding_tokens.index(pos_token_id)].item()
        neg_prob = output_distribution[0, -1, decoding_tokens.index(neg_token_id)].item()

        # Apply calibration
        if last_output_id == pos_token_id:
            predicted_label = 1 if pos_prob_calibration * pos_prob >= neg_prob_calibration * neg_prob else 0
            certainty = pos_prob_calibration * pos_prob
        elif last_output_id == neg_token_id:
            predicted_label = 0 if neg_prob_calibration * neg_prob >= pos_prob_calibration * pos_prob else 1
            certainty = neg_prob_calibration * neg_prob
        else:
            raise ValueError(f"A not supported label ({last_output_id}) occured!!!")

        predicted_labels.append(predicted_label)
        certainties.append(certainty)

    return predicted_labels, certainties


predicted_labels_dc, certainties_dc = classify_with_domain_context_calibration(texts, pos_token, neg_token, pos_prob_calibration, neg_prob_calibration)

predicted_labels = predicted_labels_calibrated.copy()
## Your code ends ##

print(classification_report(y_true=true_labels, y_pred=predicted_labels))

              precision    recall  f1-score   support

           0       1.00      0.04      0.08        50
           1       0.51      1.00      0.68        50

    accuracy                           0.52       100
   macro avg       0.76      0.52      0.38       100
weighted avg       0.76      0.52      0.38       100



## ECE (20 Points)

ECE stands for Expected Calibration Error. It is a metric used to evaluate the calibration of probabilistic predictions made by a machine learning model.

The Expected Calibration Error measures the average difference between the predicted confidence (probability) and the true accuracy across different confidence levels.
ECE is calculated by dividing the confidence interval into smaller bins and computing the average difference between the predicted accuracy and the true accuracy within each bin. It provides a quantitative measure of how well a model's predicted probabilities align with the actual outcomes. Lower values of ECE indicate better calibration, while higher values indicate greater miscalibration.

To calculate the ECE follow these steps:

1- Divide the predictions into different confidence bins.

2- Calculate the average confidence and accuracy for each bin. Confidence can be defined as the mean predicted probability within each bin, and accuracy can be calculated as the proportion of correct predictions within each bin.

3- Compute the difference between the average confidence and accuracy for each bin.

4- Weight the differences by the fraction of examples in each bin to obtain the weighted difference for each bin and sum up the weighted differences across all bins to get the ECE.

Here is a general formula to calculate ECE:
$$
\text{ECE} = \sum \left( \left| \text{Accuracy}_i - \text{Confidence}_i \right| \times \frac{N_i}{N} \right)
$$
You should implement this metric in the following cell.

In [None]:
import numpy as np

def ECE(output, ground_truth, bins=15):
    # Ensure the inputs are numpy arrays
    output = np.asarray(output)
    ground_truth = np.asarray(ground_truth)

    # Initialize ECE score
    ece = 0.0

    # Define bin edges and bin labels
    bin_edges = np.linspace(0, 1, bins + 1)
    bin_lowers = bin_edges[:-1]
    bin_uppers = bin_edges[1:]

    # Iterate through each bin to calculate the ECE
    for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
        # Find the indices of probabilities within this bin
        in_bin = np.where((output > bin_lower) & (output <= bin_upper))[0]

        if len(in_bin) > 0:
            # Calculate average confidence and accuracy in the bin
            bin_confidence = np.mean(output[in_bin])
            bin_accuracy = np.mean(ground_truth[in_bin])

            # Calculate the weighted difference
            bin_weight = len(in_bin) / len(output)
            ece += np.abs(bin_confidence - bin_accuracy) * bin_weight

    return ece

In the following cell, calculate the ECE for the two calibration methods you implemented.

In [None]:
## Your code begins ##

pos_prob_cc, neg_prob_cc = calibrate_model(pos_token, neg_token)

predicted_labels_calibrated, certainties_calibrated = classify_with_calibration(
    texts,
    pos_token,
    neg_token,
    pos_prob_cc,
    neg_prob_cc
)


pos_prob_dc, neg_prob_dc = estimate_label_bias(pos_token, neg_token)

predicted_labels_bias, certainties_bias = classify_with_domain_context_calibration(
    texts,
    pos_token,
    neg_token,
    pos_prob_dc,
    neg_prob_dc
)

# Calculate ECE for the 'Calibrate before Use' method using the unmodified calibration
ece_calibrated = ECE(certainties_calibrated, true_labels)
print(f'ECE for Calibrate before Use: {ece_calibrated}')

# Calculate ECE for the 'Mitigating label biases for in-context learning' method using the bias mitigating calibration
ece_bias = ECE(certainties_bias, true_labels)
print(f'ECE for Mitigating label biases for in-context learning: {ece_bias}')

## Your code ends ##


ECE for Calibrate before Use: 0.4999996043545509
ECE for Mitigating label biases for in-context learning: 0.49999999943896356
