# Dependencies

FULL will instal `transformers` and `torch` if not already installed.

In [1]:
!pip install full tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting full
  Downloading full-0.0.2.9-py3-none-any.whl (3.9 kB)
Collecting transformers
  Downloading transformers-4.21.3-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 4.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 50.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 74.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers, full
Successfully installed full-0.0.2.9 huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.21.3


# Data & Model
To replice the results of our paper, we need to download the data.

In [2]:
!wget https://raw.githubusercontent.com/maximedb/full/master/data/fed_data.json

--2022-09-09 08:08:30--  https://raw.githubusercontent.com/maximedb/full/master/data/fed_data.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 832080 (813K) [text/plain]
Saving to: ‘fed_data.json’


2022-09-09 08:08:30 (14.7 MB/s) - ‘fed_data.json’ saved [832080/832080]



In [3]:
import json

with open("fed_data.json") as f:
    fed_data = json.load(f)

## Model

Loading `FULL` without any parameters will default to the paper configuration. It will use the GPU if it is available.

In [4]:
from full import FULL

eval_model = FULL()

Downloading config.json:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/696M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/124k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/61.4k [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/16.0 [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

# Turn Evaluations

Compute the FULL evaluation score for each (turn) example in the dataset

In [5]:
import tqdm

annotations = []
evaluations = []

for example in tqdm.tqdm(fed_data):
    conversation = example["context"]
    response = example.get("response")
    conversation = conversation.split("\n")
    conversation = [s.replace("User: ","").replace("System: ", "").strip() for s in conversation]
    if response is None:
        # this is a conversation data point, not a turn data point
        continue
    response = response.replace("User: ","").replace("System: ", "").strip()
    model_evaluation = eval_model.evaluate_turn(conversation, response)
    mean_annotation = sum(example["annotations"]["Overall"]) / len(example["annotations"]["Overall"])
    annotations.append(mean_annotation)
    evaluations.append(model_evaluation)

100%|██████████| 500/500 [00:33<00:00, 15.02it/s]


Compute the spearman correlation between annotations and FULL evaluations

In [6]:
import scipy.stats

scipy.stats.spearmanr(annotations, evaluations)

SpearmanrResult(correlation=0.5057430648107026, pvalue=9.578268322476446e-26)

# Dialog evaluation

In [7]:
import tqdm

annotations = []
evaluations = []

for example in tqdm.tqdm(fed_data):
    conversation = example["context"]
    response = example.get("response")
    conversation = conversation.split("\n")
    conversation = [s.replace("User: ","").replace("System: ", "").strip() for s in conversation]
    if response is not None:
        continue
    model_evaluation = eval_model.evaluate_conversation(conversation)
    mean_annotation = sum(example["annotations"]["Overall"]) / len(example["annotations"]["Overall"])
    annotations.append(mean_annotation)
    evaluations.append(model_evaluation)

100%|██████████| 500/500 [00:08<00:00, 57.08it/s]


In [8]:
import scipy.stats

scipy.stats.spearmanr(annotations, evaluations)

SpearmanrResult(correlation=0.6947114632271221, pvalue=2.5688360539178383e-19)