### **Toward Consistent, Verifiable, and Coherent Commonsense Reasoning in Large LMs**

This notebook provides source code for our two papers in Findings of EMNLP 2021:


1.  Shane Storks, Qiaozi Gao, Yichi Zhang, and Joyce Y. Chai (2021). *Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding.* Findings of EMNLP 2021.
2.   Shane Storks and Joyce Y. Chai (2021). *Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers.* Findings of EMNLP 2021.

*If you have any questions or problems, please open an issue on our [GitHub repo](https://github.com/sled-group/Verifiable-Coherent-NLU) or email Shane Storks.*

***First, configure the execution mode by selecting a few settings (expand cell if needed):***




   0. (Colab only) Insert the path in your Google Drive to the folder where this notebook is located.

In [1]:
DRIVE_PATH = '.'

1.   Model type (choose from BERT large, RoBERTa large, RoBERTa large + MNLI, DeBERTa base, and DeBERTa large).






In [2]:
# mode = 'bert' # BERT large
# mode = 'roberta' # RoBERTa large
# mode = 'roberta_mnli' # RoBERTa large pre-trained on MNLI
# mode = 'deberta' # DeBERTa base for training on TRIP
# mode = 'deberta_large' # DeBERTa large for training on CE and ART

mode = 'electra'

2.   Name of the task we want to train or evaluate on. Set `debug` to `True` to run quick training/evaluation jobs on only a small amount of data.

In [3]:
task_name = 'trip'
# task_name = 'ce'
# task_name = 'art'

debug = False

3.   (If training models) Training batch size, learning rate, and maximum number of epochs. Settings for results in the paper are provided as examples.

In [4]:
config_batch_size = 1
config_lr = 1e-5 # Selected learning rate for best RoBERTa-based model in TRIP paper
config_epochs = 10

4.   (For training TRIP models only) Configure the loss weighting scheme for training models here. We provide the 4 modes from the paper as examples.


In [5]:
# Loss weights for (attributes, preconditions, effects, conflicts, story choices)
if task_name != 'trip':
  print("We do not need a loss weighting scheme for %s dataset. Ignoring this cell." % task_name)
# loss_weights = [0.0, 0.4, 0.4, 0.1, 0.1] # "All losses"
loss_weights = [0.0, 0.4, 0.4, 0.2, 0.0] # "Omit story choice loss"
# loss_weights = [0.0, 0.4, 0.4, 0.0, 0.2] # "Omit conflict detection loss"
# loss_weights = [0.0, 0.0, 0.0, 0.5, 0.5] # "Omit state classification losses"

   5. (If evaluating models) Provide the name of the pre-trained model directory here. This should be the name of a directory within the *saved_models* directory, which should be located where this notebook is. Names of provided pre-trained model directories are listed.

In [6]:
# TRIP, all losses
# eval_model_dir = 'bert-large-uncased_cloze_1_5e-06_4_0.0-0.4-0.4-0.1-0.1_tiered_pipeline_ablate_attributes_states-logits'
# eval_model_dir = 'roberta-large_cloze_1_1e-05_7_0.0-0.4-0.4-0.1-0.1_tiered_pipeline_ablate_attributes_states-logits'
# eval_model_dir = 'microsoft-deberta-base_cloze_1_5e-06_5_0.0-0.4-0.4-0.1-0.1_tiered_pipeline_ablate_attributes_states-logits'

# TRIP, no story classification loss
# eval_model_dir = 'bert-large-uncased_cloze_1_5e-05_8_0.0-0.4-0.4-0.2-0.0_tiered_pipeline_ablate_attributes_states-logits'
# eval_model_dir = 'roberta-large_cloze_1_1e-05_5_0.0-0.4-0.4-0.2-0.0_tiered_pipeline_lc_ablate_attributes_states-logits' # Best model trained in the TRIP paper
# eval_model_dir = 'microsoft-deberta-base_cloze_1_5e-05_5_0.0-0.4-0.4-0.2-0.0_tiered_pipeline_ablate_attributes_states-logits'

# eval_model_dir = 'google-electra-large-discriminator_cloze_1_1e-05_6_0.0-0.4-0.4-0.2-0.0_tiered_pipeline_lc_ablate_attributes_states-logits'


# TRIP, no conflict detection loss
# eval_model_dir = 'bert-large-uncased_cloze_1_1e-06_1_0.0-0.4-0.4-0.0-0.2_tiered_pipeline_ablate_attributes_states-logits'
# eval_model_dir = 'roberta-large_cloze_1_5e-06_8_0.0-0.4-0.4-0.0-0.2_tiered_pipeline_ablate_attributes_states-logits'
# eval_model_dir = 'microsoft-deberta-base_cloze_1_1e-06_3_0.0-0.4-0.4-0.0-0.2_tiered_pipeline_ablate_attributes_states-logits'

# TRIP, no physical state classification loss
# eval_model_dir = 'bert-large-uncased_cloze_1_1e-05_3_0.0-0.0-0.0-0.5-0.5_tiered_pipeline_ablate_attributes_states-logits'
# eval_model_dir = 'roberta-large_cloze_1_1e-06_7_0.0-0.0-0.0-0.5-0.5_tiered_pipeline_ablate_attributes_states-logits'
# eval_model_dir = 'microsoft-deberta-base_cloze_1_5e-06_9_0.0-0.0-0.0-0.5-0.5_tiered_pipeline_ablate_attributes_states-logits'

# CE
# eval_model_dir = 'bert-large-uncased_ConvEnt_32_7.5e-06_7_xval'
# eval_model_dir = 'roberta-large_ConvEnt_32_7.5e-06_9_xval'
# eval_model_dir = 'roberta-large-mnli_ConvEnt_32_7.5e-06_7_xval'
# eval_model_dir = 'microsoft-deberta-large_ConvEnt_16_1e-05_9_xval'

# ART
# eval_model_dir = 'bert-large-uncased_art_64_5e-06_8'
# eval_model_dir = 'roberta-large_art_64_2.5e-06_4'
# eval_model_dir = 'DeBERTa-deberta-large_art_32_1e-06_8'

**For more configuration options, scroll down to the Train Models > Configure Hyperparameters cell for the task you're working on.**

# Setup
Run this block every time when starting up the notebook. It will get Colab ready, preprocess the data, and load model packages and classes we'll need later. May take several minutes to run for the first time.

**If you get a `ModuleNotFoundError` for the `www` code base, try the following:**


1.   Ensure the DRIVE_PATH is set properly above.
2.   (Colab only) Verify that this notebook has access to your Google Drive (click the folder icon on the left and then the Google Drive icon).
2.   Try to restart the runtime and refresh your browser window.
2.   (Colab only) If the problem persists, revoke access to Google Drive and re-enable it.





## Colab Setup

Enable auto reloading of code libraries from Google Drive, set up connection to Google Drive, and import some packages. 🔌

In [7]:
%load_ext autoreload
%autoreload 2

In [8]:
import os
import json
import sys
import torch
import random
import numpy as np
import spacy
!pip install jsonlines

sys.path.append(DRIVE_PATH)

You should consider upgrading via the '/home/panqp/595/project/env/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

## Model Setup

Next, we'll load up the transformer model, tokenizer, etc. ⏳

### Install HuggingFace transformers and other dependencies

In [9]:
!pip install 'transformers==4.2.2'
!pip install sentencepiece
!pip3 install torch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2
!pip install deberta

You should consider upgrading via the '/home/panqp/595/project/env/bin/python3 -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/home/panqp/595/project/env/bin/python3 -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/home/panqp/595/project/env/bin/python3 -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/home/panqp/595/project/env/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

### Get Model Components

Specify which model parameters from transformers we want to use:

In [10]:
if task_name in ['trip', 'ce']:
  multiple_choice = False
elif task_name == 'art':
  multiple_choice = True
else:
  raise ValueError("Task name should be set to 'trip', 'ce', or 'art' in the first cell of the notebook!")

if mode == 'bert':
  model_name = 'bert-large-uncased'
elif mode == 'roberta':
  model_name = 'roberta-large'
elif mode == 'roberta_mnli':
  model_name = 'roberta-large-mnli'
elif mode == 'deberta':
  model_name = 'microsoft/deberta-base'
elif mode == 'deberta_large':
  model_name = 'microsoft/deberta-large'
elif mode == 'electra':
  model_name = 'google/electra-large-discriminator'

Load the tokenizer:

In [11]:
from transformers import BertTokenizer, RobertaTokenizer, DebertaTokenizer, AlbertTokenizer, T5Tokenizer, GPT2Tokenizer
from transformers import AutoTokenizer
from DeBERTa import deberta

if mode in ['bert']:
  tokenizer_class = BertTokenizer
elif mode in ['roberta', 'roberta_mnli']:
  tokenizer_class = RobertaTokenizer
elif mode in ['deberta', 'deberta_large']:
  tokenizer_class = DebertaTokenizer


if mode not in ['electra']:
  tokenizer = tokenizer_class.from_pretrained(model_name, 
                                                do_lower_case = False, 
                                                cache_dir=os.path.join(DRIVE_PATH, 'cache'))
else:
  tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case = False, 
                                                cache_dir=os.path.join(DRIVE_PATH, 'cache'))

Load the model and optimizer:



In [12]:
from transformers import BertForSequenceClassification, RobertaForSequenceClassification, DebertaForSequenceClassification, AlbertForSequenceClassification, AdamW
from transformers import BertForMultipleChoice, RobertaForMultipleChoice, AlbertForMultipleChoice, DebertaModel
from transformers import BertModel, RobertaModel, AlbertModel, DebertaModel, T5Model, T5EncoderModel, GPT2Model
from transformers import RobertaForMaskedLM
from transformers import BertConfig, RobertaConfig, DebertaConfig, AlbertConfig, T5Config, GPT2Config
from transformers import ElectraForSequenceClassification, ElectraConfig, ElectraModel
from www.model.transformers_ext import DebertaForMultipleChoice
from torch.optim import Adam
if not multiple_choice:
  if mode == 'bert':
    model_class = BertForSequenceClassification
    config_class = BertConfig
    emb_class = BertModel
  elif mode in ['roberta', 'roberta_mnli']:
    model_class = RobertaForSequenceClassification
    config_class = RobertaConfig
    emb_class = RobertaModel
    lm_class = RobertaForMaskedLM
  elif mode in ['deberta', 'deberta_large']:
    model_class = DebertaForSequenceClassification
    config_class = DebertaConfig
    emb_class = DebertaModel
  elif mode in ['electra']:
    model_class = ElectraForSequenceClassification
    config_class = ElectraConfig
    emb_class = ElectraModel
else:
  if mode == 'bert':
    model_class = BertForMultipleChoice
    config_class = BertConfig
    emb_class = BertModel    
  elif mode in ['roberta', 'roberta_mnli']:
    model_class = RobertaForMultipleChoice
    config_class = RobertaConfig
    emb_class = RobertaModel
    lm_class = RobertaForMaskedLM
  elif mode in ['deberta', 'deberta_large']:
    model_class = DebertaForMultipleChoice
    config_class = DebertaConfig
    emb_class = DebertaModel

## Data Setup

Preprocess the dataset.

### Preprocessing

Construct the dataset from the .txt files collected from AMT. Save a backup copy in Drive.

In [13]:
from www.utils import print_dict

partitions = ['train', 'dev', 'test']
subtasks = ['cloze', 'order']

# We can split the data into multiple json files later
data_file = os.path.join(DRIVE_PATH, 'all_data/www.json')
with open(data_file, 'r') as f:
  dataset = json.load(f)

print('Preprocessed examples:')
for ex_idx in [0,1,5,10]:
  ex = dataset['dev'][list(dataset['dev'].keys())[ex_idx]]
  print_dict(ex)

Preprocessed examples:
{
  story_id: 
    13,
  worker_id: 
    A32W24TWSWXW,
  type: 
    None,
  idx: 
    None,
  aug: 
    False,
  actor: 
    John,
  location: 
    kitchen,
  objects: 
    cabinet, counter, knife, pan, potato, pizza,
  sentences: 
    [
      John was getting the snacks ready for the party.
      John opened the cabinet, took out a pan and put it on the counter.
      John opened the fridge and got out the pizza.
      John put the pizza on the pan and put them into the oven.
      John took a knife and cut the hot pizza in eight slices.
    ],
  length: 
    5,
  example_id: 
    13,
  plausible: 
    True,
  breakpoint: 
    -1,
  confl_sents: 
    [],
  confl_pairs: 
    [],
  states: 
    [
      {'h_location': [['John', 0]], 'conscious': [['John', 2]], 'wearing': [['John', 0]], 'h_wet': [['John', 0]], 'hygiene': [['John', 0]], 'location': [['snacks', 0], ['party', 0]], 'exist': [['snacks', 4], ['party', 2]], 'clean': [['snacks', 0], ['party', 0]], 'power': 

### Data Filtering and Sampling
Since there is a big imbalance between plausible/implausible class labels, we will upsample the plausible stories.

For now, we will also break the dataset into two sub-datasets: cloze and ordering.



In [14]:
cloze_dataset = {p: [] for p in dataset}
order_dataset = {p: [] for p in dataset}

for p in dataset:
  for exid in dataset[p]:
    ex = dataset[p][exid]

    if ex['type'] == None:
      continue
    
    ex_plaus = dataset[p][str(ex['story_id'])]

    if ex['type'] == 'cloze':
      cloze_dataset[p].append(ex)
      cloze_dataset[p].append(ex_plaus) # For every implausible story, add a copy of its corresponding plausible story

    # Exclude augmented ordering examples from dev and test, since the breakpoints aren't always accurate in those
    elif ex['type'] == 'order' and not (p != 'train' and ex['aug']): 
      order_dataset[p].append(ex)
      order_dataset[p].append(ex_plaus)



### Convert TRIP to Two-Story Classification Task

Ready the TRIP dataset for two-story classification.

In [15]:
from www.utils import print_dict
import json
from collections import Counter

data_file = os.path.join(DRIVE_PATH, 'all_data/www_2s_new.json')
with open(data_file, 'r') as f:
  cloze_dataset_2s, order_dataset_2s = json.load(f)  

for p in cloze_dataset_2s:
  label_dist = Counter([ex['label'] for ex in cloze_dataset_2s[p]])
  print('Cloze label distribution (%s):' % p)
  print(label_dist.most_common())
print_dict(cloze_dataset_2s['train'][0])

Cloze label distribution (train):
[(1, 400), (0, 399)]
Cloze label distribution (dev):
[(0, 161), (1, 161)]
Cloze label distribution (test):
[(1, 176), (0, 175)]
{
  example_id: 
    0-C0,
  stories: 
    [
      {'story_id': 0, 'worker_id': 'A1F01FVEPYCPHO', 'type': 'cloze', 'idx': 0, 'aug': False, 'actor': 'Tom', 'location': 'kitchen', 'objects': 'dustbin, microwave, pan, plate, cereal, soup', 'sentences': ['Tom bought a new dustbin for the kitchen.', 'Tom threw a broken plate in the dustbin.', 'Tom got some soup from the fridge.', 'Tom put the soup in the microwave.', 'Tom ate the cold soup.'], 'length': 5, 'example_id': '0-C0', 'plausible': False, 'breakpoint': 4, 'confl_sents': [3], 'confl_pairs': [[3, 4]], 'states': [{'h_location': [['Tom', 0]], 'conscious': [['Tom', 2]], 'wearing': [['Tom', 0]], 'h_wet': [['Tom', 0]], 'hygiene': [['Tom', 0]], 'location': [['dustbin', 6]], 'exist': [['dustbin', 4]], 'clean': [['dustbin', 0]], 'power': [['dustbin', 0]], 'functional': [['dustbin', 

---

# TRIP Results

Contains code for the tiered and random TRIP baselines.

In [16]:
if task_name != 'trip':
  raise ValueError('Please configure task_name in first cell to "trip" to run TRIP results!')

## Random Tiered Classifier for TRIP

For the random baseline, we average the results of 10 runs. Running the below will report (mean, variance) for each evaluation partition.

In [17]:
from www.dataset.prepro import get_tiered_data
from www.dataset.featurize import add_bert_features_tiered, get_tensor_dataset_tiered
from collections import Counter
import numpy as np
from www.dataset.ann import att_to_num_classes, idx_to_att
from sklearn.metrics import accuracy_score, f1_score
from www.utils import print_dict

tiered_dataset = cloze_dataset_2s

seq_length = 16 # Max sequence length to pad to

tiered_dataset = get_tiered_data(tiered_dataset)
tiered_dataset = add_bert_features_tiered(tiered_dataset, tokenizer, seq_length, add_segment_ids=True)



In [18]:
from www.dataset.prepro import get_tiered_data, balance_labels
from www.dataset.featurize import add_bert_features_tiered, get_tensor_dataset_tiered
from collections import Counter
import numpy as np
from www.dataset.ann import att_to_num_classes, idx_to_att, att_default_values
from sklearn.metrics import accuracy_score, f1_score
from www.utils import print_dict
import numpy as np

# Have to add BERT input IDs and tensorize again
num_runs = 10
stories = []
pred_stories = []
conflicts = []
pred_conflicts = []
preconditions = []
pred_preconditions = []
effects = []
pred_effects = []
verifiability = []
consistency = []
for p in tiered_dataset:
  if p == 'train':
    continue
  metr_avg = {}
  print('starting %s...' % p)
  for r in range(num_runs):
    print('starting run %s...' % str(r))
    for ex in tiered_dataset[p]:
      verifiable = True
      consistent = True

      stories.append(ex['label'])
      pred_stories.append(np.random.randint(2))

      if stories[-1] != pred_stories[-1]:
        verifiable = False

      labels_ex_p = []
      preds_ex_p = []

      labels_ex_e = []
      preds_ex_e = []

      labels_ex_c = []
      preds_ex_c = []

      for si, story in enumerate(ex['stories']):
        labels_story_p = []
        preds_story_p = []

        labels_story_e = []
        preds_story_e = []      

        for ent_ann in story['entities']:
          entity = ent_ann['entity']

          if si == 1 - ex['label']:
            labels_ex_c.append(ent_ann['conflict_span_onehot'])
            pred = np.zeros(ent_ann['conflict_span_onehot'].shape)
            for cs in np.random.choice(len(pred), size=2, replace=False):
              pred[cs] = 1
            preds_ex_c.append(pred)

          labels_ent = []
          preds_ent = []
          for s, sent_ann in enumerate(ent_ann['preconditions']):
            if s < len(story['sentences']):
              if entity in story['sentences'][s]:

                labels_ent.append(sent_ann)
                sent_ann_pred = []
                for i, l in enumerate(sent_ann):
                  pl = np.random.randint(att_to_num_classes[idx_to_att[i]])
                  if pl > 0 and pl != att_default_values[idx_to_att[i]]:
                    if pl != l:
                      verifiable = False
                  sent_ann_pred.append(pl)
                preds_ent.append(sent_ann_pred)

          labels_story_p.append(labels_ent)
          preds_story_p.append(preds_ent)

          labels_ent = []
          preds_ent = []
          for s, sent_ann in enumerate(ent_ann['effects']):
            if s < len(story['sentences']):
              if entity in story['sentences'][s]:
    
                labels_ent.append(sent_ann)
                sent_ann_pred = []
                for i, l in enumerate(sent_ann):
                  pl = np.random.randint(att_to_num_classes[idx_to_att[i]])
                  if pl > 0 and pl != att_default_values[idx_to_att[i]]:
                    if pl != l:
                      verifiable = False
                  sent_ann_pred.append(pl)
                preds_ent.append(sent_ann_pred)

          labels_story_e.append(labels_ent)
          preds_story_e.append(preds_ent)

        labels_ex_p.append(labels_story_p)
        preds_ex_p.append(preds_story_p)

        labels_ex_e.append(labels_story_e)
        preds_ex_e.append(preds_story_e)

      conflicts.append(labels_ex_c)
      pred_conflicts.append(preds_ex_c)

      preconditions.append(labels_ex_p)
      pred_preconditions.append(preds_ex_p)

      effects.append(labels_ex_e)
      pred_effects.append(preds_ex_e)

      p_confl = np.nonzero(np.sum(np.array(preds_ex_c), axis=0))[0]
      l_confl = np.nonzero(np.sum(np.array(labels_ex_c), axis=0))[0]
      assert len(l_confl) == 2, str(labels_ex_c)
      if not (p_confl[0] == l_confl[0] and p_confl[1] == l_confl[1]):
        verifiable = False    
        consistent = False

      verifiability.append(1 if verifiable else 0)
      consistency.append(1 if consistent else 0)

    # Compute metrics
    metr = {}
    metr['story_accuracy'] = accuracy_score(stories, pred_stories)

    conflicts_flat = [c for c_ex in conflicts for c_ent in c_ex for c in c_ent]
    pred_conflicts_flat = [c for c_ex in pred_conflicts for c_ent in c_ex for c in c_ent]
    metr['confl_f1'] = f1_score(conflicts_flat, pred_conflicts_flat, average='macro')

    preconditions_flat = [p for p_ex in preconditions for p_story in p_ex for p_sent in p_story for p_ent in p_sent for p in p_ent]
    pred_preconditions_flat = [p for p_ex in pred_preconditions for p_story in p_ex for p_sent in p_story for p_ent in p_sent for p in p_ent]
    metr['precondition_f1'] = f1_score(preconditions_flat, pred_preconditions_flat, average='macro')

    effects_flat = [p for p_ex in effects for p_story in p_ex for p_sent in p_story for p_ent in p_sent for p in p_ent]
    pred_effects_flat = [p for p_ex in pred_effects for p_story in p_ex for p_sent in p_story for p_ent in p_sent for p in p_ent]
    metr['effect_f1'] = f1_score(effects_flat, pred_effects_flat, average='macro')

    metr['verifiability'] = np.mean(verifiability)
    metr['consistency'] = np.mean(consistency)

    for k in metr:
      if k not in metr_avg:
        metr_avg[k] = []
      metr_avg[k].append(metr[k])

  for k in metr_avg:
    metr_avg[k] = (np.mean(metr_avg[k]), np.var(metr_avg[k]) ** 0.5)
  print('RANDOM BASELINE (%s, %s runs)' % (str(p), str(num_runs)))
  print_dict(metr_avg)

starting dev...
starting run 0...
starting run 1...
starting run 2...
starting run 3...
starting run 4...
starting run 5...
starting run 6...
starting run 7...
starting run 8...
starting run 9...
RANDOM BASELINE (dev, 10 runs)
{
  story_accuracy: 
    (0.5085576259489303, 0.008681601994888405),
  confl_f1: 
    (0.4848957915481747, 0.0009149443365594638),
  precondition_f1: 
    (0.04029214647737987, 7.775770756129488e-05),
  effect_f1: 
    (0.04006568355624193, 0.00012448865250977147),
  verifiability: 
    (0.0, 0.0),
  consistency: 
    (0.11757825594005718, 0.0015168649772256044),
}


starting test...
starting run 0...
starting run 1...
starting run 2...
starting run 3...
starting run 4...
starting run 5...
starting run 6...
starting run 7...
starting run 8...
starting run 9...
RANDOM BASELINE (test, 10 runs)
{
  story_accuracy: 
    (0.4978718433902965, 0.0018625710010761359),
  confl_f1: 
    (0.48435372220950573, 0.00021763150464339153),
  precondition_f1: 
    (0.0401058336544

## Transformer-Based Tiered Classifier for TRIP

This is the baseline model presented in the paper. Based on the settings above, the below cells can be used for training and evaluating models.


### Featurization for Tiered Classification

Get the data ready for input to the model.

In [19]:
from www.dataset.prepro import get_tiered_data, balance_labels
from www.dataset.featurize import add_bert_features_tiered, get_tensor_dataset_tiered
from collections import Counter

tiered_dataset = cloze_dataset_2s

# Debug the code on a small amount of data
if debug:
  for k in tiered_dataset:
    tiered_dataset[k] = tiered_dataset[k][:20]

# train_spans = True
train_spans = False
if train_spans:
  tiered_dataset = get_story_spans_2s(tiered_dataset, train_only=True)
  tiered_dataset['train'] = [ex for ex in tiered_dataset['train'] if ex['label'] != -1] # For now, ignore examples where both stories are plausible :(

seq_length = 16 # Max sequence length to pad to

tiered_dataset = get_tiered_data(tiered_dataset)
tiered_dataset = add_bert_features_tiered(tiered_dataset, tokenizer, seq_length, add_segment_ids=True)

tiered_tensor_dataset = {}
max_story_length = max([len(ex['stories'][0]['sentences']) for p in tiered_dataset for ex in tiered_dataset[p]])
for p in tiered_dataset:
  tiered_tensor_dataset[p] = get_tensor_dataset_tiered(tiered_dataset[p], max_story_length, add_segment_ids=True)



### Train Models

#### Configure Hyperparameters
We will perform grid search over (batch size, learning rate). Configure the training sub-task, search space and set the maximum number of training epochs here. Currently configured for re-training the best RoBERTa-based model instance. Read code comments for more information.

**Additional configuration options:**
* Change the `generate_learning_curve` variable to `True` to generate data for training curves in the style presented in the paper.
* You may ablate the input to the Conflict Detector based on a few pre-defined ablation modes. To do so, change the `ablation` variable based on the comments in the code.

In [20]:
from www.dataset.ann import att_to_idx, att_to_num_classes, att_types

subtask = 'cloze'
batch_sizes = [config_batch_size]
# learning_rates = [config_lr]
learning_rates = [1e-3, 1e-4, 1e-5, 1e-6]
epochs = config_epochs
eval_batch_size = 16
generate_learning_curve = False # Generate data for training curve figure in TRIP paper

num_state_labels = {}
for att in att_to_idx:
  if att_types[att] == 'default':
    num_state_labels[att_to_idx[att]] = 3
  else:
    num_state_labels[att_to_idx[att]] = att_to_num_classes[att] # Location attributes fall into this since they don't have well-define pre- and post-condition yet

# Ablation options:
# - attributes: skip attribute prediction phase
# - embeddings: DON'T input contextual embeddings to conflict detector
# - states: DON'T input states to conflict detector
# - states-labels: in states input to conflict detector, include predicted labels
# - states-logits: in states input to conflict detector, include state logits (preferred)
# - states-teacher-forcing: train conflict detector on ground truth state labels (not predictions)
# - states-attention: re-weight input to conflict detector with weights conditioned on states representation
ablation = ['attributes', 'states-logits'] # This is the default mode presented in the paper

#### Perform Grid Search

Perform hyperparameter tuning to find the best story classification model.


In [21]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import get_linear_schedule_with_warmup
from www.model.train import train_epoch_tiered
from www.model.eval import evaluate_tiered, save_results, save_preds, add_entity_attribute_labels
from sklearn.metrics import accuracy_score, f1_score
from www.utils import print_dict, get_model_dir
from www.model.transformers_ext import TieredModelPipeline
from www.dataset.ann import att_to_num_classes
import shutil
import pandas as pd

seed_val = 22 # Save random seed for reproducibility
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# We'll keep the validation data here with a constant eval batch size
dev_sampler = SequentialSampler(tiered_tensor_dataset['dev'])
dev_dataloader = DataLoader(tiered_tensor_dataset['dev'], sampler=dev_sampler, batch_size=eval_batch_size)
dev_dataset_name = subtask + '_%s_dev'
dev_ids = [ex['example_id'] for ex in tiered_dataset['dev']]

all_losses = []
param_combos = []
combo_names = []
all_val_objs = []
output_dirs = []
best_obj = 0.0
best_model = '<none>'
best_dir = ''
best_obj2 = 0.0
best_model2 = '<none>'
best_dir2 = ''

print('Beginning grid search for the %s sub-task over %s parameter combination(s)!' % (subtask, str(len(batch_sizes) * len(learning_rates))))
for bs in batch_sizes:
  for lr in learning_rates:
    print('\nTRAINING MODEL: bs=%s, lr=%s' % (str(bs), str(lr)))

    loss_values = []
    obj_values = []

    # Set up training dataset with new batch size
    train_sampler = RandomSampler(tiered_tensor_dataset['train'])
    train_dataloader = DataLoader(tiered_tensor_dataset['train'], sampler=train_sampler, batch_size=bs)

    # Set up model
    config = config_class.from_pretrained(model_name,
                                          cache_dir=os.path.join(DRIVE_PATH, 'cache'))    
    emb = emb_class.from_pretrained(model_name,
                                          config=config,
                                          cache_dir=os.path.join(DRIVE_PATH, 'cache'))    
    if torch.cuda.is_available():
      emb.cuda()
    device = emb.device
    max_story_length = max([len(ex['stories'][0]['sentences']) for p in tiered_dataset for ex in tiered_dataset[p]])
    model = TieredModelPipeline(emb, max_story_length, len(att_to_num_classes), num_state_labels,
                                config_class, model_name, device, 
                                ablation=ablation, loss_weights=loss_weights).to(device)

    # Set up optimizer
    optimizer = AdamW(model.parameters(), lr=lr)
    total_steps = len(train_dataloader) * epochs
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps = total_steps)

    train_lc_data = []
    val_lc_data = []
    for epoch in range(epochs):
      # Train the model for one epoch
      print('[%s] Beginning epoch...' % str(epoch))

      epoch_loss, _ = train_epoch_tiered(model, optimizer, train_dataloader, device, seg_mode=False, 
                                         build_learning_curves=generate_learning_curve, val_dataloader=dev_dataloader, 
                                         train_lc_data=train_lc_data, val_lc_data=val_lc_data)
      
      # Save loss
      loss_values.append(epoch_loss)

      # Validate on dev set
      validation_results = evaluate_tiered(model, dev_dataloader, device, [(accuracy_score, 'accuracy'), (f1_score, 'f1')], seg_mode=False, return_explanations=True)
      metr_attr, all_pred_atts, all_atts, \
      metr_prec, all_pred_prec, all_prec, \
      metr_eff, all_pred_eff, all_eff, \
      metr_conflicts, all_pred_conflicts, all_conflicts, \
      metr_stories, all_pred_stories, all_stories, explanations = validation_results[:16]
      explanations = add_entity_attribute_labels(explanations, tiered_dataset['dev'], list(att_to_num_classes.keys()))

      print('[%s] Validation results:' % str(epoch))
      print('[%s] Preconditions:' % str(epoch))
      print_dict(metr_prec)
      print('[%s] Effects:' % str(epoch))
      print_dict(metr_eff)
      print('[%s] Conflicts:' % str(epoch))
      print_dict(metr_conflicts)
      print('[%s] Stories:' % str(epoch))
      print_dict(metr_stories)

      # Save accuracy - want to maximize verifiability of tiered predictions
      ver = metr_stories['verifiability']
      acc = metr_stories['accuracy']
      obj_values.append(ver)
      
      # Save model checkpoint
      print('[%s] Saving model checkpoint...' % str(epoch))
      model_param_str = get_model_dir(model_name.replace('/', '-'), subtask, bs, lr, epoch) + '_' +  '-'.join([str(lw) for lw in loss_weights]) +  '_tiered_pipeline_lc'
      if train_spans:
        model_param_str += 'spans'
      if len(model.ablation) > 0:
        model_param_str += '_ablate_'
        model_param_str += '_'.join(model.ablation)
      output_dir = os.path.join(DRIVE_PATH, 'saved_models', model_param_str)
      output_dirs.append(output_dir)
      if not os.path.exists(output_dir):
        os.makedirs(output_dir)

      save_results(metr_attr, output_dir, dev_dataset_name % 'attributes')
      save_results(metr_prec, output_dir, dev_dataset_name % 'preconditions')
      save_results(metr_eff, output_dir, dev_dataset_name % 'effects')
      save_results(metr_conflicts, output_dir, dev_dataset_name % 'conflicts')
      save_results(metr_stories, output_dir, dev_dataset_name % 'stories')
      save_results(explanations, output_dir, dev_dataset_name % 'explanations')

      # Just save story preds
      save_preds(dev_ids, all_stories, all_pred_stories, output_dir, dev_dataset_name % 'stories')

      emb = emb.module if hasattr(emb, 'module') else emb
      emb.save_pretrained(output_dir)
      torch.save(model, os.path.join(output_dir, 'classifiers.pth'))
      tokenizer.save_vocabulary(output_dir)

      if ver > best_obj:
        best_obj = ver
        best_model = model_param_str
        best_dir = output_dir
      if acc > best_obj2:
        best_obj2 = acc
        best_model2 = model_param_str
        best_dir2 = output_dir        

      # for od in output_dirs:
      #   if od != best_dir and od != best_dir2 and os.path.exists(od):
      #     shutil.rmtree(od)

      print('[%s] Finished epoch.' % str(epoch))

    all_losses.append(loss_values)
    all_val_objs.append(obj_values)
    param_combos.append((bs, lr))
    combo_names.append('bs=%s, lr=%s' % (str(bs), str(lr)))

print('Finished grid search! :)')
print('Best validation *verifiability* %s from model %s.' % (str(best_obj), best_model))
print('Best validation *accuracy* %s from model %s.' % (str(best_obj2), best_model2))

if generate_learning_curve:
  print('Saving learning curve data...')
  train_lc_data = [subrecord for record in train_lc_data for subrecord in record] # flatten
  val_lc_data = [subrecord for record in val_lc_data for subrecord in record] # flatten

  train_lc_data = pd.DataFrame(train_lc_data)
  print(os.path.join(best_dir if best_dir != '<none>' else best_dir2, 'learning_curve_data_train.csv'))
  train_lc_data.to_csv(os.path.join(best_dir if best_dir != '' else best_dir2, 'learning_curve_data_train.csv'), index=False)
  val_lc_data = pd.DataFrame(val_lc_data)
  val_lc_data.to_csv(os.path.join(best_dir if best_dir != '' else best_dir2, 'learning_curve_data_val.csv'), index=False)
  print('Learning curve data saved. %s rows saved for training, %s rows saved for validation.' % (str(len(train_lc_data.index)), str(len(val_lc_data.index))))

Beginning grid search for the cloze sub-task over 1 parameter combination(s)!

TRAINING MODEL: bs=1, lr=1e-05


[                                                                        ] N/A%

[0] Beginning epoch...


[########################################################################] 100%
[                                                                        ] N/A%

	Beginning evaluation...
		Running prediction...


[########################################################################] 100%


		Computing metrics...
	Finished evaluation in 0:00:36s.
[0] Validation results:
[0] Preconditions:
{
  accuracy: 
    0.9939943025265026,
  f1: 
    0.2616516347935549,
  accuracy_0: 
    0.9951781627983001,
  f1_0: 
    0.509809974143638,
  accuracy_1: 
    0.9997548218372017,
  f1_1: 
    0.665554530479591,
  accuracy_2: 
    0.999077663101854,
  f1_2: 
    0.5109572814336666,
  accuracy_3: 
    0.9985989819268668,
  f1_3: 
    0.3330996666355111,
  accuracy_4: 
    0.9996964460841545,
  f1_4: 
    0.3332827333341118,
  accuracy_5: 
    0.9825690001401018,
  f1_5: 
    0.20799714556546464,
  accuracy_6: 
    0.9842852472796899,
  f1_6: 
    0.6268270483605228,
  accuracy_7: 
    0.9982370522579741,
  f1_7: 
    0.3330392494824319,
  accuracy_8: 
    0.9955867930696306,
  f1_8: 
    0.576982515962123,
  accuracy_9: 
    0.9856745902022136,
  f1_9: 
    0.6298056155120746,
  accuracy_10: 
    0.9950614112922057,
  f1_10: 
    0.33250819771263823,
  accuracy_11: 
    0.997151263251296,

[                                                                        ] N/A%

[0] Finished epoch.
[1] Beginning epoch...


[########################################################################] 100%
[                                                                        ] N/A%

	Beginning evaluation...
		Running prediction...


[########################################################################] 100%


		Computing metrics...
	Finished evaluation in 0:00:36s.
[1] Validation results:
[1] Preconditions:
{
  accuracy: 
    0.9954391024144211,
  f1: 
    0.48870630691098327,
  accuracy_0: 
    0.9978284219866437,
  f1_0: 
    0.6120252879560054,
  accuracy_1: 
    0.9997081212347639,
  f1_1: 
    0.6653395737657836,
  accuracy_2: 
    0.9992761406622146,
  f1_2: 
    0.6088536114302082,
  accuracy_3: 
    0.999077663101854,
  f1_3: 
    0.5998538840540746,
  accuracy_4: 
    0.9996964460841545,
  f1_4: 
    0.3332827333341118,
  accuracy_5: 
    0.9863984495399991,
  f1_5: 
    0.40139084360299465,
  accuracy_6: 
    0.987530939149115,
  f1_6: 
    0.6786932939047166,
  accuracy_7: 
    0.9982370522579741,
  f1_7: 
    0.3330392494824319,
  accuracy_8: 
    0.997256339606781,
  f1_8: 
    0.6072882216561611,
  accuracy_9: 
    0.9891421099332182,
  f1_9: 
    0.6380304036574763,
  accuracy_10: 
    0.9972096390043431,
  f1_10: 
    0.5730701699631492,
  accuracy_11: 
    0.996987811142763

[                                                                        ] N/A%

[1] Finished epoch.
[2] Beginning epoch...


[########################################################################] 100%
[                                                                        ] N/A%

	Beginning evaluation...
		Running prediction...


[########################################################################] 100%


		Computing metrics...
	Finished evaluation in 0:00:36s.
[2] Validation results:
[2] Preconditions:
{
  accuracy: 
    0.9953772241161911,
  f1: 
    0.5868970870599055,
  accuracy_0: 
    0.9977116704805492,
  f1_0: 
    0.5975249035019231,
  accuracy_1: 
    0.9997314715359829,
  f1_1: 
    0.665446024374751,
  accuracy_2: 
    0.9994512679213562,
  f1_2: 
    0.8042179173725681,
  accuracy_3: 
    0.9992761406622146,
  f1_3: 
    0.7104133135138416,
  accuracy_4: 
    0.99978984728903,
  f1_4: 
    0.5110760787061365,
  accuracy_5: 
    0.9863634240881708,
  f1_5: 
    0.5121494716867426,
  accuracy_6: 
    0.9869355064680334,
  f1_6: 
    0.7561127742210103,
  accuracy_7: 
    0.9984121795171158,
  f1_7: 
    0.516799305438202,
  accuracy_8: 
    0.9970111614439826,
  f1_8: 
    0.7243248696924341,
  accuracy_9: 
    0.9880563209265399,
  f1_9: 
    0.6894129449951465,
  accuracy_10: 
    0.9968944099378882,
  f1_10: 
    0.5467616902559445,
  accuracy_11: 
    0.996941110540326,
 

[                                                                        ] N/A%

[2] Finished epoch.
[3] Beginning epoch...


[########################################################################] 100%
[                                                                        ] N/A%

	Beginning evaluation...
		Running prediction...


[########################################################################] 100%


		Computing metrics...
	Finished evaluation in 0:00:36s.
[3] Validation results:
[3] Preconditions:
{
  accuracy: 
    0.9954513613225611,
  f1: 
    0.5884682388914594,
  accuracy_0: 
    0.997863447438472,
  f1_0: 
    0.608060375283833,
  accuracy_1: 
    0.9997197963853733,
  f1_1: 
    0.6653924112269226,
  accuracy_2: 
    0.9994746182225751,
  f1_2: 
    0.8371814555685853,
  accuracy_3: 
    0.9992761406622146,
  f1_3: 
    0.7104133135138416,
  accuracy_4: 
    0.99978984728903,
  f1_4: 
    0.5110760787061365,
  accuracy_5: 
    0.9856512399009947,
  f1_5: 
    0.5139680445611577,
  accuracy_6: 
    0.9878111427637416,
  f1_6: 
    0.791757683333025,
  accuracy_7: 
    0.9982954280110213,
  f1_7: 
    0.512363104387553,
  accuracy_8: 
    0.9968944099378882,
  f1_8: 
    0.7609470875556715,
  accuracy_9: 
    0.988663428758231,
  f1_9: 
    0.7067780631203743,
  accuracy_10: 
    0.9969294353897166,
  f1_10: 
    0.5956182401170395,
  accuracy_11: 
    0.9974548171671415,
  f

[                                                                        ] N/A%

[3] Finished epoch.
[4] Beginning epoch...


[########################################################################] 100%
[                                                                        ] N/A%

	Beginning evaluation...
		Running prediction...


[########################################################################] 100%


		Computing metrics...
	Finished evaluation in 0:00:36s.
[4] Validation results:
[4] Preconditions:
{
  accuracy: 
    0.9953556250875636,
  f1: 
    0.5673035508240342,
  accuracy_0: 
    0.9974081165647037,
  f1_0: 
    0.6657778748598229,
  accuracy_1: 
    0.9997197963853733,
  f1_1: 
    0.6653927911882174,
  accuracy_2: 
    0.9995213188250128,
  f1_2: 
    0.840141990994307,
  accuracy_3: 
    0.9991477140055107,
  f1_3: 
    0.6972099608411636,
  accuracy_4: 
    0.9999532993975623,
  f1_4: 
    0.6666588810513695,
  accuracy_5: 
    0.9857796665576986,
  f1_5: 
    0.5000789706764179,
  accuracy_6: 
    0.987239060383879,
  f1_6: 
    0.7489007936439483,
  accuracy_7: 
    0.998470555270163,
  f1_7: 
    0.5195227036228621,
  accuracy_8: 
    0.9970345117452015,
  f1_8: 
    0.7643300764439362,
  accuracy_9: 
    0.9883949002942138,
  f1_9: 
    0.6844001133299349,
  accuracy_10: 
    0.9966492317750899,
  f1_10: 
    0.6169833735731273,
  accuracy_11: 
    0.9968126838836221,

[                                                                        ] N/A%

[4] Finished epoch.
[5] Beginning epoch...


[########################################################################] 100%
[                                                                        ] N/A%

	Beginning evaluation...
		Running prediction...


[########################################################################] 100%


		Computing metrics...
	Finished evaluation in 0:00:36s.
[5] Validation results:
[5] Preconditions:
{
  accuracy: 
    0.9953941530845748,
  f1: 
    0.5872802318101912,
  accuracy_0: 
    0.9973730911128753,
  f1_0: 
    0.6849025451835281,
  accuracy_1: 
    0.9997081212347639,
  f1_1: 
    0.6653395737657836,
  accuracy_2: 
    0.9996147200298884,
  f1_2: 
    0.8766445802627807,
  accuracy_3: 
    0.9991593891561201,
  f1_3: 
    0.703243175039925,
  accuracy_4: 
    0.9998598981926867,
  f1_4: 
    0.5925692368377345,
  accuracy_5: 
    0.9854410871900248,
  f1_5: 
    0.5117659235778577,
  accuracy_6: 
    0.9867020034558446,
  f1_6: 
    0.776754216236518,
  accuracy_7: 
    0.9982020268061458,
  f1_7: 
    0.512996241168683,
  accuracy_8: 
    0.997151263251296,
  f1_8: 
    0.7245414466212083,
  accuracy_9: 
    0.9875659646009434,
  f1_9: 
    0.6915299670235733,
  accuracy_10: 
    0.9971395881006865,
  f1_10: 
    0.704253999435056,
  accuracy_11: 
    0.9974548171671415,
 

[                                                                        ] N/A%

[5] Finished epoch.
[6] Beginning epoch...


[########################################################################] 100%
[                                                                        ] N/A%

	Beginning evaluation...
		Running prediction...


[########################################################################] 100%


		Computing metrics...
	Finished evaluation in 0:00:36s.
[6] Validation results:
[6] Preconditions:
{
  accuracy: 
    0.9954951431373464,
  f1: 
    0.600033336207623,
  accuracy_0: 
    0.9975598935226264,
  f1_0: 
    0.695085666700162,
  accuracy_1: 
    0.9997197963853733,
  f1_1: 
    0.6653927911882174,
  accuracy_2: 
    0.9996380703311073,
  f1_2: 
    0.8777354327039216,
  accuracy_3: 
    0.9991243637042918,
  f1_3: 
    0.6940144089766326,
  accuracy_4: 
    0.9999065987951244,
  f1_4: 
    0.6333177624664253,
  accuracy_5: 
    0.9863634240881708,
  f1_5: 
    0.5312665308201495,
  accuracy_6: 
    0.9872974361369262,
  f1_6: 
    0.8017945280565208,
  accuracy_7: 
    0.9985055807219914,
  f1_7: 
    0.5336298682682911,
  accuracy_8: 
    0.9975248680707981,
  f1_8: 
    0.7399846859603961,
  accuracy_9: 
    0.9879395694204455,
  f1_9: 
    0.6606836130969719,
  accuracy_10: 
    0.9966142063232616,
  f1_10: 
    0.6483700353886576,
  accuracy_11: 
    0.9973730911128753

[                                                                        ] N/A%

[6] Finished epoch.
[7] Beginning epoch...


[########################################################################] 100%
[                                                                        ] N/A%

	Beginning evaluation...
		Running prediction...


[########################################################################] 100%


		Computing metrics...
	Finished evaluation in 0:00:36s.
[7] Validation results:
[7] Preconditions:
{
  accuracy: 
    0.9953807266613739,
  f1: 
    0.5901291890177669,
  accuracy_0: 
    0.9972796899079999,
  f1_0: 
    0.6525122044153372,
  accuracy_1: 
    0.9997548218372017,
  f1_1: 
    0.6655525380693446,
  accuracy_2: 
    0.9995913697286695,
  f1_2: 
    0.8562898354859456,
  accuracy_3: 
    0.9992761406622146,
  f1_3: 
    0.6350221434263889,
  accuracy_4: 
    0.9998131975902489,
  f1_4: 
    0.541635526387494,
  accuracy_5: 
    0.9860598701723252,
  f1_5: 
    0.5168090829638969,
  accuracy_6: 
    0.986923831317424,
  f1_6: 
    0.7896672889942736,
  accuracy_7: 
    0.9984939055713818,
  f1_7: 
    0.5202204306075,
  accuracy_8: 
    0.9971979638537337,
  f1_8: 
    0.734282086640229,
  accuracy_9: 
    0.9875426142997245,
  f1_9: 
    0.6749006404833867,
  accuracy_10: 
    0.996730957829356,
  f1_10: 
    0.677181070759714,
  accuracy_11: 
    0.9976065941250642,
  f1

[                                                                        ] N/A%

[7] Finished epoch.
[8] Beginning epoch...


[########################################################################] 100%
[                                                                        ] N/A%

	Beginning evaluation...
		Running prediction...


[########################################################################] 100%


		Computing metrics...
	Finished evaluation in 0:00:36s.
[8] Validation results:
[8] Preconditions:
{
  accuracy: 
    0.995485803016859,
  f1: 
    0.5890650861099322,
  accuracy_0: 
    0.9974898426189698,
  f1_0: 
    0.6643341246525051,
  accuracy_1: 
    0.9997548218372017,
  f1_1: 
    0.6655525380693446,
  accuracy_2: 
    0.9996497454817167,
  f1_2: 
    0.8880376344091534,
  accuracy_3: 
    0.9990543128006352,
  f1_3: 
    0.6106160226555505,
  accuracy_4: 
    0.9999065987951244,
  f1_4: 
    0.6333177624664253,
  accuracy_5: 
    0.9858263671601364,
  f1_5: 
    0.5118482630528113,
  accuracy_6: 
    0.9876360155046,
  f1_6: 
    0.8053385258356004,
  accuracy_7: 
    0.9984822304207724,
  f1_7: 
    0.5159748365220392,
  accuracy_8: 
    0.9969761359921543,
  f1_8: 
    0.7315194795152937,
  accuracy_9: 
    0.9882547984869005,
  f1_9: 
    0.6700156835284165,
  accuracy_10: 
    0.9971629384019054,
  f1_10: 
    0.6920602576929832,
  accuracy_11: 
    0.9973263905104376,


[                                                                        ] N/A%

[8] Finished epoch.
[9] Beginning epoch...


[########################################################################] 100%
[                                                                        ] N/A%

	Beginning evaluation...
		Running prediction...


[########################################################################] 100%


		Computing metrics...
	Finished evaluation in 0:00:36s.
[9] Validation results:
[9] Preconditions:
{
  accuracy: 
    0.9954992294400598,
  f1: 
    0.6052710183148939,
  accuracy_0: 
    0.9973847662634848,
  f1_0: 
    0.666472689336703,
  accuracy_1: 
    0.9997314715359829,
  f1_1: 
    0.665446024374751,
  accuracy_2: 
    0.9996730957829356,
  f1_2: 
    0.8816459292579842,
  accuracy_3: 
    0.999077663101854,
  f1_3: 
    0.686181699118069,
  accuracy_4: 
    0.9998598981926867,
  f1_4: 
    0.5925692368377345,
  accuracy_5: 
    0.9866553028534069,
  f1_5: 
    0.5270077491657076,
  accuracy_6: 
    0.9874025124924112,
  f1_6: 
    0.800945588780028,
  accuracy_7: 
    0.998575631625648,
  f1_7: 
    0.5535719873983349,
  accuracy_8: 
    0.9974197917153131,
  f1_8: 
    0.7396278512178339,
  accuracy_9: 
    0.9881030215289778,
  f1_9: 
    0.6720434428882802,
  accuracy_10: 
    0.9970228365945921,
  f1_10: 
    0.6993955663125995,
  accuracy_11: 
    0.997256339606781,
  f

Delete all non-best model checkpoints:


In [22]:
# import shutil

# Delete non-best model checkpoints
for od in output_dirs:
  if od != best_dir and od != best_dir2 and os.path.exists(od):
    shutil.rmtree(od)

### Test Models

Evaluate accuracy, consistency, and verifiability on the test set.

#### Load the Trained Model

Load the trained model we want to probe and select the appropriate dataset. Paths to the pre-trained models presented in the paper are already provided (download links are found in GitHub repo).

In [23]:
from www.model.transformers_ext import TieredModelPipeline
from www.dataset.ann import att_to_num_classes, att_to_idx, att_types
eval_model_dir = best_dir
probe_model = eval_model_dir
probe_model = os.path.join(DRIVE_PATH, 'saved_models', probe_model)

ablation = ['attributes', 'states-logits']

if 'cloze' in probe_model:
  subtask = 'cloze'
elif 'order' in probe_model:
  subtask = 'order'
  
if subtask == 'cloze':
  subtask_dataset = cloze_dataset_2s
elif subtask == 'order':
  subtask_dataset = order_dataset_2s

# Load the model
model = None
# model = torch.load(os.path.join(probe_model, 'classifiers.pth'), map_location=torch.device('cpu'))
model = torch.load(os.path.join(probe_model, 'classifiers.pth'))
if torch.cuda.is_available():
  model.cuda()
device = model.embedding.device

for layer in model.precondition_classifiers:
  layer.eval()
for layer in model.effect_classifiers:
  layer.eval()

#### Test the Model

Run inference on the testing set of TRIP. Can simply edit the top-level `for` loop if you want to run inference on other partitions.

In [24]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from www.model.eval import evaluate_tiered, save_results, save_preds, list_comparison, add_entity_attribute_labels
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
metrics = [(accuracy_score, 'accuracy'), (precision_score, 'precision'), (recall_score, 'recall'), (f1_score, 'f1')]
import numpy as np
from www.utils import print_dict

print('Testing model: %s.' % probe_model)

# May alter this depending on which partition(s) you want to run inference on
for p in tiered_dataset:
  if p != 'test':
    continue

  p_dataset = tiered_dataset[p]
  p_tensor_dataset = tiered_tensor_dataset[p]
  p_sampler = SequentialSampler(p_tensor_dataset)
  p_dataloader = DataLoader(p_tensor_dataset, sampler=p_sampler, batch_size=16)
  dev_dataset_name = subtask + '_%s_' + p
  p_ids = [ex['example_id'] for ex in tiered_dataset[p]]

  # Get preds and metrics on this partition
  metr_attr, all_pred_atts, all_atts, \
  metr_prec, all_pred_prec, all_prec, \
  metr_eff, all_pred_eff, all_eff, \
  metr_conflicts, all_pred_conflicts, all_conflicts, \
  metr_stories, all_pred_stories, all_stories, explanations = evaluate_tiered(model, p_dataloader, device, [(accuracy_score, 'accuracy'), (f1_score, 'f1')], seg_mode=False, return_explanations=True)
  explanations = add_entity_attribute_labels(explanations, tiered_dataset[p], list(att_to_num_classes.keys()))

  save_results(metr_attr, probe_model, dev_dataset_name % 'attributes')
  save_results(metr_prec, probe_model, dev_dataset_name % 'preconditions')
  save_results(metr_eff, probe_model, dev_dataset_name % 'effects')
  save_results(metr_conflicts, probe_model, dev_dataset_name % 'conflicts')
  save_results(metr_stories, probe_model, dev_dataset_name % 'stories')
  save_results(explanations, probe_model, dev_dataset_name % 'explanations')

  print('\nPARTITION: %s' % p)
  print('Stories:')
  print_dict(metr_stories)
  print('Conflicts:')
  print_dict(metr_conflicts)
  print('Preconditions:')
  print_dict(metr_prec)
  print('Effects:')
  print_dict(metr_eff)

[                                                                        ] N/A%

Testing model: ./saved_models/roberta-large_cloze_1_1e-05_5_0.0-0.4-0.4-0.2-0.0_tiered_pipeline_lc_ablate_attributes_states-logits.
	Beginning evaluation...
		Running prediction...


[########################################################################] 100%


		Computing metrics...
	Finished evaluation in 0:00:56s.

PARTITION: test
Stories:
{
  accuracy: 
    0.7606837606837606,
  f1: 
    0.7606351886731182,
  verifiability: 
    0.07407407407407407,
}


Conflicts:
{
  accuracy: 
    0.9781953300471818,
  f1: 
    0.6754904345449386,
}


Preconditions:
{
  accuracy: 
    0.9959982061833914,
  f1: 
    0.5343470744036458,
  accuracy_0: 
    0.9982740167925354,
  f1_0: 
    0.6823113467049654,
  accuracy_1: 
    0.9994121105232217,
  f1_1: 
    0.6629145633285994,
  accuracy_2: 
    0.9994724068798143,
  f1_2: 
    0.7977061085142254,
  accuracy_3: 
    0.999449795746092,
  f1_3: 
    0.6170565329700258,
  accuracy_4: 
    0.9998944813759628,
  f1_4: 
    0.493315745172678,
  accuracy_5: 
    0.9886642849605812,
  f1_5: 
    0.4679341963061971,
  accuracy_6: 
    0.9864559309003753,
  f1_6: 
    0.7903805274825567,
  accuracy_7: 
    0.9981760352130723,
  f1_7: 
    0.4389126090979818,
  accuracy_8: 
    0.9971057748835527,
  f1_8: 
    0.77

#### Add Consistency Metric to Model Results
The intermediate conistency metric isn't included in the originally calculated metrics. This block adds the consistency metric to pre-existing model directory based on the tiered predictions. Generates a new `results_cloze_stories_final_[partition].json` file that includes the consistency metric.



In [25]:
import json
import os

model_directories = [eval_model_dir]

partitions = ['dev', 'test']
expl_fname = 'results_cloze_explanations_%s.json'
endtask_fname = 'results_cloze_stories_%s.json'
endtask_fname_new = 'results_cloze_stories_final_%s.json'
for md in model_directories:
  for p in partitions:
    explanations = json.load(open(os.path.join(DRIVE_PATH, 'saved_models', md, expl_fname % p), 'r'))
    endtask_results = json.load(open(os.path.join(DRIVE_PATH, 'saved_models', md, endtask_fname % p), 'r'))

    consistent_preds = 0
    verifiable_preds = 0
    total = 0
    for expl in explanations:
      if expl['valid_explanation']:
        verifiable_preds += 1
      if expl['story_pred'] == expl['story_label']:
        if len(expl['conflict_pred']) == len(expl['conflict_label']) and expl['conflict_pred'][0] == expl['conflict_label'][0] and expl['conflict_pred'][1] == expl['conflict_label'][1]:
          expl['consistent'] = True
          consistent_preds += 1
        else:
          expl['consistent'] = False
      total += 1

    endtask_results['consistency'] = float(consistent_preds) / total
    print('Found %s consistent preds in %s (versus %s verifiable)' % (str(consistent_preds), p, str(verifiable_preds)))
    json.dump(explanations, open(os.path.join(DRIVE_PATH, 'saved_models', md, (expl_fname % p).replace('explanations', 'explanations_consistency')), 'w'))
    json.dump(endtask_results, open(os.path.join(DRIVE_PATH, 'saved_models', md, endtask_fname_new % p), 'w'))

Found 93 consistent preds in dev (versus 27 verifiable)
Found 85 consistent preds in test (versus 26 verifiable)



# Conversational Entailment (CE) Results

Code for the coherence experiments on CE.

In [26]:
task_name = 'ce'
if task_name != 'ce':
  raise ValueError('Please configure task_name in first cell to "ce" to run CE results!')

## Load Conversational Entailment Dataset

In [27]:
import xml.etree.ElementTree as ET
import pickle
cache_train = os.path.join(DRIVE_PATH, 'all_data/ConvEnt/ConvEnt_train_resplit.json')
cache_dev = os.path.join(DRIVE_PATH,'all_data/ConvEnt/ConvEnt_dev_resplit.json')
cache_test = os.path.join(DRIVE_PATH,'all_data/ConvEnt/ConvEnt_test_resplit.json')
ConvEnt_train = json.load(open(cache_train))
ConvEnt_dev = json.load(open(cache_dev))
ConvEnt_test = json.load(open(cache_test))

# Combine train and dev and do cross-validation
cache_folds = os.path.join(DRIVE_PATH,'all_data/ConvEnt/ConvEnt_folds.pkl') # Folds used for results presented in paper
ConvEnt_train = ConvEnt_train + ConvEnt_dev
train_sources = list(set([ex['dialog_source'] for ex in ConvEnt_train]))
print("Reserved %s dialog sources for training and validation." % len(train_sources))

no_folds = 8
if not os.path.exists(cache_folds):
  folds = []
  for k in range(no_folds):
    folds.append(np.random.choice(train_sources, size=5, replace=False))
    train_sources = [s for s in train_sources if s not in folds[-1]]
  assert len(train_sources) == 0
  print(folds)
  pickle.dump(folds, open(cache_folds, 'wb'))
else:
  folds = pickle.load(open(cache_folds, 'rb'))

Reserved 40 dialog sources for training and validation.


In [28]:
print('train examples:', len(ConvEnt_train))
print('dev examples:', len(ConvEnt_dev))
print('test examples:', len(ConvEnt_test))

train examples: 703
dev examples: 110
test examples: 172


## Featurize Conversational Entailment

In [29]:
from www.dataset.featurize import add_bert_features_ConvEnt, get_tensor_dataset
import pickle
seq_length = 128

ConvEnt_train = add_bert_features_ConvEnt(ConvEnt_train, tokenizer, seq_length, add_segment_ids=True)
ConvEnt_dev = add_bert_features_ConvEnt(ConvEnt_dev, tokenizer, seq_length, add_segment_ids=True)
ConvEnt_test = add_bert_features_ConvEnt(ConvEnt_test, tokenizer, seq_length, add_segment_ids=True)

ConvEnt_train_folds = [[] for _ in range(no_folds)]
ConvEnt_dev_folds = [[] for _ in range(no_folds)]
for k in range(no_folds):
  ConvEnt_train_folds[k] = [ex for ex in ConvEnt_train if ex['dialog_source'] not in folds[k]]
  ConvEnt_dev_folds[k] = [ex for ex in ConvEnt_train if ex['dialog_source'] in folds[k]]

  if debug:
    ConvEnt_train_folds[k] = ConvEnt_train_folds[k][:10]
    ConvEnt_dev_folds[k] = ConvEnt_dev_folds[k][:10]

if debug:
  ConvEnt_train = ConvEnt_train[:10]
  ConvEnt_dev = ConvEnt_dev[:10]
  ConvEnt_test = ConvEnt_test[:10]

ConvEnt_train_tensor = get_tensor_dataset(ConvEnt_train, label_key='label', add_segment_ids=True)
ConvEnt_test_tensor = get_tensor_dataset(ConvEnt_test, label_key='label', add_segment_ids=True)

# Training sets for each validation fold
ConvEnt_train_folds_tensor = [get_tensor_dataset(ConvEnt_train_folds[k], label_key='label', add_segment_ids=True) for k in range(no_folds)]
ConvEnt_dev_folds_tensor = [get_tensor_dataset(ConvEnt_dev_folds[k], label_key='label', add_segment_ids=True) for k in range(no_folds)]

In [30]:
print('train examples:', len(ConvEnt_train))
print('dev examples:', len(ConvEnt_dev))
print('test examples:', len(ConvEnt_test))

train examples: 703
dev examples: 110
test examples: 172


## Train Models on Conversational Entailment

### Train Models

#### Configure Hyperparameters

In [31]:
batch_sizes = [config_batch_size]
learning_rates = [config_lr]
epochs = config_epochs
eval_batch_size = 128

#### Grid Search and Cross-Validation

In [32]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import get_linear_schedule_with_warmup
from www.model.train import train_epoch
from www.model.eval import evaluate, save_results, save_preds
from sklearn.metrics import accuracy_score
from www.utils import print_dict, get_model_dir
from collections import Counter

seed_val = 22 # Save random seed for reproducibility
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

assert len(batch_sizes) == 1
train_fold_sampler = [RandomSampler(f) for f in ConvEnt_train_folds_tensor]
train_fold_dataloader = [DataLoader(f, sampler=train_fold_sampler[i], batch_size=batch_sizes[0]) for i, f in enumerate(ConvEnt_train_folds_tensor)]

dev_fold_sampler = [SequentialSampler(f) for f in ConvEnt_dev_folds_tensor]
dev_fold_dataloader = [DataLoader(f, sampler=dev_fold_sampler[i], batch_size=eval_batch_size) for i, f in enumerate(ConvEnt_dev_folds_tensor)]

all_val_accs = Counter()
print('Beginning grid search for ConvEnt over %s parameter combination(s)!' % (str(len(batch_sizes) * len(learning_rates))))
for bs in batch_sizes:
  for lr in learning_rates:
    print('\nTRAINING MODEL: bs=%s, lr=%s' % (str(bs), str(lr)))

    for k in range(no_folds):
      print('Beginning fold %s/%s...' % (str(k+1), str(no_folds)))

      # Set up model
      if 'mnli' not in mode:
        model = model_class.from_pretrained(model_name, 
                                            cache_dir=os.path.join(DRIVE_PATH, 'cache'))
      else:
        config = config_class.from_pretrained(model_name.replace('-mnli',''),
                                        num_labels=3,
                                        cache_dir=os.path.join(DRIVE_PATH, 'cache'))
        model = model_class.from_pretrained(model_name, 
                                            config=config,
                                            cache_dir=os.path.join(DRIVE_PATH, 'cache'))
        config.num_labels = 2
        model.num_labels = 2
        model.classifier = cls_head_class(config=config) # Need to bring in a classification head for only 2 labels
    
      model.cuda()
      device = model.device 

      # Set up optimizer
      optimizer = AdamW(model.parameters(), lr=lr)
      total_steps = len(train_fold_dataloader[k]) * epochs
      scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps = total_steps)

      for epoch in range(epochs):
        # Train the model for one epoch
        print('[%s] Beginning epoch...' % str(epoch))

        epoch_loss, _ = train_epoch(model, optimizer, train_fold_dataloader[k], device, seg_mode=True if 'roberta' not in mode else False)
        
        # Validate on dev set
        results, _, _ = evaluate(model, dev_fold_dataloader[k], device, [(accuracy_score, 'accuracy')], seg_mode=True if 'roberta' not in mode else False)
        print('[%s] Validation results:' % str(epoch))
        print_dict(results)

        # Save accuracy
        acc = results['accuracy']
        if (bs, lr, epoch) in all_val_accs:
          all_val_accs[(bs, lr, epoch)] += acc
        else:
          all_val_accs[(bs, lr, epoch)] = acc
        
      model.cpu()
      del model
      del optimizer
      del results
      del scheduler
      del total_steps

      print('[%s] Finished epoch.' % str(epoch))

for k in all_val_accs:
  all_val_accs[k] /= no_folds

print('Top performing param combos:')
print(all_val_accs.most_common(5))

save_fname = os.path.join(DRIVE_PATH, 'saved_models/%s_ConvEnt_xval_%s.pkl' % (model_name.replace('/','-'), '_'.join([str(lr) for lr in learning_rates])))
pickle.dump(all_val_accs, open(save_fname, 'wb'))

Beginning grid search for ConvEnt over 1 parameter combination(s)!

TRAINING MODEL: bs=1, lr=1e-05
Beginning fold 1/8...


Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

[0] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[0] Validation results:
{
  accuracy: 
    0.5416666666666666,
}


[1] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[1] Validation results:
{
  accuracy: 
    0.5416666666666666,
}


[2] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[2] Validation results:
{
  accuracy: 
    0.5416666666666666,
}


[3] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[3] Validation results:
{
  accuracy: 
    0.5416666666666666,
}


[4] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[4] Validation results:
{
  accuracy: 
    0.5416666666666666,
}


[5] Beginning epoch.

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

[0] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[0] Validation results:
{
  accuracy: 
    0.5396825396825397,
}


[1] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[1] Validation results:
{
  accuracy: 
    0.4603174603174603,
}


[2] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[2] Validation results:
{
  accuracy: 
    0.4603174603174603,
}


[3] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[3] Validation results:
{
  accuracy: 
    0.4603174603174603,
}


[4] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[4] Validation results:
{
  accuracy: 
    0.4603174603174603,
}


[5] Beginning epoch.

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

[0] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[0] Validation results:
{
  accuracy: 
    0.49504950495049505,
}


[1] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[1] Validation results:
{
  accuracy: 
    0.49504950495049505,
}


[2] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[2] Validation results:
{
  accuracy: 
    0.49504950495049505,
}


[3] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[3] Validation results:
{
  accuracy: 
    0.49504950495049505,
}


[4] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[4] Validation results:
{
  accuracy: 
    0.49504950495049505,
}


[5] Beginning e

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

[0] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[0] Validation results:
{
  accuracy: 
    0.4358974358974359,
}


[1] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[1] Validation results:
{
  accuracy: 
    0.5641025641025641,
}


[2] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[2] Validation results:
{
  accuracy: 
    0.5641025641025641,
}


[3] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[3] Validation results:
{
  accuracy: 
    0.5641025641025641,
}


[4] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[4] Validation results:
{
  accuracy: 
    0.5641025641025641,
}


[5] Beginning epoch.

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

[0] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[0] Validation results:
{
  accuracy: 
    0.5180722891566265,
}


[1] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[1] Validation results:
{
  accuracy: 
    0.4819277108433735,
}


[2] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[2] Validation results:
{
  accuracy: 
    0.4819277108433735,
}


[3] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[3] Validation results:
{
  accuracy: 
    0.4819277108433735,
}


[4] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[4] Validation results:
{
  accuracy: 
    0.4819277108433735,
}


[5] Beginning epoch.

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

[0] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[0] Validation results:
{
  accuracy: 
    0.6262626262626263,
}


[1] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[1] Validation results:
{
  accuracy: 
    0.6262626262626263,
}


[2] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[2] Validation results:
{
  accuracy: 
    0.37373737373737376,
}


[3] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[3] Validation results:
{
  accuracy: 
    0.6262626262626263,
}


[4] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[4] Validation results:
{
  accuracy: 
    0.6262626262626263,
}


[5] Beginning epoch

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

[0] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[0] Validation results:
{
  accuracy: 
    0.5068493150684932,
}


[1] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[1] Validation results:
{
  accuracy: 
    0.5068493150684932,
}


[2] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[2] Validation results:
{
  accuracy: 
    0.5068493150684932,
}


[3] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[3] Validation results:
{
  accuracy: 
    0.5068493150684932,
}


[4] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[4] Validation results:
{
  accuracy: 
    0.5068493150684932,
}


[5] Beginning epoch.

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

[0] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[0] Validation results:
{
  accuracy: 
    0.5545454545454546,
}


[1] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[1] Validation results:
{
  accuracy: 
    0.5545454545454546,
}


[2] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[2] Validation results:
{
  accuracy: 
    0.5545454545454546,
}


[3] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[3] Validation results:
{
  accuracy: 
    0.5545454545454546,
}


[4] Beginning epoch...
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:00s.
[4] Validation results:
{
  accuracy: 
    0.5545454545454546,
}


[5] Beginning epoch.

#### Re-Train Best Model from Cross-Validation

Re-train a model with the best parameters from the search above. If this isn't run directly after the above cell, replace `save_fname.split('/'[-1])` in `xval_fnames` with the name of the `pkl` file previously generated in the `saved_models` directory.

In [33]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import get_linear_schedule_with_warmup
from www.model.train import train_epoch
from www.model.eval import evaluate, save_results, save_preds
from sklearn.metrics import accuracy_score
from www.utils import print_dict, get_model_dir
from collections import Counter

# Re-train the model with the best parameters from the grid search/cross-validation (with all folds)
xval_fnames = []
xval_fnames.append(save_fname.split('/')[-1])

xval_results = Counter()
for fname in xval_fnames:
  xval_results += pickle.load(open(os.path.join(DRIVE_PATH, 'saved_models/', fname), 'rb'))

batch_size, learning_rate, epochs = xval_results.most_common(1)[0][0]
epochs += 1

# Set up model
if 'mnli' not in mode:
  model = model_class.from_pretrained(model_name, 
                                      cache_dir=os.path.join(DRIVE_PATH, 'cache'))
else:
  config = config_class.from_pretrained(model_name.replace('-mnli',''),
                                  num_labels=3,
                                  cache_dir=os.path.join(DRIVE_PATH, 'cache'))
  model = model_class.from_pretrained(model_name, 
                                      config=config,
                                      cache_dir=os.path.join(DRIVE_PATH, 'cache'))
  config.num_labels = 2
  model.num_labels = 2
  model.classifier = cls_head_class(config=config) # Need to bring in a classification head for only 2 labels

model.cuda()
device = model.device 

train_sampler = RandomSampler(ConvEnt_train_tensor)
train_dataloader = DataLoader(ConvEnt_train_tensor, sampler=train_sampler, batch_size=batch_size)

# Set up optimizer
optimizer = AdamW(model.parameters(), lr=learning_rate)
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps = total_steps)

for epoch in range(epochs):
  print('[%s] Beginning epoch...' % str(epoch))
  epoch_loss, _ = train_epoch(model, optimizer, train_dataloader, device, seg_mode=True if 'roberta' not in mode else False)

print('[%s] Saving model checkpoint...' % str(epoch))
model_param_str = get_model_dir(model_name.replace('/','-'), 'ConvEnt', batch_size, learning_rate, epoch) + '_xval'
output_dir = os.path.join(DRIVE_PATH, 'saved_models', model_param_str)
if not os.path.exists(output_dir):
  os.makedirs(output_dir)
model = model.module if hasattr(model, 'module') else model
model.save_pretrained(output_dir)
tokenizer.save_vocabulary(output_dir)

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

[0] Beginning epoch...
[1] Beginning epoch...
[1] Saving model checkpoint...


('./saved_models/roberta-large_ConvEnt_1_1e-05_1_xval/vocab.json',
 './saved_models/roberta-large_ConvEnt_1_1e-05_1_xval/merges.txt')

## Test Models on Conversational Entailment

In [34]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import get_linear_schedule_with_warmup
from www.model.eval import evaluate, save_results, save_preds
from sklearn.metrics import accuracy_score
from www.utils import print_dict, get_model_dir

best_model = eval_model_dir


best_model = os.path.join(DRIVE_PATH, 'saved_models', best_model)

# Load the model
model = model_class.from_pretrained(best_model)
model.cuda()
device = model.device

# Select appropriate dataset
if 'cloze' in best_model:
  subtask = 'cloze'
elif 'order' in best_model:
  subtask = 'order'

test_sampler = SequentialSampler(ConvEnt_test_tensor)
test_dataloader = DataLoader(ConvEnt_test_tensor, sampler=test_sampler, batch_size=128)
test_dataset_name = '%s_%s' % ('ConvEnt', 'test')
test_ids = [str(ex['example_id']) for ex in ConvEnt_test]

print('Testing model: %s.' % best_model.split('/')[-1])

results, preds, labels = evaluate(model, test_dataloader, device, [(accuracy_score, 'accuracy')], seg_mode=True if 'roberta' not in mode else False)
save_results(results, best_model, test_dataset_name)
save_preds(test_ids, labels, preds, best_model, test_dataset_name)

print('Results (%s):' % p)
print_dict(results)

Some weights of the model checkpoint at ./saved_models/roberta-large_cloze_1_1e-05_5_0.0-0.4-0.4-0.2-0.0_tiered_pipeline_lc_ablate_attributes_states-logits were not used when initializing RobertaForSequenceClassification: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./saved_models/roberta-large_cloze_1_1e-05_5_0.0-0.4-0.4-0.2-0.0_tiered_pipeline_lc_ablate_attributes_states-logits and are newly initi

Testing model: roberta-large_cloze_1_1e-05_5_0.0-0.4-0.4-0.2-0.0_tiered_pipeline_lc_ablate_attributes_states-logits.
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:01s.
Results (test):
{
  accuracy: 
    0.5290697674418605,
}




## Coherence Checks on Conversational Entailment

### Load and Featurize Span Data

In [35]:
from www.dataset.featurize import add_bert_features_ConvEnt, get_tensor_dataset
from www.dataset.prepro import get_ConvEnt_spans
import pickle
seq_length = 128

merged_file = os.path.join(DRIVE_PATH, 'all_data/ConvEnt/ConvEnt_test_annotation_merged2.json')
ConvEnt_test = json.load(open(merged_file))

ConvEnt_test = add_bert_features_ConvEnt(ConvEnt_test, tokenizer, seq_length, add_segment_ids=True)

if debug:
  ConvEnt_test = ConvEnt_test[:10]

# Some of the annotated examples are no longer in the test set :(
# ConvEnt_test = [ex for ex in ConvEnt_test if ex['id'] in test_ids]

# Make span versions of the datasets
ConvEnt_test_spans = get_ConvEnt_spans(ConvEnt_test)

# Add BERT features
ConvEnt_test_tensor = get_tensor_dataset(ConvEnt_test, label_key='label', add_segment_ids=True)
ConvEnt_test_spans_tensor = get_tensor_dataset(ConvEnt_test_spans, label_key='label', add_segment_ids=True)

### Load the Trained Model

Load the trained model we want to probe and select the appropriate dataset.

In [36]:
probe_model = eval_model_dir
probe_model = os.path.join(DRIVE_PATH, 'saved_models', probe_model)

# Load the model
model = model_class.from_pretrained(probe_model)
if torch.cuda.is_available():
  model.cuda()
device = model.device 

Some weights of the model checkpoint at ./saved_models/roberta-large_cloze_1_1e-05_5_0.0-0.4-0.4-0.2-0.0_tiered_pipeline_lc_ablate_attributes_states-logits were not used when initializing RobertaForSequenceClassification: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./saved_models/roberta-large_cloze_1_1e-05_5_0.0-0.4-0.4-0.2-0.0_tiered_pipeline_lc_ablate_attributes_states-logits and are newly initi

#### Load Trained Model's Base Predictions

For comparison, we also want the preds and labels for the previous level.

In [37]:
from www.model.eval import load_preds
from www.utils import print_dict

preds_base = {}
preds_base['test'] = load_preds(os.path.join(probe_model, 'preds_ConvEnt_test.tsv'))
print(preds_base['test'].keys())

dict_keys(['73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '272', '273', '274', '275', '285', '286', '287', '288', '289', '290', '291', '292', '293', '294', '295', '296', '297', '298', '299', '300', '301', '302', '303', '304', '312', '313', '314', '315', '316', '317', '318', '319', '320', '321', '322', '323', '324', '325', '326', '327', '328', '403', '404', '405', '406', '407', '408', '409', '410', '411', '412', '413', '414', '415', '416', '417', '418', '419', '420', '421', '422', '593', '594', '595', '596', '597', '598', '599', '600', '601', '602', '603', '604', '605', '606', '607', '608', '609', '610', '611', '612', '613', '614', '615', '616', '617', '618', '619', '620', '730', '731', '732', '733', '734', '735', '736', '737', '738', '739', '740', '741', '742', '743', '744', '745', '746', '747', '748', '749', '750', '751', '752', '753', '754', '755', '808', '809', '810', '811', 

### Check a Model

Will print out strict and lenient coherence metrics.

In [38]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from www.model.eval import evaluate, save_results, save_preds, list_comparison
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
metrics = [(accuracy_score, 'accuracy'), (precision_score, 'precision'), (recall_score, 'recall'), (f1_score, 'f1')]
import numpy as np
from www.utils import print_dict

def is_polarized(smax, thres):
  return (abs(smax[0] - smax[1]) >= thres)

print('Testing model: %s.' % probe_model)

all_results = {}
p = 'test'

p_dataset = ConvEnt_test_spans
p_tensor_dataset = ConvEnt_test_spans_tensor
p_sampler = SequentialSampler(p_tensor_dataset)
p_dataloader = DataLoader(p_tensor_dataset, sampler=p_sampler, batch_size=512)
p_dataset_name = '%s_spans_%s' % ('ConvEnt', p)
p_dataset_name_co = '%s_consistent_%s' % ('ConvEnt', p)
p_dataset_name_bp = '%s_breakpoints_%s' % ('ConvEnt', p)
p_dataset_name_ev = '%s_evidence_%s' % ('ConvEnt', p)
p_dataset_name_coh = '%s_coherent_%s' % ('ConvEnt', p)
p_ids = [str(ex['example_id']) for ex in ConvEnt_test_spans]
p_labels = [ex['label'] for ex in ConvEnt_test_spans]

# Get span preds and save metrics
results, preds, labels = evaluate(model, p_dataloader, device, metrics, seg_mode=True if 'roberta' not in mode else False)
save_results(results, probe_model, p_dataset_name)
save_preds(p_ids, labels, preds, probe_model, p_dataset_name)

# Convert substory preds into breakpoint preds for each example
ids_base = [str(ex['example_id']) for ex in ConvEnt_test]

id_to_pred = {k: v for k,v in zip(p_ids, preds)}
id_to_label = {k: v for k,v in zip(p_ids, p_labels)}

preds_entailment = []
labels_entailment = []
preds_consistent = []
preds_breakpoint = []
labels_breakpoint = []
preds_evidence = []
labels_evidence = []    
span_accuracies = []
span_accuracies_strict = []
preds_coherent = []

for i, exid in enumerate(ids_base):
  ex = ConvEnt_test[i]
  ex['length'] = len(ex['turns'])

  label_entailment = preds_base[p][exid]['label']
  pred_entailment = preds_base[p][exid]['pred']
  labels_entailment.append(label_entailment)
  preds_entailment.append(pred_entailment)

  # Get ground truth breakpoint and evidence
  label_breakpoint = ex['conflict_pair'][1] if ex['conflict_pair'] is not None and len(ex['conflict_pair']) > 0 else 0
  labels_breakpoint.append(label_breakpoint)
  if label_breakpoint > 0:
    label_ev = ex['conflict_pair'][0]
  else:
    label_ev = -1
  labels_evidence.append(label_ev)

  # Check consistency - any span that entails the hypothesis' superspans should also entail
  pred_consistent = True
  span_accuracy = 0.0
  span_accuracy_strict = 0.0
  pred_coherent = True
  
  no_spans = 0
  for sp1 in range(ex['length']):
    if not pred_consistent:
      break

    for sp2 in range(sp1, ex['length']):
      if not pred_consistent:
        break

      span_pred = id_to_pred[exid + '-sp%s:%s' % (str(sp1), str(sp2))]
      span_label = id_to_label[exid + '-sp%s:%s' % (str(sp1), str(sp2))]

      if span_pred == span_label:
        span_accuracy += 1.0
        if label_entailment == pred_entailment:
            span_accuracy_strict += 1.0
      else:
        pred_coherent = False
      no_spans += 1
      # print('%s:%s\t%s\t(%s, %s)' % (str(sp1), str(sp2), str(span_pred), str(span_prob[0]), str(span_prob[1])))      

      if span_pred == 1:
        if pred_entailment == 1:
          for sp3 in range(sp1+1):
            if not pred_consistent:
              break

            for sp4 in range(sp2, ex['length']):
              if not pred_consistent:
                break

              sspan_pred = id_to_pred[exid + '-sp%s:%s' % (str(sp3), str(sp4))]

              if sspan_pred == 0:
                pred_consistent = False
                break
        elif pred_entailment == 0:
          pred_consistent = False

  preds_consistent.append(1 if pred_consistent else 0)
  span_accuracies.append(span_accuracy / no_spans)
  span_accuracies_strict.append(span_accuracy_strict / no_spans)
  preds_coherent.append(1 if pred_coherent else 0)

  # Check pred. breakpoint (verifiability) - will be first sentence where the model prediction becomes polarized, i.e., confidence > threshold
  pred_breakpoint = 0 # For now, 0 means -1, i.e., stories are entirely plausible - this shouldn't happen but it will (inconsistent?)
  for ss in range(1, ex['length']):
    if id_to_pred[exid + '-sp%s:%s' % (str(0), str(ss))] == 1:
      pred_breakpoint = ss
      break
  preds_breakpoint.append(pred_breakpoint)

  # Check pred. evidence (verifiability)
  if pred_breakpoint > 0:
    pred_evidence = -1 
    for ss in range(0, pred_breakpoint+1):
      if id_to_pred[exid + '-sp%s:%s' % (str(0), str(ss))] == 1:
        pred_evidence = ss
  else:
    pred_evidence = -1 # This should never happen - it would be inconsistent if it did
  preds_evidence.append(pred_evidence)

# Calculate tiered accuracy for model
acc = 0
acc_con = 0
acc_con_vbp = 0
acc_con_vbp_vev = 0
no_ex = len(ids_base)
for p_plaus, l_plaus, con, p_bp, l_bp, p_ev, l_ev in zip(preds_entailment, labels_entailment, preds_consistent, preds_breakpoint, labels_breakpoint, preds_evidence, labels_evidence):
  # Accuracy
  if p_plaus == l_plaus:
    acc += 1
    
    # Consistency
    if con == 1:
      acc_con += 1
    
      # Verifiability (breakpoint)
      if p_bp == l_bp:
        acc_con_vbp += 1

        # Verifiability (evidence)
        if p_ev == l_ev:
          acc_con_vbp_vev += 1

acc /= no_ex
acc_con /= no_ex
acc_con_vbp /= no_ex
acc_con_vbp_vev /= no_ex

# all_results['acc'] = acc
# all_results['acc_con'] = acc_con
# all_results['acc_con_vbp'] = acc_con_vbp
# all_results['acc_con_vbp_vev'] = acc_con_vbp_vev
# all_results['span_accuracy'] = np.mean(span_accuracies)

all_results['lenient_coherence'] = np.mean(span_accuracies_strict)
all_results['strict_coherence'] = np.mean(preds_coherent)

best_preds_entailment = preds_entailment
best_preds_consistent = preds_consistent
best_preds_breakpoint = preds_breakpoint
best_preds_evidence = preds_evidence
best_preds_coherent = preds_coherent
    
print('\nPARTITION: %s' % p)
print_dict(all_results)

# Save preds for breakpoint and evidence
save_preds(ids_base, np.array(labels_breakpoint), best_preds_breakpoint, probe_model, p_dataset_name_bp)
save_preds(ids_base, np.array(labels_evidence), best_preds_evidence, probe_model, p_dataset_name_ev)
save_preds(ids_base, np.array([1 for p in best_preds_coherent]), best_preds_coherent, probe_model, p_dataset_name_coh)

p_dataset_name_agg = '%s_tiers_agg_nostates_lenient_%s' % ('ConvEnt', p)
save_results(all_results, probe_model, p_dataset_name_agg)

Testing model: ./saved_models/roberta-large_cloze_1_1e-05_5_0.0-0.4-0.4-0.2-0.0_tiered_pipeline_lc_ablate_attributes_states-logits.
	Beginning evaluation...
		Running prediction...
		Computing metrics...
	Finished evaluation in 0:00:04s.

PARTITION: test
{
  lenient_coherence: 
    0.3009151511203495,
  strict_coherence: 
    0.4127906976744186,
}




# ART Results

Code for the coherence experiments on ART.

In [39]:
task_name = 'art'
if task_name != 'art':
  raise ValueError('Please configure task_name in first cell to "art" to run ART results!')

## Load ART dataset

ART is originally gathered from [HuggingFace datasets](https://huggingface.co/docs/datasets/), but we added some of our own annotations for the coherence evaluation.

In [40]:
import os
fname = os.path.join(DRIVE_PATH, 'all_data/ART/art.json')
with open(fname, 'r') as f:
  art = json.load(f)

## Train Models on ART

### Featurize ART

### Train Models

Train models on ART. Note that ART's test set is not public, so we cannot test the model (unless we submit to their [leaderboard](https://leaderboard.allenai.org/anli/submissions/public)).

#### Configure Hyperparameters

#### Grid Search

Delete non-best model checkpoints:

: 

## Coherence Checks on ART

### Load and Featurize Span Data

### Load the Trained Model

Load the trained model we want to probe and select the appropriate dataset.

#### Load Trained Model's Two-Story Classification Predictions

For comparison, we also want the preds and labels for the previous level.

### Calculate Coherence Metrics

As ART is a multiple-choice task, we will need to tune the confidence threshold $\rho$. This code will print out the strict and lenient coherence metrics, as well as the chosen $\rho$ (`best_threshold`).