# Exploring the ImpPres Dataset

The https://huggingface.co/datasets/facebook/imppres dataset was introduced in *"Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition"*, Jeretivc et al, ACL 2020, https://www.aclweb.org/anthology/2020.acl-main.768" to investigate the pragmatic inference capabilities of NLI models.

It was created by synthesizing pairs (premise, hypothesis) according to different templates predicted by pragmatic analysis, for presuppositions triggered by different linguistic forms and implicatures of different forms.  Each sample is grouped in "paradigms" (groups of related pairs) that test the predicted relation between premise and hypothesis according to linguistic transformations.  For example, given a pair (premise, presupposition), the paradigm will include (negated-premise, presupposition), (question-premise, presupposition), (condition-premise, presupposition), (premise, negated-presupposition) etc.  If a model detects that the relation (premise, presupposition) is a form of "presupposition entailment", then it should consistently label the other members of the group according to linguistic predictions.





In [None]:
from datasets import load_dataset
sections = ['implicature_connectives', 'implicature_gradable_adjective', 'implicature_gradable_verb', 'implicature_modals', 'implicature_numerals_10_100', 'implicature_numerals_2_3', 'implicature_quantifiers', 'presupposition_all_n_presupposition', 'presupposition_both_presupposition', 'presupposition_change_of_state', 'presupposition_cleft_existence', 'presupposition_cleft_uniqueness', 'presupposition_only_presupposition', 'presupposition_possessed_definites_existence', 'presupposition_possessed_definites_uniqueness', 'presupposition_question_presupposition']


imp_connectives = load_dataset("facebook/imppres", sections[0])


In [18]:
imp_connectives

DatasetDict({
    connectives: Dataset({
        features: ['premise', 'hypothesis', 'gold_label_log', 'gold_label_prag', 'spec_relation', 'item_type', 'trigger', 'lexemes'],
        num_rows: 1200
    })
})

In [4]:
imp_connectives['connectives'][0]

{'premise': 'These computers or dresses would irritate Veronica.',
 'hypothesis': "These computers and dresses wouldn't both irritate Veronica.",
 'gold_label_log': 1,
 'gold_label_prag': 0,
 'spec_relation': 'implicature_PtoN',
 'item_type': 'target',
 'trigger': 'connective',
 'lexemes': 'or - and'}

In [19]:
pcos = load_dataset("facebook/imppres", "presupposition_change_of_state")

In [20]:
pcos

DatasetDict({
    change_of_state: Dataset({
        features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
        num_rows: 1900
    })
})

In [21]:
pcos['change_of_state'][0]

{'premise': 'The guest had found John.',
 'hypothesis': 'John used to be in an unknown location.',
 'trigger': 'unembedded',
 'trigger1': 'Not_In_Example',
 'trigger2': 'Not_In_Example',
 'presupposition': 'positive',
 'gold_label': 0,
 'UID': 'change_of_state',
 'pairID': '0e',
 'paradigmID': 0}

In [22]:
print(list(set([s['paradigmID'] for s in pcos['change_of_state']])))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]


In [23]:
def get_paradigm(dataset, paradigm_id):
    return [s for s in dataset if s['paradigmID'] == paradigm_id]

In [24]:
get_paradigm(pcos['change_of_state'], 0)

[{'premise': 'The guest had found John.',
  'hypothesis': 'John used to be in an unknown location.',
  'trigger': 'unembedded',
  'trigger1': 'Not_In_Example',
  'trigger2': 'Not_In_Example',
  'presupposition': 'positive',
  'gold_label': 0,
  'UID': 'change_of_state',
  'pairID': '0e',
  'paradigmID': 0},
 {'premise': 'The guest had found John.',
  'hypothesis': "John didn't used to be in an unknown location.",
  'trigger': 'unembedded',
  'trigger1': 'Not_In_Example',
  'trigger2': 'Not_In_Example',
  'presupposition': 'negated',
  'gold_label': 2,
  'UID': 'change_of_state',
  'pairID': '1c',
  'paradigmID': 0},
 {'premise': 'The guest had found John.',
  'hypothesis': 'Peter used to be in an unknown location.',
  'trigger': 'unembedded',
  'trigger1': 'Not_In_Example',
  'trigger2': 'Not_In_Example',
  'presupposition': 'neutral',
  'gold_label': 1,
  'UID': 'change_of_state',
  'pairID': '2n',
  'paradigmID': 0},
 {'premise': "The guest hadn't found John.",
  'hypothesis': 'John 

In [25]:
pop = load_dataset("facebook/imppres", "presupposition_only_presupposition")

presupposition_only_presupposition/only_(…):   0%|          | 0.00/38.1k [00:00<?, ?B/s]

Generating only_presupposition split:   0%|          | 0/1900 [00:00<?, ? examples/s]

In [26]:
pop

DatasetDict({
    only_presupposition: Dataset({
        features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
        num_rows: 1900
    })
})

In [27]:
pcos

DatasetDict({
    change_of_state: Dataset({
        features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
        num_rows: 1900
    })
})

## Unify the Datasets

Your task is to create a new dataset that 
* Has all the lines from the presupposition sections of ImprPres 
    * ['presupposition_all_n_presupposition', 'presupposition_both_presupposition', 'presupposition_change_of_state', 'presupposition_cleft_existence', 'presupposition_cleft_uniqueness', 'presupposition_only_presupposition', 'presupposition_possessed_definites_existence', 'presupposition_possessed_definites_uniqueness', 'presupposition_question_presupposition']
* Has one more column which is the name of the section:
    * ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID', 'section']