# Exploring the ImpPres Dataset

The https://huggingface.co/datasets/facebook/imppres dataset was introduced in *"Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition"*, Jeretivc et al, ACL 2020, https://www.aclweb.org/anthology/2020.acl-main.768" to investigate the pragmatic inference capabilities of NLI models.

It was created by synthesizing pairs (premise, hypothesis) according to different templates predicted by pragmatic analysis, for presuppositions triggered by different linguistic forms and implicatures of different forms.  Each sample is grouped in "paradigms" (groups of related pairs) that test the predicted relation between premise and hypothesis according to linguistic transformations.  For example, given a pair (premise, presupposition), the paradigm will include (negated-premise, presupposition), (question-premise, presupposition), (condition-premise, presupposition), (premise, negated-presupposition) etc.  If a model detects that the relation (premise, presupposition) is a form of "presupposition entailment", then it should consistently label the other members of the group according to linguistic predictions.





In [45]:
from datasets import load_dataset
sections = ['implicature_connectives', 'implicature_gradable_adjective', 'implicature_gradable_verb', 'implicature_modals', 'implicature_numerals_10_100', 'implicature_numerals_2_3', 'implicature_quantifiers', 'presupposition_all_n_presupposition', 'presupposition_both_presupposition', 'presupposition_change_of_state', 'presupposition_cleft_existence', 'presupposition_cleft_uniqueness', 'presupposition_only_presupposition', 'presupposition_possessed_definites_existence', 'presupposition_possessed_definites_uniqueness', 'presupposition_question_presupposition']


imp_connectives = load_dataset("facebook/imppres", sections[0])


In [39]:
imp_connectives

DatasetDict({
    connectives: Dataset({
        features: ['premise', 'hypothesis', 'gold_label_log', 'gold_label_prag', 'spec_relation', 'item_type', 'trigger', 'lexemes'],
        num_rows: 1200
    })
})

In [40]:
imp_connectives['connectives'][0]

{'premise': 'These computers or dresses would irritate Veronica.',
 'hypothesis': "These computers and dresses wouldn't both irritate Veronica.",
 'gold_label_log': 1,
 'gold_label_prag': 0,
 'spec_relation': 'implicature_PtoN',
 'item_type': 'target',
 'trigger': 'connective',
 'lexemes': 'or - and'}

In [41]:
pcos = load_dataset("facebook/imppres", "presupposition_change_of_state")

In [42]:
pcos

DatasetDict({
    change_of_state: Dataset({
        features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
        num_rows: 1900
    })
})

In [25]:
pcos['change_of_state'][0]

{'premise': 'The guest had found John.',
 'hypothesis': 'John used to be in an unknown location.',
 'trigger': 'unembedded',
 'trigger1': 'Not_In_Example',
 'trigger2': 'Not_In_Example',
 'presupposition': 'positive',
 'gold_label': 0,
 'UID': 'change_of_state',
 'pairID': '0e',
 'paradigmID': 0}

In [26]:
print(list(set([s['paradigmID'] for s in pcos['change_of_state']])))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]


In [27]:
def get_paradigm(dataset, paradigm_id):
    return [s for s in dataset if s['paradigmID'] == paradigm_id]

In [28]:
get_paradigm(pcos['change_of_state'], 0)

[{'premise': 'The guest had found John.',
  'hypothesis': 'John used to be in an unknown location.',
  'trigger': 'unembedded',
  'trigger1': 'Not_In_Example',
  'trigger2': 'Not_In_Example',
  'presupposition': 'positive',
  'gold_label': 0,
  'UID': 'change_of_state',
  'pairID': '0e',
  'paradigmID': 0},
 {'premise': 'The guest had found John.',
  'hypothesis': "John didn't used to be in an unknown location.",
  'trigger': 'unembedded',
  'trigger1': 'Not_In_Example',
  'trigger2': 'Not_In_Example',
  'presupposition': 'negated',
  'gold_label': 2,
  'UID': 'change_of_state',
  'pairID': '1c',
  'paradigmID': 0},
 {'premise': 'The guest had found John.',
  'hypothesis': 'Peter used to be in an unknown location.',
  'trigger': 'unembedded',
  'trigger1': 'Not_In_Example',
  'trigger2': 'Not_In_Example',
  'presupposition': 'neutral',
  'gold_label': 1,
  'UID': 'change_of_state',
  'pairID': '2n',
  'paradigmID': 0},
 {'premise': "The guest hadn't found John.",
  'hypothesis': 'John 

In [29]:
pop = load_dataset("facebook/imppres", "presupposition_only_presupposition")

In [30]:
pop

DatasetDict({
    only_presupposition: Dataset({
        features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
        num_rows: 1900
    })
})

In [31]:
pcos

DatasetDict({
    change_of_state: Dataset({
        features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
        num_rows: 1900
    })
})

## Unify the Datasets

Your task is to create a new dataset that 
* Has all the lines from the presupposition sections of ImprPres 
    * ['presupposition_all_n_presupposition', 'presupposition_both_presupposition', 'presupposition_change_of_state', 'presupposition_cleft_existence', 'presupposition_cleft_uniqueness', 'presupposition_only_presupposition', 'presupposition_possessed_definites_existence', 'presupposition_possessed_definites_uniqueness', 'presupposition_question_presupposition']
* Has one more column which is the name of the section:
    * ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID', 'section']

## Task 2.1

In [46]:
from datasets import concatenate_datasets

# Load all presupposition sections
presupposition_sections = [
    'presupposition_all_n_presupposition', 
    'presupposition_both_presupposition', 
    'presupposition_change_of_state', 
    'presupposition_cleft_existence', 
    'presupposition_cleft_uniqueness', 
    'presupposition_only_presupposition', 
    'presupposition_possessed_definites_existence', 
    'presupposition_possessed_definites_uniqueness', 
    'presupposition_question_presupposition'
]

# Load and prepare each dataset with section information
dataset_list = []

for section_name in presupposition_sections:
    print(f"Loading {section_name}...")
    dataset = load_dataset("facebook/imppres", section_name)
    
    # Get the actual split name (it varies by section)
    split_name = list(dataset.keys())[0]
    data = dataset[split_name]
    
    # Add section column using datasets map function
    data_with_section = data.map(lambda x: {**x, 'section': section_name})
    dataset_list.append(data_with_section)

# Concatenate all datasets into one unified dataset
unified_dataset = concatenate_datasets(dataset_list)

print(f"\nTotal samples in unified dataset: {len(unified_dataset)}")
print(f"Columns in unified dataset: {unified_dataset.column_names}")

# Show first sample from each section
print("\nFirst sample from each section:")
for section_name in presupposition_sections:
    section_data = unified_dataset.filter(lambda x: x['section'] == section_name)
    if len(section_data) > 0:
        print(f"\n{section_name}: {len(section_data)} samples")
        first_sample = section_data[0]
        print(f"Example: {first_sample['premise']} -> {first_sample['hypothesis']}")
        print(f"Label: {first_sample['gold_label']}")


Loading presupposition_all_n_presupposition...
Loading presupposition_both_presupposition...
Loading presupposition_change_of_state...
Loading presupposition_cleft_existence...
Loading presupposition_cleft_uniqueness...
Loading presupposition_only_presupposition...
Loading presupposition_possessed_definites_existence...
Loading presupposition_possessed_definites_uniqueness...
Loading presupposition_question_presupposition...

Total samples in unified dataset: 17100
Columns in unified dataset: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID', 'section']

First sample from each section:

presupposition_all_n_presupposition: 1900 samples
Example: All ten guys that proved to boast were divorcing. -> There are exactly ten guys that proved to boast.
Label: 0

presupposition_both_presupposition: 1900 samples
Example: Both gloves that aren't loosening do fray. -> There are exactly two gloves that aren't loosening
Label: 