# Exploring the ImpPres Dataset

The https://huggingface.co/datasets/facebook/imppres dataset was introduced in *"Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition"*, Jeretivc et al, ACL 2020, https://www.aclweb.org/anthology/2020.acl-main.768" to investigate the pragmatic inference capabilities of NLI models.

It was created by synthesizing pairs (premise, hypothesis) according to different templates predicted by pragmatic analysis, for presuppositions triggered by different linguistic forms and implicatures of different forms.  Each sample is grouped in "paradigms" (groups of related pairs) that test the predicted relation between premise and hypothesis according to linguistic transformations.  For example, given a pair (premise, presupposition), the paradigm will include (negated-premise, presupposition), (question-premise, presupposition), (condition-premise, presupposition), (premise, negated-presupposition) etc.  If a model detects that the relation (premise, presupposition) is a form of "presupposition entailment", then it should consistently label the other members of the group according to linguistic predictions.





In [1]:
from datasets import load_dataset
sections = ['implicature_connectives', 'implicature_gradable_adjective', 'implicature_gradable_verb', 'implicature_modals', 'implicature_numerals_10_100', 'implicature_numerals_2_3', 'implicature_quantifiers', 'presupposition_all_n_presupposition', 'presupposition_both_presupposition', 'presupposition_change_of_state', 'presupposition_cleft_existence', 'presupposition_cleft_uniqueness', 'presupposition_only_presupposition', 'presupposition_possessed_definites_existence', 'presupposition_possessed_definites_uniqueness', 'presupposition_question_presupposition']


imp_connectives = load_dataset("facebook/imppres", sections[0])


In [2]:
imp_connectives

DatasetDict({
    connectives: Dataset({
        features: ['premise', 'hypothesis', 'gold_label_log', 'gold_label_prag', 'spec_relation', 'item_type', 'trigger', 'lexemes'],
        num_rows: 1200
    })
})

In [3]:
imp_connectives['connectives'][0]

{'premise': 'These computers or dresses would irritate Veronica.',
 'hypothesis': "These computers and dresses wouldn't both irritate Veronica.",
 'gold_label_log': 1,
 'gold_label_prag': 0,
 'spec_relation': 'implicature_PtoN',
 'item_type': 'target',
 'trigger': 'connective',
 'lexemes': 'or - and'}

In [4]:
pcos = load_dataset("facebook/imppres", "presupposition_change_of_state")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


change_of_state-00000-of-00001.parquet:   0%|          | 0.00/35.8k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Generating change_of_state split:   0%|          | 0/1900 [00:00<?, ? examples/s]

In [5]:
pcos

DatasetDict({
    change_of_state: Dataset({
        features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
        num_rows: 1900
    })
})

In [12]:
pcos['change_of_state'][5]

{'premise': "The guest hadn't found John.",
 'hypothesis': 'Peter used to be in an unknown location.',
 'trigger': 'negated',
 'trigger1': 'Not_In_Example',
 'trigger2': 'Not_In_Example',
 'presupposition': 'neutral',
 'gold_label': 1,
 'UID': 'change_of_state',
 'pairID': '5n',
 'paradigmID': 0}

In [13]:
print(list(set([s['paradigmID'] for s in pcos['change_of_state']])))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]


In [14]:
def get_paradigm(dataset, paradigm_id):
    return [s for s in dataset if s['paradigmID'] == paradigm_id]

In [15]:
get_paradigm(pcos['change_of_state'], 0)

[{'premise': 'The guest had found John.',
  'hypothesis': 'John used to be in an unknown location.',
  'trigger': 'unembedded',
  'trigger1': 'Not_In_Example',
  'trigger2': 'Not_In_Example',
  'presupposition': 'positive',
  'gold_label': 0,
  'UID': 'change_of_state',
  'pairID': '0e',
  'paradigmID': 0},
 {'premise': 'The guest had found John.',
  'hypothesis': "John didn't used to be in an unknown location.",
  'trigger': 'unembedded',
  'trigger1': 'Not_In_Example',
  'trigger2': 'Not_In_Example',
  'presupposition': 'negated',
  'gold_label': 2,
  'UID': 'change_of_state',
  'pairID': '1c',
  'paradigmID': 0},
 {'premise': 'The guest had found John.',
  'hypothesis': 'Peter used to be in an unknown location.',
  'trigger': 'unembedded',
  'trigger1': 'Not_In_Example',
  'trigger2': 'Not_In_Example',
  'presupposition': 'neutral',
  'gold_label': 1,
  'UID': 'change_of_state',
  'pairID': '2n',
  'paradigmID': 0},
 {'premise': "The guest hadn't found John.",
  'hypothesis': 'John 

In [16]:
pop = load_dataset("facebook/imppres", "presupposition_only_presupposition")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(…)ly_presupposition-00000-of-00001.parquet:   0%|          | 0.00/38.1k [00:00<?, ?B/s]

Generating only_presupposition split:   0%|          | 0/1900 [00:00<?, ? examples/s]

In [17]:
pop

DatasetDict({
    only_presupposition: Dataset({
        features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
        num_rows: 1900
    })
})

In [18]:
pcos

DatasetDict({
    change_of_state: Dataset({
        features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
        num_rows: 1900
    })
})

## Unify the Datasets

Your task is to create a new dataset that 
* Has all the lines from the presupposition sections of ImprPres 
    * ['presupposition_all_n_presupposition', 'presupposition_both_presupposition', 'presupposition_change_of_state', 'presupposition_cleft_existence', 'presupposition_cleft_uniqueness', 'presupposition_only_presupposition', 'presupposition_possessed_definites_existence', 'presupposition_possessed_definites_uniqueness', 'presupposition_question_presupposition']
* Has one more column which is the name of the section:
    * ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID', 'section']

In [3]:
from datasets import load_dataset
import pandas as pd

# Presupposition sections
presupposition_sections = [
    'all_n_presupposition',
    'both_presupposition',
    'change_of_state',
    'cleft_existence',
    'cleft_uniqueness',
    'only_presupposition',
    'possessed_definites_existence',
    'possessed_definites_uniqueness',
    'question_presupposition'
]



# Load and merge all presupposition sections
dfs = []
for section in presupposition_sections:
    ds = load_dataset("facebook/imppres", 'presupposition_' + section)
    df = pd.DataFrame(ds[section])
    df["section"] = section
    dfs.append(df)

# Combine into one dataset
combined_df = pd.concat(dfs, ignore_index=True)

# Show shape and preview
print("Combined dataset shape:", combined_df.shape)
combined_df.head()


Combined dataset shape: (17100, 11)


Unnamed: 0,premise,hypothesis,trigger,trigger1,trigger2,presupposition,gold_label,UID,pairID,paradigmID,section
0,All ten guys that proved to boast were divorcing.,There are exactly ten guys that proved to boast.,unembedded,Not_In_Example,Not_In_Example,positive,0,all_n_presupposition,0e,0,all_n_presupposition
1,All ten guys that proved to boast were divorcing.,There are exactly eleven guys that proved to b...,unembedded,Not_In_Example,Not_In_Example,negated,2,all_n_presupposition,1c,0,all_n_presupposition
2,All ten guys that proved to boast were divorcing.,There are exactly ten senators that proved to ...,unembedded,Not_In_Example,Not_In_Example,neutral,1,all_n_presupposition,2n,0,all_n_presupposition
3,All ten guys that proved to boast weren't divo...,There are exactly ten guys that proved to boast.,negated,Not_In_Example,Not_In_Example,positive,0,all_n_presupposition,3e,0,all_n_presupposition
4,All ten guys that proved to boast weren't divo...,There are exactly eleven guys that proved to b...,negated,Not_In_Example,Not_In_Example,negated,2,all_n_presupposition,4c,0,all_n_presupposition


# for future task lets use a third of the data

In [None]:
from datasets import load_dataset
import pandas as pd
import os
dfs = []
for section in presupposition_sections:
    ds = load_dataset("facebook/imppres", 'presupposition_' + section)
    df = pd.DataFrame(ds[section])
    
    # Use only one third of the rows
    subset_size = len(df) // 3
    df = df.iloc[:subset_size].copy()
    
    df["section"] = section
    dfs.append(df)

output_path="combined_presuppositions.csv"
combined_df = pd.concat(dfs, ignore_index=True)
combined_df.to_csv(output_path, index=False)

print(f"Combined dataset saved to {os.path.abspath(output_path)}")
print("Shape:", combined_df.shape)

Combined dataset saved to c:\Users\idol\Desktop\University\Simester F\LLM\ass2\nlp-with-llms-2025-hw2\combined_presuppositions.csv
Shape: (5697, 11)


In [5]:
from datasets import load_dataset
import pandas as pd
import os
dfs = []
for section in presupposition_sections:
    ds = load_dataset("facebook/imppres", 'presupposition_' + section)
    df = pd.DataFrame(ds[section])
    
    # Use only one third of the rows
    subset_size = len(df) // 30
    df = df.iloc[:subset_size].copy()
    
    df["section"] = section
    dfs.append(df)

output_path="small_combined_presuppositions.csv"
combined_df = pd.concat(dfs, ignore_index=True)
combined_df.to_csv(output_path, index=False)

print(f"Combined dataset saved to {os.path.abspath(output_path)}")
print("Shape:", combined_df.shape)

Combined dataset saved to c:\Users\idol\Desktop\University\Simester F\LLM\ass2\nlp-with-llms-2025-hw2\small_combined_presuppositions.csv
Shape: (567, 11)
