# Natural language inference: task and datasets

In [None]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"

## Contents

1. [Overview](#Overview)
1. [Our version of the task](#Our-version-of-the-task)
1. [Primary resources](#Primary-resources)
1. [NLI model landscape](#NLI-model-landscape)
1. [Set-up](#Set-up)
1. [Properties of the corpora](#Properties-of-the-corpora)
  1. [SNLI properties](#SNLI-properties)
  1. [MultiNLI properties](#MultiNLI-properties)
1. [Working with SNLI and MultiNLI](#Working-with-SNLI-and-MultiNLI)
  1. [Readers](#Readers)
  1. [The NLIExample class](#The-NLIExample-class)
  1. [Labels](#Labels)
  1. [Tree representations](#Tree-representations)
1. [Annotated MultiNLI subsets](#Annotated-MultiNLI-subsets)
1. [Other NLI datasets](#Other-NLI-datasets)

## Overview

Natural Language Inference (NLI) is the task of predicting the logical relationships between words, phrases, sentences, (paragraphs, documents, ...). Such relationships are crucial for all kinds of reasoning in natural language: arguing, debating, problem solving, summarization, and so forth.

[Dagan et al. (2006)](https://link.springer.com/chapter/10.1007%2F11736790_9), one of the foundational papers on NLI (also called Recognizing Textual Entailment; RTE), make a case for the generality of this task in NLU:

> It seems that major inferences, as needed by multiple applications, can indeed be cast in terms of textual entailment. For example, __a QA system__ has to identify texts that entail a hypothesized answer. [...] Similarly, for certain __Information Retrieval__ queries the combination of semantic concepts and relations denoted by the query should be entailed from relevant retrieved documents. [...] In __multi-document summarization__ a redundant sentence, to be omitted from the summary, should be entailed from other sentences in the summary. And in __MT evaluation__ a correct translation should be semantically equivalent to the gold standard translation, and thus both translations should entail each other. Consequently, we hypothesize that textual entailment recognition is a suitable generic task for evaluating and comparing applied semantic inference models. Eventually, such efforts can promote the development of entailment recognition "engines" which may provide useful generic modules across applications.

## Our version of the task

Our NLI data will look like this:

| Premise | Relation      | Hypothesis |
|---------|---------------|------------|
| turtle  | contradiction | linguist   |
| A turtled danced | entails | A turtle moved |
| Every reptile danced | entails | Every turtle moved |
| Some turtles walk | contradicts | No turtles move |
| James Byron Dean refused to move without blue jeans | entails | James Dean didn't dance without pants |

In the [word-entailment bakeoff](nli_wordentail_bakeoff.ipynb), we looked at a special case of this where the premise and hypothesis are single words. This notebook begins to introduce the problem of NLI more fully.

## Primary resources


We're going to focus on two large, human-labeled, relatively naturalistic entailment corpora:

* [The Stanford Natural Language Inference corpus (SNLI)](https://nlp.stanford.edu/projects/snli/)
* [The Multi-Genre NLI Corpus (MultiNLI)](https://www.nyu.edu/projects/bowman/multinli/)

The first was collected by a group at Stanford, led by [Sam Bowman](https://www.nyu.edu/projects/bowman/), and the second was collected by a group at NYU, also led by [Sam Bowman](https://www.nyu.edu/projects/bowman/). They have the same format and were crowdsourced using the same basic methods. However, SNLI is entirely focused on image captions, whereas MultiNLI includes a greater range of contexts.

This notebook presents tools for working with these corpora. The [second notebook in the unit](nli_02_models.ipynb) concerns models of NLI.

## NLI model landscape

<img src="fig/nli-model-landscape.png" width=800 />

## Set-up

* As usual, you need to be fully set up to work with [the CS224u repository](https://github.com/cgpotts/cs224u/).

* If you haven't already, download [the course data](http://web.stanford.edu/class/cs224u/data/data.zip), unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change `DATA_HOME` below.)

In [1]:
import nli
import os
import pandas as pd
import random

In [2]:
DATA_HOME = os.path.join("data", "nlidata")

SNLI_HOME = os.path.join(DATA_HOME, "snli_1.0")

MULTINLI_HOME = os.path.join(DATA_HOME, "multinli_1.0")

ANNOTATIONS_HOME = os.path.join(DATA_HOME, "multinli_1.0_annotations")

## Properties of the corpora

For both SNLI and MultiNLI, MTurk annotators were presented with premise sentences and asked to produce new sentences that entailed, contradicted, or were neutral with respect to the premise. A subset of the examples were then validated by an additional four MTurk annotators.

### SNLI properties

* All the premises are captions from the [Flickr30K corpus](http://shannon.cs.illinois.edu/DenotationGraph/).

* Some of the sentences rather depressingly reflect stereotypes ([Rudinger et al. 2017](https://aclanthology.coli.uni-saarland.de/papers/W17-1609/w17-1609)).

* 550,152 train examples; 10K dev; 10K test

* Mean length in tokens:
  * Premise: 14.1
  * Hypothesis: 8.3

* Clause-types
  * Premise S-rooted: 74%
  * Hypothesis S-rooted: 88.9%

* Vocab size: 37,026

* 56,951 examples validated by four additional annotators
  * 58.3% examples with unanimous gold label
  * 91.2% of gold labels match the author's label
  * 0.70 overall Fleiss kappa
  
* Top scores currently around 89%.  

### MultiNLI properties


* Train premises drawn from five genres: 
  1. Fiction: works from 1912–2010 spanning many genres
  1. Government: reports, letters, speeches, etc., from government websites
  1. The _Slate_ website
  1. Telephone: the Switchboard corpus
  1. Travel: Berlitz travel guides
  
* Additional genres just for dev and test (the __mismatched__ condition): 
  1. The 9/11 report
  1. Face-to-face: The Charlotte Narrative and Conversation Collection
  1. Fundraising letters
  1. Non-fiction from Oxford University Press
  1. _Verbatim_ articles about linguistics

* 392,702 train examples; 20K dev; 20K test

* 19,647 examples validated by four additional annotators
  * 58.2% examples with unanimous gold label
  * 92.6% of gold labels match the author's label
  
* Test-set labels available as a Kaggle competition.  

  * Top matched scores currently around 0.81.
  * Top mismatched scores currently around 0.83.

## Working with SNLI and MultiNLI

### Readers

The following readers should make it easy to work with these corpora:
    
* `nli.SNLITrainReader`
* `nli.SNLIDevReader`
* `nli.MultiNLITrainReader`
* `nli.MultiNLIMatchedDevReader`
* `nli.MultiNLIMismatchedDevReader`

The base class is `nli.NLIReader`, which should be easy to use to define additional readers.

If you did change `data_home`, `snli_home`, or `multinli_home` above, then you'll need to call these readers with `dirname` as an argument, where `dirname` is your `snli_home` or `multinli_home`, as appropriate.

Because the datasets are so large, it is often useful to be able to randomly sample from them. All of the reader classes allow this with their keyword argument `samp_percentage`. For example, the following samples approximately 10% of the examples from the SNLI training set:

In [3]:
nli.SNLITrainReader(SNLI_HOME, samp_percentage=0.10, random_state=42)

"NLIReader({'src_filename': 'data/nlidata/snli_1.0/snli_1.0_train.jsonl', 'filter_unlabeled': True, 'samp_percentage': 0.1, 'random_state': 42})

The precise number of examples will vary somewhat because of the way the sampling is done. (Here, we trade efficiency for precision in the number of cases we return; see the implementation for details.)

### The NLIExample class

All of the readers have a `read` method that yields `NLIExample` example instances, which have the following attributes:

* __annotator_labels__: `list of str`
* __captionID__: `str`
* __gold_label__: `str`
* __pairID__: `str`
* __sentence1__: `str`
* __sentence1_binary_parse__: `nltk.tree.Tree`
* __sentence1_parse__: `nltk.tree.Tree`
* __sentence2__: `str`
* __sentence2_binary_parse__: `nltk.tree.Tree`
* __sentence2_parse__: `nltk.tree.Tree`

In [4]:
snli_iterator = iter(nli.SNLITrainReader(SNLI_HOME).read())

In [5]:
snli_ex = next(snli_iterator)

In [6]:
print(snli_ex)

A person on a horse jumps over a broken down airplane.
neutral
A person is training his horse for a competition.


In [7]:
snli_ex

"NLIExample({'annotator_labels': ['neutral'], 'captionID': '3416050480.jpg#4', 'gold_label': 'neutral', 'pairID': '3416050480.jpg#4r1n', 'sentence1': 'A person on a horse jumps over a broken down airplane.', 'sentence1_binary_parse': Tree('X', [Tree('X', [Tree('X', ['A', 'person']), Tree('X', ['on', Tree('X', ['a', 'horse'])])]), Tree('X', [Tree('X', ['jumps', Tree('X', ['over', Tree('X', ['a', Tree('X', ['broken', Tree('X', ['down', 'airplane'])])])])]), '.'])]), 'sentence1_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('PP', [Tree('IN', ['on']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['horse'])])])]), Tree('VP', [Tree('VBZ', ['jumps']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['broken']), Tree('JJ', ['down']), Tree('NN', ['airplane'])])])]), Tree('.', ['.'])])]), 'sentence2': 'A person is training his horse for a competition.', 'sentence2_binary_parse': Tree('X', [Tree('X', ['A', 'perso

### Labels

In [8]:
snli_labels = pd.Series(
    [ex.gold_label for ex in nli.SNLITrainReader(SNLI_HOME, filter_unlabeled=False).read()])

In [9]:
snli_labels.value_counts()

entailment       183416
contradiction    183187
neutral          182764
-                   785
dtype: int64

In [10]:
multinli_labels = pd.Series(
    [ex.gold_label for ex in nli.MultiNLITrainReader(MULTINLI_HOME, filter_unlabeled=False).read()])

In [11]:
multinli_labels.value_counts()

contradiction    130903
neutral          130900
entailment       130899
dtype: int64

### Tree representations

Both corpora contain __three versions__ of the premise and hypothesis sentences:

1. Regular string representations of the data
1. Unlabeled binary parses 
1. Labeled parses

In [12]:
snli_ex.sentence1

'A person on a horse jumps over a broken down airplane.'

The binary parses lack node labels; so that we can use `nltk.tree.Tree` with them, the label `X` is added to all of them:

In [None]:
snli_ex.sentence1_binary_parse

Here's the full parse tree with syntactic categories:

In [None]:
import matplotlib

snli_ex.sentence1_parse

The leaves of either tree are a tokenized version of the example:

In [13]:
snli_ex.sentence1_parse.leaves()

['A',
 'person',
 'on',
 'a',
 'horse',
 'jumps',
 'over',
 'a',
 'broken',
 'down',
 'airplane',
 '.']

## Annotated MultiNLI subsets

MultiNLI includes additional annotations for a subset of the dev examples. The goal is to help people understand how well their models are doing on crucial NLI-related linguistic phenomena.

In [14]:
matched_ann_filename = os.path.join(
    ANNOTATIONS_HOME,
    "multinli_1.0_matched_annotations.txt")

mismatched_ann_filename = os.path.join(
    ANNOTATIONS_HOME, 
    "multinli_1.0_mismatched_annotations.txt")

In [15]:
def view_random_example(annotations, random_state=42):
    random.seed(random_state)
    ann_ex = random.choice(list(annotations.items()))
    pairid, ann_ex = ann_ex
    ex = ann_ex['example']   
    print("pairID: {}".format(pairid))
    print(ann_ex['annotations'])
    print(ex.sentence1)
    print(ex.gold_label)
    print(ex.sentence2)

In [16]:
matched_ann = nli.read_annotated_subset(matched_ann_filename, MULTINLI_HOME)

In [17]:
view_random_example(matched_ann)

pairID: 63218c
[]
Recently, however, I have settled down and become decidedly less experimental.
contradiction
I am still as experimental as ever, and I am always on the move.


## Other NLI datasets

* [The FraCaS textual inference test suite](http://www-nlp.stanford.edu/~wcmac/downloads/) is a smaller, hand-built dataset that is great for evaluating a model's ability to handle complex logical patterns.

* [SemEval 2013](https://www.cs.york.ac.uk/semeval-2013/) had a wide range of interesting data sets for NLI and related tasks.

* [The SemEval 2014 semantic relatedness shared task](http://alt.qcri.org/semeval2014/task1/) used an NLI dataset called [Sentences Involving Compositional Knowledge (SICK)](http://alt.qcri.org/semeval2014/task1/index.php?id=data-and-tools). `data/nlidata` contains a parsed version of SICK created by [Sam Bowman](https://www.nyu.edu/projects/bowman/).

* [MedNLI](https://physionet.org/physiotools/mimic-code/mednli/) is specialized to the medical domain, using data derived from [MIMIC III](https://mimic.physionet.org).

* [XNLI](https://github.com/facebookresearch/XNLI) is a multilingual NLI dataset derived from MultiNLI.

* [Diverse Natural Language Inference Collection (DNC)](http://decomp.io/projects/diverse-natural-language-inference/) transforms existing annotations from other tasks into NLI problems for a diverse range of reasoning challenges.

* [SciTail](http://data.allenai.org/scitail/) is an NLI dataset derived from multiple-choice science exam questions and Web text.

* Models for NLI might be adapted for use with [the 30M Factoid Question-Answer Corpus](http://agarciaduran.org/).

* Models for NLI might be adapted for use with [the Penn Paraphrase Database](http://paraphrase.org/).