Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

SNLI-VE: Visual Entailment Dataset

Ning Xie, Farley Lai, Derek Doran, Asim Kadav

NEC Laboratories America, Inc.

SNLI-VE is the dataset proposed for the Visual Entailment (VE) task investigated in Visual Entailment Task for Visually-Grounded Language Learning accpeted to NeurIPS 2018 ViGIL workshop). Refer to our full paper for detailed analysis and evaluations.



Rank Test Accuracy Source Date
1 78.98% UNITER 09/25/2019
2 73.02%, 73.18%, 72.52% e-SNLI-VE-2.0 04/07/2020
Baseline 71.16% EVE-Image 11/26/2018


e-SNLI-VE-2.0 relabels the dev as well as test splits of the neutral class and evalutes the resulting performance in order of the original, val-correction and val/test correction configurations.


SNLI-VE is built on top of SNLI and Flickr30K. The problem that VE is trying to solve is to reason about the relationship between an image premise Pimage and a text hypothesis Htext.

Specifically, given an image as premise, and a natural language sentence as hypothesis, three labels (entailment, neutral and contradiction) are assigned based on the relationship conveyed by the (Pimage, Htext)

  • entailment holds if there is enough evidence in Pimage to conclude that Htext is true.
  • contradiction holds if there is enough evidence in Pimage to conclude that Htext is false.
  • Otherwise, the relationship is neutral, implying the evidence in Pimage is insufficient to draw a conclusion about Htext.

Examples from SNLI-VE


SNLI-VE Statistics

Below is some highlighted dataset statistic, details can be found in our paper.

Distribution by Split

The data details of train, dev and test split is shown below. The instances of three labels (entailment, neutral and contradiction) are evenly distributed for each split.

Train Dev Test
#Image 29783 1000 1000
#Entailment 176932 5959 5973
#Neutral 176045 5960 5964
#Contradiction 176550 5939 5964
Vocabulary Size 29550 6576 6592

Dataset Comparision

Below is a dataset comparison among SNLI-VE, VQA-v2.0 and CLEVR.

Partition Size:
Training 529527 443757 699989
Validation 17858 214354 149991
Test 17901 555187 149988
Question Length:
Mean 7.4 6.1 18.4
Median 7 6 17
Mode 6 5 14
Max 56 23 43
Vocabulary Size 32191 19174 87

Question Length Distribution

The question here for SNLI-VE dataset is the hypothesis. As shown in the figure, the question length of SNLI-VE dataset is distributed with a quite long tail.

Question length distribution


To check the quality of SNLI-VE dataset, we randomly sampled 217 pairs from all three splits (565286 pairs in total). Among all sampled pairs, 20 (about 9.2%) examples are incorrectly labeled, among which the majority is in the neutral class. This is consistent to the analysis reported by GTE in its Table 2.

It is worth noting that the original SNLI dataset is not perfectly labeled, with 8.8% of the sampled data not assigned a gold label, implying the disagreement within human labelers. SNLI-VE is no exception but we believe it is a common scenario in other large scale datasets. However, if the dataset quality is a major concern to you, we suggest dropping the neutral classs and only use entailment and contradiction examples.

SNLI-VE Creation generates the SNLI-VE dataset in train, dev and test splits with disjoint image sets. Each entry contains a Flickr30kID field to associate with the original Flickr30K image id. parses entires in SNLI-VE for applications and is free to revise.

Follow the instructions below to set up the environment and generate SNLI-VE:

  1. Set the conda environment and dependencies

    conda create -n vet37 python=3.7
    conda activate vet37
    conda install jsonlines
    # conda install -c NECLA-ML ml
  2. Clone the repo

    git clone
  3. Generate SNLI-VE in data/

    cd SNLI-VE
    python -m
  4. Download dependent datasets: Flickr30K, Entities, SNLI, and RoI features

    cd data
    ./download # y to all if necessary

SNLI-VE Extensions

Flickr30k Entities dataset is an extension to Flickr30k, which contains grounded RoI and entity annotations.

It is easy to extend our SNLI-VE dataset with Flickr30k Entities if fine-grained annotations are required in your experiments.


The first is our full paper while the second is the ViGiL workshop version.

  title={Visual Entailment: A Novel Task for Fine-grained Image Understanding},
  author={Xie, Ning and Lai, Farley and Doran, Derek and Kadav, Asim},
  journal={arXiv preprint arXiv:1901.06706},

  title={Visual Entailment Task for Visually-Grounded Language Learning},
  author={Xie, Ning and Lai, Farley and Doran, Derek and Kadav, Asim},
  journal={arXiv preprint arXiv:1811.10582},

Thank you for your interest in our dataset!
Please contact us for any questions, comments, or suggestions!


Dataset and starting code for visual entailment dataset




No releases published


No packages published
You can’t perform that action at this time.