InferSent

InferSent is a sentence embeddings method that provides semantic sentence representations. It is trained on natural language inference data and generalizes well to many different tasks.

We provide our pre-trained sentence encoder for reproducing the results from our paper. See also SentEval for automatic evaluation of the quality of sentence embeddings.

Dependencies

This code is written in python. The dependencies are:

Python 2.7 (with recent versions of NumPy/SciPy)
Pytorch >= 0.12
NLTK >= 3

Download datasets

To get GloVe, SNLI and MultiNLI [2GB, 90MB, 216MB], run (in dataset/):

./get_data.bash

This will download GloVe and preprocess SNLI/MultiNLI data/senteval_data.

Use our sentence encoder

See encoder/play.ipynb for an example.

0) Download our model trained on AllNLI (SNLI and MultiNLI) [147MB]:

curl -Lo encoder/infersent.allnli.pickle https://s3.amazonaws.com/senteval/infersent/infersent.allnli.pickle

1) Load our pre-trained model (in encoder/):

import torch
infersent = torch.load('infersent.allnli.pickle')

Note: to load it, you need the file "models.py" (in encoder/) that provides the definition of the model.

2) Set GloVe path for the model:

infersent.set_glove_path(glove_path)

where glove_path is the path to 'glove.840B.300d.txt', containing glove vectors with which our model was trained. Note that using GloVe vectors allows to have a coverage of more than 2 million english words.

3) Build the vocabulary of word vectors (i.e keep only those needed):

infersent.build_vocab(sentences, tokenize=True)

where sentences is your list of n sentences. You can update your vocabulary using infersent.update_vocab(sentences), or directly load the K most common words with infersent.build_vocab_k_words(K=100000). If tokenize is True (by default), sentences will be tokenized using NTLK. Use nltk.download('punkt') once to download the NLTK tokenizer.

4) Encode your sentences (list of n sentences):

infersent.encode(sentences, tokenize=True)

This will output an numpy array with n vectors of dimension 4096 (dimension of the sentence embeddings). Speed is around 1000 sentences per second with batch size 128 on a single GPU.

5) Visualize the importance that our model attributes to each word:

Our representations were trained to focus on semantic information such that a classifier can easily tell the difference between contradictory, neutral or entailed sentences. We provide a function to visualize the importance of each word in the encoding of a sentence:

infersent.visualize('A man plays an instrument.', tokenize=True)

Train model on Natural Language Inference (SNLI)

To reproduce our results and train our models on SNLI, set GLOVE_PATH in train_nli.py, then run:

python train_nli.py

You should obtain a dev accuracy of 85 and a test accuracy of 84.5 with the default setting.

Reproduce our results on transfer tasks

To reproduce our results on transfer tasks, clone SentEval and set PATH_SENTEVAL, PATH_TRANSFER_TASKS in evaluate_model.py, then run:

python evaluate_model.py

Using our best model infersent.allnli.pickle, you should obtain the following test results:

Model	MR	CR	SUBJ	MPQA	STS14	STS Benchmark	SICK Relatedness	SICK Entailment	SST	TREC	MRPC
`InferSent`	81.1	86.3	92.4	90.2	.68/.65	75.8/75.5	0.884	86.1	84.6	88.2	76.2/83.1
`SkipThought`	79.4	83.1	93.7	89.3	.44/.45	72.1/70.2	0.858	79.5	82.9	88.4	-

Note that while InferSent provides good features for many different tasks, our approach also obtains strong results on STS tasks which evaluate the quality of the cosine metrics in the embedding space.

Reference

Please cite 1 if you found this code useful.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

[1] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

@article{conneau2017supervised,
  title={Supervised Learning of Universal Sentence Representations from Natural Language Inference Data},
  author={Conneau, Alexis and Kiela, Douwe and Schwenk, Holger and Barrault, Loic and Bordes, Antoine},
  journal={arXiv preprint arXiv:1705.02364},
  year={2017}
}

Contact: aconneau@fb.com

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
encoder		encoder
LICENSE		LICENSE
README.md		README.md
data.py		data.py
models.py		models.py
mutils.py		mutils.py
train_nli.py		train_nli.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

encoder

encoder

LICENSE

LICENSE

README.md

README.md

data.py

data.py

models.py

models.py

mutils.py

mutils.py

train_nli.py

train_nli.py

Repository files navigation

InferSent

Dependencies

Download datasets

Use our sentence encoder

Train model on Natural Language Inference (SNLI)

Reproduce our results on transfer tasks

Reference

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

About

Releases

Packages

Languages

License

Kryndex/InferSent

Folders and files

Latest commit

History

Repository files navigation

InferSent

Dependencies

Download datasets

Use our sentence encoder

Train model on Natural Language Inference (SNLI)

Reproduce our results on transfer tasks

Reference

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

About

Resources

License

Stars

Watchers

Forks

Languages