# fse - Tutorial

Welcome to fse - fast sentence embeddings. The library is intended to compute sentence embeddings as fast as possible. 
It offers a simple and easy to understand syntax for you to use in your own projects. Before we start with any model, lets have a look at the input types which.
All fse models require an iterable/generator which produces an IndexedSentence object. An IndexedSentence is a named tuple with two fields: words and index. The index is required for the multi-core processing, as sentences might not be processed sequentially. The index dictates, which row of the corresponding sentence vector matrix the sentence belongs to.

## Input handling

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s', level=logging.DEBUG)

In [2]:
from fse import IndexedSentence
from fse import IndexedList
s = IndexedSentence(["Hello", "world"], 0)
print(s.words)
print(s.index)

2019-09-04 13:29:34,356 : MainThread : DEBUG : {'uri': '/Users/oliverborchers/anaconda3/envs/fsedev/lib/python3.7/site-packages/smart_open/VERSION', 'mode': 'r', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
2019-09-04 13:29:34,731 : MainThread : INFO : 'pattern' package not found; tag filters are not available for English


['Hello', 'world']
0


The words of the IndexedSentence will always consist of a list of strings. Otherwise the train method will raise an Error. However, most input data is available as a list of strings.

In [None]:
sentences_a = ["Hello there", "how are you?"]
sentences_b = ["today is a good day", "Lorem ipsum"]

In order to deal with this common input format, fse provides the IndexedList, which handels all required data operations for you. You can provide multiple lists (or sets) which will all be merged into a single list. This eases work if you have to work with the STS datasets. IndexedList will perform an automatic split if you don't provide a specific function for the model to split on.

In [None]:
s = IndexedList(sentences_a, sentences_b)
print(len(s))
s[0]

To save memory, we do not convert the original lists inplace. The conversion will only take place once you call the getitem method. To access the original data, call:

In [None]:
s.items

If the data is already preprocessed as a list of lists you can provide the argument pre_splitted=True.

In [None]:
sentences_splitted = ["Hello there".split(), "how are you?".split()]
s = IndexedList(sentences_splitted, pre_splitted=True)
print(len(s))
s[0]

In case you want to provide your own splitting function, you can pass a callable to the split_func argument.

In [None]:
def split_func(string):
    return string.split()

s = IndexedList(sentences_a, split=False, split_func=split_func)
print(len(s))
s[0]

If you want to stream a file from disk (where each line corresponds to a sentence) you can use the IndexedLineDocument.

In [None]:
from fse import IndexedLineDocument
doc = IndexedLineDocument("../fse/test/test_data/test_sentences.txt")

In [None]:
i = 0
for s in doc:
    print(f"{s.index}\t{s.words}")
    i += 1
    if i == 4:
        break

If you are later working with the similarity of sentences, the IndexedLineDocument provides you the option to access each line by its corresponding index. This helps you in determining the similarity of sentences, as the most_similar method would otherwise just return indices.

In [None]:
doc[20]

# Training a model / Performing inference

Training a fse model is simple. You only need a pre-trained word embedding model which you use during the initializiation of the fse model you want to use.

In [3]:
import gensim.downloader as api
data = api.load("quora-duplicate-questions")
glove = api.load("glove-wiki-gigaword-100")

2019-09-04 13:29:40,491 : MainThread : INFO : loading projection weights from /Users/oliverborchers/gensim-data/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz
2019-09-04 13:29:40,492 : MainThread : DEBUG : {'uri': '/Users/oliverborchers/gensim-data/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
2019-09-04 13:30:23,523 : MainThread : INFO : loaded (400000, 100) matrix from /Users/oliverborchers/gensim-data/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz


In [4]:
sentences = []
for d in data:
    # Let's blow up the data a bit by replicating each sentence.
    for i in range(8):
        sentences.append(d["question1"].split())
        sentences.append(d["question2"].split())
s = IndexedList(sentences, pre_splitted=True)
print(len(s))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-09-04 13:30:23,534 : MainThread : DEBUG : {'uri': '/Users/oliverborchers/gensim-data/quora-duplicate-questions/quora-duplicate-questions.gz', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': {}}


6468640


So we have about 6468640 sentences we want to compute the embeddings for. If you import the FAST_VERSION variable as follows you can ensure, that the compiliation of the cython routines worked correctly:

In [5]:
from fse.models.average import FAST_VERSION, MAX_WORDS_IN_BATCH
print(MAX_WORDS_IN_BATCH)
print(FAST_VERSION)
# 1 -> The fast version works

10000
1


In [8]:
from fse.models import Average
model = Average(glove, workers=2)

In [9]:
model.train(s)

2019-09-04 13:31:33,740 : MainThread : INFO : scanning all indexed sentences and their word counts
2019-09-04 13:31:38,741 : MainThread : INFO : SCANNING : finished 3450090 sentences with 38128985 words
2019-09-04 13:31:42,880 : MainThread : INFO : finished scanning 6468640 sentences with an average length of 11 and 71556728 total words
2019-09-04 13:31:43,020 : MainThread : INFO : estimated memory for 6468640 sentences with 100 dimensions and 400000 vocabulary: 2621 MB (2 GB)
2019-09-04 13:31:43,021 : MainThread : INFO : initializing sentence vectors for 6468640 sentences
2019-09-04 13:31:53,581 : MainThread : INFO : begin training
2019-09-04 13:31:58,591 : MainThread : INFO : PROGRESS : finished 41.60% with 2691182 sentences and 20473640 words, 538236 sentences/s
2019-09-04 13:32:03,591 : MainThread : INFO : PROGRESS : finished 82.79% with 5355357 sentences and 40758152 words, 532835 sentences/s
2019-09-04 13:32:05,656 : Thread-8 : DEBUG : job loop exiting, total 7161 jobs
2019-09-04

(6468624, 49255184)

The model runs at around 300,000 sentences / seconds. That means we finish the task in about 20 seconds.

Once the sif model is trained, you can perform additional inferences for unknown sentences. This two step process for new data is required, as computing the principal components for models like SIF and uSIF will require a fair amount of sentences. If you want the vector for a single sentence (which is out of the training vocab), just use:

In [None]:
tmp = IndexedSentence("Hello my friends".split(), 0)
model.infer([tmp])

## Querying the model

In order to query the model or perform similarity computations we can just access the model.sv (sentence vectors) object and use its method. To get a vector for an index, just call

In [None]:
model.sv[0]

To compute the similarity or distance between two sentence from the training set you can call:

In [None]:
print(model.sv.similarity(0,1).round(3))
print(model.sv.distance(0,1).round(3))

We can further call for the most similar sentences given an index. For example, we want to know the most similar sentences for sentence index 100:

In [None]:
print(s[100])

In [None]:
model.sv.most_similar(100)
# Division by zero can happen if you encounter empy sentences

However, the preceding function will only supply the indices of the most similar sentences. You can circumvent this problem by passing an indexable function to the most_similar call:

In [None]:
model.sv.most_similar(100, indexable=sentences)

There we go. This is a lot more understandable than the initial list of indices.

To search for sentences, which are similar to a given word vector, you can call:

In [None]:
model.sv.similar_by_word("easy", wv=glove, indexable=sentences)

Furthermore, you can query for unknown (or new) sentences by calling:

In [None]:
model.sv.similar_by_sentence("Is this really easy to learn".split(), model=model, indexable=sentences)

Feel free to browse through the library and get to know the functions a little better!