# fse - Tutorial

Welcome to fse - fast sentence embeddings. The library is intended to compute sentence embeddings as fast as possible. 
It offers a simple and easy to understand syntax for you to use in your own projects. Before we start with any model, lets have a look at the input types which.
All fse models require an iterable/generator which produces an IndexedSentence object. An IndexedSentence is a named tuple with two fields: words and index. The index is required for the multi-core processing, as sentences might not be processed sequentially. The index dictates, which row of the corresponding sentence vector matrix the sentence belongs to.

## Input handling

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
from fse import IndexedSentence
s = IndexedSentence(["Hello", "world"], 0)
print(s.words)
print(s.index)

2019-08-25 16:47:39,738 : MainThread : INFO : 'pattern' package not found; tag filters are not available for English


['Hello', 'world']
0


The words of the IndexedSentence will always consist of a list of strings. Otherwise the train method will raise an Error. However, most input data is available as a list of strings.

In [3]:
sentences_a = ["Hello there", "how are you?"]
sentences_b = ["today is a good day", "Lorem ipsum"]

In order to deal with this common input format, fse provides the IndexedList, which handels all required data operations for you. You can provide multiple lists (or sets) which will all be merged into a single list. This eases work if you have to work with the STS datasets. IndexedList will perform an automatic split if you don't provide a specific function for the model to split on.

In [4]:
from fse import IndexedList
s = IndexedList(sentences_a, sentences_b)
print(len(s))
s[0]

4


IndexedSentence(words=['Hello', 'there'], index=0)

To save memory, we do not convert the original lists inplace. The conversion will only take place once you call the getitem method. To access the original data, call:

In [5]:
s.items

['Hello there', 'how are you?', 'today is a good day', 'Lorem ipsum']

If the data is already preprocessed as a list of lists you can provide the argument pre_splitted=True.

In [6]:
sentences_splitted = ["Hello there".split(), "how are you?".split()]
s = IndexedList(sentences_splitted, pre_splitted=True)
print(len(s))
s[0]

2


IndexedSentence(words=['Hello', 'there'], index=0)

In case you want to provide your own splitting function, you can pass a callable to the split_func argument.

In [7]:
def split_func(string):
    return string.split()

s = IndexedList(sentences_a, split=False, split_func=split_func)
print(len(s))
s[0]

2


IndexedSentence(words=['Hello', 'there'], index=0)

If you want to stream a file from disk (where each line corresponds to a sentence) you can use the IndexedLineDocument.

In [8]:
from fse import IndexedLineDocument
doc = IndexedLineDocument("../test/test_data/test_sentences.txt")

In [9]:
i = 0
for s in doc:
    print(f"{s.index}\t{s.words}")
    i += 1
    if i == 4:
        break

0	['Good', 'stuff', 'i', 'just', 'wish', 'it', 'lasted', 'longer']
1	['Hp', 'makes', 'qualilty', 'stuff']
2	['I', 'like', 'it']
3	['Try', 'it', 'you', 'will', 'like', 'it']


If you are later working with the similarity of sentences, the IndexedLineDocument provides you the option to access each line by its corresponding index. This helps you in determining the similarity of sentences, as the most_similar method would otherwise just return indices.

In [10]:
doc[20]

'I feel like i just got screwed'

# Training a model / Performing inference

Training a fse model is simple. You only need a pre-trained word embedding model which you use during the initializiation of the fse model you want to use.

In [11]:
import gensim.downloader as api
data = api.load("quora-duplicate-questions")
glove = api.load("glove-wiki-gigaword-100")

2019-08-25 16:47:40,935 : MainThread : INFO : loading projection weights from /Users/oliverborchers/gensim-data/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz
2019-08-25 16:48:25,069 : MainThread : INFO : loaded (400000, 100) matrix from /Users/oliverborchers/gensim-data/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz


In [12]:
sentences = []
for d in data:
    # Let's blow up the data a bit by replicating each sentence.
    for i in range(4):
        sentences.append(d["question1"])
        sentences.append(d["question2"])
s = IndexedList(sentences)
print(len(s))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


3234320


So we have about 3234320 sentences we want to compute the embeddings for. If you import the FAST_VERSION variable as follows you can ensure, that the compiliation of the cython routines worked correctly:

In [13]:
from fse.models.average import FAST_VERSION
FAST_VERSION
# 1 -> The fast version works

1

In [14]:
from fse.models import SIF
model = SIF(glove, lang_freq="en")

2019-08-25 16:48:29,876 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


In [15]:
model.train(s)

2019-08-25 16:48:30,446 : MainThread : INFO : scanning all indexed sentences and their word counts
2019-08-25 16:48:35,446 : MainThread : INFO : SCANNING : finished 1053801 sentences with 11639639 words
2019-08-25 16:48:40,446 : MainThread : INFO : SCANNING : finished 1922223 sentences with 21245025 words
2019-08-25 16:48:45,446 : MainThread : INFO : SCANNING : finished 2921544 sentences with 32296684 words
2019-08-25 16:48:46,877 : MainThread : INFO : finished scanning 3234320 sentences with an average length of 11 and 35778364 total words
2019-08-25 16:48:47,000 : MainThread : INFO : estimated memory for 3234320 sentences with 100 dimensions and 400000 vocabulary: 1387 MB (1 GB)
2019-08-25 16:48:47,001 : MainThread : INFO : initializing sentence vectors for 3234320 sentences
2019-08-25 16:48:52,168 : MainThread : INFO : pre-computing SIF weights for 400000 words
2019-08-25 16:48:52,410 : MainThread : INFO : begin training
2019-08-25 16:48:57,429 : MainThread : INFO : PROGRESS : finis

(3234312, 24627592)

The model runs at around 140,000 sentences / seconds. That means we finish the task in 6 seconds.

Once the sif model is trained, you can perform additional inferences for unknown sentences. This two step process for new data is required, as computing the principal components for models like SIF and uSIF will require a fair amount of sentences. If you want the vector for a single sentence (which is out of the training vocab), just use:

In [16]:
tmp = IndexedSentence("Hello my friends".split(), 0)
model.infer([tmp])

2019-08-25 16:49:23,574 : MainThread : INFO : scanning all indexed sentences and their word counts
2019-08-25 16:49:23,575 : MainThread : INFO : finished scanning 1 sentences with an average length of 3 and 3 total words
2019-08-25 16:49:23,576 : MainThread : INFO : removing 1 principal components took 0s


array([[ 0.21351576, -0.09288862, -0.0119403 , -0.1889258 , -0.06775595,
         0.35332185, -0.05645356,  0.0137386 , -0.05183255, -0.23596533,
        -0.00128163, -0.02317927,  0.12560129,  0.12072462, -0.11241709,
        -0.19177678,  0.01603449, -0.1464626 , -0.23061372,  0.2182611 ,
        -0.12705329,  0.22893776, -0.0711069 , -0.11446346,  0.09433743,
         0.12357411, -0.1769016 ,  0.00240368,  0.37974662,  0.11471796,
         0.22353241,  0.15717354,  0.35247564,  0.11246662, -0.05959256,
         0.1634959 ,  0.12703045,  0.03458374,  0.23784837, -0.09175231,
        -0.02836828,  0.13040201,  0.21949698, -0.04487759, -0.09548192,
        -0.07434461, -0.35468528,  0.2915059 ,  0.33256316,  0.02439231,
        -0.10745892, -0.14252488,  0.05496336, -0.05148613,  0.07837424,
        -0.00869656,  0.02335814,  0.06826252, -0.3131165 ,  0.06759262,
         0.11155459,  0.41700476,  0.06062685, -0.12836376, -0.08194584,
        -0.06413196,  0.14149769,  0.15394527,  0.1

## Querying the model

In order to query the model or perform similarity computations we can just access the model.sv (sentence vectors) object and use its method. To get a vector for an index, just call

In [17]:
model.sv[0]

array([ 0.05602818, -0.05382837, -0.02521911,  0.21895382,  0.12569104,
       -0.23286618, -0.1197442 , -0.09862348, -0.00718871, -0.05476034,
       -0.04207896,  0.06721796, -0.07115522, -0.17658317, -0.11463246,
        0.01588205, -0.04609087,  0.00192668,  0.06665533,  0.01994948,
        0.03874935,  0.01709431, -0.13803758,  0.11546692, -0.08431192,
        0.01444703,  0.05757499,  0.10997273, -0.06485826,  0.03536717,
       -0.12626679,  0.04387207, -0.05193089, -0.04863958,  0.06036283,
        0.04478271,  0.15353824, -0.09030947, -0.05488757, -0.11358316,
        0.01325982, -0.03299464,  0.02760698,  0.00774282, -0.11196408,
        0.08852603, -0.15662912,  0.07745412, -0.02901547, -0.1876617 ,
       -0.14219847,  0.00209911, -0.01960241,  0.05305627, -0.10361422,
       -0.05813611, -0.11573686, -0.00658958,  0.21034062, -0.10365039,
        0.04388143, -0.23539278, -0.04639078,  0.02537311, -0.08185716,
        0.06672949, -0.18605871,  0.08167145,  0.07045798, -0.00

To compute the similarity or distance between two sentence from the training set you can call:

In [18]:
print(model.sv.similarity(0,1).round(3))
print(model.sv.distance(0,1).round(3))

0.939
0.061


We can further call for the most similar sentences given an index. For example, we want to know the most similar sentences for sentence index 100:

In [19]:
print(s[100])

IndexedSentence(['What', 'can', 'make', 'Physics', 'easy', 'to', 'learn?'], 100)


In [20]:
model.sv.most_similar(100)
# Division by zero can happen if you encounter empy sentences

2019-08-25 16:49:23,609 : MainThread : INFO : precomputing L2-norms of sentence vectors
  return (m / dist).astype(REAL)
  return (m / dist).astype(REAL)


[(102, 0.9999999403953552),
 (1920653, 0.9999999403953552),
 (1920655, 0.9999999403953552),
 (1920651, 0.9999999403953552),
 (1920649, 0.9999999403953552),
 (96, 0.9999999403953552),
 (98, 0.9999999403953552),
 (2752780, 0.969744086265564),
 (2752778, 0.969744086265564),
 (2752776, 0.969744086265564)]

However, the preceding function will only supply the indices of the most similar sentences. You can circumvent this problem by passing an indexable function to the most_similar call:

In [21]:
model.sv.most_similar(100, indexable=sentences)

[('What can make Physics easy to learn?', 102, 0.9999999403953552),
 ('What can make Physics easy to learn?', 1920653, 0.9999999403953552),
 ('What can make Physics easy to learn?', 1920655, 0.9999999403953552),
 ('What can make Physics easy to learn?', 1920651, 0.9999999403953552),
 ('What can make Physics easy to learn?', 1920649, 0.9999999403953552),
 ('What can make Physics easy to learn?', 96, 0.9999999403953552),
 ('What can make Physics easy to learn?', 98, 0.9999999403953552),
 ('How can I make an easy Mrs. Peacock costume?', 2752780, 0.969744086265564),
 ('How can I make an easy Mrs. Peacock costume?', 2752778, 0.969744086265564),
 ('How can I make an easy Mrs. Peacock costume?', 2752776, 0.969744086265564)]

There we go. This is a lot more understandable than the initial list of indices.

To search for sentences, which are similar to a given word vector, you can call:

In [22]:
model.sv.similar_by_word("easy", wv=glove, indexable=sentences)

[('Is Hadoop easy to learn?', 2607199, 0.6739339828491211),
 ('Is Hadoop easy to learn?', 2607195, 0.6739339828491211),
 ('Is Python easy to learn?', 58913, 0.6739339828491211),
 ('Is Python easy to learn?', 58915, 0.6739339828491211),
 ('Is Python easy to learn?', 58917, 0.6739339828491211),
 ('Is Python easy to learn?', 58919, 0.6739339828491211),
 ('Is Java easy to learn?', 2495013, 0.6739339828491211),
 ('Is Hadoop easy to learn?', 2607197, 0.6739339828491211),
 ('Is Adobe Premiere Pro easy to learn?', 712780, 0.6739339828491211),
 ('Is Adobe Premiere Pro easy to learn?', 712776, 0.6739339828491211)]

Furthermore, you can query for unknown (or new) sentences by calling:

In [23]:
model.sv.similar_by_sentence("Is this really easy to learn".split(), model=model, indexable=sentences)

2019-08-25 16:49:25,864 : MainThread : INFO : scanning all indexed sentences and their word counts
2019-08-25 16:49:25,865 : MainThread : INFO : finished scanning 1 sentences with an average length of 6 and 6 total words
2019-08-25 16:49:25,866 : MainThread : INFO : removing 1 principal components took 0s


[('How do I learn C# the easy way?', 1120173, 0.959309458732605),
 ('How do I learn C# the easy way?', 1120169, 0.959309458732605),
 ('How do I learn C# the easy way?', 1120171, 0.959309458732605),
 ('How do I learn C# the easy way?', 1120175, 0.959309458732605),
 ('How easy is it to learn Java?', 2894414, 0.9592858552932739),
 ('How easy is it to learn Java?', 176146, 0.9592858552932739),
 ('How easy is it to learn Java?', 176150, 0.9592858552932739),
 ('How easy is it to learn Java?', 121921, 0.9592858552932739),
 ('How easy is it to learn C?', 1552117, 0.9592858552932739),
 ('How easy is it to learn Java?', 176144, 0.9592858552932739)]

Feel free to browse through the library and get to know the functions a little better!