# fse - Tutorial

Welcome to fse - fast sentence embeddings. The library is intended to compute sentence embeddings as fast as possible. 
It offers a simple and easy to understand syntax for you to use in your own projects. Before we start with any model, lets have a look at the input types.
All fse models require an iterable/generator which produces a tuple. The tuple has two fields: words and index. The index is required for the multi-thread processing, as sentences might not be processed sequentially. The index dictates, which row of the corresponding sentence vector matrix the sentence belongs to.

## Input handling

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
s = (["Hello", "world"], 0)
print(s[0])
print(s[1])

['Hello', 'world']
0


The words of the tuple will always consist of a list of strings. Otherwise the train method will raise an Error. However, most input data is available as a list of strings.

In order to deal with this common input format, fse provides the IndexedList and some variants, which handel all required data operations for you. You can provide multiple lists (or sets) which will all be merged into a single list. This eases work if you have to work with the STS datasets.

The multiple types of indexed lists. Let's go through them one by one:
- IndexedList: for already pre-splitted sentences
- **C**IndexedList: for already pre-splitted sentences with a custom index for each sentence
- SplitIndexedList: for sentences which have not been splitted. Will splitt the strings
- **C**SplitIndexedList: for sentences which have not been splitted. Will splitt the strings. You can provide a custom split function
- **C**Split*C*IndexedList: for sentences where you want to provide a custom index and a custom split function.

*Note*: These are ordered by speed. Meaning, that IndexedList is the fastest, while **C**Split**C**IndexedList is the slowest variant.

In [3]:
from fse import SplitIndexedList

sentences_a = ["Hello there", "how are you?"]
sentences_b = ["today is a good day", "Lorem ipsum"]

s = SplitIndexedList(sentences_a, sentences_b)
print(len(s))
s[0]

2019-09-09 22:38:15,759 : MainThread : INFO : 'pattern' package not found; tag filters are not available for English


4


(['Hello', 'there'], 0)

To save memory, we do not convert the original lists inplace. The conversion will only take place once you call the getitem method. To access the original data, call:

In [4]:
s.items

['Hello there', 'how are you?', 'today is a good day', 'Lorem ipsum']

If the data is already preprocessed as a list of lists you can provide the argument pre_splitted=True.

In [5]:
from fse import IndexedList

sentences_splitted = ["Hello there".split(), "how are you?".split()]
s = IndexedList(sentences_splitted)
print(len(s))
s[0]

2


(['Hello', 'there'], 0)

In case you want to provide your own splitting function, you can pass a callable to the **C**SplitIndexedList class.

In [6]:
from fse import CSplitIndexedList

def split_func(string):
    return string.lower().split()

s = CSplitIndexedList(sentences_a, custom_split=split_func)
print(len(s))
s[0]

2


(['hello', 'there'], 0)

If you want to stream a file from disk (where each line corresponds to a sentence) you can use the IndexedLineDocument.

In [7]:
from fse import IndexedLineDocument
doc = IndexedLineDocument("../fse/test/test_data/test_sentences.txt")

In [8]:
i = 0
for s in doc:
    print(f"{s[1]}\t{s[0]}")
    i += 1
    if i == 4:
        break

0	['Good', 'stuff', 'i', 'just', 'wish', 'it', 'lasted', 'longer']
1	['Hp', 'makes', 'qualilty', 'stuff']
2	['I', 'like', 'it']
3	['Try', 'it', 'you', 'will', 'like', 'it']


If you are later working with the similarity of sentences, the IndexedLineDocument provides you the option to access each line by its corresponding index. This helps you in determining the similarity of sentences, as the most_similar method would otherwise just return indices.

In [9]:
doc[20]

'I feel like i just got screwed'

# Training a model / Performing inference

Training a fse model is simple. You only need a pre-trained word embedding model which you use during the initializiation of the fse model you want to use.

In [10]:
import gensim.downloader as api
data = api.load("quora-duplicate-questions")
glove = api.load("glove-wiki-gigaword-100")

2019-09-09 22:38:16,681 : MainThread : INFO : loading projection weights from /Users/oliverborchers/gensim-data/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz
2019-09-09 22:38:58,016 : MainThread : INFO : loaded (400000, 100) matrix from /Users/oliverborchers/gensim-data/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz


In [11]:
sentences = []
for d in data:
    # Let's blow up the data a bit by replicating each sentence.
    for i in range(8):
        sentences.append(d["question1"].split())
        sentences.append(d["question2"].split())
s = IndexedList(sentences)
print(len(s))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


6468640


So we have about 6468640 sentences we want to compute the embeddings for. If you import the FAST_VERSION variable as follows you can ensure, that the compiliation of the cython routines worked correctly:

In [12]:
from fse.models.average import FAST_VERSION, MAX_WORDS_IN_BATCH
print(MAX_WORDS_IN_BATCH)
print(FAST_VERSION)
# 1 -> The fast version works

10000
1


In [13]:
from fse.models import SIF
model = SIF(glove, workers=2)

In [14]:
model.train(s)

2019-09-09 22:39:22,954 : MainThread : INFO : scanning all indexed sentences and their word counts
2019-09-09 22:39:27,960 : MainThread : INFO : SCANNING : finished 3603151 sentences with 39821529 words
2019-09-09 22:39:32,287 : MainThread : INFO : finished scanning 6468640 sentences with an average length of 11 and 71556728 total words
2019-09-09 22:39:32,494 : MainThread : INFO : estimated memory for 6468640 sentences with 100 dimensions and 400000 vocabulary: 2621 MB (2 GB)
2019-09-09 22:39:32,494 : MainThread : INFO : initializing sentence vectors for 6468640 sentences
2019-09-09 22:39:45,235 : MainThread : INFO : pre-computing SIF weights for 400000 words
2019-09-09 22:39:45,517 : MainThread : INFO : begin training
2019-09-09 22:39:50,528 : MainThread : INFO : PROGRESS : finished 28.59% with 1849612 sentences and 14048992 words, 369922 sentences/s
2019-09-09 22:39:55,532 : MainThread : INFO : PROGRESS : finished 61.40% with 3971644 sentences and 30216912 words, 424406 sentences/s


(6468624, 49255184)

The models training speed revolves at around 500,000 sentences / seconds. That means we finish the task in about 10 seconds.

Once the sif model is trained, you can perform additional inferences for unknown sentences. This two step process for new data is required, as computing the principal components for models like SIF and uSIF will require a fair amount of sentences. If you want the vector for a single sentence (which is out of the training vocab), just use:

In [15]:
tmp = ("Hello my friends".split(), 0)
model.infer([tmp])

2019-09-09 22:40:27,964 : MainThread : INFO : scanning all indexed sentences and their word counts
2019-09-09 22:40:27,966 : MainThread : INFO : finished scanning 1 sentences with an average length of 3 and 3 total words
2019-09-09 22:40:27,968 : MainThread : INFO : removing 1 principal components took 0s


array([[ 0.38196447, -0.1988384 ,  0.0571419 , -0.2608053 , -0.39127567,
         0.56997204, -0.08829699,  0.04303335,  0.02415937, -0.18261634,
         0.12377772,  0.16709268,  0.22155543,  0.19370788, -0.26391292,
        -0.2801168 , -0.26631495, -0.04989915, -0.2803055 ,  0.35429263,
        -0.33491358,  0.4502716 , -0.35931408, -0.15993637,  0.19240806,
         0.5006909 , -0.45003742, -0.2531514 ,  0.7854812 ,  0.16518082,
         0.48821324,  0.38489473,  0.71547   ,  0.1450263 , -0.2963434 ,
         0.28262872, -0.11362517,  0.04488435,  0.41105908, -0.2494816 ,
         0.1678418 ,  0.18146755,  0.5637449 , -0.16483429, -0.34314314,
        -0.11086482, -0.71529424,  0.37857854,  0.5497621 , -0.21101826,
        -0.1545457 , -0.07437938, -0.00441471, -0.08365119,  0.20791844,
         0.01463819,  0.27646348,  0.39419666, -0.30494595, -0.09648007,
         0.47651446,  0.8765762 , -0.01992229, -0.23987037, -0.14416808,
         0.04672491,  0.27598125,  0.0977577 ,  0.1

## Querying the model

In order to query the model or perform similarity computations we can just access the model.sv (sentence vectors) object and use its method. To get a vector for an index, just call

In [16]:
model.sv[0]

array([ 0.10094505, -0.17338434, -0.06120607,  0.34027654,  0.22177696,
       -0.24794772, -0.11074717, -0.14664018, -0.05048759, -0.00544698,
       -0.10035248, -0.00855931, -0.12817514, -0.23004776, -0.17999992,
        0.07368515, -0.03067316, -0.05431816,  0.08335009, -0.10299528,
        0.06848799, -0.00242306, -0.0400511 ,  0.29239285, -0.0816397 ,
       -0.08835362,  0.12184662,  0.14001876, -0.07441948,  0.14240132,
       -0.15721211, -0.02932569, -0.06760302, -0.07329973, -0.03554936,
        0.08278874,  0.19171   , -0.05927336, -0.14163393, -0.13755181,
       -0.02026197, -0.04314215, -0.05729384,  0.0653533 , -0.06144977,
        0.06533504, -0.09953416,  0.04081041, -0.03480543, -0.19186741,
       -0.21262896,  0.09202444,  0.10283869,  0.12685204, -0.16222197,
       -0.21388721, -0.24895634, -0.16497523,  0.29118574, -0.19552863,
       -0.03108095, -0.3014589 , -0.00456905,  0.11545967, -0.11587113,
        0.0689742 , -0.2582472 ,  0.07914913,  0.16394164,  0.06

To compute the similarity or distance between two sentence from the training set you can call:

In [17]:
print(model.sv.similarity(0,1).round(3))
print(model.sv.distance(0,1).round(3))

0.929
0.071


We can further call for the most similar sentences given an index. For example, we want to know the most similar sentences for sentence index 100:

In [18]:
print(s[100])

(['Should', 'I', 'buy', 'tiago?'], 100)


In [19]:
model.sv.most_similar(100)
# Division by zero can happen if you encounter empy sentences

2019-09-09 22:40:28,004 : MainThread : INFO : precomputing L2-norms of sentence vectors


[(3949083, 1.0),
 (897678, 1.0),
 (4229890, 1.0),
 (3949079, 1.0),
 (3949081, 1.0),
 (2934317, 1.0),
 (4093542, 1.0),
 (3949075, 1.0),
 (4229889, 1.0),
 (2934319, 1.0)]

However, the preceding function will only supply the indices of the most similar sentences. You can circumvent this problem by passing an indexable function to the most_similar call:

In [20]:
model.sv.most_similar(100, indexable=sentences)

[(['Should', 'I', 'buy', 'Asus', 'Zenfone', '5?'], 3949083, 1.0),
 (['Why', "doesn't", 'Google', 'buy', 'Quora?'], 897678, 1.0),
 (['Should', 'Google', 'buy', 'Quora?'], 4229890, 1.0),
 (['Should', 'I', 'buy', 'Asus', 'Zenfone', '5?'], 3949079, 1.0),
 (['Should', 'I', 'buy', 'Asus', 'Zenfone', '5?'], 3949081, 1.0),
 (['Why', "didn't", 'Facebook', 'buy', 'Twitter?'], 2934317, 1.0),
 (['Should', 'I', 'buy', 'Xiaomi', 'Redmi', 'Note', '3?', 'Why?'],
  4093542,
  1.0),
 (['Should', 'I', 'buy', 'Asus', 'Zenfone', '5?'], 3949075, 1.0),
 (['Will', 'Google', 'buy', 'Quora?'], 4229889, 1.0),
 (['Why', "didn't", 'Facebook', 'buy', 'Twitter?'], 2934319, 1.0)]

There we go. This is a lot more understandable than the initial list of indices.

To search for sentences, which are similar to a given word vector, you can call:

In [21]:
model.sv.similar_by_word("easy", wv=glove, indexable=sentences)

[(['How', 'do', 'I', 'make', 'easy', 'money?'], 6383261, 0.5409939289093018),
 (['How', 'do', 'I', 'make', 'easy', 'money?'], 842112, 0.5409939289093018),
 (['How', 'do', 'I', 'make', 'easy', 'money?'], 6383255, 0.5409939289093018),
 (['How', 'do', 'I', 'make', 'easy', 'money?'], 6383257, 0.5409939289093018),
 (['How', 'do', 'I', 'make', 'easy', 'money?'], 6383259, 0.5409939289093018),
 (['How', 'do', 'I', 'make', 'easy', 'money?'], 4405391, 0.5409939289093018),
 (['How', 'do', 'I', 'make', 'easy', 'money?'], 6383263, 0.5409939289093018),
 (['How', 'do', 'I', 'make', 'easy', 'money?'], 842116, 0.5409939289093018),
 (['How', 'do', 'I', 'make', 'easy', 'money?'], 842114, 0.5409939289093018),
 (['How', 'do', 'I', 'make', 'easy', 'money?'], 6383251, 0.5409939289093018)]

Furthermore, you can query for unknown (or new) sentences by calling:

In [22]:
model.sv.similar_by_sentence("Is this really easy to learn".split(), model=model, indexable=sentences)

2019-09-09 22:40:49,606 : MainThread : INFO : scanning all indexed sentences and their word counts
2019-09-09 22:40:49,608 : MainThread : INFO : finished scanning 1 sentences with an average length of 6 and 6 total words
2019-09-09 22:40:49,622 : MainThread : INFO : removing 1 principal components took 0s


[(['Is',
   'it',
   'easy',
   'to',
   'learn',
   'Hebrew',
   'if',
   'you',
   'learn',
   'Arabic?'],
  4666689,
  0.9029382467269897),
 (['Is',
   'it',
   'easy',
   'to',
   'learn',
   'Arabic',
   'if',
   'you',
   'learn',
   'Hebrew',
   'first?'],
  4666700,
  0.9029382467269897),
 (['Is',
   'it',
   'easy',
   'to',
   'learn',
   'Arabic',
   'if',
   'you',
   'learn',
   'Hebrew',
   'first?'],
  4666698,
  0.9029382467269897),
 (['Is',
   'it',
   'easy',
   'to',
   'learn',
   'Arabic',
   'if',
   'you',
   'learn',
   'Hebrew',
   'first?'],
  4666690,
  0.9029382467269897),
 (['Is',
   'it',
   'easy',
   'to',
   'learn',
   'Hebrew',
   'if',
   'you',
   'learn',
   'Arabic?'],
  4666691,
  0.9029382467269897),
 (['Is',
   'it',
   'easy',
   'to',
   'learn',
   'Hebrew',
   'if',
   'you',
   'learn',
   'Arabic?'],
  4666699,
  0.9029382467269897),
 (['Is',
   'it',
   'easy',
   'to',
   'learn',
   'Arabic',
   'if',
   'you',
   'learn',
   'Hebrew',

Feel free to browse through the library and get to know the functions a little better!