# fse - Tutorial

Welcome to fse - fast sentence embeddings. The library is intended to compute sentence embeddings as fast as possible. 
It offers a simple and easy to understand syntax for you to use in your own projects. Before we start with any model, lets have a look at the input types.
All fse models require an iterable/generator which produces a tuple. The tuple has two fields: words and index. The index is required for the multi-thread processing, as sentences might not be processed sequentially. The index dictates, which row of the corresponding sentence vector matrix the sentence belongs to.

## Input handling

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
s = (["Hello", "world"], 0)
print(s[0])
print(s[1])

The words of the tuple will always consist of a list of strings. Otherwise the train method will raise an Error. However, most input data is available as a list of strings.

In order to deal with this common input format, fse provides the IndexedList and some variants, which handel all required data operations for you. You can provide multiple lists (or sets) which will all be merged into a single list. This eases work if you have to work with the STS datasets.

The multiple types of indexed lists. Let's go through them one by one:
- IndexedList: for already pre-splitted sentences
- **C**IndexedList: for already pre-splitted sentences with a custom index for each sentence
- SplitIndexedList: for sentences which have not been splitted. Will split the strings
- Split**C**IndexedList: for sentences which have not been splitted and with a custom index for each sentence
- **C**SplitIndexedList: for sentences which have not been splitted. Will split the strings. You can provide a custom split function
- **C**Split*C*IndexedList: for sentences where you want to provide a custom index and a custom split function.

*Note*: These are ordered by speed. Meaning, that IndexedList is the fastest, while **C**Split*C*IndexedList is the slowest variant.

In [3]:
from fse import SplitIndexedList

sentences_a = ["Hello there", "how are you?"]
sentences_b = ["today is a good day", "Lorem ipsum"]

s = SplitIndexedList(sentences_a, sentences_b)
print(len(s))
s[0]

4


(['Hello', 'there'], 0)

To save memory, we do not convert the original lists inplace. The conversion will only take place once you call the getitem method. To access the original data, call:

In [4]:
s.items

['Hello there', 'how are you?', 'today is a good day', 'Lorem ipsum']

If the data is already preprocessed as a list of lists you can just use the IndexedList

In [5]:
from fse import IndexedList

sentences_splitted = ["Hello there".split(), "how are you?".split()]
s = IndexedList(sentences_splitted)
print(len(s))
s[0]

2


(['Hello', 'there'], 0)

In case you want to provide your own splitting function, you can pass a callable to the **C**SplitIndexedList class.

In [6]:
from fse import CSplitIndexedList

def split_func(string):
    return string.lower().split()

s = CSplitIndexedList(sentences_a, custom_split=split_func)
print(len(s))
s[0]

2


(['hello', 'there'], 0)

If you want to stream a file from disk (where each line corresponds to a sentence) you can use the IndexedLineDocument.

In [7]:
from fse import IndexedLineDocument
doc = IndexedLineDocument("../test/test_data/test_sentences.txt")

In [8]:
i = 0
for s in doc:
    print(f"{s[1]}\t{s[0]}")
    i += 1
    if i == 4:
        break

0	['Good', 'stuff', 'i', 'just', 'wish', 'it', 'lasted', 'longer']
1	['Hp', 'makes', 'qualilty', 'stuff']
2	['I', 'like', 'it']
3	['Try', 'it', 'you', 'will', 'like', 'it']


If you are later working with the similarity of sentences, the IndexedLineDocument provides you the option to access each line by its corresponding index. This helps you in determining the similarity of sentences, as the most_similar method would otherwise just return indices.

In [9]:
doc[20]

'I feel like i just got screwed'

# Training a model / Performing inference

Training a fse model is simple. You only need a pre-trained word embedding model which you use during the initializiation of the fse model you want to use.

In [10]:
from fse import Vectors
import gensim.downloader as api
data = api.load("quora-duplicate-questions")
glove = Vectors.from_pretrained("glove-wiki-gigaword-100")

2021-12-02 20:28:02,197 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse__glove-wiki-gigaword-100.3282d5e7c5e979c2411ba9703d63a46243a2047e/glove-wiki-gigaword-100.model
2021-12-02 20:28:03,181 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse__glove-wiki-gigaword-100.3282d5e7c5e979c2411ba9703d63a46243a2047e/glove-wiki-gigaword-100.model.vectors.npy with mmap=None
2021-12-02 20:28:03,249 : MainThread : INFO : setting ignored attribute vectors_norm to None
2021-12-02 20:28:03,250 : MainThread : INFO : loaded /home/oborchers/.cache/huggingface/hub/fse__glove-wiki-gigaword-100.3282d5e7c5e979c2411ba9703d63a46243a2047e/glove-wiki-gigaword-100.model


In [11]:
sentences = []
for d in data:
    # Let's blow up the data a bit by replicating each sentence.
    for i in range(8):
        sentences.append(d["question1"].split())
        sentences.append(d["question2"].split())
s = IndexedList(sentences)
print(len(s))

6468640


So we have about 6468640 sentences we want to compute the embeddings for. If you import the FAST_VERSION variable as follows you can ensure, that the compiliation of the cython routines worked correctly:

In [12]:
from fse.models.average import FAST_VERSION, MAX_WORDS_IN_BATCH
print(MAX_WORDS_IN_BATCH)
print(FAST_VERSION)
# 1 -> The fast version works

10000
1


In [23]:
from fse.models import uSIF
model = uSIF(glove, workers=1, lang_freq="en")

2021-12-02 20:29:33,501 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


In [24]:
model.train(s)

2021-12-02 20:29:33,926 : MainThread : INFO : scanning all indexed sentences and their word counts
2021-12-02 20:29:38,928 : MainThread : INFO : SCANNING : finished 3898235 sentences with 43084793 words
2021-12-02 20:29:42,337 : MainThread : INFO : finished scanning 6468640 sentences with an average length of 11 and 71556728 total words
2021-12-02 20:29:42,467 : MainThread : INFO : estimated memory for 6468640 sentences with 100 dimensions and 400000 vocabulary: 2621 MB (2 GB)
2021-12-02 20:29:42,468 : MainThread : INFO : initializing sentence vectors for 6468640 sentences
2021-12-02 20:30:01,833 : MainThread : INFO : pre-computing uSIF weights for 400000 words
2021-12-02 20:30:02,752 : MainThread : INFO : begin training
2021-12-02 20:30:07,761 : MainThread : INFO : PROGRESS : finished 25.71% with 1663049 sentences and 12641690 words, 332609 sentences/s
2021-12-02 20:30:12,763 : MainThread : INFO : PROGRESS : finished 49.99% with 3233385 sentences and 24604424 words, 314067 sentences/s

(6468624, 49255184)

The models training speed revolves at around 300,000 - 500,000 sentences / seconds. That means we finish the task in about 10 seconds. This is **heavily dependent** on the input processing. If you train ram-to-ram it is naturally faster than any ram-to-disk or disk-to-disk varianty. Similarly, the speed depends on the workers.

Once the sif model is trained, you can perform additional inferences for unknown sentences. This two step process for new data is required, as computing the principal components for models like SIF and uSIF will require a fair amount of sentences. If you want the vector for a single sentence (which is out of the training vocab), just use:

In [15]:
tmp = ("Hello my friends".split(), 0)
model.infer([tmp])

2021-12-02 20:29:27,731 : MainThread : INFO : scanning all indexed sentences and their word counts
2021-12-02 20:29:27,732 : MainThread : INFO : finished scanning 1 sentences with an average length of 3 and 3 total words
2021-12-02 20:29:27,733 : MainThread : INFO : removing 5 principal components took 0s


array([[ 2.52946198e-01, -2.80404240e-02,  2.69833803e-02,
        -2.78671950e-01, -7.44080096e-02,  4.57280308e-01,
        -1.05054319e-01,  2.72667259e-02, -6.48381487e-02,
        -3.40230405e-01, -2.04274803e-03, -7.25736842e-02,
         1.93554670e-01,  1.53935701e-01, -1.17377929e-01,
        -2.86470026e-01,  9.35275406e-02, -1.55883789e-01,
        -3.67838562e-01,  3.55114430e-01, -1.01716474e-01,
         2.67178684e-01, -3.58482040e-02, -1.73439160e-01,
         1.11153685e-01,  9.17388499e-02, -2.18827292e-01,
        -5.82419336e-02,  4.64093864e-01,  1.16017178e-01,
         2.43311703e-01,  2.93871671e-01,  3.83903325e-01,
         1.23666152e-01,  1.68591365e-03,  2.47326195e-01,
         1.76458687e-01,  6.19876608e-02,  2.72473156e-01,
        -1.29384965e-01, -1.28560305e-01,  1.32312194e-01,
         2.21162975e-01, -1.13845311e-01, -1.39296561e-01,
        -1.14041977e-01, -4.00316596e-01,  3.18139911e-01,
         3.94160390e-01, -1.03439599e-01, -1.09797075e-0

## Querying the model

In order to query the model or perform similarity computations we can just access the model.sv (sentence vectors) object and use its method. To get a vector for an index, just call

In [16]:
model.sv[0]

array([ 0.06334692, -0.00278309,  0.02876258,  0.2938737 ,  0.16536492,
       -0.32892653, -0.24968779, -0.11547095, -0.00762739, -0.09775834,
       -0.02934675,  0.11205705, -0.06664   , -0.26486415, -0.1903032 ,
       -0.05020472, -0.00186126,  0.06867541,  0.02295774,  0.15203542,
        0.09067672,  0.04975739, -0.23175132,  0.14476334, -0.14295411,
        0.02923434,  0.04803507,  0.06715866, -0.07600797,  0.01031642,
       -0.2484782 ,  0.22390996, -0.09542373, -0.09283138,  0.13540202,
        0.15456603,  0.19957334, -0.10639023, -0.09370194, -0.21725996,
       -0.0491615 , -0.07300739,  0.03414775, -0.09599279, -0.24818763,
        0.1342045 , -0.23917073,  0.05558453, -0.06525436, -0.48910773,
       -0.22362332, -0.00779874, -0.03814342,  0.2980885 , -0.17636092,
       -0.5499361 , -0.14905512, -0.03137571,  0.67050046, -0.07416987,
        0.0496444 , -0.18189807, -0.14830717, -0.00139662,  0.05445424,
        0.14017463, -0.19543567,  0.214339  ,  0.12590402, -0.07

To compute the similarity or distance between two sentence from the training set you can call:

In [17]:
print(model.sv.similarity(0,1).round(3))
print(model.sv.distance(0,1).round(3))

0.965
0.035


We can further call for the most similar sentences given an index. For example, we want to know the most similar sentences for sentence index 100:

In [18]:
print(s[100])

(['Should', 'I', 'buy', 'tiago?'], 100)


In [19]:
model.sv.most_similar(100)
# Division by zero can happen if you encounter empy sentences

2021-12-02 20:29:27,773 : MainThread : INFO : precomputing L2-norms of sentence vectors


[(2688924, 1.0),
 (3116047, 1.0),
 (2688918, 1.0),
 (2688920, 1.0),
 (2688922, 1.0),
 (2688914, 1.0),
 (2688926, 1.0),
 (3116041, 1.0),
 (3116043, 1.0),
 (1384926, 1.0)]

However, the preceding function will only supply the indices of the most similar sentences. You can circumvent this problem by passing an indexable function to the most_similar call:

In [20]:
model.sv.most_similar(100, indexable=s.items)

[(['Will', 'Facebook', 'buy', 'Quora?'], 2688924, 1.0),
 (['Why', "doesn't", 'Apple', 'buy', 'Samsung?'], 3116047, 1.0),
 (['Will', 'Facebook', 'buy', 'Quora?'], 2688918, 1.0),
 (['Will', 'Facebook', 'buy', 'Quora?'], 2688920, 1.0),
 (['Will', 'Facebook', 'buy', 'Quora?'], 2688922, 1.0),
 (['Will', 'Facebook', 'buy', 'Quora?'], 2688914, 1.0),
 (['Will', 'Facebook', 'buy', 'Quora?'], 2688926, 1.0),
 (['Why', "doesn't", 'Apple', 'buy', 'Samsung?'], 3116041, 1.0),
 (['Why', "doesn't", 'Apple', 'buy', 'Samsung?'], 3116043, 1.0),
 (['Should', 'I', 'buy', 'CS:GO?'], 1384926, 1.0)]

There we go. This is a lot more understandable than the initial list of indices.

To search for sentences, which are similar to a given word vector, you can call:

In [21]:
model.sv.similar_by_word("easy", wv=glove, indexable=s.items)

[(['Which',
   'is',
   'more',
   'easy',
   'to',
   'learn?',
   'Ruby',
   'on',
   'Rails',
   'or',
   'Python/Django?'],
  4717071,
  0.9476152658462524),
 (['Which',
   'is',
   'more',
   'easy',
   'to',
   'learn?',
   'Ruby',
   'on',
   'Rails',
   'or',
   'Python/Django?'],
  4717059,
  0.9476152658462524),
 (['Which',
   'is',
   'more',
   'easy',
   'to',
   'learn?',
   'Ruby',
   'on',
   'Rails',
   'or',
   'Python/Django?'],
  4717061,
  0.9476152658462524),
 (['Which',
   'is',
   'more',
   'easy',
   'to',
   'learn?',
   'Ruby',
   'on',
   'Rails',
   'or',
   'Python/Django?'],
  4717063,
  0.9476152658462524),
 (['Which',
   'is',
   'more',
   'easy',
   'to',
   'learn?',
   'Ruby',
   'on',
   'Rails',
   'or',
   'Python/Django?'],
  4717065,
  0.9476152658462524),
 (['Which',
   'is',
   'more',
   'easy',
   'to',
   'learn?',
   'Ruby',
   'on',
   'Rails',
   'or',
   'Python/Django?'],
  4717067,
  0.9476152658462524),
 (['Which',
   'is',
   'mor

Furthermore, you can query for unknown (or new) sentences by calling:

In [22]:
model.sv.similar_by_sentence("Is this really easy to learn".split(), model=model, indexable=s.items)

2021-12-02 20:29:31,107 : MainThread : INFO : scanning all indexed sentences and their word counts
2021-12-02 20:29:31,109 : MainThread : INFO : finished scanning 1 sentences with an average length of 6 and 6 total words
2021-12-02 20:29:31,110 : MainThread : INFO : removing 5 principal components took 0s


[(['How', 'do', 'I', 'learn', 'Python', 'in', 'easy', 'way?'],
  6255670,
  0.9872919321060181),
 (['How', 'do', 'I', 'learn', 'Python', 'in', 'easy', 'way?'],
  6255672,
  0.9872919321060181),
 (['How', 'do', 'I', 'learn', 'Python', 'in', 'easy', 'way?'],
  418226,
  0.9872919321060181),
 (['How', 'do', 'I', 'learn', 'Python', 'in', 'easy', 'way?'],
  418224,
  0.9872919321060181),
 (['How', 'do', 'I', 'learn', 'Python', 'in', 'easy', 'way?'],
  418232,
  0.9872919321060181),
 (['How', 'do', 'I', 'learn', 'Python', 'in', 'easy', 'way?'],
  6255678,
  0.9872919321060181),
 (['How', 'do', 'I', 'learn', 'Python', 'in', 'easy', 'way?'],
  6255676,
  0.9872919321060181),
 (['How', 'do', 'I', 'learn', 'Python', 'in', 'easy', 'way?'],
  6255674,
  0.9872919321060181),
 (['How', 'do', 'I', 'learn', 'Python', 'in', 'easy', 'way?'],
  418236,
  0.9872919321060181),
 (['How', 'do', 'I', 'learn', 'Python', 'in', 'easy', 'way?'],
  418228,
  0.9872919321060181)]

Feel free to browse through the library and get to know the functions a little better!