# fse - Tutorial

Welcome to fse - fast sentence embeddings. The library is intended to compute sentence embeddings as fast as possible. 
It offers a simple and easy to understand syntax for you to use in your own projects. Before we start with any model, lets have a look at the input types which.
All fse models require an iterable/generator which produces an IndexedSentence object. An IndexedSentence is a named tuple with two fields: words and index. The index is required for the multi-core processing, as sentences might not be processed sequentially. The index dictates, which row of the corresponding sentence vector matrix the sentence belongs to.

## Input handling

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s', level=logging.DEBUG)

In [2]:
from fse import IndexedSentence
from fse import IndexedList
s = IndexedSentence(["Hello", "world"], 0)
print(s.words)
print(s.index)

2019-09-02 17:25:06,307 : MainThread : DEBUG : {'uri': '/Users/oliverborchers/anaconda3/envs/fsedev/lib/python3.7/site-packages/smart_open/VERSION', 'mode': 'r', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
2019-09-02 17:25:06,741 : MainThread : INFO : 'pattern' package not found; tag filters are not available for English


['Hello', 'world']
0


The words of the IndexedSentence will always consist of a list of strings. Otherwise the train method will raise an Error. However, most input data is available as a list of strings.

In [3]:
sentences_a = ["Hello there", "how are you?"]
sentences_b = ["today is a good day", "Lorem ipsum"]

In order to deal with this common input format, fse provides the IndexedList, which handels all required data operations for you. You can provide multiple lists (or sets) which will all be merged into a single list. This eases work if you have to work with the STS datasets. IndexedList will perform an automatic split if you don't provide a specific function for the model to split on.

In [4]:
s = IndexedList(sentences_a, sentences_b)
print(len(s))
s[0]

4


IndexedSentence(words=['Hello', 'there'], index=0)

To save memory, we do not convert the original lists inplace. The conversion will only take place once you call the getitem method. To access the original data, call:

In [5]:
s.items

['Hello there', 'how are you?', 'today is a good day', 'Lorem ipsum']

If the data is already preprocessed as a list of lists you can provide the argument pre_splitted=True.

In [6]:
sentences_splitted = ["Hello there".split(), "how are you?".split()]
s = IndexedList(sentences_splitted, pre_splitted=True)
print(len(s))
s[0]

2


IndexedSentence(words=['Hello', 'there'], index=0)

In case you want to provide your own splitting function, you can pass a callable to the split_func argument.

In [7]:
def split_func(string):
    return string.split()

s = IndexedList(sentences_a, split=False, split_func=split_func)
print(len(s))
s[0]

2


IndexedSentence(words=['Hello', 'there'], index=0)

If you want to stream a file from disk (where each line corresponds to a sentence) you can use the IndexedLineDocument.

In [8]:
from fse import IndexedLineDocument
doc = IndexedLineDocument("../fse/test/test_data/test_sentences.txt")

2019-09-02 17:25:12,687 : MainThread : DEBUG : {'uri': PosixPath('../fse/test/test_data/test_sentences.txt'), 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}


In [9]:
i = 0
for s in doc:
    print(f"{s.index}\t{s.words}")
    i += 1
    if i == 4:
        break

2019-09-02 17:25:16,545 : MainThread : DEBUG : {'uri': PosixPath('../fse/test/test_data/test_sentences.txt'), 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}


0	['Good', 'stuff', 'i', 'just', 'wish', 'it', 'lasted', 'longer']
1	['Hp', 'makes', 'qualilty', 'stuff']
2	['I', 'like', 'it']
3	['Try', 'it', 'you', 'will', 'like', 'it']


If you are later working with the similarity of sentences, the IndexedLineDocument provides you the option to access each line by its corresponding index. This helps you in determining the similarity of sentences, as the most_similar method would otherwise just return indices.

In [10]:
doc[20]

2019-09-02 17:25:17,428 : MainThread : DEBUG : {'uri': PosixPath('../fse/test/test_data/test_sentences.txt'), 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}


'I feel like i just got screwed'

# Training a model / Performing inference

Training a fse model is simple. You only need a pre-trained word embedding model which you use during the initializiation of the fse model you want to use.

In [11]:
import gensim.downloader as api
data = api.load("quora-duplicate-questions")
glove = api.load("glove-wiki-gigaword-100")

2019-09-02 17:25:20,478 : MainThread : INFO : loading projection weights from /Users/oliverborchers/gensim-data/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz
2019-09-02 17:25:20,480 : MainThread : DEBUG : {'uri': '/Users/oliverborchers/gensim-data/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
2019-09-02 17:26:08,015 : MainThread : INFO : loaded (400000, 100) matrix from /Users/oliverborchers/gensim-data/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz


In [12]:
sentences = []
for d in data:
    # Let's blow up the data a bit by replicating each sentence.
    for i in range(4):
        sentences.append(d["question1"])
        sentences.append(d["question2"])
s = IndexedList(sentences)
print(len(s))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-09-02 17:26:16,621 : MainThread : DEBUG : {'uri': '/Users/oliverborchers/gensim-data/quora-duplicate-questions/quora-duplicate-questions.gz', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': {}}


6468640


So we have about 3234320 sentences we want to compute the embeddings for. If you import the FAST_VERSION variable as follows you can ensure, that the compiliation of the cython routines worked correctly:

In [13]:
from fse.models.average import FAST_VERSION
FAST_VERSION
# 1 -> The fast version works

1

In [14]:
from fse.models import Average
model = Average(glove, workers=1)

In [15]:
model.train(s)

2019-09-02 17:26:43,709 : MainThread : INFO : scanning all indexed sentences and their word counts
2019-09-02 17:26:48,710 : MainThread : INFO : SCANNING : finished 1059350 sentences with 11700688 words
2019-09-02 17:26:53,710 : MainThread : INFO : SCANNING : finished 2160826 sentences with 23865035 words
2019-09-02 17:26:58,710 : MainThread : INFO : SCANNING : finished 3298528 sentences with 36461432 words
2019-09-02 17:27:03,710 : MainThread : INFO : SCANNING : finished 4407256 sentences with 48744308 words
2019-09-02 17:27:08,710 : MainThread : INFO : SCANNING : finished 5545789 sentences with 61305355 words
2019-09-02 17:27:12,738 : MainThread : INFO : finished scanning 6468640 sentences with an average length of 11 and 71556728 total words
2019-09-02 17:27:12,933 : MainThread : INFO : estimated memory for 6468640 sentences with 100 dimensions and 400000 vocabulary: 2621 MB (2 GB)
2019-09-02 17:27:12,934 : MainThread : INFO : initializing sentence vectors for 6468640 sentences
2019

(6468624, 49255184)

The model runs at around 160,000 sentences / seconds. That means we finish the task in 6 seconds.

Once the sif model is trained, you can perform additional inferences for unknown sentences. This two step process for new data is required, as computing the principal components for models like SIF and uSIF will require a fair amount of sentences. If you want the vector for a single sentence (which is out of the training vocab), just use:

In [16]:
tmp = IndexedSentence("Hello my friends".split(), 0)
model.infer([tmp])

2019-09-02 17:28:21,743 : MainThread : INFO : scanning all indexed sentences and their word counts
2019-09-02 17:28:21,745 : MainThread : INFO : finished scanning 1 sentences with an average length of 3 and 3 total words


array([[ 0.26518148,  0.026005  ,  0.418165  , -0.491575  , -0.4111695 ,
         0.75515497, -0.26521   ,  0.2275845 , -0.0826425 , -0.15927498,
         0.17639194,  0.132565  ,  0.46823   ,  0.26481998, -0.17778   ,
        -0.529775  , -0.04180501,  0.07661   , -0.59272003,  0.613505  ,
        -0.153675  ,  0.419775  , -0.16156301, -0.15667   ,  0.3594    ,
         0.4279275 , -0.524735  , -0.73454   ,  0.81944   ,  0.01679499,
         0.38288   ,  0.875635  ,  0.68189   ,  0.17048   , -0.19861001,
         0.562425  , -0.13079502,  0.300335  ,  0.42567   , -0.41034502,
        -0.168975  , -0.00744   ,  0.624575  , -0.54095   , -0.45428002,
        -0.082555  , -0.503395  ,  0.07380998,  0.41128498, -0.93290997,
        -0.12029   , -0.16773999,  0.113726  ,  0.938125  , -0.0205    ,
        -2.2649999 ,  0.3064845 ,  0.17881   ,  1.257695  ,  0.322062  ,
         0.3309515 ,  1.6305    , -0.26051632, -0.17654894,  0.63209003,
         0.11346   ,  0.72634   ,  0.35617   ,  0.2

## Querying the model

In order to query the model or perform similarity computations we can just access the model.sv (sentence vectors) object and use its method. To get a vector for an index, just call

In [17]:
model.sv[0]

array([-1.9212671e-02,  5.5341166e-02,  3.0522481e-01,  1.0868758e-01,
        2.0464543e-01, -6.3812248e-02, -2.9070374e-01,  4.0029332e-02,
       -1.5944116e-01,  1.9179760e-02, -4.7995415e-02, -4.4555258e-02,
        1.2083842e-01, -1.5988165e-01, -9.1990761e-02, -1.7833400e-01,
        1.9876814e-01,  7.4273467e-02, -2.3242335e-01,  1.5821201e-01,
        2.5478417e-01, -3.5725668e-02,  1.6262001e-01,  2.9798460e-01,
        8.6748540e-02, -1.6531321e-01,  4.8800584e-02, -3.4739166e-01,
       -4.4252168e-02, -8.5781654e-03, -2.6755494e-01,  4.6747109e-01,
       -1.0570883e-01, -4.8522335e-02,  6.5135837e-02,  3.6621445e-01,
        1.7579168e-01,  1.9990325e-01, -1.2957168e-01, -3.0050886e-01,
       -3.6363500e-01, -2.3632818e-01,  1.4003392e-03, -3.1585187e-01,
       -1.7299536e-01,  9.5002741e-02,  1.1902635e-01, -2.7074313e-01,
       -1.7854732e-01, -9.2564583e-01, -1.7807779e-01, -2.0552608e-03,
        2.2348361e-01,  1.1666777e+00, -3.9630362e-01, -2.5325801e+00,
      

To compute the similarity or distance between two sentence from the training set you can call:

In [18]:
print(model.sv.similarity(0,1).round(3))
print(model.sv.distance(0,1).round(3))

0.993
0.007


We can further call for the most similar sentences given an index. For example, we want to know the most similar sentences for sentence index 100:

In [19]:
print(s[100])

IndexedSentence(['Should', 'I', 'buy', 'tiago?'], 100)


In [20]:
model.sv.most_similar(100)
# Division by zero can happen if you encounter empy sentences

2019-09-02 17:28:25,437 : MainThread : INFO : precomputing L2-norms of sentence vectors
  return (m / dist).astype(REAL)
  return (m / dist).astype(REAL)


[(754564, 1.0),
 (754562, 1.0),
 (1836814, 1.0),
 (1836815, 1.0),
 (754574, 1.0),
 (754572, 1.0),
 (754570, 1.0),
 (754568, 1.0),
 (754566, 1.0),
 (1836812, 1.0)]

However, the preceding function will only supply the indices of the most similar sentences. You can circumvent this problem by passing an indexable function to the most_similar call:

In [21]:
model.sv.most_similar(100, indexable=sentences)

[('Should Google buy Twitter?', 754564, 1.0),
 ('Should Google buy Twitter?', 754562, 1.0),
 ("Why doesn't Google buy Quora?", 1836814, 1.0),
 ("Why doesn't Facebook buy Quora?", 1836815, 1.0),
 ('Should Google buy Twitter?', 754574, 1.0),
 ('Should Google buy Twitter?', 754572, 1.0),
 ('Should Google buy Twitter?', 754570, 1.0),
 ('Should Google buy Twitter?', 754568, 1.0),
 ('Should Google buy Twitter?', 754566, 1.0),
 ("Why doesn't Google buy Quora?", 1836812, 1.0)]

There we go. This is a lot more understandable than the initial list of indices.

To search for sentences, which are similar to a given word vector, you can call:

In [22]:
model.sv.similar_by_word("easy", wv=glove, indexable=sentences)

[('How do I make easy money?', 842126, 0.8764350414276123),
 ('How do I make easy money?', 5387888, 0.8764350414276123),
 ('How do I make easy money?', 2113542, 0.8764350414276123),
 ('How do I make easy money?', 2113540, 0.8764350414276123),
 ('How do I make easy money?', 2113538, 0.8764350414276123),
 ('How do I make easy money?', 2113536, 0.8764350414276123),
 ('How do I make easy money?', 842124, 0.8764350414276123),
 ('How do I make easy money?', 2113546, 0.8764350414276123),
 ('How do I make easy money?', 5387894, 0.8764350414276123),
 ('How do I make easy money?', 5387896, 0.8764350414276123)]

Furthermore, you can query for unknown (or new) sentences by calling:

In [23]:
model.sv.similar_by_sentence("Is this really easy to learn".split(), model=model, indexable=sentences)

2019-09-02 17:28:46,689 : MainThread : INFO : scanning all indexed sentences and their word counts
2019-09-02 17:28:46,691 : MainThread : INFO : finished scanning 1 sentences with an average length of 6 and 6 total words


[('Is it easy to learn Hebrew if you learn Arabic?',
  4666697,
  0.9844098091125488),
 ('Is it easy to learn Arabic if you learn Hebrew first?',
  4666696,
  0.9844098091125488),
 ('Is it easy to learn Arabic if you learn Hebrew first?',
  4666702,
  0.9844098091125488),
 ('Is it easy to learn Hebrew if you learn Arabic?',
  4666701,
  0.9844098091125488),
 ('Is it easy to learn Arabic if you learn Hebrew first?',
  4666700,
  0.9844098091125488),
 ('Is it easy to learn Hebrew if you learn Arabic?',
  4666699,
  0.9844098091125488),
 ('Is it easy to learn Arabic if you learn Hebrew first?',
  4666698,
  0.9844098091125488),
 ('Is it easy to learn Hebrew if you learn Arabic?',
  4666691,
  0.9844098091125488),
 ('Is it easy to learn Arabic if you learn Hebrew first?',
  4666690,
  0.9844098091125488),
 ('Is it easy to learn Hebrew if you learn Arabic?',
  4666695,
  0.9844098091125488)]

Feel free to browse through the library and get to know the functions a little better!