# Faster co-occurence vectorization

The goal of this notebook is to write a faster way to implement the co-ocurence matrix vectorizer.
To begin, let's write a toy tokenizer.

In [1]:
tokenizer = lambda txt: txt.split(' ')

tokenizer("Say hello to faster vectorisation !")

['Say', 'hello', 'to', 'faster', 'vectorisation', '!']

Now we will import a larger text...

In [2]:
from urllib.request import urlopen

# Uses only demonstration, the text is not cleaned (with a copyright message...).
test_tokens = tokenizer(str(urlopen('http://www.gutenberg.org/cache/epub/1777/pg1777.txt').read()))

Let's run the new vectorizer into a CProfiler.

In [3]:
import cProfile

cProfile.run("""
from scipy import sparse as S
import numpy as np

vocabulary = sorted(set(test_tokens))
len_vocabulary = len(vocabulary)

voc_indices = [vocabulary.index(word) for word in test_tokens]
len_document = len(test_tokens)
cooc_matrix = S.lil_matrix((len_vocabulary, len_vocabulary))

window_size = 5

for word_pos, row_index in enumerate(voc_indices):
    window = voc_indices[max(0, word_pos - window_size) : word_pos] +\
             voc_indices[word_pos+1 : min(len_document+1, word_pos+1+window_size)]
    
    for col_index in window:
        cooc_matrix[row_index, col_index] += 1    
""")

         6458352 function calls (6456243 primitive calls) in 6.546 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        8    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1043(__import__)
      372    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap>:119(release)
      168    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:159(__init__)
      168    0.000    0.000    0.003    0.000 <frozen importlib._bootstrap>:163(__enter__)
      168    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:170(__exit__)
      372    0.001    0.000    0.002    0.000 <frozen importlib._bootstrap>:176(_get_module_lock)
      171    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:190(cb)
      204    0.001    0.000    0.002    0.000 <frozen importlib._bootstrap>:195(_lock_unlock_module)
    204/2    0.000    0.000    0.135    0.068 <frozen importlib._bootstrap>:214(_call_

This is 96% faster than the implementation in the actual Kadot !
Indeed, we have a very low weight in ram so we can run on a large corpus without breaking the computer !
To be totally great, we should implement a new VectorDict on Kadot that internally store vectors as a scipy LIL matrix.

In [4]:
cooc_matrix.shape

(7768, 7768)