### Note: This is a far cry from what this will eventually be. As of now, this is just a very quick overview.

The vectors used in this tutorial can be downloaded here: https://bit.ly/2XUcNWU

The vectors were trained on 1 million MEDLINE abstracts (these are medical scientific abstracts). The dimensionality of the vectors is 500 and I used seeds=20 for training.

In [13]:
import sys
sys.path.append('../')

from random_indexing.query import QueryVectors
from utils import text_utils
from utils import vector_utils

In [2]:
path = "F:\\github_projects\\data\\embeddings\\medline_sentences\\ri_demo\\"

In [7]:
ri = QueryVectors(path + 'ri/term_index_ri')
drri = QueryVectors(path + 'drri/term_index_drri')
trri = QueryVectors(path + 'trri/term_index_trri')
window = QueryVectors(path + 'window/term_index_window')

# What is random projection?

I will be using the phrase 'random projection' a lot. So what is random projection? There is a lot of very interesting mathematical theory behind it, but it is very very simple. For our purposes, you select the desired dimensionality (dim) and the number of seeds. For a given term or document, you initialize its vector to size dim. You then select N elements of the vector where N is determined by seeds and randomly flip these elements to +1 or -1. Congrats! You have just randomly projected!

This works because when you randomly project using this approach the randomly projected vectors are near orthogonal to each other. In other words, you have a unique identifier for a word or document in the lower dimensional space.

For demonstration purposes, this is what a vector looks like when initialized with random projection.

In [15]:
vectors = vector_utils.initialize_vectors_random_projection(['fish'], dim=500, seeds=20)
print(vectors['fish'])

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  0.
  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  0.  0.  0.
  0.  1.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0

# Example queries with random indexing

Random Indexing is the vanilla version that was originally proposed as an alternative to LSA. Random Indexing is an incredibly fast method for generating vectors and is highly parallelizable. Vanilla Random Indexing uses the document as context and then uses random projection to map each document to a fixed dimensional space (500 dimensions in this case). Random projection more or less gives each document a unique ID. For training, for each term in a document you add the document vector to the term vector.

The results are 'ok.' To be fair, 1 million documents is not a lot and I also have not implemented term weighting at this point.

In [4]:
ri.get_similar('mother')

[('mother', 1.0),
 ('father', 0.25328775133169823),
 ('maternal', 0.22938290392806948),
 ('mothers', 0.2206636706180367),
 ('infant', 0.19130721074871948),
 ('alternated', 0.18989466060845117),
 ('ppar-α', 0.18852798235854762),
 ('ln229', 0.1865784950289182),
 ('metapopulations', 0.18180235370409403),
 ('mgl', 0.178919612315935)]

In [5]:
ri.get_similar('ulcerative colitis')

[('ulcerative', 0.8587647045774129),
 ('colitis', 0.8587647045774129),
 ('uc', 0.2458510566665434),
 ('crohn', 0.23734590798881516),
 ('dss', 0.19987876825718698),
 ('interfere', 0.17927755192195305),
 ('brew', 0.1753638280767027),
 ('8-oh-dg', 0.17291678480842132),
 ('cbr', 0.1666171450700007),
 ('ibd', 0.16650605213170122)]

# Example queries with random indexing with sliding window

It may not be advantageous to always treat the entire document as context as we did previously. For example, a given document may have different topics. Here I use a variant of Random Indexing that uses a sliding window approach. I trained this particular model with a window of 5.

Let's look at the context for the following sentence: Fish oil has been shown to reduce blood pressure.

In [10]:
sent = 'Fish oil has been shown to reduce blood pressure.'

contexts = text_utils.create_context_training(sent, 5, set(sent.split()))

for context in contexts:
    print(context)

['Fish', ['oil', 'reduce', 'blood', 'pressure.']]
['oil', ['Fish', 'reduce', 'blood', 'pressure.']]
['reduce', ['Fish', 'oil', 'blood', 'pressure.']]
['blood', ['Fish', 'oil', 'reduce', 'pressure.']]
['pressure.', ['Fish', 'oil', 'reduce', 'blood']]


In training, each term is mapped to a lower dimensional space using random projection. For each document, you break the document into a set of contexts as above. For ['Fish', ['oil', 'reduce', 'blood', 'pressure.']], 'Fish' is the target term. You obtain the vector for 'Fish', and then add the vectors for ['oil', 'reduce', 'blood', 'pressure.'] to it.

Below is the results to Random Indexing with the same queries. Looks much better!

In [11]:
window.get_similar('mother')

[('mother', 1.0),
 ('child', 0.7288195466566987),
 ('dyads', 0.7258266563285336),
 ('parent', 0.7247515408407812),
 ('infant', 0.7116490350809898),
 ('maternal', 0.6991020961555843),
 ('baby', 0.6576233255853627),
 ('mothers', 0.6455671590380575),
 ('father', 0.6454699859477869),
 ('parental', 0.6447839830193891)]

In [12]:
window.get_similar('ulcerative colitis')

[('ulcerative', 0.8930706544287557),
 ('colitis', 0.8930706544287557),
 ('uc', 0.7707548551366463),
 ('crohn', 0.7365325448157829),
 ('ileitis', 0.7200394152702355),
 ('pancolitis', 0.6776322028485896),
 ('ibd', 0.6659753898580144),
 ('fistulizing', 0.6397353252313589),
 ('dss', 0.6173235312895747),
 ('pseudomembranous', 0.6033841113219148)]

# Examples of Reflective Random Indexing

Around 2010 it was discovered that you can do multiple iterations of training. This was dubbed Reflective Random Indexing. We will look at two examples: document-based Reflective Random Indexing (DRRI) and term-based Reflective Random Indexing (TRRI). It should be noted that these approaches generate document vectors, but I am currently not saving them. I will need to include some sort of approximate nearest neighbors search to make these usable.

TRRI and DRRI are very similar and the difference really boils down to the order of operations during training. During TRRI, you perform random projection on the terms and then add the terms to the document vector. Next you perform the reflective step where you add the document vector to the terms that are in the document. DRRI is essentially the same but you begin by applying random projection to the document vectors.

Reflective Random Indexing opened up the potential to apply these approaches in very creative ways. For example, you can associate text with labels, text with authors, authors with citations, and so forth. Unfortunately, you are limited in how much you can reflect as you eventually will end up with a mud puddle.

Below are the same queries performed using TRRI and DRRI.


In [20]:
trri.get_similar('mother')

[('mother', 1.0),
 ('mothers', 0.7366709382283277),
 ('father', 0.7327814148389312),
 ('maternal', 0.7209406519691841),
 ('infant', 0.7007631444239568),
 ('child', 0.6776361501810731),
 ('dyads', 0.6702282412682318),
 ('baby', 0.640874762678684),
 ('neurodevelopment', 0.6278331319647544),
 ('toddler', 0.6219275004574695)]

In [21]:
drri.get_similar('mother')

[('mother', 1.0),
 ('father', 0.9688162255507828),
 ('infant', 0.9677965946881427),
 ('maternal', 0.9667039576985549),
 ('mothers', 0.9658348463525763),
 ('baby', 0.9603432202002335),
 ('prenatal', 0.9565578854048025),
 ('infancy', 0.9558384543571197),
 ('born', 0.9554195311172776),
 ('dyads', 0.9521744465013378)]

In [22]:
trri.get_similar('ulcerative colitis')

[('ulcerative', 0.9720778576004048),
 ('colitis', 0.9720778576004047),
 ('crohn', 0.8131168553627316),
 ('pancolitis', 0.8089453682144533),
 ('ibd', 0.7333352071881086),
 ('uc', 0.7179428268740302),
 ('aminosalicylates', 0.7058131722863821),
 ('ibds', 0.695012002047562),
 ('disease', 0.6869420642449156),
 ('bowel', 0.6797990237865235)]

In [23]:
drri.get_similar('ulcerative colitis')

[('colitis', 0.9933332772974606),
 ('ulcerative', 0.9933332772974606),
 ('uc', 0.9604458048226887),
 ('crohn', 0.9581159673318018),
 ('ibds', 0.9504978128969214),
 ('ibd', 0.9492115569155353),
 ('pancolitis', 0.9480324919325203),
 ('dss', 0.9427591135066176),
 ('colonic', 0.9427114578695646),
 ('ileitis', 0.9390434791628233)]