# Latent Vector Exploration

The VDSH model learns latent vector representations of documents. These are then used to hash documents, but the vectors by themselves are still very useful for similarity search. This is also a good sanity check that the network is learning.

In this notebook, I pass each document through the trained encoder to get the latent vector representations, and then I use [FAISS](https://github.com/facebookresearch/faiss) for similarity search. FAISS lets us query for nearest neighbors among the nearly 300k documents near instantaneously.

In [1]:
import faiss
import numpy as np
from gensim.corpora import Dictionary
from src.utils.corpus import Corpus
from src.utils.tfidf import generate_tfidf
from src.models.vdsh import VDSH

Using TensorFlow backend.


In [2]:
corpus = Corpus()
dictionary = Dictionary(corpus.debates.bag_of_words)
dictionary.filter_extremes(no_below=100)
dictionary.compactify()
X = generate_tfidf(corpus.debates, dictionary)
vdsh = VDSH()
vdsh.build_model(X.shape[1])
vdsh.load_weights('vdsh.hdf5')
latent_vectors = vdsh.encoder_predict(X)

In [3]:
index = faiss.IndexFlatL2(latent_vectors.shape[1])
index.add(latent_vectors)

## Querying

In [4]:
target = 286995
k = 10
D, I = index.search(latent_vectors[target].reshape((1,-1)), k)

In [7]:
def print_paragraph(i):
    doc = corpus.debates.iloc[i]
    print(doc.country_name, doc.year)
    print(doc.text)
    print('\n\n')

In [8]:
print_paragraph(target)
print('Nearest Neighbors:\n')
for i in I[0]:
    if i == target:
        continue
    print_paragraph(i)

United States Of America 2015
I have said before and I will repeat: there is no room to accommodate an apocalyptic cult like the Islamic State in Iraq and the Levant (ISIL), and the United States makes no apologies for using our military, as part of a broad coalition, to go after them. We do so with a determination to ensure that there will never be a safe haven for terrorists who carry out these crimes. We have demonstrated over more than a decade of relentless pursuit of Al-Qaida that we will not be outlasted by extremists.




Nearest Neighbors:

Czech Republic 2015
The second illusion is that we can reduce terrorist organizations to the so-called Islamic State only. But there are many other terrorist organizations — for instance, Al-Qaida, the Taliban, Al-Nusra, Boko Haram and others. Two outstanding politicians from the Arab world told me that the cover organization is the Muslim Brotherhood. If so, there is a terrorism network, and that network cannot be reduced simply to the Isl