# Latent Vector Exploration

The VDSH model learns latent vector representations of documents. These are then used to hash documents, but the vectors by themselves are still very useful for similarity search. This is also a good sanity check that the network is learning.

In this notebook, I pass each document through the trained encoder to get the latent vector representations, and then I use [FAISS](https://github.com/facebookresearch/faiss) for similarity search. FAISS lets us query for nearest neighbors among the nearly 300k documents near instantaneously.

In [None]:
import faiss
import numpy as np
from gensim.corpora import Dictionary
from src.utils.corpus import Corpus
from src.utils.tfidf import generate_tfidf
from src.models.vdsh import VDSH

In [None]:
corpus = Corpus()
dictionary = Dictionary(corpus.debates.bag_of_words)
dictionary.filter_extremes(no_below=100)
dictionary.compactify()
X = generate_tfidf(corpus.debates, dictionary)
vdsh = VDSH()
vdsh.build_model(X.shape[1])
vdsh.load_weights('vdsh.hdf5')
latent_vectors = vdsh.encoder_predict(X)

In [None]:
index = faiss.IndexFlatL2(latent_vectors.shape[1])
index.add(latent_vectors)

## Querying

In [None]:
target = 283322
k = 10
D, I = index.search(latent_vectors[target].reshape((1,-1)), k)

In [None]:
def print_paragraph(i):
    doc = corpus.iloc[i]
    print(doc.country_name, doc.year)
    print(doc.text)
    print('\n\n')

In [None]:
print_paragraph(target)
print('Nearest Neighbors:\n')
for i in I[0]:
    if i == target:
        continue
    print_paragraph(i)