## NearPy

Locality Sensitive Hashing (LSH) implemented as follows: https://github.com/pixelogik/NearPy

"Fast (approximated) nearest neighbour search in high dimensional vector spaces using different locality-sensitive hashing methods."

In [1]:
from __future__ import division
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nearpy import Engine
from nearpy.distances import CosineDistance
from nearpy.filters import NearestFilter
from nearpy.hashes import RandomBinaryProjections, PCABinaryProjections
import wikipedia

In [2]:
df = pd.DataFrame([['Albert Einstein'],
                    ['Richard Feynman'],
                    ['Leonhard Euler'],
                    ['Stephen Hawking']])
df.columns = ['name']
df.head()

Unnamed: 0,name
0,Albert Einstein
1,Richard Feynman
2,Leonhard Euler
3,Stephen Hawking


In [3]:
df['wiki_content'] = df['name'].map(lambda x: wikipedia.summary(x, sentences=100))
df.head()

Unnamed: 0,name,wiki_content
0,Albert Einstein,Albert Einstein (/ˈaɪnstaɪn/; German: [ˈalbɛɐ̯...
1,Richard Feynman,"Richard Phillips Feynman (/ˈfaɪnmən/; May 11, ..."
2,Leonhard Euler,Leonhard Euler (/ˈɔɪlər/ OY-lər; Swiss Standar...
3,Stephen Hawking,"Stephen William Hawking, CH, CBE, FRS, FRSA (/..."


In [4]:
vectorizer = TfidfVectorizer(input='content', lowercase=True, tokenizer=None, 
                        stop_words='english', use_idf=True)
tfidf = vectorizer.fit_transform(df['wiki_content'])

LSH

In [5]:
pca = PCABinaryProjections('rbp', 10, tfidf.toarray())

In [6]:
# Dimension of our vector space
dimension = tfidf.shape[1]

# Create a random binary hash with 4 bits - need more for more data
rbp = RandomBinaryProjections('rbp', 4)

# Create engine with pipeline configuration
engine = Engine(dimension, lshashes=[rbp],
               distance=CosineDistance(),
               vector_filters=[NearestFilter(2)])

In [7]:
# Store tfidf vectors
for index in xrange(tfidf.shape[0]):
    curr_item = list(tfidf[index].toarray().reshape(dimension))
    engine.store_vector(curr_item, df.ix[index]['name'])

In [8]:
# Find vector nearest neighbors
for index in xrange(tfidf.shape[0]):
    curr_item = list(tfidf[index].toarray().reshape(dimension))    
    N = engine.neighbours(curr_item)
    for x in xrange(max(0, len(N))):
        print '''{} 
            neighbour_name: {}
            neighbour_dist: {} 
        '''.format(df.ix[index]['name'], N[x][1], N[x][2],)


Albert Einstein 
            neighbour_name: Albert Einstein
            neighbour_dist: 0.0 
        
Richard Feynman 
            neighbour_name: Richard Feynman
            neighbour_dist: 0.0 
        
Leonhard Euler 
            neighbour_name: Leonhard Euler
            neighbour_dist: -2.22044604925e-16 
        
Stephen Hawking 
            neighbour_name: Stephen Hawking
            neighbour_dist: -2.22044604925e-16 
        


Euler and Einstein ended up in the same bucket, thus they are considered neighbours.