# Nearest Neighbors

## Import module

In [5]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix

## Load Wikipedia dataset

In [3]:
wiki = pd.read_csv("people_wiki.csv")

Explore some data:

In [4]:
wiki.head(5)

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


## Extract word count vectors

To load in the word count vectors, define the function:

In [6]:
def load_sparse_csr(filename):
    loader = np.load(filename)
    data = loader['data']
    indices = loader['indices']
    indptr = loader['indptr']
    shape = loader['shape']
    
    return csr_matrix( (data, indices, indptr), shape)

In [8]:
# word count vector
word_count = load_sparse_csr('people_wiki_word_count.npz')

The word-to-index mapping is given by:

In [40]:
idx_to_word = pd.read_json("people_wiki_map_index_to_word.json", typ='series')
idx_to_word = idx_to_word.sort_values()

Let's see some of indices:

In [41]:
idx_to_word

bioarchaeologist               0
leaguehockey                   1
electionruss                   2
teramoto                       3
trumpeterpercussionist         4
spoofax                        5
mendelssohni                   6
crosswise                      7
yec                            8
asianthemed                    9
masheldon                     10
maywoods                      11
feduring                      12
seameo                        13
2012green                     14
wrighthassell                 15
lidda                         16
wfo                           17
ukfang                        18
outfitover                    19
pagbabago                     20
influences1                   21
stonier                       22
brbbarbosa                    23
ipishuna                      24
researchteuvo                 25
stephensens                   26
titheridge                    27
dunlapi                       28
specs                         29
          

## Find nearest neighbors

In [42]:
model = NearestNeighbors(metric='euclidean', algorithm = 'brute')
model.fit(word_count)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='euclidean',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

In [45]:
# row number of Obama's article
wiki[wiki['name']=='Barack Obama']

Unnamed: 0,URI,name,text
35817,<http://dbpedia.org/resource/Barack_Obama>,Barack Obama,barack hussein obama ii brk husen bm born augu...


10 nearest neighbors of Barack Obama:

In [55]:
distances, indices = model.kneighbors(word_count[35817], n_neighbors=10)
nn_obama = wiki.loc[indices[0]]
nn_obama['distance'] = distances[0]
nn_obama[['name','distance']]

Unnamed: 0,name,distance
35817,Barack Obama,0.0
24478,Joe Biden,33.075671
28447,George W. Bush,34.394767
35357,Lawrence Summers,36.152455
14754,Mitt Romney,36.166283
13229,Francisco Barrio,36.331804
31423,Walter Mondale,36.400549
22745,Wynn Normington Hugh-Jones,36.496575
36364,Don Bonker,36.633318
9210,Andy Anstett,36.959437


In [62]:
obama_words = pd.Series(word_count[35817].toarray()[0],index=idx_to_word.index)
obama_words = obama_words.sort_values(ascending=False)
obama_words.head(10)

the      40
in       30
and      21
of       18
to       14
his      11
obama     9
act       8
a         7
he        7
dtype: int64

All of the 10 people are politicians, but about half of them have rather tenuous connections with Obama, other than the fact that they are politicians.

* Francisco Barrio is a Mexican politician, and a former governor of Chihuahua.
* Walter Mondale and Don Bonker are Democrats who made their career in late 1970s.
* Wynn Normington Hugh-Jones is a former British diplomat and Liberal Party official.
* Andy Anstett is a former politician in Manitoba, Canada.

Nearest neighbors with raw word counts got some things right, showing all politicians in the query result, but missed finer and important details.

For instance, let's find out why Francisco Barrio was considered a close neighbor of Obama. To do this, let's look at the most frequently used words in each of Barack Obama and Francisco Barrio's pages:

In [56]:
# row number of Francisco Barrio's article
wiki[wiki['name']=='Francisco Barrio']

Unnamed: 0,URI,name,text
13229,<http://dbpedia.org/resource/Francisco_Barrio>,Francisco Barrio,francisco javier barrio terrazas born november...


In [57]:
distances, indices = model.kneighbors(word_count[13229], n_neighbors=10)
nn_obama = wiki.loc[indices[0]]
nn_obama['distance'] = distances[0]
nn_obama[['name','distance']]

Unnamed: 0,name,distance
13229,Francisco Barrio,0.0
7794,Sali Berisha,27.87472
26858,Ciril Zlobec,28.0
29243,Jay Naidoo,28.05352
23880,Richard Hu,28.319605
27737,Fernando Canales Clariond,28.337255
36197,Arnd Kr%C3%BCger,28.530685
58686,Ramakanta Rath,28.565714
53529,Chris Saxman,28.653098
33378,Jaros%C5%82aw Lasecki,28.670542


In [63]:
barrio_words = pd.Series(word_count[13229].toarray()[0],index=idx_to_word.index)
barrio_words = barrio_words.sort_values(ascending=False)
barrio_words.head(10)

the          36
of           24
and          18
in           17
he           10
to            9
chihuahua     7
a             6
governor      6
as            5
dtype: int64

Let's extract the list of most frequent words that appear in both Obama's and Barrio's documents. We've so far sorted all words from Obama and Barrio's articles by their word frequencies. We will now use a dataframe operation known as join.