#Document retrieval from wikipedia data

#Fire up GraphLab Create

In [12]:
import sklearn, pandas
import numpy as np

#Load some text data - from wikipedia, pages on people

In [13]:
people = pandas.read_csv('people_wiki.csv')

Data contains:  link to wikipedia article, name of person, text of article.

In [14]:
people.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [15]:
len(people)

59071

#Explore the dataset and checkout the text it contains

##Exploring the entry for president Obama

In [16]:
obama = people[people['name'] == 'Barack Obama']

In [17]:
obama

Unnamed: 0,URI,name,text
35817,<http://dbpedia.org/resource/Barack_Obama>,Barack Obama,barack hussein obama ii brk husen bm born augu...


In [18]:
obama['text'].values[0]

'barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american to hold the office born in honolulu hawaii obama is a graduate of columbia university and harvard law school where he served as president of the harvard law review he was a community organizer in chicago before earning his law degree he worked as a civil rights attorney and taught constitutional law at the university of chicago law school from 1992 to 2004 he served three terms representing the 13th district in the illinois senate from 1997 to 2004 running unsuccessfully for the united states house of representatives in 2000in 2004 obama received national attention during his campaign to represent illinois in the united states senate with his victory in the march democratic party primary his keynote address at the democratic national convention in july and his election to the senate in november he began his presidential campaign in 2007 and afte

##Exploring the entry for actor George Clooney

In [19]:
clooney = people[people['name'] == 'George Clooney']
clooney['text'].values[0]

'george timothy clooney born may 6 1961 is an american actor writer producer director and activist he has received three golden globe awards for his work as an actor and two academy awards one for acting and the other for producingclooney made his acting debut on television in 1978 and later gained wide recognition in his role as dr doug ross on the longrunning medical drama er from 1994 to 1999 for which he received two emmy award nominations while working on er he began attracting a variety of leading roles in films including the superhero film batman robin 1997 and the crime comedy out of sight 1998 in which he first worked with a director who would become a longtime collaborator steven soderbergh in 1999 clooney took the lead role in three kings a wellreceived war satire set during the gulf warin 2001 clooneys fame widened with the release of his biggest commercial success the heist comedy oceans eleven the first of the film trilogy a remake of the 1960 film with frank sinatra as d

#Get the word counts for Obama article

In [20]:
#from nltk.tokenize import word_tokenize
def word_count_function(string):
  count = {}
  for word in string.strip().split():
    count[word] = count.get(word, 0) + 1
  return count
obama_word_count = obama['text'].map(word_count_function)

In [21]:
print(obama_word_count.values[0])

{'barack': 1, 'hussein': 1, 'obama': 9, 'ii': 1, 'brk': 1, 'husen': 1, 'bm': 1, 'born': 2, 'august': 1, '4': 1, '1961': 1, 'is': 2, 'the': 40, '44th': 1, 'and': 21, 'current': 1, 'president': 4, 'of': 18, 'united': 3, 'states': 3, 'first': 3, 'african': 1, 'american': 3, 'to': 14, 'hold': 1, 'office': 2, 'in': 30, 'honolulu': 1, 'hawaii': 1, 'a': 7, 'graduate': 1, 'columbia': 1, 'university': 2, 'harvard': 2, 'law': 6, 'school': 3, 'where': 1, 'he': 7, 'served': 2, 'as': 6, 'review': 1, 'was': 5, 'community': 1, 'organizer': 1, 'chicago': 2, 'before': 1, 'earning': 1, 'his': 11, 'degree': 1, 'worked': 1, 'civil': 1, 'rights': 1, 'attorney': 1, 'taught': 1, 'constitutional': 1, 'at': 2, 'from': 3, '1992': 1, '2004': 3, 'three': 1, 'terms': 1, 'representing': 1, '13th': 1, 'district': 1, 'illinois': 2, 'senate': 3, '1997': 1, 'running': 1, 'unsuccessfully': 1, 'for': 4, 'house': 2, 'representatives': 2, '2000in': 1, 'received': 1, 'national': 2, 'attention': 1, 'during': 2, 'campaign': 3

##Sort the word counts for the Obama article

###Turning dictonary of word counts into a table

In [22]:
obama_word_count_table = pandas.DataFrame.from_dict(obama_word_count.values[0], orient="index", columns=["count"])

###Sorting the word counts to show most common words at the top

In [23]:
obama_word_count_table.head()

Unnamed: 0,count
barack,1
hussein,1
obama,9
ii,1
brk,1


In [24]:
obama_word_count_table.sort_values('count',ascending=False)

Unnamed: 0,count
the,40
in,30
and,21
of,18
to,14
...,...
laureateduring,1
two,1
years,1
into,1


Most common words include uninformative words like "the", "in", "and",...

#Compute TF-IDF for the corpus 

To give more weight to informative words, we weigh them by their TF-IDF scores.

In [25]:
people['word_count'] = people['text'].map(word_count_function)
people.head()

Unnamed: 0,URI,name,text,word_count
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,"{'digby': 1, 'morrell': 5, 'born': 1, '10': 1,..."
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,"{'alfred': 1, 'j': 1, 'lewy': 3, 'aka': 1, 'sa..."
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,"{'harpdog': 2, 'brown': 2, 'is': 7, 'a': 7, 's..."
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,"{'franz': 1, 'rottensteiner': 3, 'born': 1, 'i..."
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'henry': 1, 'krvits': 1, 'born': 1, '30': 1, ..."


In [26]:
#tfidf = graphlab.text_analytics.tf_idf(people['word_count'])
#tfidf
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english")
people_tfidf = tfidf_vectorizer.fit_transform(people['text'])

##Examine the TF-IDF for the Obama article

In [27]:
#obama = people[people['name'] == 'Barack Obama']
tf_idf_values = tfidf_vectorizer.idf_
tf_idf_vocab = tfidf_vectorizer.vocabulary_
obama_words_tfidf_values = {k: count * tf_idf_values[tf_idf_vocab[k]] for k, count in obama_word_count.values[0].items() if k in tf_idf_vocab}

In [28]:
#obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)
obama_word_tfidf_table = pandas.DataFrame.from_dict(obama_words_tfidf_values, orient="index", columns=["tfidf"])
obama_word_tfidf_table.sort_values('tfidf', ascending=False).head()

Unnamed: 0,tfidf
obama,52.277114
act,35.674051
iraq,21.741728
law,20.721856
control,18.88433


Words with highest TF-IDF are much more informative.

#Manually compute distances between a few people

Let's manually compare the distances between the articles for a few famous people.  

In [29]:
clinton = people[people['name'] == 'Bill Clinton']
beckham = people[people['name'] == 'David Beckham']

In [30]:
obama_tfidf = tfidf_vectorizer.transform(obama['text'])[0]
clinton_tfidf = tfidf_vectorizer.transform(clinton['text'])[0]
beckham_tfidf = tfidf_vectorizer.transform(beckham['text'])[0]

##Is Obama closer to Clinton than to Beckham?

We will use cosine distance, which is given by

(1-cosine_similarity) 

and find that the article about president Obama is closer to the one about former president Clinton than that of footballer David Beckham.

In [31]:
from sklearn.metrics.pairwise import cosine_distances
cosine_distances(obama_tfidf, clinton_tfidf)
#graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])

array([[0.81103282]])

In [32]:
cosine_distances(obama_tfidf, beckham_tfidf)

array([[0.97443419]])


#Build a nearest neighbor model for document retrieval

We now create a nearest-neighbors model and apply it to document retrieval.  

In [33]:
#knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')
from sklearn.neighbors import NearestNeighbors
knn_model = NearestNeighbors().fit(people_tfidf)

#Applying the nearest-neighbors model for retrieval

##Who is closest to Obama?

In [35]:
n_dist, n_ids = knn_model.kneighbors(obama_tfidf, return_distance=True)
n_ids, n_dist = n_ids[0], n_dist[0]
#print(n_dist, n_ids)
n_names = people['name'].iloc[n_ids]
neighbors = pandas.DataFrame({"name":n_names, "distance": n_dist, "index": n_ids}).sort_values("distance", ascending=True)
neighbors

Unnamed: 0,name,distance,index
35817,Barack Obama,0.0,35817
24478,Joe Biden,1.165145,24478
38376,Samantha Power,1.207369,38376
57108,Hillary Rodham Clinton,1.21964,57108
38714,Eric Stern (politician),1.222509,38714


As we can see, president Obama's article is closest to the one about his vice-president Biden, and those of other politicians.  

##Other examples of document retrieval

In [24]:
swift = people[people['name'] == 'Taylor Swift']
switf_tfidf = tfidf_vectorizer.transform(swift['text'])[0]

In [25]:
n_dist, n_ids = knn_model.kneighbors(switf_tfidf, return_distance=True)
n_ids, n_dist = n_ids[0], n_dist[0]
#print(n_dist, n_ids)
n_names = people['name'].iloc[n_ids]
neighbors = pandas.DataFrame({"name":n_names, "distance": n_dist, "index": n_ids}).sort_values("distance", ascending=True)
neighbors

Unnamed: 0,distance,index,name
54264,0.0,54264,Taylor Swift
317,1.183004,317,Carrie Underwood
9379,1.187754,9379,Al Swift
25403,1.193938,25403,Ed Sheeran
19943,1.197285,19943,Tim McGraw


In [26]:
jolie = people[people['name'] == 'Angelina Jolie']
jolie_tfidf = tfidf_vectorizer.transform(jolie['text'])[0]

In [27]:
n_dist, n_ids = knn_model.kneighbors(jolie_tfidf, return_distance=True)
n_ids, n_dist = n_ids[0], n_dist[0]
#print(n_dist, n_ids)
n_names = people['name'].iloc[n_ids]
neighbors = pandas.DataFrame({"name":n_names, "distance": n_dist, "index": n_ids}).sort_values("distance", ascending=True)
neighbors

Unnamed: 0,distance,index,name
39521,1.825012e-08,39521,Angelina Jolie
24426,1.173973,24426,Brad Pitt
16625,1.241878,16625,Keith Jolie
21644,1.25319,21644,Jodie Foster
34756,1.254573,34756,Maggie Smith


In [28]:
arnold = people[people['name'] == 'Arnold Schwarzenegger']
arnold_tfidf = tfidf_vectorizer.transform(arnold['text'])[0]

In [29]:
n_dist, n_ids = knn_model.kneighbors(arnold_tfidf, return_distance=True)
n_ids, n_dist = n_ids[0], n_dist[0]
#print(n_dist, n_ids)
n_names = people['name'].iloc[n_ids]
neighbors = pandas.DataFrame({"name":n_names, "distance": n_dist, "index": n_ids}).sort_values("distance", ascending=True)
neighbors

Unnamed: 0,distance,index,name
16018,0.0,16018,Arnold Schwarzenegger
58965,1.259683,58965,Bonnie Garcia
35293,1.263233,35293,Paul Grant (bodybuilder)
47709,1.283846,47709,Gray Davis
8050,1.284463,8050,James Tramel
