# Document retrieval from wikipedia data

## Fire up GraphLab Create

In [1]:
import graphlab

# Load some text data - from wikipedia, pages on people

In [None]:
people = graphlab.SFrame('people_wiki.gl/')

Data contains:  link to wikipedia article, name of person, text of article.

In [3]:
people.head(5)

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [4]:
len(people)

59071

# Explore the dataset and checkout the text it contains

## Exploring the entry for president Obama

In [3]:
obama = people[people['name'] == 'Barack Obama']

In [6]:
obama

URI,name,text
<http://dbpedia.org/resou rce/Barack_Obama> ...,Barack Obama,barack hussein obama ii brk husen bm born august ...


In [17]:
obama['text'][0][:1000]

'barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american to hold the office born in honolulu hawaii obama is a graduate of columbia university and harvard law school where he served as president of the harvard law review he was a community organizer in chicago before earning his law degree he worked as a civil rights attorney and taught constitutional law at the university of chicago law school from 1992 to 2004 he served three terms representing the 13th district in the illinois senate from 1997 to 2004 running unsuccessfully for the united states house of representatives in 2000in 2004 obama received national attention during his campaign to represent illinois in the united states senate with his victory in the march democratic party primary his keynote address at the democratic national convention in july and his election to the senate in november he began his presidential campaign in 2007 and afte

## Exploring the entry for actor George Clooney

In [21]:
clooney = people[people['name'] == 'George Clooney']
clooney['text']

dtype: str
Rows: ?
['george timothy clooney born may 6 1961 is an american actor writer producer director and activist he has received three golden globe awards for his work as an actor and two academy awards one for acting and the other for producingclooney made his acting debut on television in 1978 and later gained wide recognition in his role as dr doug ross on the longrunning medical drama er from 1994 to 1999 for which he received two emmy award nominations while working on er he began attracting a variety of leading roles in films including the superhero film batman robin 1997 and the crime comedy out of sight 1998 in which he first worked with a director who would become a longtime collaborator steven soderbergh in 1999 clooney took the lead role in three kings a wellreceived war satire set during the gulf warin 2001 clooneys fame widened with the release of his biggest commercial success the heist comedy oceans eleven the first of the film trilogy a remake of the 1960 film wit

# Get the word counts for Obama article

In [7]:
## count words using text_analytics.count_words
obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])

In [12]:
## represent dict as a list and sort by count then key
obama_list = [(x,y) for x,y in obama['word_count'][0].iteritems()]
sorted(obama_list, key= lambda x: (-x[1], x[0]))[:10]

[('the', 40),
 ('in', 30),
 ('and', 21),
 ('of', 18),
 ('to', 14),
 ('his', 11),
 ('obama', 9),
 ('act', 8),
 ('a', 7),
 ('he', 7)]

## Sort the word counts for the Obama article

### Turning dictonary of word counts into a table

In [48]:
obama[['word_count']]  ##  SFrame object

word_count
"{'operations': 1, 'represent': 1, 'offi ..."


In [49]:
## use stack method to create table w/ new cols 'word', 'count'
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name = ['word','count'])

### Sorting the word counts to show most common words at the top

In [50]:
obama_word_count_table.head(8)

word,count
cuba,1
relations,1
sought,1
combat,1
ending,1
withdrawal,1
state,1
islamic,1


In [17]:
## sort by 'count' column in descending order
obama_word_count_table.sort('count',ascending=False)

word,count
the,40
in,30
and,21
of,18
to,14
his,11
obama,9
act,8
he,7
a,7


Most common words include uninformative words like "the", "in", "and",...

# Compute TF-IDF for the corpus 

To give more weight to informative words, we weigh them by their TF-IDF scores.

In [15]:
## add 'word_count' col to entire people dataset

people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people.head(5)

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'selection': 1, 'carltons': 1, 'being': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1, 'thomas': 1, 'closely': 1, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1, 'issued': 1, 'mainly': 1, 'nominat ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1, 'bauforschung': 1, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'they': 1, 'gangstergenka': 1, ..."


In [24]:
## use built-in tf-idf method

tfidf = graphlab.text_analytics.tf_idf(people['word_count'])

# Earlier versions of GraphLab Create returned an SFrame rather than a single SArray
# This notebook was created using Graphlab Create version 1.7.1
if graphlab.version <= '1.6.1':
    tfidf = tfidf['docs']

## show 10 random key-value pairs
import random
random.sample( tfidf[0].items(), 10 )

[('by', 0.37455341206197373),
 ('career', 1.3050270203415668),
 ('afl', 4.70049729471633),
 ('selection', 3.836578553093086),
 ('victorian', 4.564873121418676),
 ('parade', 5.510031837293684),
 ('before', 2.9935647453367427),
 ('corey', 6.486685718894929),
 ('along', 2.5088749729287803),
 ('its', 1.6875948402695313)]

In [57]:
people['tfidf'] = tfidf
people.head(3)

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'selection': 1, 'carltons': 1, 'being': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1, 'thomas': 1, 'closely': 1, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1, 'issued': 1, 'mainly': 1, 'nominat ..."

tfidf
"{'selection': 3.836578553093086, ..."
"{'precise': 6.44320060695519, ..."
"{'just': 2.7007299687108643, ..."


## Examine the TF-IDF for the Obama article

In [58]:
obama = people[people['name'] == 'Barack Obama']

In [59]:
## create sorted table of obama tf-idf values

obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

word,tfidf
obama,43.2956530721
act,27.678222623
iraq,17.747378588
control,14.8870608452
law,14.7229357618
ordered,14.5333739509
military,13.1159327785
involvement,12.7843852412
response,12.7843852412
democratic,12.4106886973


Words with highest TF-IDF are much more informative.

# Manually compute distances between a few people

Let's manually compare the distances between the articles for a few famous people.  

In [60]:
clinton = people[people['name'] == 'Bill Clinton']
beckham = people[people['name'] == 'David Beckham']

## Is Obama closer to Clinton than to Beckham?

We will use cosine distance, which is given by

(1-cosine_similarity) 

and find that the article about president Obama is closer to the one about former president Clinton than that of footballer David Beckham.

In [61]:
## using cosine distance (lower the better) within distances module (use tab to see other measures)

print "Obama distance to Clinton:"
graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])

Obama distance to Clinton:


0.8339854936884277

In [62]:
print "Obama distance to Beckham:"
graphlab.distances.cosine(obama['tfidf'][0],beckham['tfidf'][0])

Obama distance to Beckham:


0.9791305844747478

# Build a nearest neighbor model for document retrieval

We now create a nearest-neighbors model and apply it to document retrieval.  

In [63]:
## graphlab nearest_neighbors module - create() method
## use 'name' col as label

knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')

PROGRESS: Starting brute force nearest neighbors model training.


# Applying the nearest-neighbors model for retrieval

## Who is closest to Obama?

In [64]:
## query the model to find closest to obama

knn_model.query(obama)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 28.284ms     |
PROGRESS: | Done         |         | 100         | 497.574ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Barack Obama,0.0,1
0,Joe Biden,0.794117647059,2
0,Joe Lieberman,0.794685990338,3
0,Kelly Ayotte,0.811989100817,4
0,Bill Clinton,0.813852813853,5


As we can see, president Obama's article is closest to the one about his vice-president Biden, and those of other politicians.  

## Other examples of document retrieval

In [29]:
swift = people[people['name'] == 'Taylor Swift']

In [30]:
## find who is closes to Taylor Swift

knn_model.query(swift)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 19.485ms     |
PROGRESS: | Done         |         | 100         | 654.766ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Taylor Swift,0.0,1
0,Carrie Underwood,0.76231884058,2
0,Alicia Keys,0.764705882353,3
0,Jordin Sparks,0.769633507853,4
0,Leona Lewis,0.776119402985,5


In [31]:
jolie = people[people['name'] == 'Angelina Jolie']

In [32]:
knn_model.query(jolie)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 16.811ms     |
PROGRESS: | Done         |         | 100         | 440.707ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Angelina Jolie,0.0,1
0,Brad Pitt,0.784023668639,2
0,Julianne Moore,0.795857988166,3
0,Billy Bob Thornton,0.803069053708,4
0,George Clooney,0.8046875,5


In [33]:
arnold = people[people['name'] == 'Arnold Schwarzenegger']

In [34]:
knn_model.query(arnold)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 16.376ms     |
PROGRESS: | Done         |         | 100         | 458.279ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Arnold Schwarzenegger,0.0,1
0,Jesse Ventura,0.818918918919,2
0,John Kitzhaber,0.824615384615,3
0,Lincoln Chafee,0.833876221498,4
0,Anthony Foxx,0.833910034602,5


## 1. Compare top words according to word counts to TDF-IDF

In [65]:
## look at Elton John
elton = people[people['name'] == 'Elton John']
elton

URI,name,text,word_count
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,"{'all': 1, 'least': 1, 'producer': 1, 'heavi ..."

tfidf
"{'all': 1.6431112434912472, ..."


In [75]:
## create tables of word_count, tfidf

print elton[['word_count']].stack('word_count',new_column_name=['word','count']).sort('count',ascending=False)
print elton[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

+-------+-------+
|  word | count |
+-------+-------+
|  the  |   27  |
|   in  |   18  |
|  and  |   15  |
|   of  |   13  |
|   a   |   10  |
|  has  |   9   |
|  john |   7   |
|   he  |   7   |
|   on  |   6   |
| award |   5   |
+-------+-------+
[255 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
+---------------+---------------+
|      word     |     tfidf     |
+---------------+---------------+
|    furnish    |  18.38947184  |
|     elton     |  17.48232027  |
|   billboard   | 17.3036809575 |
|      john     | 13.9393127924 |
|  songwriters  |  11.250406447 |
| tonightcandle | 10.9864953892 |
|  overallelton | 10.9864953892 |
|    19702000   | 10.2933482087 |
|   fivedecade  | 10.2933482087 |
|      aids     |  10.262846934 |
+---------------+---------------+
[255 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to

## 2. Measuring distance

In [91]:
def cosine_dist(x,y):
    x = people[people['name'] == x]
    y = people[people['name'] == y]
    print x['name'][0], "distance to", y['name'][0], ":"
    return graphlab.distances.cosine(x['tfidf'][0],y['tfidf'][0])

### Elton John to Victoria Beckham

In [92]:
cosine_dist("Elton John", 'Victoria Beckham')

Elton John distance to Victoria Beckham :


0.9567006376655429

### Elton John to Paul McCartney

In [94]:
cosine_dist('Elton John', 'Paul McCartney')

Elton John distance to Paul McCartney :


0.8250310029221779

In [95]:
cosine_dist('Paul McCartney', 'Elton John')

Paul McCartney distance to Elton John :


0.825031002922178

## 3. Building nearest neighbor model w/ diff input features, setting the distance metric 

In [96]:
## build knn models using word_count, tfidf

knn_model_wordcount = graphlab.nearest_neighbors.create(people,features=['word_count'],label='name', distance='cosine')
knn_model_tfidf = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name', distance='cosine')


PROGRESS: Starting brute force nearest neighbors model training.
PROGRESS: Starting brute force nearest neighbors model training.


### most similar to Elton John using word count

In [97]:
knn_model_wordcount.query(elton)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 26.978ms     |
PROGRESS: | Done         |         | 100         | 429.682ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,2.22044604925e-16,1
0,Cliff Richard,0.16142415259,2
0,Sandro Petrone,0.16822542751,3
0,Rod Stewart,0.168327165587,4
0,Malachi O'Doherty,0.177315545979,5


### most similar to Elton John using TF-IDF

In [98]:
knn_model_tfidf.query(elton)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 47.167ms     |
PROGRESS: | Done         |         | 100         | 583.063ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,-2.22044604925e-16,1
0,Rod Stewart,0.717219667893,2
0,George Michael,0.747600998969,3
0,Sting (musician),0.747671954431,4
0,Phil Collins,0.75119324879,5


### most similar to Victoria Beckham using word count

In [100]:
def knn_wordcount(person):
    return knn_model_wordcount.query(people[people['name'] == person])

knn_wordcount('Victoria Beckham')

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 16.102ms     |
PROGRESS: | Done         |         | 100         | 393.547ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Victoria Beckham,-2.22044604925e-16,1
0,Mary Fitzgerald (artist),0.207307036115,2
0,Adrienne Corri,0.214509782788,3
0,Beverly Jane Fry,0.217466468741,4
0,Raman Mundair,0.217695474992,5


### most similar to Victoria Beckham using TF-IDF

In [101]:
def knn_tfidf(person):
    return knn_model_tfidf.query(people[people['name'] == person])

knn_tfidf('Victoria Beckham')

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 45.792ms     |
PROGRESS: | Done         |         | 100         | 569.21ms     |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Victoria Beckham,1.11022302463e-16,1
0,David Beckham,0.548169610263,2
0,Stephen Dow Beckham,0.784986706828,3
0,Mel B,0.809585523409,4
0,Caroline Rush,0.819826422919,5
