Assignment: Retrieving Documents
This assignment focuses on using nearest neighbors and clustering to retrieve documents based on an analysis of their text. You will explore two types of document representations: word counts and term-frequency, inverse document frequency (TF-IDF). You will build a model for retrieving Wikipedia articles about famous people.

Learning outcomes
Load and transform real text data
Build a document retrieval model using nearest neighbor search
Compare results between using two approaches: word counts and TF-IDF
Set the distance function for the retrieval task
Instructions
There are three tasks in this assignment. There are several results you need to gather for the quiz that accompanies this module.

Task 1: Compare top words using word counts to those from TF-IDF
The video explored two document representations: word counts and TF-IDF. Let's apply these representations to the articles of a famous person, Elton John.

What are the 3 words in his articles with highest word counts?
What are the 3 words in his articles with highest TF-IDF?
As you can see, TF-IDF is useful for finding important words.

Save these results to answer the quiz for this module.

Task 2: Measure the distance between articles
Elton John is a famous singer. Compute the distance between his article and those of two other famous singers. Use the cosine distance, which is one measure of similarity between vectors and is similar to the one discussed in the lectures. You can compute cosine distance using the turicreate.distances.cosine function.

What’s the cosine distance between the articles on Elton John and Victoria Beckham?
What’s the cosine distance between the articles on Elton John and Paul McCartney? Which one of the two is closest to Elton John? Does this result make sense to you?
Save these results to answer the quiz for this module.

Task 3: Building nearest neighbors models with different input features and by setting the distance metric
In the video, Carlos built a nearest neighbors model for retrieving articles using TF-IDF as features and using the default setting in the construction of the nearest neighbors model.

You will build two nearest neighbors models:

Using word counts as features
Using TF-IDF as features
In both of these models, you will set the distance function to cosine similarity by calling the function turicreate.nearest_neighbors.create, and adding the parameter distance='cosine'.

Use your model to retrieve documents and collect the following results:

What’s the most similar article, other than itself, to the one on ‘Elton John’ using word count features?
What’s the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features?
What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features?
What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features?
Save these results to answer the quiz for this module.

In [1]:
import turicreate

In [2]:
people = turicreate.SFrame('~/my-env/data/people_wiki.sframe')

In [3]:
people

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


# Task 1: Compare top words using word counts to those from TF-IDF

In [4]:
john = people[people['name'] == 'Elton John']

In [5]:
john

URI,name,text
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...


In [6]:
john['word_count'] = turicreate.text_analytics.count_words(john['text'])

In [7]:
john['word_count']

dtype: dict
Rows: 1
[{'social': 1.0, 'champion': 1.0, 'be': 1.0, '2014': 1.0, 'legal': 1.0, 'became': 1.0, '2005': 1.0, 'december': 2.0, '21': 2.0, 'furnish': 2.0, 'david': 1.0, 'civil': 1.0, 'gay': 2.0, 'who': 1.0, '200': 1.0, 'raised': 1.0, 'industry': 1.0, 'film': 1.0, 'parties': 1.0, 'oscar': 1.0, 'partnership': 1.0, 'highestprofile': 1.0, 'which': 1.0, 'hosting': 1.0, 'later': 1.0, 'foundation': 2.0, 'established': 1.0, '1988': 1.0, '1992': 1.0, '1980s': 1.0, 'against': 1.0, 'fight': 1.0, 'heavily': 1.0, 'marriage': 2.0, '2012he': 1.0, 'buckingham': 1.0, 'outside': 1.0, 'and': 15.0, 'queens': 1.0, 'at': 4.0, '10': 1.0, '2002': 1.0, 'palace': 2.0, 'abbey': 1.0, 'hall': 2.0, 'royal': 1.0, 'services': 2.0, '1998': 1.0, 'charitable': 1.0, 'ii': 1.0, 'elizabeth': 1.0, 'empire': 1.0, 'year': 1.0, 'commander': 1.0, 'westminster': 1.0, '1996': 1.0, 'single': 2.0, 'named': 1.0, 'been': 3.0, 'songwriters': 2.0, '100': 3.0, '1994': 1.0, 'into': 3.0, 'overallelton': 1.0, 'wed': 1.0, 'male': 1

## Find most common words in John article

In [8]:
john.stack('word_count',new_column_name=['word','count'])

URI,name,text,word,count
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,social,1.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,champion,1.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,be,1.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,2014,1.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,legal,1.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,became,1.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,2005,1.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,december,2.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,21,2.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,furnish,2.0


In [9]:
john_word_count_table = john[['word_count']].stack('word_count', new_column_name = ['word','count'])

In [10]:
john_word_count_table

word,count
social,1.0
champion,1.0
be,1.0
2014,1.0
legal,1.0
became,1.0
2005,1.0
december,2.0
21,2.0
furnish,2.0


In [11]:
john_word_count_table.sort('count',ascending=False)

word,count
the,27.0
in,18.0
and,15.0
of,13.0
a,10.0
has,9.0
john,7.0
he,7.0
on,6.0
award,5.0


## the, in, and, of, a are the most common using wordcount

# Compute TF-IDF for the entire corpus of articles

In [12]:
people['word_count'] = turicreate.text_analytics.count_words(people['text'])

In [13]:
people['tfidf'] = turicreate.text_analytics.tf_idf(people['text'])

In [14]:
people

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'melbourne': 1.0, 'college': 1.0, 'para ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'time': 1.0, 'each': 1.0, 'rhythms': 1.0, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'time': 1.0, 'honored': 1.0, 'maple': 1.0, ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'kurdlawitzpreis': 1.0, 'this': 1.0, 'occasion': ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'curtis': 1.0, 'promo': 1.0, '2007': 1.0, 'ce ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'journal': 1.0, 'niblit': 1.0, ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'including': 1.0, 'artists': 1.0, 'local': ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'concordia': 1.0, 'creative': 1.0, ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'knuckles': 1.0, 'simply': 1.0, 'brand': ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'n3': 1.0, '2002': 1.0, 'harvard': 1.0, 'tria ..."

tfidf
"{'melbourne': 3.8914310119380633, ..."
"{'time': 1.3253342074200498, ..."
"{'time': 1.3253342074200498, ..."
"{'kurdlawitzpreis': 10.986495389225194, ..."
"{'curtis': 5.299520032885375, ..."
"{'journal': 3.025473923341824, ..."
"{'including': 1.2272824458461182, ..."
"{'concordia': 6.250296940830698, ..."
"{'knuckles': 8.042056410058754, ..."
"{'n3': 10.293348208665249, ..."


## Examine the TF-IDF for the John article

In [19]:
john = people[people['name'] == 'Elton John']
john[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

word,tfidf
furnish,18.38947183999428
elton,17.482320270031995
billboard,17.30368095754203
john,13.93931279239831
songwriters,11.25040644703154
tonightcandle,10.986495389225194
overallelton,10.986495389225194
19702000,10.293348208665249
fivedecade,10.293348208665249
aids,10.262846934045534


## 3 most common words from TF-IDF is furnish, elton, billboard

# Task 2: Measure the distance between articles

In [16]:
beckham = people[ people['name']=='Victoria Beckham']

In [17]:
beckham['tfidf'][0]

{'new': 0.8871532656125274,
 'biannual': 7.552508184740048,
 'ticket': 5.142950972193835,
 'clamour': 9.887883100557085,
 'scathing': 7.990763115671204,
 'celebrity': 4.564873121418676,
 'significant': 3.7624705809393637,
 'saying': 4.093853748053105,
 'most': 1.4186204428983973,
 'show': 2.1689013529494012,
 'successful': 2.679282762596886,
 'wag': 8.789270811888976,
 'that': 0.6614069466714981,
 'belinda': 7.202305755306933,
 'won': 1.3836400683164753,
 'business': 2.3749015223874728,
 'familys': 5.421974981902501,
 'performer': 4.073752568732018,
 'daily': 3.536415819417696,
 'star': 2.9854754279015427,
 'assessed': 6.9975113426609195,
 '2012': 1.7938099524877322,
 'year': 1.3423616371539895,
 'brand': 8.994580915799753,
 'named': 2.0300155412252816,
 '2011': 5.107041270312875,
 'diffusion': 7.1363477875151355,
 'label': 9.90513433710723,
 'eponymous': 5.810345656651365,
 'other': 1.4424007566948476,
 'collaborations': 4.808551275174594,
 'highprofile': 5.38807343022682,
 'following

In [20]:
john['tfidf'][0]

{'social': 2.6226865047083137,
 'champion': 3.176548302748404,
 'be': 1.4062480045415613,
 '2014': 2.2073995783446634,
 'legal': 3.4243337579995425,
 'became': 1.3300599330549516,
 '2005': 1.6425861253275964,
 'december': 4.00285165915879,
 '21': 5.594501726978586,
 'furnish': 18.38947183999428,
 'david': 2.4512658353228582,
 'civil': 3.3244978303233013,
 'gay': 8.685411312155043,
 'who': 0.9098952189804214,
 '200': 3.8524016680323285,
 'raised': 3.059531844362216,
 'industry': 2.9570625486439512,
 'film': 2.033113917057952,
 'parties': 4.495771854722687,
 'oscar': 4.781937626656504,
 'partnership': 4.1599501656686,
 'highestprofile': 8.789270811888976,
 'which': 0.7674309670437692,
 'hosting': 4.806478735572622,
 'later': 1.4294496043477696,
 'foundation': 5.428871534029564,
 'established': 3.0759047769687164,
 '1988': 2.4491074905234376,
 '1992': 2.278351314316948,
 '1980s': 2.9688582293167167,
 'against': 2.0079609791418744,
 'fight': 4.274754994169014,
 'heavily': 4.584578192498009

In [21]:
mcCartney=people[people['name']== 'Paul McCartney'] 

In [22]:
turicreate.distances.cosine(john['tfidf'][0],beckham['tfidf'][0])

0.9567006376655429

In [23]:
turicreate.distances.cosine(john['tfidf'][0],mcCartney['tfidf'][0])

0.8250310029221779

## Shorter distance with Paul McCartney

# Task 3: Building nearest neighbors models with different input features and by setting the distance metric

# Apply nearest neighbors for retrieval of Wikipedia articles

## Build the NN model with word counts

In [25]:
word_count_knn_model = turicreate.nearest_neighbors.create(people,features=['word_count'],label='name', distance='cosine')

In [26]:
tfidf_knn_model = turicreate.nearest_neighbors.create(people,features=['tfidf'],label='name', distance='cosine')

## What’s the most similar article, other than itself, to the one on ‘Elton John’ using word count features?  = Cliff Richard

In [28]:
word_count_knn_model.query(john)

query_label,reference_label,distance,rank
0,Elton John,2.220446049250313e-16,1
0,Cliff Richard,0.1614241525896703,2
0,Sandro Petrone,0.1682254275104111,3
0,Rod Stewart,0.168327165587061,4
0,Malachi O'Doherty,0.177315545978884,5


## What’s the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features? : Rod Stewart

In [29]:
tfidf_knn_model.query(john)

query_label,reference_label,distance,rank
0,Elton John,-2.220446049250313e-16,1
0,Rod Stewart,0.7172196678927374,2
0,George Michael,0.7476009989692847,3
0,Sting (musician),0.7476719544306141,4
0,Phil Collins,0.7511932487904706,5


## What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features? : Mary Fitzgerald 

In [30]:
word_count_knn_model.query(beckham)

query_label,reference_label,distance,rank
0,Victoria Beckham,-2.220446049250313e-16,1
0,Mary Fitzgerald (artist),0.2073070361150499,2
0,Adrienne Corri,0.2145097827875479,3
0,Beverly Jane Fry,0.2174664687407927,4
0,Raman Mundair,0.2176954749915048,5


##  What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features? David Beckham

In [31]:
tfidf_knn_model.query(beckham)

query_label,reference_label,distance,rank
0,Victoria Beckham,1.1102230246251563e-16,1
0,David Beckham,0.5481696102632145,2
0,Stephen Dow Beckham,0.7849867068283364,3
0,Mel B,0.8095855234085036,4
0,Caroline Rush,0.81982642291868,5
