# Nearest Neighbors

In [33]:
import graphlab
import matplotlib.pyplot as plt

In [2]:
wiki = graphlab.SFrame("E:\\Machine Learning\\U.W\\Cluster and Retrieval\\people_wiki.gl/")

This non-commercial license of GraphLab Create for academic use is assigned to lxn1021@gmail.com and will expire on November 18, 2019.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Xiaoning\AppData\Local\Temp\graphlab_server_1555184631.log.0


In [3]:
wiki.head()

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


## Extract word count vectors

In [4]:
wiki["word_count"] = graphlab.text_analytics.count_words(wiki["text"])

In [5]:
wiki

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'since': 1L, 'carltons': 1L, 'being': 1L, '2005': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1L, 'thomas': 1L, 'closely': 1L, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1L, 'issued': 1L, 'mainly': 1L, ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1L, 'bauforschung': 1L, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'legendary': 1L, 'gangstergenka': 1L, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'now': 1L, 'currently': 1L, 'less': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2L, 'producer': 1L, 'tribe': ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1L, 'salon': 1L, 'gangs': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1L, 'frankie': 1L, 'labels': ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1L, 'deborash': 1L, ..."


## Find nearest neighbors

Let's start by finding the nearest neighbors of the Barack Obama page using the word count vectors to represent the articles and Euclidean distance to measure distance.

Euclidean distance: distance(X_i, X_q) = |X_i - X_q|

In [6]:
model = graphlab.nearest_neighbors.create(wiki, label = "name", features = ["word_count"],
                                         method = "brute_force", distance = "euclidean")

Let's look at the top 10 nearest neighbors by performing the following query:

In [7]:
model.query(wiki[wiki["name"] == "Barack Obama"], label = "name", k = 10)

query_label,reference_label,distance,rank
Barack Obama,Barack Obama,0.0,1
Barack Obama,Joe Biden,33.0756708171,2
Barack Obama,George W. Bush,34.3947670438,3
Barack Obama,Lawrence Summers,36.1524549651,4
Barack Obama,Mitt Romney,36.1662826401,5
Barack Obama,Francisco Barrio,36.3318042492,6
Barack Obama,Walter Mondale,36.4005494464,7
Barack Obama,Wynn Normington Hugh- Jones ...,36.4965751818,8
Barack Obama,Don Bonker,36.633318168,9
Barack Obama,Andy Anstett,36.9594372252,10


Nearest neighbors with raw word counts got some things right, showing all politicians in the query result, but missed finer and important details.

For instance, let's find out why Francisco Barrio was considered a close neighbor of Obama. To do this, let's look at the most frequently used words in each of Barack Obama and Francisco Barrio's pages:

In [8]:
def top_words(name):
    row = wiki[wiki["name"] == name]
    word_count_table = row[["word_count"]].stack("word_count", new_column_name = ["word", "count"])
    
    
    return word_count_table.sort("count", ascending = False)

In [9]:
obama_words = top_words("Barack Obama")

obama_words

word,count
the,40
in,30
and,21
of,18
to,14
his,11
obama,9
act,8
a,7
he,7


In [10]:
barrio_words = top_words("Francisco Barrio")

barrio_words

word,count
the,36
of,24
and,18
in,17
he,10
to,9
chihuahua,7
a,6
governor,6
his,5


In [11]:
combined_words = obama_words.join(barrio_words, on = "word")

combined_words

word,count,count.1
the,40,36
in,30,17
and,21,18
of,18,24
to,14,9
his,11,5
a,7,6
he,7,10
as,6,5
was,5,4


In [12]:
combined_words = combined_words.rename({"count": "Obama", "count.1": "Barrio"})

combined_words

word,Obama,Barrio
the,40,36
in,30,17
and,21,18
of,18,24
to,14,9
his,11,5
a,7,6
he,7,10
as,6,5
was,5,4


In [13]:
combined_words.sort("Obama", ascending = False)

word,Obama,Barrio
the,40,36
in,30,17
and,21,18
of,18,24
to,14,9
his,11,5
a,7,6
he,7,10
as,6,5
was,5,4


**Q1: Among the words that appear in both Barack Obama and Francisco Barrio, take the 5 that appear most frequently in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?**

In [14]:
common_words = set(combined_words.sort("Obama", ascending = False)["word"][:5])

def has_top_words(word_count_vector):
    unique_words = set(word_count_vector.keys())
    
    return common_words.issubset(unique_words)


wiki["has_top_words"] = wiki["word_count"].apply(has_top_words)

wiki["has_top_words"].sum()

56066L

**Q2: Measure the pairwise distance between the Wikipedia pages of Barack Obama, George W. Bush, and Joe Biden. Which of the three pairs has the smallest distance?**

In [15]:
Obama_word = wiki[wiki["name"] == "Barack Obama"]["word_count"][0]
Bush_word = wiki[wiki["name"] == "George W. Bush"]["word_count"][0]
Biden_word = wiki[wiki["name"] == "Joe Biden"]["word_count"][0]

Obama_Bush = graphlab.toolkits.distances.euclidean(Obama_word, Bush_word)
Obama_Biden = graphlab.toolkits.distances.euclidean(Obama_word, Biden_word)
Bush_Biden = graphlab.toolkits.distances.euclidean(Bush_word, Biden_word)

print Obama_Bush, Obama_Biden, Bush_Biden

34.3947670438 33.0756708171 32.7566787083


**Q3: Collect all words that appear both in Barack Obama and George W. Bush pages. Out of those words, find the 10 words that show up most often in Obama's page.**

In [16]:
obama_words = top_words("Barack Obama")
bush_words = top_words("George W. Bush")

combined_words = obama_words.join(bush_words, on = "word")
combined_words.rename({"count":"Obama", "count.1":"Bush"})

combined_words.sort("Obama", ascending = False)[:10]

word,Obama,Bush
the,40,39
in,30,22
and,21,14
of,18,14
to,14,11
his,11,6
act,8,3
he,7,8
a,7,6
as,6,6


## TF-IDF to the rescue

Much of the perceived commonalities between Obama and Barrio were due to occurrences of extremely frequent words, such as "the", "and", and "his". So nearest neighbors is recommending plausible results sometimes for the wrong reasons.

To retrieve articles that are more relevant, we should focus more on rare words that don't happen in every article. TF-IDF (term frequency–inverse document frequency) is a feature representation that penalizes words that are too common. 

In [17]:
wiki["tf_idf"] = graphlab.text_analytics.tf_idf(wiki["word_count"])

In [18]:
wiki.head()

URI,name,text,word_count,has_top_words
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'since': 1L, 'carltons': 1L, 'being': 1L, '2005': ...",1
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1L, 'thomas': 1L, 'closely': 1L, ...",1
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1L, 'issued': 1L, 'mainly': 1L, ...",1
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1L, 'bauforschung': 1L, ...",1
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'legendary': 1L, 'gangstergenka': 1L, ...",0
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'now': 1L, 'currently': 1L, 'less': 1L, 'being': ...",0
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2L, 'producer': 1L, 'tribe': ...",1
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1L, 'salon': 1L, 'gangs': 1L, 'being': ...",1
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1L, 'frankie': 1L, 'labels': ...",1
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1L, 'deborash': 1L, ...",1

tf_idf
"{'since': 1.455376717308041, ..."
"{'precise': 6.44320060695519, ..."
"{'just': 2.7007299687108643, ..."
"{'all': 1.6431112434912472, ..."
"{'legendary': 4.280856294365192, ..."
"{'now': 1.96695239252401, 'currently': ..."
"{'exclusive': 10.455187230695827, ..."
"{'taxi': 6.0520214560945025, ..."
"{'houston': 3.935505942157149, ..."
"{'phenomenon': 5.750053426395245, ..."


In [19]:
model_tf_idf = graphlab.nearest_neighbors.create(wiki, label = "name", features = ["tf_idf"],
                                                method = "brute_force", distance = "euclidean")

In [20]:
model_tf_idf

Class                          : NearestNeighborsModel

Attributes
----------
Method                         : brute_force
Number of distance components  : 1
Number of examples             : 59071
Number of feature columns      : 1
Number of unpacked features    : 547979
Total training time (seconds)  : 5.3773

In [21]:
model_tf_idf.query(wiki[wiki["name"] == "Barack Obama"], label = "name", k = 10)

query_label,reference_label,distance,rank
Barack Obama,Barack Obama,0.0,1
Barack Obama,Phil Schiliro,106.861013691,2
Barack Obama,Jeff Sessions,108.871674216,3
Barack Obama,Jesse Lee (politician),109.045697909,4
Barack Obama,Samantha Power,109.108106165,5
Barack Obama,Bob Menendez,109.781867105,6
Barack Obama,Eric Stern (politician),109.95778808,7
Barack Obama,James A. Guest,110.413888718,8
Barack Obama,Roland Grossenbacher,110.4706087,9
Barack Obama,Tulsi Gabbard,110.696997999,10


Let's determine whether this list makes sense.
* With a notable exception of Roland Grossenbacher, the other 8 are all American politicians who are contemporaries of Barack Obama.
* Phil Schiliro, Jesse Lee, Samantha Power, and Eric Stern worked for Obama.

Clearly, the results are more plausible with the use of TF-IDF.  Notice that TF-IDF representation assigns a weight to each word. This weight captures relative importance of that word in the document.

In [22]:
def top_words_tf_idf(name):
    row = wiki[wiki["name"] == name]
    word_count_table = row[["tf_idf"]].stack("tf_idf", new_column_name = ["word", "weight"])
    
    return word_count_table.sort("weight", ascending = False)

In [23]:
obama_tf_idf = top_words_tf_idf("Barack Obama")

obama_tf_idf

word,weight
obama,43.2956530721
act,27.678222623
iraq,17.747378588
control,14.8870608452
law,14.7229357618
ordered,14.5333739509
military,13.1159327785
involvement,12.7843852412
response,12.7843852412
democratic,12.4106886973


In [24]:
schiliro_tf_idf = top_words_tf_idf("Phil Schiliro")

schiliro_tf_idf

word,weight
schiliro,21.9729907785
staff,15.8564416352
congressional,13.5470876563
daschleschiliro,10.9864953892
obama,9.62125623824
waxman,9.04058524017
president,9.03358661416
2014from,8.68391029623
law,7.36146788088
consultant,6.91310403725


In [25]:
combined_words = obama_tf_idf.join(schiliro_tf_idf, on = "word")
combined_words.rename({"weight": "Obama", "weight.1": "Schiliro"})

combined_words.sort("Obama", ascending = False)

word,Obama,Schiliro
obama,43.2956530721,9.62125623824
law,14.7229357618,7.36146788088
democratic,12.4106886973,6.20534434867
senate,10.1642881797,3.3880960599
presidential,7.3869554189,3.69347770945
president,7.22686929133,9.03358661416
policy,6.09538628214,3.04769314107
states,5.47320098963,1.82440032988
office,5.24817282322,2.62408641161
2011,5.10704127031,3.40469418021


**Q4: Among the words that appear in both Barack Obama and Phil Schiliro, take the 5 that have largest weights in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?**

In [26]:
common_words = combined_words["word"][:5]

def has_top_words(word_count_vector):
    unique_words = set(word_count_vector.keys())
    
    return set(common_words).issubset(unique_words)


wiki["has_top_words"] = wiki["word_count"].apply(has_top_words)

wiki["has_top_words"].sum()

14L

Notice the huge difference in this calculation using TF-IDF scores instead of raw word counts. We've eliminated noise arising from extremely common words.

## Choosing metrics

You may wonder why Joe Biden, Obama's running mate in two presidential elections, is missing from the query results of model_tf_idf. Let's find out why. First, compute the distance between TF-IDF features of Obama and Biden.

**Q5: Compute the Euclidean distance between TF-IDF features of Obama and Biden.**

In [27]:
obama_words = wiki[wiki["name"] == "Barack Obama"]["tf_idf"][0]
biden_words = wiki[wiki["name"] == "Joe Biden"]["tf_idf"][0]

graphlab.toolkits.distances.euclidean(obama_words, biden_words)

123.29745600964296

The distance is larger than the distances we found for the 10 nearest neighbors, which we repeat here for readability:

In [28]:
model_tf_idf.query(wiki[wiki['name'] == 'Barack Obama'], label='name', k=10)

query_label,reference_label,distance,rank
Barack Obama,Barack Obama,0.0,1
Barack Obama,Phil Schiliro,106.861013691,2
Barack Obama,Jeff Sessions,108.871674216,3
Barack Obama,Jesse Lee (politician),109.045697909,4
Barack Obama,Samantha Power,109.108106165,5
Barack Obama,Bob Menendez,109.781867105,6
Barack Obama,Eric Stern (politician),109.95778808,7
Barack Obama,James A. Guest,110.413888718,8
Barack Obama,Roland Grossenbacher,110.4706087,9
Barack Obama,Tulsi Gabbard,110.696997999,10


But one may wonder, is Biden's article that different from Obama's, more so than, say, Schiliro's? It turns out that, when we compute nearest neighbors using the Euclidean distances, we unwittingly favor short articles over long ones. Let us compute the length of each Wikipedia document, and examine the document lengths for the 100 nearest neighbors to Obama's page.

In [29]:
def compute_length(row):
    return len(row["text"].split(" "))

wiki["length"] = wiki.apply(compute_length)

In [30]:
wiki.head()

URI,name,text,word_count,has_top_words
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'since': 1L, 'carltons': 1L, 'being': 1L, '2005': ...",0
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1L, 'thomas': 1L, 'closely': 1L, ...",0
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1L, 'issued': 1L, 'mainly': 1L, ...",0
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1L, 'bauforschung': 1L, ...",0
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'legendary': 1L, 'gangstergenka': 1L, ...",0
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'now': 1L, 'currently': 1L, 'less': 1L, 'being': ...",0
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2L, 'producer': 1L, 'tribe': ...",0
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1L, 'salon': 1L, 'gangs': 1L, 'being': ...",0
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1L, 'frankie': 1L, 'labels': ...",0
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1L, 'deborash': 1L, ...",0

tf_idf,length
"{'since': 1.455376717308041, ...",251
"{'precise': 6.44320060695519, ...",223
"{'just': 2.7007299687108643, ...",226
"{'all': 1.6431112434912472, ...",377
"{'legendary': 4.280856294365192, ...",201
"{'now': 1.96695239252401, 'currently': ...",270
"{'exclusive': 10.455187230695827, ...",440
"{'taxi': 6.0520214560945025, ...",633
"{'houston': 3.935505942157149, ...",248
"{'phenomenon': 5.750053426395245, ...",210


In [55]:
nearest_neighbors_euclidean = model_tf_idf.query(wiki[wiki["name"] == "Barack Obama"], label = "name", k = 100)
nearest_neighbors_euclidean = nearest_neighbors_euclidean.join(wiki[["name", "length"]], on = {"reference_label": "name"})

In [56]:
nearest_neighbors_euclidean.sort("rank")

query_label,reference_label,distance,rank,length
Barack Obama,Barack Obama,0.0,1,540
Barack Obama,Phil Schiliro,106.861013691,2,208
Barack Obama,Jeff Sessions,108.871674216,3,230
Barack Obama,Jesse Lee (politician),109.045697909,4,216
Barack Obama,Samantha Power,109.108106165,5,310
Barack Obama,Bob Menendez,109.781867105,6,220
Barack Obama,Eric Stern (politician),109.95778808,7,255
Barack Obama,James A. Guest,110.413888718,8,215
Barack Obama,Roland Grossenbacher,110.4706087,9,201
Barack Obama,Tulsi Gabbard,110.696997999,10,228


Relative to the rest of Wikipedia, nearest neighbors of Obama are overwhemingly short, most of them being shorter than 300 words. The bias towards short articles is not appropriate in this application as there is really no reason to favor short articles over long articles.

**Note:** Both word-count features and TF-IDF are proportional to word frequencies. While TF-IDF penalizes very common words, longer articles tend to have longer TF-IDF vectors simply because they have more words in them.

To remove this bias, we turn to **cosine distances**:
$$
d(\mathbf{x},\mathbf{y}) = 1 - \frac{\mathbf{x}^T\mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|}
$$
Cosine distances let us compare word distributions of two articles of varying lengths.

Let us train a new nearest neighbor model, this time with cosine distances.  We then repeat the search for Obama's 100 nearest neighbors.

In [62]:
model2_tf_idf = graphlab.nearest_neighbors.create(wiki, label = "name", features = ["tf_idf"],
                                                 method = "brute_force", distance = "cosine")

In [64]:
nearest_neighbors_cosine = model2_tf_idf.query(wiki[wiki["name"] == "Barack Obama"], label = "name", k = 100)
nearest_neighbors_cosine = nearest_neighbors_cosine.join(wiki[["name", "length"]], on = {"reference_label": "name"})

In [65]:
nearest_neighbors_cosine.sort("rank")

query_label,reference_label,distance,rank,length
Barack Obama,Barack Obama,0.0,1,540
Barack Obama,Joe Biden,0.703138676734,2,414
Barack Obama,Samantha Power,0.742981902328,3,310
Barack Obama,Hillary Rodham Clinton,0.758358397887,4,580
Barack Obama,Eric Stern (politician),0.770561227601,5,255
Barack Obama,Robert Gibbs,0.784677504751,6,257
Barack Obama,Eric Holder,0.788039072943,7,232
Barack Obama,Jesse Lee (politician),0.790926415366,8,216
Barack Obama,Henry Waxman,0.798322602893,9,279
Barack Obama,Joe the Plumber,0.799466360042,10,217


From a glance at the above table, things look better. Indeed, the 100 nearest neighbors using cosine distance provide a sampling across the range of document lengths, rather than just short articles like Euclidean distance provided.

# Problem with cosine distances: tweets vs. long articles

Cosine distances ignore all document lengths, which may be great in certain situations but not in others. For instance, consider the following (admittedly contrived) example.

```
+--------------------------------------------------------+
|                                             +--------+ |
|  One that shall not be named                | Follow | |
|  @username                                  +--------+ |
|                                                        |
|  Democratic governments control law in response to     |
|  popular act.                                          |
|                                                        |
|  8:05 AM - 16 May 2016                                 |
|                                                        |
|  Reply   Retweet (1,332)   Like (300)                  |
|                                                        |
+--------------------------------------------------------+
```

How similar is this tweet to Barack Obama's Wikipedia article? Let's transform the tweet into TF-IDF features, using an encoder fit to the Wikipedia dataset. (That is, let's treat this tweet as an article in our Wikipedia dataset and see what happens.)

In [74]:
sf = graphlab.SFrame({"text": ["democratic governments control law in response to popular act"]})
sf["word_count"] = graphlab.text_analytics.count_words(sf["text"])

encoder = graphlab.feature_engineering.TFIDF(features = ["word_count"], output_column_prefix = "tf_idf")
encoder.fit(wiki)
sf = encoder.transform(sf)
sf

text,word_count,tf_idf.word_count
democratic governments control law in response ...,"{'control': 1L, 'democratic': 1L, 'act': ...","{'control': 3.721765211295327, ..."


Let's look at the TF-IDF vectors for this tweet and for Barack Obama's Wikipedia entry, just to visually see their differences.

In [77]:
tweet_tf_idf = sf[0]["tf_idf.word_count"]

tweet_tf_idf

{'act': 3.4597778278724887,
 'control': 3.721765211295327,
 'democratic': 3.1026721743330414,
 'governments': 4.167571323949673,
 'in': 0.0009654063501214492,
 'law': 2.4538226269605703,
 'popular': 2.764478952022998,
 'response': 4.261461747058352,
 'to': 0.04694493768179923}

In [78]:
obama = wiki[wiki["name"] == "Barack Obama"]

obama

URI,name,text,word_count,has_top_words
<http://dbpedia.org/resou rce/Barack_Obama> ...,Barack Obama,barack hussein obama ii brk husen bm born august ...,"{'operations': 1L, 'represent': 1L, ...",1

tf_idf,length
"{'operations': 3.811771079388818, ...",540


Now, compute the cosine distance between the Barack Obama article and this tweet:

In [80]:
obama_tf_idf = obama[0]["tf_idf"]

graphlab.toolkits.distances.cosine(obama_tf_idf, tweet_tf_idf)

0.7059183777794327

Let's compare this distance to the distance between the Barack Obama article and all of its Wikipedia 10 nearest neighbors:

In [81]:
model2_tf_idf.query(obama, label = "name", k = 10)

query_label,reference_label,distance,rank
Barack Obama,Barack Obama,0.0,1
Barack Obama,Joe Biden,0.703138676734,2
Barack Obama,Samantha Power,0.742981902328,3
Barack Obama,Hillary Rodham Clinton,0.758358397887,4
Barack Obama,Eric Stern (politician),0.770561227601,5
Barack Obama,Robert Gibbs,0.784677504751,6
Barack Obama,Eric Holder,0.788039072943,7
Barack Obama,Jesse Lee (politician),0.790926415366,8
Barack Obama,Henry Waxman,0.798322602893,9
Barack Obama,Joe the Plumber,0.799466360042,10


With cosine distances, the tweet is "nearer" to Barack Obama than everyone else, except for Joe Biden! This probably is not something we want. If someone is reading the Barack Obama Wikipedia page, would you want to recommend they read this tweet? Ignoring article lengths completely resulted in nonsensical results. In practice, it is common to enforce maximum or minimum document lengths. After all, when someone is reading a long article from The Atlantic, you wouldn't recommend him/her a tweet.