## 0. Preparation

In [1]:
import turicreate

In [2]:
people = turicreate.SFrame('./data/people_wiki.sframe')
people

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


In [4]:
## word count for entire corpus:
people['word_count'] = turicreate.text_analytics.count_words(people['text'])

## TF-IDF for the entire corpus:
people['tfidf'] = turicreate.text_analytics.tf_idf(people['text'])

## Q1. Compare top words according to word counts to TF-IDF

In the notebook we covered in the module, we explored two document representations: word counts and TF-IDF.
Now, take a particular famous person, 'Elton John'. 

  - What are the 3 words in his articles with highest word counts?  
  - What are the 3 words in his articles with highest TF-IDF?   

These results illustrate why TF-IDF is useful for finding important words.  

*Save these results to answer the quiz at the end.*

In [8]:
## Top word count for elton
elton = people[people['name'] == 'Elton John']

elton_word_count_table = elton[['word_count']]\
  .stack('word_count', new_column_name = ['word', 'count'])\
  .sort('count', ascending=False)
  
elton_word_count_table

word,count
the,27.0
in,18.0
and,15.0
of,13.0
a,10.0
has,9.0
he,7.0
john,7.0
on,6.0
award,5.0


In [10]:
elton_word_count_table.topk('count', 3)

word,count
the,27.0
in,18.0
and,15.0


In [11]:
## Top tf-idf for elton

elton_tfidf_table = elton[['tfidf']]\
  .stack('tfidf', new_column_name=['word', 'tfidf'])\
  .sort('tfidf', ascending=False)

elton_tfidf_table

word,tfidf
furnish,18.38947183999428
elton,17.482320270031995
billboard,17.30368095754203
john,13.93931279239831
songwriters,11.25040644703154
overallelton,10.986495389225194
tonightcandle,10.986495389225194
fivedecade,10.293348208665249
19702000,10.293348208665249
aids,10.262846934045534


In [13]:
elton_tfidf_table.topk('tfidf', 3)

word,tfidf
furnish,18.38947183999428
elton,17.482320270031995
billboard,17.30368095754203


### Quiz / Q1. Top word count words for Elton John

  - [ ] (the, john, singer)
  - [ ] (england, awards, musician)
  - [x] (the, in, and)
  - [ ] (his, the, since)
  - [ ] (rock, artists, best)

### Quiz / Q2. Top TF-IDF words for Elton John

  - [x] (furnish, elton, billboard)
  - [ ] (john, elton, fivedecade)
  - [ ] (the, of, has)
  - [ ] (awards, rock, john)
  - [ ] (elton, john, singer)



## Q2. Measuring distance

Elton John is a famous singer; let’s compute the distance between his article and those of two other famous singers. In this assignment, we will use the **cosine distance**, which one measure of similarity between vectors, similar to the one discussed in the lectures.  
We can compute this distance using the `turicreate.distances.cosine` function. 
  - What’s the cosine distance between the articles on ‘Elton John’ and ‘Victoria Beckham’? 
  - What’s the cosine distance between the articles on ‘Elton John’ and Paul McCartney’?  
  - Which one of the two is closest to Elton John?  
  - Does this result make sense to you?  
  
*Save these results to answer the quiz at the end.*

In [15]:
victoria = people[people['name'] == 'Victoria Beckham']
paul = people[people['name'] == 'Paul McCartney']

In [16]:
## cosine distance elton/victoria
cos_dist_elton_victoria = turicreate.distances.cosine(elton['tfidf'][0], victoria['tfidf'][0])
cos_dist_elton_paul = turicreate.distances.cosine(elton['tfidf'][0], paul['tfidf'][0])

print(f"cosine distance(elton, vitoria) = {cos_dist_elton_victoria:1.7f} / cosine distance(elton, paul) = {cos_dist_elton_paul:1.7f}")

cosine distance(elton, vitoria) = 0.9567006 / cosine distance(elton, paul) = 0.8250310


### Quiz / Q3. The cosine distance between 'Elton John's and 'Victoria Beckham's articles (represented with TF-IDF) falls within which range?

  - [ ] 0.1 to 0.29
  - [ ] 0.3 to 0.49
  - [ ] 0.5 to 0.69
  - [ ] 0.7 to 0.89
  - [x] 0.9 to 1.0

###  Quiz / Q4. The cosine distance between 'Elton John's and 'Paul McCartney's articles (represented with TF-IDF) falls within which range?

  - [ ] 0.1 to 0.29
  - [ ] 0.3 to 0.49
  - [ ] 0.5 to 0.69
  - [x] 0.7 to 0.89
  - [ ] 0.9 to 1

###  Quiz / Q5. Who is closer to 'Elton John', 'Victoria Beckham' or 'Paul McCartney'?

  - [ ] Victoria Beckham
  - [x] Paul McCartney



This result makes sense. 

## Q3. Building nearest neighbors models with different input features and setting the distance metric

 In the sample notebook, we built a nearest neighbors model for retrieving articles using TF-IDF as features and using the default setting in the construction of the nearest neighbors model.  Now, you will build two nearest neighbors models:

 - Using word counts as features
 - Using TF-IDF as features

In both of these models, we are going to set the distance function to `cosine similarity`.  Here is how, when we call the function:
```python
turicreate.nearest_neighbors.create(..., distance='cosine') 
```

Now we are ready to use our model to retrieve documents.  Use these two models to collect the following results:

  - What’s the most similar article, other than itself, to the one on ‘Elton John’ using word count features?
  - What’s the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features?
  - What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features?
  - What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features?
  
*Save these results to answer the quiz at the end.*

In [17]:
knn_model_wc = turicreate.nearest_neighbors.create(people, features=['word_count'], label='name', distance='cosine')

In [18]:
knn_model_tfidf = turicreate.nearest_neighbors.create(people, features=['tfidf'], label='name', distance='cosine')

In [22]:
## most similar article using word count features for elton
most_sim_elton_wc = knn_model_wc.query(elton)
most_sim_elton_wc.sort('distance', ascending=True)

query_label,reference_label,distance,rank
0,Elton John,2.220446049250313e-16,1
0,Cliff Richard,0.1614241525896703,2
0,Sandro Petrone,0.1682254275104111,3
0,Rod Stewart,0.168327165587061,4
0,Malachi O'Doherty,0.177315545978884,5


In [23]:
## most similar article using tfidf features for elton
most_sim_elton_tfidf = knn_model_tfidf.query(elton)
most_sim_elton_tfidf.sort('distance', ascending=True)

query_label,reference_label,distance,rank
0,Elton John,-2.220446049250313e-16,1
0,Rod Stewart,0.7172196678927374,2
0,George Michael,0.7476009989692848,3
0,Sting (musician),0.7476719544306141,4
0,Phil Collins,0.7511932487904706,5


In [24]:
## most similar article using word count features for victoria
most_sim_victoria_wc = knn_model_wc.query(victoria)
most_sim_victoria_wc.sort('distance', ascending=True)

query_label,reference_label,distance,rank
0,Victoria Beckham,-2.220446049250313e-16,1
0,Mary Fitzgerald (artist),0.2073070361150499,2
0,Adrienne Corri,0.2145097827875479,3
0,Beverly Jane Fry,0.2174664687407927,4
0,Raman Mundair,0.2176954749915048,5


In [25]:
## most similar article using tfidf features for victoria
most_sim_victoria_tfidf = knn_model_tfidf.query(victoria)
most_sim_victoria_tfidf.sort('distance', ascending=True)

query_label,reference_label,distance,rank
0,Victoria Beckham,1.1102230246251563e-16,1
0,David Beckham,0.5481696102632145,2
0,Stephen Dow Beckham,0.7849867068283364,3
0,Mel B,0.8095855234085036,4
0,Caroline Rush,0.81982642291868,5


###  Quiz / Q6. Who is the nearest cosine-distance neighbor to 'Elton John' using raw word counts?

  - [ ] Billy Joel
  - [x] Cliff Richard
  - [ ] Roger Daltrey
  - [ ] George Bush

###  Quiz / Q7. Who is the nearest cosine-distance neighbor to 'Elton John' using TF-IDF?

  - [ ] Roger Daltrey
  - [x] Rod Stewart
  - [ ] Tommy Haas
  - [ ] Elvis Presley

###  Quiz / Q8. Who is the nearest cosine-distance neighbor to 'Victoria Beckham' using raw word counts?

  - [ ] Stephen Dow Beckham
  - [ ] Louis Molloy
  - [ ] Adrienne Corri
  - [x] Mary Fitzgerald (artist)


###  Quiz / Q9. Who is the nearest cosine-distance neighbor to 'Victoria Beckham' using TF-IDF?

  - [ ] Mel B
  - [ ] Caroline Rush
  - [x] David Beckham
  - [ ] Carrie Reichardt