# Semantic Recommendations

Today we will do a mini-project using NearestNeighbors and cosine similarity to find similar articles within our database. We will work with the UCI news aggregator dataset. In this dataset, news articles are grouped into clusters that represent pages discussing the same news story. 
The dataset includes also references to web pages that, at the access time, pointed (has a link to) one of the news page in the collection. 

422937 news pages and divided up into: 
    
    152746 news of business category 
    108465 news of science and technology category 
    115920 news of business category 
    45615 news of health category 
    
    2076 clusters of similar news for entertainment category 
    1789 clusters of similar news for science and technology category 
    2019 clusters of similar news for business category 
    1347 clusters of similar news for health category 
    
Our goal for today will be to use **SpaCy word vectors** to recommend similar articles by title, and use **cosine similarity** to describe the similarity of those titles. 

In [1]:
import numpy as np
import pandas as pd
from spacy.en import English
import re
!pip install redis
import redis
from spacy.en import STOP_WORDS

[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


#### Load the data

In [2]:
csv_path = 'https://git.generalassemb.ly/raw/DSI-SM-4/curriculum/master/lessons/7.5-Intro_to_NLP/data/uci-news-aggregator.csv?token=AAAV3gRfJliIhYRk95S4tRzrqBktbNP1ks5ZIJX5wA%3D%3D'

In [3]:
news_data = pd.read_csv(csv_path, nrows=20000)

In [4]:
news_data.head(2)

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207


#### Clean the `TITLE` column

* lemmatize or stem
* remove stop words
* remove links/html 
  * I think this is rare or non-existent in the `TITLE` column, so we may omit this. However, this is always something worth checking.

In [5]:
nlp = English()

In [6]:
def cleaner(text):
    text = re.sub(u'<.{0,2}>','',text)
    text = re.sub(u'[^a-z\s]','',text)
    text = nlp(text)
    text = [str(i.lemma_) for i in text if str(i.orth_) not in STOP_WORDS]
    text = ' '.join(text)
    return text

In [7]:
news_data['clean_title'] = news_data['TITLE'].apply(cleaner)

#### Make an array of document vectors from the `clean_title` column.

* What is the shape of each vector? Of the whole array?

In [8]:
news_vecs = np.array([nlp(i).vector for i in news_data['clean_title']])

In [9]:
news_vecs.shape

(20000, 300)

#### Use `sklearn.neighbors.NearestNeighbors` to get the closest neighbors to a given string/article. Evaluate your results qualitatively.

* Do the neighbors make sense?
* Given a search term, can we use this method to recommend articles based on title?

In [10]:
from sklearn.neighbors import NearestNeighbors

In [11]:
nn = NearestNeighbors()

In [12]:
nn.fit(news_vecs)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

In [13]:
distances, indices = nn.kneighbors(nlp('politics election').vector.reshape(1,-1))

distances, indices

(array([[ 4.33650954,  4.34322755,  4.5765176 ,  4.67730428,  4.83806862]]),
 array([[ 4114, 17853, 16192,  8694,  4429]]))

In [14]:
news_data.ix[indices[0]]

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP,clean_title
4114,4115,'Ban Bossy' campaign up for debate,http://www.wjla.com/articles/2014/03/-ban-boss...,WJLA,e,dSg2r49OAZjxU0MKNazc_QdiwDkOM,www.wjla.com,1394545645708,ossy campaign debate
17853,17854,Ellis clearly pushing a political agenda,http://thegazette.com/2014/03/17/ellis-clearly...,The Gazette\: Eastern Iowa Breaking News and H...,t,dofAdteHHf1YiVMQLSATPLN9FbG4M,thegazette.com,1395161703219,llis clearly push political agenda
16192,16193,5 things to know about Illinois' primary election,http://www.postbulletin.com/news/politics/thin...,Post-Bulletin,b,dwZ1PHdiGZX4z-M0w5bU_YumWfnlM,www.postbulletin.com,1395155803457,thing know llinois primary election
8694,8695,Gold rally continues as political tensions per...,http://www.proactiveinvestors.co.uk/companies/...,Proactive Investors UK,b,dd-ja3A1HzLxsTMnYm_Zo9l7RZAHM,www.proactiveinvestors.co.uk,1394709553145,old rally continue political tension persist
4429,4430,Libya parliament votes to bring down Zeidan go...,http://www.middle-east-online.com/english/\?id...,Middle East Online,b,dZN_cm41wSN59aMjra4bdYw_-ofgM,www.middle-east-online.com,1394561572479,ibya parliament vote bring eidan government


In [15]:
distances, indices = nn.kneighbors(nlp('sports').vector.reshape(1,-1))

distances, indices

(array([[ 5.00590646,  5.28275595,  5.45591982,  5.50783067,  5.62286695]]),
 array([[17783, 15929, 15926, 12088, 19640]]))

In [16]:
news_data.ix[indices[0]]

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP,clean_title
17783,17784,How local sports could change all of television,http://www.pressrepublican.com/fyi/x1984790267...,Plattsburgh Press Republican,t,dCIGhX_vQYQIzjMJqGV3FXwvx5ZEM,www.pressrepublican.com,1395161366070,ow local sport change television
15929,15930,Obamacare highlights sports injuries to enroll...,http://wnax.com/news/030030-obamacare-highligh...,WNAX,b,dUGqnyCCIa_a8XM3oYqtaQwB6dXJM,wnax.com,1395155267551,bamacare highlight sport injury enroll young
15926,15927,Obamacare campaign highlights sports injuries ...,http://www.globalpost.com/dispatch/news/thomso...,GlobalPost,b,dUGqnyCCIa_a8XM3oYqtaQwB6dXJM,www.globalpost.com,1395155266829,bamacare campaign highlight sport injury enrol...
12088,12089,"Hospital, hockey leagues push cancer testing",http://www.thesudburystar.com/2014/03/10/hospi...,The Sudbury Star,m,dCAlr6DjfkFfL6MoIPGTT8vmavAyM,www.thesudburystar.com,1394722254498,ospital hockey league push cancer testing
19640,19641,Angelina Jolie sports wings in 'Maleficent' po...,http://www.nydailynews.com/entertainment/tv-mo...,New York Daily News,e,dpDq_fv6VCa_s_MOd1VlKirtLRw1M,www.nydailynews.com,1395166996832,ngelina olie sport wing aleficent poster


#### Build a function that takes the argument of a string and returns a `DataFrame` of the index and titles of the 5 most similar articles.

* Test this function on a few search terms of your own

In [17]:
def most_similar(search):
    distances, indices = nn.kneighbors(nlp(search).vector.reshape(1,-1))
    return news_data.ix[indices[0]][['ID','TITLE','PUBLISHER','CATEGORY']]

In [18]:
most_similar('food')

Unnamed: 0,ID,TITLE,PUBLISHER,CATEGORY
12106,12107,Changes in food labels in the works,San Francisco Chronicle,m
12102,12103,Proposed change in food labeling would affect ...,keene-equinox,m
12114,12115,Proposed food labeling revisions needed now,TriCities.com,m
12107,12108,Proposed changes in nutrition labels align bet...,Medical Xpress,m
3411,3412,From Iron Man to food truck man,TODAYonline,e


## Cosine Similarity

Rember SohCahToa from high school geometry? Here, we're using the **cosine** of the angle between two vectors to measure their. Below are a few cosines. 

Is each vector parallel? Orthogonal?

Can we divine a few rules from the observations of the cosines of the angles between these vectors?

In [19]:
from scipy.spatial.distance import cosine

In [20]:
cosine([1.,1.],[1.,1.])

2.2204460492503131e-16

In [21]:
cosine([1.,1.],[1.,-1.])

1.0

It looks like the cosine of the angle between vectors that are orthogonal (i.e. 90 degrees between the vectors) is 1 (or very close to 1), while when the angle between vectors is small (or 0), the cosine of of the angle between those two vectors is 0. 

This is known as **cosine similarity**. In more detail, here is the formula for the cosine similarity:

$$
\text{similarity} = cos(\theta) = \frac{\sum_{i=1}^nA_iB_i}{\sqrt{\sum_{i=1}^nA_i^2}\sqrt{\sum_{i=1}^nB_i^2}}, \text{     where } A \text{ and } B \text{ are components of vector } A \text{ and } B
$$

This comes from the formula:

$$
a\cdot b = \|a\|^2\|b\|^2cos(\theta), \text{   where } \theta \text{ is the angle between } a \text{ and } b
$$


Thus, we can also denote as the dot product of A and B divided by the product of the l2 norms of a and b:

$$
\text{similarity} = \frac{A\cdot B}{\|A\|^2\|B\|^2}
$$

Now that we've established this, let's look at the cosine similarity between some of the vectors that we've found:

In [22]:
cosine(news_vecs[12106], news_vecs[12102])

0.27039807944158778

In [23]:
cosine(news_vecs[12106], news_vecs[5130])

0.54135747982579541

### Deploy

We're going to make a deployable infrastructure using Redis. We want 3 things in there:
* The data
* The model
* key:value pairs for index:title

In [24]:
!pip install redis

[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [25]:
import redis

In [26]:
redis_ip = '34.210.47.177'

Model is db 0

In [27]:
r = redis.StrictRedis(redis_ip, db=0)

#### `set` the pickled nearest neighbors model

In [28]:
import pickle

In [29]:
model_pkl = pickle.dumps(nn)

In [31]:
r.set('model',model_pkl)

True

#### `set` the pickled original DataFrame with cleaned data

data is db 1

In [32]:
data_pkl = pickle.dumps(news_data)

In [33]:
r.set('data', data_pkl)

True

Check that the data and model are in there:

In [34]:
r.keys()

[b'model', b'data']

In [35]:
pickle.loads(r.get('model'))

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

In [36]:
pickle.loads(r.get('data')).head(2)

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP,clean_title
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698,ed official say weak datum cause weather slow ...
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207,ed harl loss see high bar change pace taper


Now, let's use `redis pipeline` to load our lookup.

In [37]:
pipe = r.pipeline()

In [38]:
lookup = news_data[['TITLE']].to_dict(orient='index')

In [39]:
y = lookup[0].values()

In [40]:
list(y)[0]

'Fed official says weak data caused by weather, should not slow taper'

In [41]:
for index, title in lookup.items():
    pipe.set(index, list(title.values())[0])

In [42]:
pipe.execute()

[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,

Now let's define a function that:

1. Gets a text vector
2. Loads the nn model from redis
3. gets the indices of the 5 nearest neighbors
4. looks up and returns the titles of those neighbors

In [43]:
for i in range(2,5):
    pipe.get(i)
x = pipe.execute()

In [44]:
x[0]

b'US open: Stocks fall after Fed official hints at accelerated tapering'

In [49]:
from time import time

In [50]:
a = time()

In [51]:
b= time()

In [52]:
b-a

7.620621204376221

In [53]:
def get_nearest(title):
    title_vec = nlp(title).vector.reshape(1,-1)
    t = time()
    print('{} seconds elapsed. Vector created. Loading model...'.format(time()-t))
    neighbors = pickle.loads(r.get('model'))
    print('{} seconds elapsed. Model loaded. Predicting...'.format(time()-t))
    _, indices = neighbors.kneighbors(title_vec)
    print('{} seconds elapsed. Querying database...'.format(time()-t))
    pipe = r.pipeline()
    for index in indices[0]:
        pipe.get(index)
    print('Total time: {} seconds'.format(time()-t))
    return pipe.execute()

In [54]:
get_nearest('food')

6.67572021484375e-06 seconds elapsed. Vector created. Loading model...
45.87578511238098 seconds elapsed. Model loaded. Predicting...
45.887019634246826 seconds elapsed. Querying database...
Total time: 45.88733887672424 seconds


[b'Changes in food labels in the works',
 b'Proposed change in food labeling would affect the American diet',
 b'Proposed food labeling revisions needed now',
 b'Proposed changes in nutrition labels align better with the way we really eat',
 b'From Iron Man to food truck man']

In [None]:
r.flushall()