<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 10px">

# Content Based Recommendations Lab
Week 9 | Lesson 3.1


![](http://aerofarms.com/wp-content/uploads/2015/04/NYTimes-banner.jpg)


## Introduction 


In this lab you will create a content based recommender for New York Times articles. This recommender is an example of a very simple data product. You will follow the same proceedure outlined in this [Medium article](https://medium.com/data-lab/how-we-used-data-to-suggest-tags-for-your-story-a120076d0bb6#.4vu7uby9z) in order to build your very own content based recommender. ![](https://s-media-cache-ak0.pinimg.com/236x/1d/cd/4e/1dcd4e0152a3692314f65a1aafb53982.jpg) However, we will not be recommending tags. Instead we'll be recommending new articles that a user should read based on the article that they are currently reading.

### Explain your approach 

You are in a technical job interview. 

Explain in your own words to your interviewer how the content based recommendation approach works. Specifically, explianed how you will apply this approach to recommend readers new NYT articles to read based on the article they are currently reading. 

**Hint: ** Read the Medium article and adapt their approach for your purposes. 

>Write your answer here

## EDA

1. Inspect your data
2. Identify the time range in which these articles were published
3. Can you think of any major domestic events happening in this time range?
4. Count the number of articles from each section
5. Will the section name imbalance bias our recommendations? 
6. What do you think is the appropriate response to the section name imbalance?

In [1]:
# import relavent packages 
import numpy as np
import pandas as pd
import pickle
from sklearn.utils import shuffle
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

In [2]:
# import data
data_path = "./datasets/NYT-articles.pkl"
df = pd.read_pickle(data_path)

### 1. Inspect your data

In [27]:
len(df.body.unique()),  len(df.body)

(18356, 24005)

In [3]:
df.head(2)

Unnamed: 0,_id,abstract,blog,body,byline,document_type,headline,keywords,lead_paragraph,multimedia,...,print_page,pub_date,section_name,slideshow_credits,snippet,source,subsection_name,type_of_material,web_url,word_count
0,580ae247253f0a1d0316f71e,,[],TOKYO — State-backed Japan Bank for Internati...,"{u'person': [], u'original': u'By REUTERS', u'...",article,{u'main': u'Japan to Lend to Sanctioned Russia...,[],State-backed Japan Bank for International Coop...,[],...,,2016-10-21T23:51:28Z,Business Day,,State-backed Japan Bank for International Coop...,Reuters,,News,http://www.nytimes.com/reuters/2016/10/21/busi...,
1,580adf45253f0a1d0316f71d,,[],"INTERNATIONAL\nBecause of an editing error, an...",[],article,"{u'main': u'Corrections: October 22, 2016', u'...",[],"Corrections appearing in print on Saturday, Oc...",[],...,,2016-10-21T23:38:36Z,Corrections,,"Corrections appearing in print on Saturday, Oc...",The New York Times,,News,http://www.nytimes.com/2016/10/22/pageoneplus/...,


In [4]:
df.body[0]

u'TOKYO \u2014  State-backed Japan Bank for International Cooperation [JBIC.UL] will lend about 4 billion yen ($39 million) to Russia\'s Sberbank, which is subject to Western sanctions, in the hope of advancing talks on a territorial dispute, the Nikkei business daily said on Saturday.\nSberbank, Russia\'s biggest bank, will use the yen-denominated loan to help a company operating the port of Vostochny in the Russian Far East to buy coal-handling equipment.\nJBIC will issue the loan by the end of the year in a bid to encourage progress on a dispute over a string of Russia-controlled Pacific islands, called the Northern Territories in Japan and Southern Kuriles in Russia, at a December summit.\n"JBIC\'s move to provide financing to Russia comes because  the Japanese government aims to make progress in the negotiations," the Nikkei said.\nJBIC was not available for comment.\nJapanese foreign ministry and the prime minister\'s office were not available for comment.\nThe United States and 

In [5]:
df.web_url[0]

u'http://www.nytimes.com/reuters/2016/10/21/business/21reuters-japan-russia-loans.html'

### 2. Identify the time range in which these articles were published

In [6]:
min(df.pub_date).split("T")[0], max(df.pub_date).split("T")[0]

(u'2016-10-05', u'2016-11-27')

### 3. Can you think of any major domestic events happening in this time range?

The US 2017 presidential election took place in this time range. 

### 4. Count the number of articles in each section

**Hint:** Make your life easier and use Counter from collections 

In [48]:
Counter(sections)

Counter({u'Arts': 1037,
         u'Automobiles': 13,
         u'Books': 180,
         u'Briefing': 94,
         u'Business Day': 3615,
         u'Corrections': 35,
         u'Crosswords & Games': 54,
         u'Education': 17,
         u'Fashion & Style': 348,
         u'Food': 102,
         u'Giving': 13,
         u'Health': 56,
         u'Job Market': 14,
         u'Magazine': 76,
         u'Movies': 157,
         u'N.Y. / Region': 354,
         u'NYT Now': 9,
         u'Obituaries': 1,
         u'Opinion': 727,
         u'Podcasts': 17,
         u'Public Editor': 8,
         u'Real Estate': 90,
         u'Science': 129,
         u'Sports': 3719,
         u'Style': 35,
         u'Sunday Review': 3,
         u'T Magazine': 82,
         u'Technology': 416,
         u'The Learning Network': 116,
         u'The Upshot': 66,
         u'Theater': 134,
         u'Times Insider': 45,
         u'Today\u2019s Paper': 34,
         u'Travel': 85,
         u'U.S.': 6039,
         u'Universal': 2,

## 5. Feature Engineering 

Here you will split your data into a "train" and "test". As well as select a metric in which to measure the similarity between articles. Think of the "train" set as the corpus. Think of the "test" set as the NYT articles that users are currently reading. 

### 5.1 Split your data

In [11]:
# move articles to an array
articles = df.body.values

# move article section names to an array
sections = df.section_name.values

# move article web_urls to an array
web_url = df.web_url.values

# shuffle these three arrays 
articles, sections, web_ur = shuffle(articles, sections, web_url, random_state=4)

In [12]:
# split the shuffled articles into two arrays
n = 10

# one will have all but the last 10 articles -- think of this as your training set/corpus 
X_train = articles[:-n]
X_train_urls = web_url[:-n]
X_train_sections = sections[:-n]

# the other will have those last 10 articles -- think of this as your test set/corpus 
X_test = articles[-n:]
X_test_urls = web_url[-n:]
X_test_sections = sections[-n:]

### 5. 2 Text Vectorizers

You're still in a job interview. Your interviewer asks you to respond to the follow:

**Choose CountVectorizer or TFIDF as your text vectorizer. Which do you choose?**

> Write your anser here

**Justify your choice**

> Write your anser here

**Explain why you didn't choose the other option **

> Write your anser here

Your interviewer asks you to respond to the follow:

**Choose to use or not to use stop words. Which do you choose?**

> Write your anser here

**Justify your choice**

> Write your anser here

**Explain why you didn't choose the other option **

> Write your anser here

In [13]:
# instantiate your vectorizor 
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

In [17]:
# fit the vectorizer 
tfidf_vectorizer.fit(X_train)

In [15]:
# transform both article splits 
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

### 5.3 Similarity Metric

You're still in a job interview. 
Your logic and reasoning powers are still being evaluated. 
Your interviewer asks you to respond to the follow:

**Hint: ** You might find these resources helpful: [Jaccard](https://en.wikipedia.org/wiki/Jaccard_index) , [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) 


**Choose the Jaccard or Cosine similarity metric to find the similarity between your train and test articles. Which do you choose?**

> Write your anser here

**Justify your choice**

> Write your anser here

**Explain why you didn't choose the other option **

> Write your anser here



#### Jacard Metric
If you choose to use the Jacard metric, you can import that metric from sklearn. 

#### Cosine Metric
If you choose to use the Cosine metric, apply the same approach to calculating similarities as was done in the Medium article. 

## 6. Builidng a Content Based Recommder

This section is where the magic happends. Here you will build a function that outputs the top n articles to recommend to your user based on the similarity scores between the article they're currently reading and all other articles in the corpus (i.e. "train" data).

In [78]:
def get_top_n_rec_articles(X_train_tfidf, X_train, test_article, X_train_sections, X_train_urls, n = 5):
    '''This function calculates similarity scores bewteen a document and a corpus
    
       INPUT: vectorized document corpus, 2D array
              text document corpus, 1D array
              user article, 1D array
              article section names, 1D array
              article URLs, 1D array
              number of articles to recommend, int
              
       OUTPUT: top n recommendations, 1D array
               top n corresponding section names, 1D array
               top n corresponding URLs, 1D array
               similarity scores bewteen user article and entire corpus, 1D array
              '''
    # calculate similarity between the corpus (i.e. the "test" data) and the user's article
    similarity_scores = X_train_tfidf.dot(test_article.toarray().T)

    # get sorted similairty score indicies 
    sorted_indicies = np.argsort(similarity_scores, axis = 0)[::-1]

    # get sorted similarity socres
    sorted_sim_scores = similarity_scores[sorted_indicies]

    # get top n most similar documents
    top_n_recs = X_train[sorted_indicies[:n]]

    # get top n corresponding document section names
    rec_sections = X_train_sections[sorted_indicies[:n]]

    # get top n corresponding urls
    rec_urls = X_train_urls[sorted_indicies[:n]]
    
    # return recommendations and corresponding article meta-data
    return top_n_recs, rec_sections, rec_urls, sorted_sim_scores

In [56]:
# pick an article from the "test" set
# treat this as the article that the user is currently reading
k = 0
test_article = X_test_tfidf[k]

In [79]:
# return the top n most similar articles as recommendations 
top_n_recs, rec_sections, rec_urls, sorted_sim_scores = \
get_top_n_rec_articles(X_train_tfidf, X_train,  test_article,X_train_sections, X_train_urls, n = 5 )

## 7. Interrogate the results 

Now that you have recommended articles for the user to read (based on what they are currently reading) check to see if the results make sense. 

Compare the user's article and corresponding section name with the recommended articles and corresponding section names. 

Also take a look at the similarity scores. 

In [89]:
# similarity scores
sorted_sim_scores[:5]

array([[[ 0.56601716]],

       [[ 0.49837752]],

       [[ 0.4792004 ]],

       [[ 0.46857784]],

       [[ 0.46037552]]])

In [85]:
# user's article
X_test[k]

u'LOS ANGELES \u2014  The White House says President Barack Obama has told the Defense Department that it must ensure service members instructed to repay enlistment bonuses are being treated fairly and expeditiously.\nWhite House spokesman Josh Earnest says the president only recently become aware of Pentagon demands that some soldiers repay their enlistment bonuses after audits revealed overpayments by the California National Guard.  If soldiers refuse, they could face interest charges, wage garnishments and tax liens.\nEarnest says he did not believe the president was prepared to support a blanket waiver of those repayments, but he said "we\'re not going to nickel and dime" service members when they get back from serving the country. He says they should not be held responsible for fraud perpetrated by others.'

In [80]:
# user's article's section name
X_test_sections[k]

u'U.S.'

In [86]:
# corresponding section names for top n recs 
rec_sections

array([[u'World'],
       [u'U.S.'],
       [u'U.S.'],
       [u'World'],
       [u'U.S.']], dtype=object)

In [82]:
# top n article recs
top_n_recs

array([[ u'WASHINGTON \u2014  House Speaker Paul Ryan on Tuesday called for the Pentagon to immediately suspend efforts to recover enlistment bonuses paid to thousands of soldiers in California, even as the Pentagon said late Tuesday the number of soldiers affected was smaller than first believed.\n"When those Californians answered the call to duty" to serve in Iraq and Afghanistan, "they earned more from us than bureaucratic bungling and false promises," Ryan said. He urged the Pentagon to suspend collection efforts until "Congress has time ... to protect service members from lifelong liability for DOD\'s mistakes."\nRyan\'s comments came as the White House said President Barack Obama has warned the Defense Department not to "nickel and dime" service members who were victims of fraud by overzealous recruiters.\nWhite House spokesman Josh Earnest said Tuesday he did not believe Obama would support a blanket waiver of repayments, but said California National Guard members should not be 

In [72]:
# corresonding URLs for top n recs 
rec_urls

array([[ u'http://www.nytimes.com/2016/10/19/dining/wine-school-assignment-montsant.html'],
       [ u'http://www.nytimes.com/aponline/2016/10/26/us/ap-us-oil-pipeline-news-guide.html'],
       [ u'http://www.nytimes.com/reuters/2016/11/01/business/01reuters-britain-boe-may-welcome.html'],
       [ u'http://www.nytimes.com/reuters/2016/11/01/world/europe/01reuters-hongkong-china.html'],
       [ u'http://www.nytimes.com/reuters/2016/11/02/us/02reuters-colorado-bomb.html']], dtype=object)

### 8. Additoinal Resources 

http://infolab.stanford.edu/~ullman/mmds/ch9.pdf

http://benanne.github.io/2014/08/05/spotify-cnns.html

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.25.5743&rep=rep1&type=pdf