# Recommendations with IBM data

In this notebook, we will be using real data from the IBM Watson Studio platform to build recommendation systems. 

Let's get started by importing the necessary libraries and reading in the data.

In [1]:
import pandas as pd
import numpy as np

import re
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer

nltk.download(['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger'])

[nltk_data] Downloading package punkt to /home/nabanita/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/nabanita/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/nabanita/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/nabanita/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
df = pd.read_csv('data/user-item-interactions.csv')
df_content = pd.read_csv('data/articles_community.csv')

Let's look at each of the dataframes closely. 

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,article_id,title,email
0,0,1430.0,"using pixiedust for fast, flexible, and easier...",ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1,1,1314.0,healthcare python streaming application demo,083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2,2,1429.0,use deep learning for image classification,b96a4f2e92d8572034b1e9b28f9ac673765cd074
3,3,1338.0,ml optimization using cognitive assistant,06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4,4,1276.0,deploy your python model as a restful api,f01220c46fc92c6e6b161b1849de11faacd7ccb2


In [4]:
del df['Unnamed: 0']

In [5]:
df_content.head()

Unnamed: 0.1,Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
0,3,Skip navigation Sign in SearchLoading...\r\n\r...,Detect bad readings in real time using Python ...,Detect Malfunctioning IoT Sensors with Streami...,Live,0
1,5,No Free Hunch Navigation * kaggle.com\r\n\r\n ...,"See the forest, see the trees. Here lies the c...",Communicating data science: A guide to present...,Live,1
2,7,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat...,Here’s this week’s news in Data Science and Bi...,"This Week in Data Science (April 18, 2017)",Live,2
3,8,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA...",Learn how distributed DBs solve the problem of...,DataLayer Conference: Boost the performance of...,Live,3
4,12,Skip navigation Sign in SearchLoading...\r\n\r...,This video demonstrates the power of IBM DataS...,Analyze NY Restaurant data using Spark in DSX,Live,4


In [6]:
del df_content['Unnamed: 0']

## Exploratory Data Analysis

In this section, we perform EDA on both of our dataframes. We start with the dataframe 'df'.

In [7]:
df.shape

(45993, 3)

Let's find out what are the dtypes of each column and whether there is any null value.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45993 entries, 0 to 45992
Data columns (total 3 columns):
article_id    45993 non-null float64
title         45993 non-null object
email         45976 non-null object
dtypes: float64(1), object(2)
memory usage: 1.1+ MB


Only the 'email' column has (45993-45976)=17 null values. Next we ask how many unique values are there in each column.

In [9]:
df.nunique()

article_id     714
title          714
email         5148
dtype: int64

Therefore, unique values in each column is much smaller than the total number of samples. This is, however, is expected as:

- An article might have multiple interactions with different as well as same users.
- A user might have interacted with different articles and/or same articles multiple times.

Note that, the column 'email' contains the user information but it is quite messy to handle. Each unique user will have a unique email. Therefore, we can use the information about email to translate it into unique user ids. 

There were a small number of null values, and it was found that all of these null values likely belonged to a single user (which is how they are stored using the function below).

In [10]:
def email_to_id():
    
    """
    A function that maps unique user emails to unique user ids.
    
    Parameter
    ------------
    None
    
    Returns
    ------------
    coded_dict : dict
        A dictionary whose keys are user emails and values are corresponding user ids.
    
    """
    
    user_id = 1
    coded_dict = {}

    for email in df['email'].unique():
        coded_dict[email] = user_id
        user_id += 1
       
    return coded_dict

In [11]:
coded_dict = email_to_id()
df['user_id'] = df['email'].apply(lambda x:coded_dict[x])

A new column 'user_id' is introduced that holds user ids. Now, we can get rid of the 'email' column.

In [12]:
df.drop('email', axis=1, inplace=True)

In [13]:
df.isnull().sum()

article_id    0
title         0
user_id       0
dtype: int64

So after this conversion, there is no null value in the dataframe.

Let's now move to the other dataframe 'df_content' and extract out some information from it as we did for the dataframe 'df'.

In [14]:
df_content.shape

(1056, 5)

In [15]:
df_content.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1056 entries, 0 to 1055
Data columns (total 5 columns):
doc_body           1042 non-null object
doc_description    1053 non-null object
doc_full_name      1056 non-null object
doc_status         1056 non-null object
article_id         1056 non-null int64
dtypes: int64(1), object(4)
memory usage: 41.4+ KB


In [16]:
df_content.nunique()

doc_body           1036
doc_description    1022
doc_full_name      1051
doc_status            1
article_id         1051
dtype: int64

We note the following:

- Only two columns 'doc_body' and 'doc_description' have null values.
- Although 'article_id' and 'doc_full_name' have no null value, number of unique entries in each of these columns is less than the total entries. Since the dataframe only holds information about articles and their content, this might be an indication of the presence of duplicate rows.
- The column 'doc_status' has only one unique value and hence can be dropped.

In [17]:
df_content.drop('doc_status', axis=1, inplace=True)

Let's now check for duplicate rows.

In [18]:
df_content[df_content.duplicated('article_id')]

Unnamed: 0,doc_body,doc_description,doc_full_name,article_id
365,Follow Sign in / Sign up Home About Insight Da...,During the seven-week Insight Data Engineering...,Graph-based machine learning,50
692,Homepage Follow Sign in / Sign up Homepage * H...,One of the earliest documented catalogs was co...,How smart catalogs can turn the big data flood...,221
761,Homepage Follow Sign in Get started Homepage *...,Today’s world of data science leverages data f...,Using Apache Spark as a parallel processing fr...,398
970,This video shows you how to construct queries ...,This video shows you how to construct queries ...,Use the Primary Index,577
971,Homepage Follow Sign in Get started * Home\r\n...,"If you are like most data scientists, you are ...",Self-service data preparation with IBM Data Re...,232


So there are 5 duplicate rows and we drop them keeping only their first occurences.

In [19]:
df_content.drop_duplicates('article_id', inplace=True)

Let's do a quick check that everything is as per expectation.

In [20]:
df_content[df_content.duplicated('article_id')]

Unnamed: 0,doc_body,doc_description,doc_full_name,article_id


## Part II: Rank-Based Recommendations

In this type of recommendation technique, recommendations are provided based on the popularity of the associated items, *e.g.*, movies with higher ratings, songs with maximum number of downloads and so on.

Note that, we do not actually have ratings for whether a user liked an article or not in our dataset. We only know that a user has interacted with an article. In this situation, the popularity of an article can really only be based on how often an article was interacted with.

The following function recommends articles based on their popularities amoung users.

In [21]:
def get_top_articles(n, df=df):
    
    """
    This function determines the most popular articles based on the number of interactions and returns
    the corresponding ids and names.
    
    Parameters
    ------------
    n : int
     The number of top articles to return
     
    df : pandas dataframe
      dataframe as defined at the top of the notebook from the file user-item-interactions.csv 
    
    Returns
    -----------
    top_articles_id : list
       A list of the top 'n' article ids
       
    top_articles : list 
       A list of the top 'n' article names    
    
    """
    
    top_articles_id = list(df.groupby('article_id').count()['user_id'].sort_values(ascending=False).index[:n])
    top_articles = list(df[df['article_id'].isin(top_articles_id)]['title'].unique())
    
    return top_articles_id, top_articles 

Let's make some recommendations for any arbitrary user (note that this method is independent of user profile).

In [22]:
print("top 5 articles using rank-based recommendation technique are :")
print()
print(get_top_articles(5)[1])

top 5 articles using rank-based recommendation technique are :

['use deep learning for image classification', 'predicting churn with the spss random tree algorithm', 'visualize car data with brunel', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'insights from new york car accident reports']


##  Part III: User-User Based Collaborative Filtering

In this type of recommendation technique, we look for users who are closely related to each other in terms of their interactions with items, *e.g.*, users who have watched same movies and have given ratings in the same ballpark are `neighbors` in the movie ratings plane.

Let's start by constructing a new column in the dataframe 'df' which will return the value 1 for every user-article interaction pair. If a user has interacted with the same article multiple times, for each interaction there will be one value (which is 1) in this column. We call this new column 'intercation'.

In [24]:
df['interaction'] = [1 for title in df['title']]

We now build our recommendation system based on user-user collaborative filtering step-by-step as follows:

### Step 1

Let's define the following function to create user-item matrix whose rows are unique user ids, columns are unique article ids and entries are number of interactions between users and articles. 

In [25]:
def create_user_item_matrix(df):
    
    """
    This function returns a matrix which holds the information about interactions between users and articles.
    
    Parameter
    ----------
    df : pandas dataframe 
       dataframe as defined at the top of the notebook from the file user-item-interactions.csv
    
    Returns
    ---------
    user_item : matrix
       a matrix whose rows are unique user ids, columns are unique article ids and entries in each cell is the
       number of interactions between the corresponding user and article. We fill all the null values with 0 to 
       denote no interaction between the corresponding users and articles. 
  
    """
    
    user_item = df.groupby(['user_id', 'article_id'])['interaction'].min().unstack().fillna(0)
    
    return user_item 

In [26]:
user_item = create_user_item_matrix(df)

### Step 2 

Next we construct a function which, given a user id, provides a dataframe containing all users similar to the input user sorted according to their similarity scores first and then by the total number of interactions. The returned dataframe does not contain the input user id as we know that each user is most similar to itself. We compute the similarity between two users by taking dot product between their interaction vectors.

In [27]:
def get_top_sorted_users(user_id, df=df, user_item=user_item):
    
    """
    This function returns a dataframe which contains all the neighbors of the input user sorted according to their
    similarity scores first and then by the total number of interactions.
    
    Parameters
    ------------
    user_id : int
         an input user id
        
    df : pandas dataframe
         dataframe as defined at the top of the notebook from the file user-item-interactions.csv
    
    user_item : (pandas dataframe) matrix 
         a users by articles matrix where non-zero entries represents that a user has interacted with an article
         and 0 stands for no interaction.
    
    Returns
    ---------
    neighbors_df : pandas dataframe
         a dataframe with the following columns:
         1. neighbor_id : a neighbor user_id
         2. similarity_score : measure of the similarity between the input user and its neighbors
         3. total_interactions : the number of articles viewed by a neighbor user
   
    """
    most_similar_users, similarity_score = [], []
    
    # loop through all other users    
    for other_id in range(len(user_item)):
        if other_id != user_id-1: # since the indexing start from zero in python
            # store the similarity score for every other user
            similarity_score.append(np.dot(user_item.iloc[user_id-1, :], user_item.iloc[other_id, :]))
            # store the id of every other user
            most_similar_users.append(other_id+1)

    # store the total number interactions for each similar user
    total_interactions = [df.groupby('user_id').count()['interaction'].values[id-1] for id in most_similar_users]
    
    # construct the dataframe
    neighbors_df = pd.DataFrame([most_similar_users, similarity_score, total_interactions]).transpose() 
    neighbors_df.columns = ['neighbor_id', 'similarity_score', 'total_interactions']
    # sort first by similarity score and then by number of interactions
    neighbors_df.sort_values(by=['similarity_score','total_interactions'], ascending=False, inplace=True)
    
    return neighbors_df 

### Step 3

In this step, we construct the following three functions that will be helpful when making recommendations later.

In [28]:
def get_article_names(article_ids, df=df):
    
    """
    This function, given a list of article ids, returns a corresponding list containing the article titles.
    
    Parameters
    ------------
    article_ids : list
        a list of article ids
        
    df : pandas dataframe
        dataframe as defined at the top of the notebook from the file user-item-interactions.csv
    
    Returns
    ---------
    article_names : list
          a list of article names associated with the input list of article ids 
                    
    """
    
    article_names = df[df['article_id'].isin(article_ids)]['title'].unique()
    
    return list(article_names) 

In [29]:
def get_user_articles(user_id, df=df):
    
    """
    This function provides a list of article ids and corresponding titles that a user has interacted with.
    
    Parameters
    ------------
    user_id : int
        an input user id
        
    df : pandas dataframe
        dataframe as defined at the top of the notebook from the file user-item-interactions.csv
    
    Returns
    ---------
    article_ids : list
          a list of the article ids that the user has interacted with
          
    article_names : list
          a list of article names associated with the list article_ids 
    
    """
    
    article_ids = df.query('user_id==@user_id')['article_id'].unique()
    article_names = get_article_names(article_ids)
    
    return article_ids, article_names 

In [30]:
def get_top_sorted_articles(article_ids, df=df):
    
    """
    Given a list of article ids, this function sorts them according to the number of interactions they have with
    users in a descending order.
    
    Parameters
    ------------
    article_ids : list
        a list of article ids
        
    df : pandas dataframe
        dataframe as defined at the top of the notebook from the file user-item-interactions.csv
    
    Returns
    ---------
    sorted_article_ids : array
         an array of article ids sorted according to the number of interactions
    
    """
    
    df_new = df.groupby('article_id').count().reset_index()[['article_id', 'interaction']].sort_values('interaction', ascending=False)
    sorted_article_ids = df_new[df_new['article_id'].isin(article_ids)]['article_id'].values
    
    return sorted_article_ids

### Step 4

In the final step, we write a function that will make recommendation for a given user based on the collaborative filtering using the following steps :

- find out the articles that our input user has already read using the function 'get_user_articles'
- list all the users similar to the input user determined from the dataframe returned by the function 'get_top_sorted_users'
- For every similar user 
    - find out the articles that he/she has read but our input user has not
    - sort these articles using the function 'get_top_sorted_articles' in a descending order of popularity
    - include them in the recommendation list
    
The above described method taked into consideration the following two things :

* Instead of arbitrarily choosing when there are neighbor users with the same similarity score, we choose the users with more number of total article interactions before choosing those with fewer article interactions.

* Instead of arbitrarily choosing articles from a neighbor user, we choose articles with more number of total interactions before choosing those with fewer total interactions.

In [31]:
def user_user_recs(user_id, n=10):

    """
    A function to make recommendations based on user-user collaborative filtering. Loops through all the users 
    based on the similarity to the input user. For each user, we find articles the input user has not interacted 
    with before but its neighbor has and provide them as recommendations. Note that if users have the same 
    similarity score, users with more total interactions are chosen before those with fewer total interactions. 
    Also we choose articles with more number of interactions before those with fewer interactions.
    
    Parameters
    ------------
    user_id : int
         an input user id for whom recommendations are to be made
         
    n : int
        the number of recommendations we want to make for the input user
        
    Returns
    ---------
    recs : numpy array
         an array of recommended article ids for the input user
         
    rec_names : list
         a list of names of the recommended articles

    """
    
    # articles that our input user has interacted with    
    articles_read = get_user_articles(user_id)[0]
    
    # users similar to our input user sorted according to the similarity score and interaction numbers
    neighbors_df = get_top_sorted_users(user_id)
    similar_users = neighbors_df['neighbor_id'].values
    
    # list to store all the recommendations
    recs = []
    
    for user in similar_users:
        # articles that have no interaction with the input user but have interactions with similar users
        articles_not_read = np.setdiff1d(get_user_articles(user)[0], articles_read)
        
        # sort these articles according to the number of interactions in descending order
        articles_not_read_sorted = get_top_sorted_articles(articles_not_read)
        
        # now add these articles to the list 'recs'
        recs.extend(articles_not_read_sorted)
        
        # if the length of the array exceeds the max. number of recommendations we want to make, break the loop
        if len(set(recs)) > n:
            break
    
    # consider only the unique values in the recommendation list
    recs = list(set(recs))
    rec_names = get_article_names(recs)       
            
    # return the first 'm' elements of our recommendation array in case it has more than m entries
    return recs[:n], rec_names[:n]

Let's use our recommendation system to make recommendation for some arbitrarily chosen user.

In [32]:
rec_ids, rec_names = user_user_recs(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)

The top 10 recommendations for user 20 are the following article ids:
[1024.0, 1409.0, 1410.0, 1411.0, 1152.0, 1157.0, 1154.0, 1153.0, 1160.0, 1162.0]

The top 10 recommendations for user 20 are the following article names:
['ml optimization using cognitive assistant', 'deploy your python model as a restful api', 'apache spark lab, part 1: basic concepts', 'timeseries data analysis of iot events by using jupyter notebook', 'dsx: hybrid mode', 'predicting churn with the spss random tree algorithm', 'analyze energy consumption in buildings', 'ibm watson facebook posts for 2015', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'apache spark lab, part 3: machine learning']


## Part IV: Content Based Recommendations

In content based recommendations, we use features or information about items, users, *e.g.*, recommending movies on the basis of its genres is an example of content based recommendation.

Let's consider the columns 'article_id' and 'doc_full_name' from the original dataframe df_content. We will use the column 'doc_full_name' to extract out content related information from articles.  

In [33]:
article_content = df_content[['doc_full_name', 'article_id']]

We now build our recommendation system based on content step-by-step as follows:

### Step 1

First, we construct a function that will convert a text to a list of tokens and will use this function to extarct out content or important features associated with an article. 

In [34]:
def text_to_word(text):
    
    """
    A function to clean an input text. The steps followed for the text cleaning are :
     
     1. Normalization i.e. conversion to lower case and punctuation removal
     2. Tokenization 
     3. Stop words removal
     4. Lemmatization
    
    Parameter 
    -----------
      text : str 
        the input text to be cleaned
      
    Returns 
    ----------
      lemm_token_list : list
             a list of tokens obtained after cleaning the text
        
    """    
    
    token_list = word_tokenize(re.sub(r"[^a-z0-9]", " ", text.lower()))
    token_nostop_list = [token for token in token_list if token not in stopwords.words("english")]
    pos_dict = {"N":wordnet.NOUN, "J":wordnet.ADJ, "V":wordnet.VERB, "R":wordnet.ADV}
    
    lemm_token_list = set()
    for token,pos in nltk.pos_tag(token_nostop_list):
        try:
            lemm_token_list.add(WordNetLemmatizer().lemmatize(token, pos_dict[pos[0]]))
        except:
            pass
        
    return list(lemm_token_list)    

### Step 2

We will now use sklearn's CountVectorizer on the 'doc_full_name' column to construct the document-term matrix where the rows are unique article names present in the dataset and the columns are tokens of the vocabulary constructed from the same dataset.

In [35]:
countvec = CountVectorizer(analyzer=text_to_word)
article_by_content = countvec.fit_transform(article_content['doc_full_name'])

If we take the dot product of the matrix article_by_content with itself, we will get an (n_article x n_article) matrix where n_article is the number of unique articles present in the dataset. Each cell in this matrix will represent how similar an article is to others. Note that in each row, the diagonal elements will be maximum as an article is always most similar with itself. 

In [36]:
similarity_matrix = np.dot(article_by_content, article_by_content.transpose()).toarray()

### Step 3

We will now use this similarity matrix to find articles which are similar in content to a given article.

In [37]:
def find_similar_articles(article_id, article_content=article_content, similarity_matrix=similarity_matrix):
    
    """
    Given an article id, this function returns a list of articles which are similar to that of the input article
    in terms of their content.
    
    Parameters
    ------------
    article_id : str
        id of the input article
        
    article_content : pandas dataframe
        a dataframe containing ids and full names of articles
        
    similarity_matrix : array
        an (n_article x n_article) numpy array where n_article is the number of unique articles present in the 
        dataset and each entry in this array will represent how similar an article is to others 
        
        
    Returns
    ---------
    similar_id : list
         ids of the users similar to the input user
         
    similarity_score : list    
         similarity scores of the neighbor users
         
    similar_dict : dict
         a dictionary whose keys are ids of similar users and values are the corresponding similariy scores sorted
         according to their similarity scores
    
    """
    # find out which row of the dataframe does the input article id belong to
    article_row = np.where(article_content['article_id']==article_id)[0][0]
    
    # find out the row numbers of similar articles
    similar_row = list(np.where(similarity_matrix[article_row] > 2)[0])
    
    # store the corresponding similarity scores
    similarity_score = list(similarity_matrix[article_row, similar_row])
    
     # store the ids of similar articles
    similar_id = list(article_content.iloc[similar_row]['article_id'])
    
    similar_dict = {}
    for similar_id,score in zip(similar_id, similarity_score):
        similar_dict[similar_id] = score
        
    # sort the dictionary according to the similarity scores        
    similar_dict = {k:v for k,v in sorted(similar_dict.items(), key=lambda x:x[1], reverse=True)}    
    
    return similar_id, similarity_score, similar_dict

### Step 4

Finally we build the function 'make_content_recs' which will make recommendations for a given user based on article content. The entire method for making recommendations can be broken down into the following steps :

- find out the articles that our input user has read
- for each article that our user has read 
     - find out other articles that are similar to it 
     - select only those similar articles which our user has not read yet
     - sort these articles according to the number of interactions in descending order
     - add them to our recommendation list

In [38]:
def make_content_recs(user_id, article_content=article_content, n=10):
    
    """
    A function to make recommendations based on content of the articles. Loops through all the articles that the 
    input user has interacted with, find articles similar in content for each of them, choose those with which our
    user has not interacted yet and recommend them. One important point to note here is that not all articles of 
    the dataframe 'df' have information about content in the dataframe 'article_content'. Also not all articles
    that have a description, have interactions with users (i.e. present in the dataset 'df').
    
    Parameters
    ------------
    user_id : int
        id of the input user for whom recommendations are to be made
         
    article_content : pandas dataframe
        a dataframe containing ids and full names of articles
        
    n : int
        the number of recommendations we want to make for the input user
        
    Returns
    ---------
    rec_ids : list
        a list of ids of the recommended articles
        
    rec_names : list
        a list of names of the recommended articles
    
    """

    # find out the articles that our user has interacted with and also have information about content
    articles_read = np.intersect1d(get_user_articles(user_id)[0], article_content['article_id'].unique())
    
    # array to store our recommendations
    recs = np.array([])
    
    for article in articles_read:
        # find out the articles similar in content
        similar_articles = list(find_similar_articles(article)[2].keys())[1:]
        
        # choose articles that have similar content and also no interaction with the user
        articles_not_read = np.setdiff1d(similar_articles, articles_read, assume_unique=True)
        
        # store them in the array
        recs = np.unique(np.concatenate([recs, articles_not_read]))
       
        # if the length of the array exceeds the max. number of recommendations we want to make, break the loop  
        if len(recs) > 10:
            break
            
    #  return the first 'n' elements of our recommendation array in case it has more than n entries   
    rec_ids = list(recs[:n])   
    
    # return the names of the recommended articles
    rec_names = list(article_content[article_content['article_id'].isin(rec_ids)]['doc_full_name'].values)
            
    return rec_ids, rec_names

Let's make recommendation for some user based on content of articles that the user has already interacted with.

In [39]:
rec_ids, rec_names = make_content_recs(90)

In [40]:
print("The following articles are recommended for user {} on the basis of article content :".format(132))
print()
print(rec_names)

The following articles are recommended for user 132 on the basis of article content :

['This Week in Data Science (September 27, 2016)', 'This Week in Data Science (May 2, 2017)', 'This Week in Data Science (July 26, 2016)', 'This Week in Data Science (December 20, 2016)', 'This Week in Data Science (November 08, 2016)', 'This Week in Data Science (November 15, 2016)', 'This Week in Data Science (February 28, 2017)', 'This Week in Data Science (November 01, 2016)', 'This Week in Data Science (February 7, 2017)', 'This Week in Data Science (July 12, 2016)']


**Comments**
--------------

Using this technique, however, we might not be able to provide all users the same number of recommendations and also might even fail to provide recommendations to certain users. This is because of the fact that not all the articles that have interactions with users (as we can see from the dataframe 'df'), have content related information available in the other dataframe 'df_content'. Content based recommendation that we have built above, will only work for articles with content information available. In this situation, we can try to combine user-based collaborative filtering and content based recommendation to provide recommendations to all existing users.

One can also try with other features, *e.g.* 'doc_description' to see whether that improves the present  recommendation engine.