![Alt text](./watson_image.png)




# <a id="asso-title"> Recommendations for Articles with IBM </a>


### Building a recommendation system for data science articles on the IBM Watson Studio. 

<br/><br/>



## <a id="toc">  Table of Content. </a>

- Notebook I: Data Collection, Cleaning, and EDA.


- Notebook II: Rank-Based Recommendations and Neighbor Collaborative Filtering.


- Notebook III: Matrix Factorization Collaborative Filtering.


- Notebook IV: More EDA, Preprocessing, Content-based Recommendation.






## <a id="notebook1"> This is Notebook II: Rank-based, neighbor collaborative filtering.</a>

0. [Data Loading.](#Loading)<br>


Part A: Rank-based Recommendation ("Popular/Trending...")

1. [Ranking and Recommendation](#Rank)<br>

        
Part B: Neighbor-based Collabrative Filtering 

("People watch spongebob-squarepants also watch...")    
     
2. [Data Preprocessing: user-item matrix.](#user-item)<br>
    

3. [Similar users and Collabrative Filtering.](#user-user)<br>

    
4. [Data Saving.](#save)<br>

In [2]:
# This is Python 3 environment, with Anaconda distribution of standard libraries.
# Additional module may needed: pickle.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

%matplotlib inline
plt.style.use('seaborn')

## <a class="anchor" id="Loading"> 0. Data Loading and Overview.</a>

In [3]:
# Read a prepared dataset to start with
# We shall collect more data via the IBM Watson API later
with open('1_data_df.pkl', 'rb') as pickle1:
    df = pickle.load(pickle1)
    
with open('1_data_df_content.pkl', 'rb') as pickle2:
    df_content = pickle.load(pickle2)

# overview of df
df.head()

Unnamed: 0,article_id,title,user_id
0,1430,"using pixiedust for fast, flexible, and easier...",1
1,1314,healthcare python streaming application demo,2
2,1429,use deep learning for image classification,3
3,1338,ml optimization using cognitive assistant,4
4,1276,deploy your python model as a restful api,5


In [4]:
# overview of df_content 
df_content.head()

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
0,Skip navigation Sign in SearchLoading...\r\n\r...,Detect bad readings in real time using Python ...,Detect Malfunctioning IoT Sensors with Streami...,Live,0
1,No Free Hunch Navigation * kaggle.com\r\n\r\n ...,"See the forest, see the trees. Here lies the c...",Communicating data science: A guide to present...,Live,1
2,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat...,Here’s this week’s news in Data Science and Bi...,"This Week in Data Science (April 18, 2017)",Live,2
3,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA...",Learn how distributed DBs solve the problem of...,DataLayer Conference: Boost the performance of...,Live,3
4,Skip navigation Sign in SearchLoading...\r\n\r...,This video demonstrates the power of IBM DataS...,Analyze NY Restaurant data using Spark in DSX,Live,4


## <a class="anchor" id="Rank">1. Ranking and Recommendation by Popularity .</a>
 

In our data, we can extract the number of interactions of a user with an article. However, it is not clear that the numbers can be interpreted as ratings for whether a user liked an article or not.  

With these considerations, we shall treat interactions between users and articles as binary, namely weather or not someone read an article, and disregard the number of times they read it.

We first define a function to rank articles in this way. And use it to find popular items.

In [10]:
# rank articles by how many users have read it.
def get_top_article_ids(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article id's
    
    '''
    top_articles=df.groupby('article_id').count().sort_values('title', ascending=False).index[:n]
 
    return list(top_articles) # Return the top article ids

In [11]:
get_top_article_ids(10)

[1429, 1330, 1431, 1427, 1364, 1314, 1293, 1170, 1162, 1304]

In [12]:
# return article titles given article ids
def get_article_names(article_ids, df=df):
    '''
    INPUT:
    article_ids - (list) a list of article ids
    df - (pandas dataframe) df as defined at the top of the notebook
    
    OUTPUT:
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the title column)
    '''
    
    article_names = [df[df.article_id==x].title.unique()[0] for x in article_ids if sum(df.article_id==x)>0]
    
    return article_names # Return the article names associated with list of article ids

In [13]:
get_article_names(get_top_article_ids(10))

['use deep learning for image classification',
 'insights from new york car accident reports',
 'visualize car data with brunel',
 'use xgboost, scikit-learn & ibm watson machine learning apis',
 'predicting churn with the spss random tree algorithm',
 'healthcare python streaming application demo',
 'finding optimal locations of new store using decision optimization',
 'apache spark lab, part 1: basic concepts',
 'analyze energy consumption in buildings',
 'gosales transactions for logistic regression model']

In [16]:
def get_top_articles(n, df=df):
    '''
    
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    item_id - a column name in df to groupby with
    user_id - a column name in df to count
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''
    top_articles = get_article_names(get_top_article_ids(n,df=df))
    
    return top_articles # Return the top article titles from df (not df_content)

In [17]:
get_top_articles(5)

['use deep learning for image classification',
 'insights from new york car accident reports',
 'visualize car data with brunel',
 'use xgboost, scikit-learn & ibm watson machine learning apis',
 'predicting churn with the spss random tree algorithm']

## Part B: Neighbor Based Collaborative Filtering.

## <a class="anchor" id="user-item">2. Data-Preprocessing: user-item matrix.</a>

We reformat the **df** dataframe to be shaped with users as the rows and articles as the columns.  

* Each **user** appear in each **row** once.


* Each **article** show up in one **column**.  


* **If a user has interacted with an article, then place a 1 where the user-row meets for that article-column**.  It does not matter how many times a user has interacted with the article, all entries where a user has interacted with an article should be a 1.  


* **If a user has not interacted with an item, then place a zero where the user-row meets for that article-column**. 



In [18]:
# sanity check
df[df.user_id==1]

Unnamed: 0,article_id,title,user_id
0,1430,"using pixiedust for fast, flexible, and easier...",1
268,1430,"using pixiedust for fast, flexible, and easier...",1
1143,732,rapidly build machine learning flows with dsx,1
1562,1429,use deep learning for image classification,1
1710,43,deep learning with tensorflow course by big da...,1
1712,109,tensorflow quick tips,1
2047,1232,country statistics: life expectancy at birth,1
3839,310,time series prediction using recurrent neural ...,1
4042,1293,finding optimal locations of new store using d...,1
4664,1406,uci: iris,1


In [26]:
# number of articles that user 1 read
df[df.user_id==1].article_id.unique().shape[0]

36

In [20]:
# create the user-article matrix with 1's and 0's

def create_user_item_matrix(df):
    '''
    INPUT:
    df - pandas dataframe with article_id, title, user_id columns
    
    OUTPUT:
    user_item - user item matrix 
    
    Description:
    Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with 
    an article and a 0 otherwise
    '''
    user_item_df = df.groupby(['user_id','article_id']).count().astype('int').unstack()
    user_item_df = (~user_item_df.isna())*1
    user_item = user_item_df.values
    
    return user_item, user_item_df

user_item, user_item_df = create_user_item_matrix(df)

In [21]:
# array version
user_item.shape

(5149, 714)

In [22]:
# dataframe version
user_item_df

Unnamed: 0_level_0,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title
article_id,0,2,4,8,9,12,14,15,16,18,...,1434,1435,1436,1437,1439,1440,1441,1442,1443,1444
user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5145,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5146,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5147,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5148,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
# double check our function with user 1 data
user_item.sum(axis=1)[0] == 36

True

## <a class="anchor" id="user-user">3. Similar Users and Collaborative Filtering.</a>

We define the function below which should take a user_id and provide an ordered list of the most similar users to that user (from most similar to least similar). 

The key question here is what similarity measure we use. 

Given the type of our data and goal, the following two distance functions probably make more sense:

- **cosine distance**: emphasize more on the similarity between users.


- **dot product**: favor those users and items that are more active and more popular.


We decide to go with dot product first.

In [29]:
# similar user with respect to dot product

def find_similar_users(user_id, user_item=user_item_df):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered
    
    '''
    # fillna for user_item
    user_item.fillna(0, inplace=True)
    # compute similarity of each user to the provided user
    similarities = np.matmul(user_item.drop(user_id, axis=0).values, user_item.loc[user_id].values.reshape(-1,1))
    similarities = pd.DataFrame(data=similarities, index=user_item.drop(user_id, axis=0).index, columns=['score'])
    # sort by similarity
    similarities = similarities.sort_values('score', ascending=False)
    # create list of just the ids
    most_similar_users = list(similarities.index)
    
       
    return most_similar_users # return a list of the users in order from most to least similar
        

In [30]:
# sanity check
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(3933)[:5]))


The 10 most similar users to user 1 are: [3933, 23, 3782, 203, 4459, 3870, 131, 46, 4201, 395]
The 5 most similar users to user 3933 are: [1, 23, 3782, 4459, 203]


In [35]:
user_item_df.loc[3782].sum()

135

In [36]:
user_item_df.loc[23].sum()

135

In [37]:
user_item_df.loc[4459].sum()

96

Next we use these similar users to find articles you can recommend.

In [31]:
# first, a function that return the article ids and names that a user have read.
def get_user_articles(user_id, user_item=user_item_df):
    '''
    INPUT:
    user_id - (int) a user id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    article_ids - (list) a list of the article ids seen by the user
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the doc_full_name column in df_content)
    
    Description:
    Provides a list of the article_ids and article titles that have been seen by a user
    '''
    # 
    mask = (user_item_df.loc[user_id]!=0)
    
    article_ids = list(user_item_df.columns.get_level_values(1)[mask])
    article_names = get_article_names(article_ids)
    
    return article_ids, article_names # return the ids 

In [32]:
# neighbor-based recommendation
def user_user_recs(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily
    
    '''
    similar_users = find_similar_users(user_id)
    seen_articles = get_user_articles(user_id)[0]
    recs=[]
    
    for user in similar_users:
        user_articles = get_user_articles(user)[0]
        
        if user_articles:
            rec = [x for x in user_articles if x not in seen_articles+recs]
            recs.extend(rec)
            
        if len(recs)>=m:
            break
    
    recs = recs[:m]  
    
    return recs # return your recommendations for this user_id    

In [33]:
# Check Results
get_article_names(user_user_recs(1, 10)) # Return 10 recommendations for user 1

['this week in data science (april 18, 2017)',
 'timeseries data analysis of iot events by using jupyter notebook',
 'got zip code data? prep it for analytics. – ibm watson data lab – medium',
 'higher-order logistic regression for large datasets',
 'using machine learning to predict parking difficulty',
 'deep forest: towards an alternative to deep neural networks',
 'experience iot with coursera',
 'using brunel in ipython/jupyter notebooks',
 'graph-based machine learning',
 'the 3 kinds of context: machine learning and the art of the frame']

### Improving our model:

* In the above, we choose the order arbitrarily when we obtain users who are all the same closeness to a given user


* Next, we should improve our model by ranking the users that have same distant to our client.



In [34]:
# sort users based on similarities and then number of article read
def get_top_sorted_users(user_id, df=df, user_item=user_item_df):
    '''
    INPUT:
    user_id - (int)
    df - (pandas dataframe) df as defined at the top of the notebook 
    user_item_df - (pandas dataframe) matrix of users by articles: 
            1's when a user has interacted with an article, 0 otherwise
    
            
    OUTPUT:
    neighbors_df - (pandas dataframe) a dataframe with:
                    neighbor_id - is a neighbor user_id
                    similarity - measure of the similarity of each user to the provided user_id
                    num_interactions - the number of articles viewed by the user - if a u
                    
    Other Details - sort the neighbors_df by the similarity and then by number of interactions where 
                    highest of each is higher in the dataframe
     
    '''
    # fillna for user_item
    user_item.fillna(0, inplace=True)
    # reshape the user_id row to column
    user_column = user_item.loc[user_id].values.reshape(-1,1)
    # column of ones
    ones_column = np.ones_like(user_column)
    # concatenate the two columns
    columns = np.concatenate([user_column, ones_column], axis=1)
    # compute similarity of each user to the provided user, as well as number of interaction of each user
    similarities = np.matmul(user_item.drop(user_id, axis=0).values, columns)
    # transform into dataframe
    similarities = pd.DataFrame(data=similarities, index=user_item.drop(user_id, axis=0).index, columns=['similarity','num_interactions']) 
    # sort values
    neighbors_df = similarities.sort_values(['similarity','num_interactions'], ascending=False)
    
    
    return neighbors_df # Return the dataframe specified in the doc_string

In [35]:
get_top_sorted_users(1)

Unnamed: 0_level_0,similarity,num_interactions
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3933,35,35
23,17,135
3782,17,135
203,15,96
4459,15,96
...,...,...
5141,0,1
5144,0,1
5147,0,1
5148,0,1


In [36]:
def user_user_recs_part2(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user by article id
    rec_names - (list) a list of recommendations for the user by article title
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    * Choose the users that have the most total article interactions 
    before choosing those with fewer article interactions.

    * Choose articles with the articles with the most total interactions 
    before choosing those with fewer total interactions. 
   
    '''
    similar_users = get_top_sorted_users(user_id).index.values
    seen_articles = get_user_articles(user_id)[0]
    recs=[]
    
    for user in similar_users:
        user_articles = get_user_articles(user)[0]
        
        if user_articles:
            rec = [x for x in user_articles if x not in seen_articles+recs]
            recs.extend(rec)
            
        if len(recs)>=m:
            break
    
    recs = recs[:m]  
    
    rec_names = get_article_names(recs)
    
    
    return recs, rec_names

In [37]:
# Quick spot check - don't change this code - just use it to test your functions
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)

The top 10 recommendations for user 20 are the following article ids:
[12, 14, 29, 33, 43, 51, 109, 111, 130, 142]

The top 10 recommendations for user 20 are the following article names:
['timeseries data analysis of iot events by using jupyter notebook', 'got zip code data? prep it for analytics. – ibm watson data lab – medium', 'experience iot with coursera', 'using brunel in ipython/jupyter notebooks', 'deep learning with tensorflow course by big data university', 'modern machine learning algorithms', 'tensorflow quick tips', 'tidy up your jupyter notebooks with scripts', "feature importance and why it's important", 'neural networks for beginners: popular types and applications']
