# Recommendations with IBM

In this notebook, we will be putting recommendation skills to use on real data from the IBM Watson Studio platform. 

By following the table of contents, we will will build out a number of different methods for making recommendations that can be used for different situations. 


## Table of Contents

I. [Exploratory Data Analysis](#Exploratory-Data-Analysis)<br>
II. [Rank Based Recommendations](#Rank)<br>
III. [User-User Based Collaborative Filtering](#User-User)<br>
IV. [Content Based Recommendations (EXTRA - NOT REQUIRED)](#Content-Recs)<br>
V. [Matrix Factorization](#Matrix-Fact)<br>
VI. [Extras & Concluding](#conclusions)

Let's get started by importing the necessary libraries and reading in the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import os 

%matplotlib inline

os.chdir(r'C:\Users\ogzpython\Desktop\ml\ibm_rec')
df = pd.read_csv(r'.\data\user-item-interactions.csv')
df_content = pd.read_csv(r'.\data\articles_community.csv')
del df['Unnamed: 0']
del df_content['Unnamed: 0']

In [2]:
# Show df_content to get an idea of the data
df_content.head()

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
0,Skip navigation Sign in SearchLoading...\r\n\r...,Detect bad readings in real time using Python ...,Detect Malfunctioning IoT Sensors with Streami...,Live,0
1,No Free Hunch Navigation * kaggle.com\r\n\r\n ...,"See the forest, see the trees. Here lies the c...",Communicating data science: A guide to present...,Live,1
2,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat...,Here’s this week’s news in Data Science and Bi...,"This Week in Data Science (April 18, 2017)",Live,2
3,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA...",Learn how distributed DBs solve the problem of...,DataLayer Conference: Boost the performance of...,Live,3
4,Skip navigation Sign in SearchLoading...\r\n\r...,This video demonstrates the power of IBM DataS...,Analyze NY Restaurant data using Spark in DSX,Live,4


In [3]:
df.head()

Unnamed: 0,article_id,title,email
0,1430.0,"using pixiedust for fast, flexible, and easier...",ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1,1314.0,healthcare python streaming application demo,083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2,1429.0,use deep learning for image classification,b96a4f2e92d8572034b1e9b28f9ac673765cd074
3,1338.0,ml optimization using cognitive assistant,06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4,1276.0,deploy your python model as a restful api,f01220c46fc92c6e6b161b1849de11faacd7ccb2


### <a class="anchor" id="Exploratory-Data-Analysis">Part I : Exploratory Data Analysis</a>

Use the dictionary and cells below to provide some insight into the descriptive statistics of the data.

`1.` What is the distribution of how many articles a user interacts with in the dataset?  Provide a visual and descriptive statistics to assist with giving a look at the number of times each user interacts with an article.  

In [4]:
df['email'].value_counts().describe()

count    5148.000000
mean        8.930847
std        16.802267
min         1.000000
25%         1.000000
50%         3.000000
75%         9.000000
max       364.000000
Name: email, dtype: float64

In [5]:
median_val = df['email'].value_counts().describe()['50%'] 
# 50% of individuals interact with 3 number of articles or fewer.
median_val

3.0

In [6]:
max_views_by_user = df['email'].value_counts().describe()['max'] 
max_views_by_user
# The maximum number of user-article interactions by any 1 user is 364.

364.0

`2.` removing duplicate articles from the **df_content** dataframe.  

In [7]:
df_content.drop_duplicates(subset= ['article_id'],inplace= True)

`3.` below to find:

**a.** The number of unique articles that have an interaction with a user. 

In [8]:
unique_articles = df['article_id'].value_counts().count()
unique_articles

714

##### there are 714 unique articles in the user dataset that have interaction

**b.** The number of unique articles in the dataset (whether they have any interactions or not).<br>

In [9]:
total_articles = df_content['article_id'].value_counts().count()
total_articles

1051

##### there are 1051 unique articles in the content dataset

**c.** The number of unique users in the dataset. (excluding null values) <br>

In [10]:
unique_users = df['email'].value_counts().describe()['count']
unique_users

5148.0

##### there are 5148 unique users in the dataset

**d.** The number of user-article interactions in the dataset.

In [11]:
user_article_interactions = df.shape[0]
user_article_interactions

45993

#### there are 45993 user-article inteactions in the dataset.

`4.` most viewed **article_id**, as well as how often it was viewed.  After talking to the company leaders, the `email_mapper` function was deemed a reasonable way to map users to ids.  There were a small number of null values, and it was found that all of these null values likely belonged to a single user (which is how they are stored using the function below).

In [12]:
df.groupby(['article_id']).agg('count')['title'].sort_values(ascending= False).reset_index()[0:1]

Unnamed: 0,article_id,title
0,1429.0,937


##### article id : 1429 is the most viewed with 937 reads.

In [13]:
# Run this cell to map the user email to a user_id column and remove the email column

def email_mapper():
    coded_dict = dict()
    cter = 1
    email_encoded = []
    
    for val in df['email']:
        if val not in coded_dict:
            coded_dict[val] = cter
            cter+=1
        
        email_encoded.append(coded_dict[val])
    return email_encoded

email_encoded = email_mapper()
del df['email']
df['user_id'] = email_encoded



In [14]:
df.head()

Unnamed: 0,article_id,title,user_id
0,1430.0,"using pixiedust for fast, flexible, and easier...",1
1,1314.0,healthcare python streaming application demo,2
2,1429.0,use deep learning for image classification,3
3,1338.0,ml optimization using cognitive assistant,4
4,1276.0,deploy your python model as a restful api,5


### <a class="anchor" id="Rank">Part II: Rank-Based Recommendations</a>

we don't actually have ratings for whether a user liked an article or not.  We only know that a user has interacted with an article.  In these cases, the popularity of an article can really only be based on how often an article was interacted with.

`1.` Function below to return the **n** top articles ordered with most interactions as the top. Test your function using the tests below.

In [15]:
def get_top_articles(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''
    top_articles = list(df.groupby(['title'])['article_id'].count().nlargest(n).reset_index()['title'])
    return top_articles # Returns the top article titles from df (not df_content)

def get_top_article_ids(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''
    top_articles = list(df.groupby(['article_id'])['title'].count().nlargest(n).reset_index()['article_id'])
    # Your code here
 
    return top_articles # Return the top article ids

In [16]:
print(get_top_articles(10))
print(get_top_article_ids(10))

['use deep learning for image classification', 'insights from new york car accident reports', 'visualize car data with brunel', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'healthcare python streaming application demo', 'finding optimal locations of new store using decision optimization', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model']
[1429.0, 1330.0, 1431.0, 1427.0, 1364.0, 1314.0, 1293.0, 1170.0, 1162.0, 1304.0]


### <a class="anchor" id="User-User">Part III: User-User Based Collaborative Filtering</a>


`1.`function below to reformat the **df** dataframe to be shaped with users as the rows and articles as the columns.  

* Each **user** should only appear in each **row** once.


* Each **article** should only show up in one **column**.  


* **If a user has interacted with an article, then place a 1 where the user-row meets for that article-column**.  It does not matter how many times a user has interacted with the article, all entries where a user has interacted with an article should be a 1.  


* **If a user has not interacted with an item, then zero will be placed where the user-row meets for that article-column**. 


In [17]:
# create the user-article matrix with 1's and 0's

def create_user_item_matrix(df):
    '''
    INPUT:
    df - pandas dataframe with article_id, title, user_id columns
    
    OUTPUT:
    user_item - user item matrix 
    
    Description:
    Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with 
    an article and a 0 otherwise
    '''
    user_item = df.groupby(['user_id'])['article_id'].value_counts().unstack().fillna(0).apply(lambda x : x.apply(lambda x : 1 if x >.0 else .0))
    
    return user_item # return the user_item matrix 

`2.` the function below which takes a user_id and provide an ordered list of the most similar users to that user (from most similar to least similar).  The returned result should not contain the provided user_id, as we know that each user is similar to him/herself. Because the results for each user here are binary, it makes sense to compute similarity as the dot product of two users. 


In [18]:
user_item = create_user_item_matrix(df)
test = np.dot(np.array(user_item[0]),user_item.to_numpy())

In [68]:
def find_similar_users(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered
    
    '''
    # compute similarity of each user to the provided user
    sim = np.dot(user_item[user_id-1:user_id],user_item.transpose())
    # sort by similarity
    sim = pd.DataFrame(sim)
    sim = sim.transpose()
    sim.columns = ['sim']
    #sim = sim[sim['sim']>0]
    sim = sim.sort_values(['sim'],ascending = False)
    # create list of just the ids
    sim['id'] = sim.index+1
    most_similar_users = list(sim.id)
    # remove the own user's id
    try :
        most_similar_users.remove((user_id))
    except:
        pass
       
    return most_similar_users # return a list of the users in order from most to least similar
        

`3.` Now that we have a function that provides the most similar users to each user, we will want to use these users to find articles we can recommend. The functions below to return the articles we would recommend to each user.

In [248]:
df.head(5)

Unnamed: 0,article_id,title,user_id
0,1430.0,"using pixiedust for fast, flexible, and easier...",1
1,1314.0,healthcare python streaming application demo,2
2,1429.0,use deep learning for image classification,3
3,1338.0,ml optimization using cognitive assistant,4
4,1276.0,deploy your python model as a restful api,5


In [249]:
df_content.head(5)

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
0,Skip navigation Sign in SearchLoading...\r\n\r...,Detect bad readings in real time using Python ...,Detect Malfunctioning IoT Sensors with Streami...,Live,0
1,No Free Hunch Navigation * kaggle.com\r\n\r\n ...,"See the forest, see the trees. Here lies the c...",Communicating data science: A guide to present...,Live,1
2,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat...,Here’s this week’s news in Data Science and Bi...,"This Week in Data Science (April 18, 2017)",Live,2
3,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA...",Learn how distributed DBs solve the problem of...,DataLayer Conference: Boost the performance of...,Live,3
4,Skip navigation Sign in SearchLoading...\r\n\r...,This video demonstrates the power of IBM DataS...,Analyze NY Restaurant data using Spark in DSX,Live,4


In [250]:
def get_article_names(article_ids, df=df):
    '''
    INPUT:
    article_ids - (list) a list of article ids
    df - (pandas dataframe) df as defined at the top of the notebook
    
    OUTPUT:
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the title column)
    '''
    mask = df['article_id'].isin(article_ids)
    article_names = list(set(df[mask]['title']))
    
    return article_names # Return the article names associated with list of article ids


def get_user_articles(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    article_ids - (list) a list of the article ids seen by the user
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the doc_full_name column in df_content)
    
    Description:
    Provides a list of the article_ids and article titles that have been seen by a user
    '''
    
    mask = df['user_id']==user_id
    article_ids = list(set(df[mask]['article_id']))
    
    
    mask_full_name = df_content['article_id'].isin(article_ids)
    article_names = set(list(df_content[mask_full_name]['doc_full_name']))
    
    return article_ids, article_names # return the ids and names



    return recs # return your recommendations for this user_id    

In [239]:
def user_user_recs(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily
    
    '''
    n = 0
    recs = []
    iteration = 0
    article_ids_chosen, article_names_chosen = get_user_articles(user_id)
    while n<m:
        similars = find_similar_users(user_id)[iteration]
        article_ids, article_names = get_user_articles(similars)
        temp_recs = [elem for elem in article_ids if not elem in article_ids_chosen] 
        temp_recs = [elem for elem in temp_recs if not elem in recs]
        recs.extend(temp_recs[0:10-n])
        iteration = iteration+1
        n = len(recs)
        
    return recs
    

In [337]:
user_id = 170
article_ids_chosen, article_names_chosen = get_user_articles(user_id)

In [338]:
m = 0
iteration = 1
similars = find_similar_users(user_id)[iteration]

In [339]:
article_ids, article_names = get_user_articles(similars)

In [340]:
recs = []

In [341]:
temp_recs = [elem for elem in article_ids if not elem in article_ids_chosen] 

In [342]:
temp_recs[0:11]

[1025.0, 2.0, 517.0, 524.0, 14.0, 16.0, 26.0, 1051.0, 28.0, 29.0, 1052.0]

In [343]:
recs.extend(temp_recs[0:10])

In [345]:
m = len(recs)

In [347]:
recs

[1025.0, 2.0, 517.0, 524.0, 14.0, 16.0, 26.0, 1051.0, 28.0, 29.0]

In [327]:
recs = []

In [381]:
m = 10
n = 0
recs = []
iteration = 0
user_id = 950
article_ids_chosen, article_names_chosen = get_user_articles(user_id)
while n<m:
    similars = find_similar_users(user_id)[iteration]
    article_ids, article_names = get_user_articles(similars)
    temp_recs = [elem for elem in article_ids if not elem in article_ids_chosen] 
    temp_recs = [elem for elem in temp_recs if not elem in recs]
    recs.extend(temp_recs[0:10-n])
    iteration = iteration+1
    print(len(recs))
    n = len(recs)

0
10


In [376]:
recs

[1025.0, 2.0, 517.0, 12.0, 524.0, 14.0, 16.0, 26.0, 1051.0, 1052.0]

In [359]:
user_id = 170
iteration = 1
similars = find_similar_users(user_id)[iteration]
similars

3782

In [353]:
iteration

5148

In [362]:
m = 10
n = 0
recs = []
iteration = 1
user_id = 170
article_ids_chosen, article_names_chosen = get_user_articles(user_id)
similars = find_similar_users(user_id)[iteration]
article_ids, article_names = get_user_articles(similars)
temp_recs = [elem for elem in article_ids if not elem in article_ids_chosen] 
temp_recs = [elem for elem in temp_recs if not elem in recs]
recs.extend(temp_recs[0:10-n])
#iteration = iteration+1
n = len(recs)

In [364]:
n

10

In [288]:
## 170 i test et