# Recommendations with IBM
## 3. User-User Based Collaborative Filtering
This notebook is the third one in the Recommendations with IBM project. In this notebook I created a recommendation function based on similarities between users. This process was conducted in 4 steps: 

1. Creating user-article matrix
2. Finding similar users
3. Recommending articles

I used IBM data set for this project. The detailed infromation regarding the data sets can be found in the first notebook: Exploratory Analysis.   
If you need to make a recommendation to a new user, you can use rank based recommendation explained in the second notebook in this repo.

In [2]:
#import necessary libraries
import pandas as pd
import numpy as np

#read the data
df = pd.read_csv('user-item-interactions.csv')
del df['Unnamed: 0']

#to make data look better we converted the emails to user ids.
def email_mapper():
    coded_dict = dict()
    cter = 1
    email_encoded = []
    
    for val in df['email']:
        if val not in coded_dict:
            coded_dict[val] = cter
            cter+=1
        
        email_encoded.append(coded_dict[val])
    return email_encoded

email_encoded = email_mapper()
del df['email']
df['user_id'] = email_encoded

# show header
df.head()


Unnamed: 0,article_id,title,user_id
0,1430.0,"using pixiedust for fast, flexible, and easier...",1
1,1314.0,healthcare python streaming application demo,2
2,1429.0,use deep learning for image classification,3
3,1338.0,ml optimization using cognitive assistant,4
4,1276.0,deploy your python model as a restful api,5


### 1. Creating user item matrix
User item matrix has users in rows and items ,in this case articles, in columns. This is generally a sparse matrix consisting of many NaN values. There are different ways to handle this. In this project we ignored how many times a user interacted with an article, instead we placed 1 if a user is interacted with an article and placed 0 if not. 

In [3]:
def create_user_item_matrix(df):
    '''
    INPUT:
    df - pandas dataframe with article_id, title, user_id columns
    
    OUTPUT:
    user_item - user item matrix 
    
    Description:
    Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with 
    an article and a 0 otherwise
    '''
    df_grouped=df.groupby(['user_id','article_id']).count() #create a groupby object
    df_grouped.loc[df_grouped['title']>1]=1 #changes the numbers higher than 1 to 1
    user_item=df_grouped.unstack() #placec users to rows and articles to columns
    user_item=user_item.fillna(0) #fills NaN values as 0s

    
    
    
    return user_item # return the user_item matrix 

In [4]:
user_item = create_user_item_matrix(df)
user_item.head()

Unnamed: 0_level_0,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title,title
article_id,0.0,2.0,4.0,8.0,9.0,12.0,14.0,15.0,16.0,18.0,...,1434.0,1435.0,1436.0,1437.0,1439.0,1440.0,1441.0,1442.0,1443.0,1444.0
user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2. Finding similar users
There are different ways to determine the similar users. As our user-item matrix consists of 0s and 1s, I used dot product to determine similar users. When similarity values is equal, Instead of arbitrarily choosing we will choose the users that have the most total article interactions before choosing those with fewer article interactions.

In [10]:
def get_top_sorted_users(user_id, df=df, user_item=user_item):
    '''
    INPUT:
    user_id - (int)
    df - (pandas dataframe) df as defined at the top of the notebook 
    user_item - (pandas dataframe) matrix of users by articles: 
            1's when a user has interacted with an article, 0 otherwise
    
            
    OUTPUT:
    neighbors_df - (pandas dataframe) a dataframe with:
                    neighbor_id - is a neighbor user_id
                    similarity - measure of the similarity of each user to the provided user_id
                    num_interactions - the number of articles viewed by the user - if a u
                    
    Other Details - sort the neighbors_df by the similarity and then by number of interactions where 
                    highest of each is higher in the dataframe
     
    '''
    user1=user_item.loc[user_id,]
    dot_prod=np.dot(user_item,user1)
    user_ids=user_item.index
    num_interactions=user_item.sum(axis=1)
    

    # sort by similarity
    neighbors_df=pd.DataFrame({'ids':user_ids,'similarity':dot_prod,'num_interactions':num_interactions})
    neighbors_df=neighbors_df.sort_values(by=['similarity','num_interactions'],ascending=False)
    neighbors_df=neighbors_df.drop(neighbors_df.index[0])

    
    return neighbors_df # Return the dataframe specified in the doc_string


In [11]:
#Lets have a look at some users
get_top_sorted_users(1).head()

Unnamed: 0_level_0,ids,similarity,num_interactions
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3933,3933,35.0,35.0
23,23,17.0,135.0
3782,3782,17.0,135.0
203,203,15.0,96.0
4459,4459,15.0,96.0


As shown in the table above, get_top_sorted_users function gives us a data frame sorted most similar to least similar, and also provides similarity as well as number of interactions. 

### 3. Creating Recommendations
In this part, I first created a couple of functions which will be helpful for recommendation function.

In [12]:
def get_article_names(article_ids, df=df):
    '''
    INPUT:
    article_ids - (list) a list of article ids
    df - (pandas dataframe) df as defined at the top of the notebook
    
    OUTPUT:
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the title column)
    '''
    article_names=[]
    for article in article_ids:
        title=df[df['article_id']==float(article)].iloc[0,].title
        article_names.append(title)
    
    return article_names # Return the article names associated with list of article ids


def get_user_articles(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    article_ids - (list) a list of the article ids seen by the user
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the doc_full_name column in df_content)
    
    Description:
    Provides a list of the article_ids and article titles that have been seen by a user
    '''
    article_ids=list(user_item.loc[user_id,user_item.loc[user_id,]==1].title.index)
    article_names=get_article_names(article_ids)
    

    
    return article_ids, article_names # return the ids and names

In [13]:
def user_user_recs(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user by article id
    rec_names - (list) a list of recommendations for the user by article title
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    * Choose the users that have the most total article interactions 
    before choosing those with fewer article interactions.

    * Choose articles with the articles with the most total interactions 
    before choosing those with fewer total interactions. 
   
    '''
    similar_users=list(get_top_sorted_users(user_id).ids)
    user_articles=get_user_articles(user_id)[0]
    recs=[]
    i=0
    while len(recs)< m and i<len(similar_users):
        ids=get_user_articles(similar_users[i])[0]
        for d in ids:
            if d not in user_articles:
                recs.append(d)
        i+=1
        
    recs=recs[0:m] 
    rec_names=get_article_names(recs)
    
    return recs, rec_names

In [14]:
user_user_recs(1)[0]

[2.0, 12.0, 14.0, 16.0, 26.0, 28.0, 29.0, 33.0, 50.0, 74.0]

In [15]:
user_user_recs(1)[1]

['this week in data science (april 18, 2017)',
 'timeseries data analysis of iot events by using jupyter notebook',
 'got zip code data? prep it for analytics. – ibm watson data lab – medium',
 'higher-order logistic regression for large datasets',
 'using machine learning to predict parking difficulty',
 'deep forest: towards an alternative to deep neural networks',
 'experience iot with coursera',
 'using brunel in ipython/jupyter notebooks',
 'graph-based machine learning',
 'the 3 kinds of context: machine learning and the art of the frame']