In [90]:
import pandas as pd
import numpy as np
import math
from sklearn.preprocessing import MinMaxScaler

## Content Based Recommender
A content based recommender develops content recommendations based on a given item taxonomy and the respective user preferences usign this taxonomy. Therefore key questions that need to be answered are:<br>
1. How to model Items
2. How to model user preferences


In [91]:
data_path = ''

In [92]:
#### Data Import ####
a_votes = pd.read_csv(f'{data_path}/InputData/AnswerUpvotes.csv') # votes represent proactively signalled positive and negative interest in the question by the user
q_feedback = pd.read_csv(f'{data_path}/InputData/QuestionsFeedback.csv') # Feedback to users on given answer based on up- and downvotes
topics = pd.read_csv(f'{data_path}/InputData/TopicsXQuestions.csv') # taxonomy in form of topics and questions that are each assigned to the topics

#since topic names will be encoded I created an item & taxonomy(topcic) dictionary to access the original names
item_dict = {}
for row in topics.iterrows(): item_dict[row[1][0]] = row[0]
topic_dict ={}
for count, topic in enumerate(topics.columns): topic_dict[topic]=count 

# droping question names since they are string
a_votes.drop(['Unnamed: 0'], inplace=True,axis=1) 
q_feedback.drop(['Unnamed: 0'], inplace=True,axis=1)
topics.drop(['Unnamed: 0'], inplace=True,axis=1)

# replacing topic names with number for matrix multiplaction later
a_votes.rename(columns=topic_dict, inplace=True)
q_feedback.rename(columns=topic_dict, inplace=True)
topics.rename(columns=topic_dict, inplace=True)

## Modelling User Profiles
The following cell models the user profiles as matrix product from taxonomy of the items and and given user preferences.<br>

Firstly, I created a target dataframe. Secondly, I left the option open to run the function for Simple Unary, Unit Weight as well as TFIDF model. For the latter two the taxonomy weights of the items need to be adapted. This represents the uniqueness of the category (1/number of items in category).<br>

Secondy, I iterated over the users in the database, replacing nans by 0, and finally again giving the choice between Simple Unary or Unit Weights or TFIDF. For the former two taxonomy and user user ratings can simply be multiplied in order to retrieve the respective user preference for each User. For the TFIDF model however, I initated the computation of IDF. After that the mutliplication of taxonomy and user ratings is processed and finally per category multiplied by the recpective IDF figure.

In [93]:
def user_pref(user_ratings, taxonomy, pref_mode='SimpleUnary'):
    '''
    Modelling user Profiles as Product of Items' taxonomy attribution and user taxonomy preferences.
    '''
    user_pref_df = pd.DataFrame(columns=user_ratings.columns, index=taxonomy.index) #creating result df
    
    if pref_mode == 'UnitWeights' or pref_mode == 'TFIDF': # If we are modelling Unit Weights or TFIDF  
            taxonomy = taxonomy.div(taxonomy.sum(axis=1), axis=0) #Taxonomy per item needs to be aggregated - representing uniqueness of the taxonomy cateory
    
    for u in user_ratings.columns: #iterating all users in dataset     
        
        rated = user_ratings.loc[:, u].fillna(0).values #fill nan values with 0 instead (representing neutral rating)
        
        if pref_mode == 'SimpleUnary' or pref_mode == 'UnitWeights': #if simple or unit weights
            user_pref_df[u] = np.dot(taxonomy.values, rated) # compute simple matrix multiplacation of taxonomy and ratings
        
        elif pref_mode == 'TFIDF': #if TFIDF 
            idf = np.log(len(taxonomy) / taxonomy.sum(axis=1)) #compute IDF
            user_pref_df_vec = np.dot(taxonomy.values, rated) #multiply taxonomy and rating matrix
            user_pref_df[u] = user_pref_df_vec * idf # multiply the result for each category by the respective IDF

    return user_pref_df

### Creating User Profiles
The following Cell calls the modelling function of the user profiles in a way that a dictionary is created containign all the necessary user profiles, for all the desired model types. Based on this we can keep the algorithm lean and neat.<br>

Hereby, it creates profiles for all users based on the three model types (Simple Unary, Unit Weights and TFDIF). Furthermore all available types of user feedback (feedback, votes and both combined). Feedback represents the feeback that each individual user has given per question (up or down vote). That given clearly indicates interest or missing interest. Votes represent the upvotes and downvotes that users have received for the answers they have given to others questions. This information is hereby used as a "repuation" score for the topic. The combined profiles combine the standardized values of interest and reputation by adding the values up and again standadizing the result. <br>

The standadization is necessary since for the combined profiles we assume that interest and reputation are equally important (Depending on the use case of the recommender). For example do we intent using the recommender in order to recommend questions to give anserws, targeting active users it might make sense to weight the reputation higher. For users that tend to read but not answer we should do the opposite. The standardization is necessary to weigh the different scales of rating and evaluation equally. Concerning the standadization another tradeoff was necessary between standardizing across all users or per user. Here I chose to standadize per user, not to falsely compare different rating scales (these are normalized later in the recommendation function).<br>

Furthermore, I am computing the mean profile for each profile type computed (represented by the mean relevance for users per topic). The mean is later used in order solve the cold start problem by combining personalized with unpersonalized content based recommendations. 

In [94]:
#%%timeit
def stand(df): 
    """
    Standardizing a given array.
    """
    for col in df.columns:
        array = df.loc[:, col].values
        df[col] = (array - np.mean(array)) / np.std(array)
    return df

In [95]:
"""
Computing user profiles and creating a profiles dict to index them. (profiles for all model and profiles types)
"""
iteration_dict = {'interest': q_feedback, 'reputation': a_votes}
model_types = ['SimpleUnary', 'UnitWeights', 'TFIDF']
profiles_dict = {}

for model in model_types: #iterate all model types
    for i in iteration_dict: #iterate all profile types
        profiles_dict[f'{i}_{model}'] = user_pref(iteration_dict[i], topics.transpose(), pref_mode=model) #insert every combination of desired models
        profiles_dict[f'{i}_{model}']['mean_profile'] = profiles_dict[f'{i}_{model}'].mean(axis=1) #insert mean profile for every combination of models

In [96]:
"""
Computing the combined profiles per model type and adding them to the profile dict.
"""
for model in model_types:
    profiles_dict[f'combined_profile_{model}'] = stand(stand(profiles_dict[f'interest_{model}']) + stand(profiles_dict[f'interest_{model}']))

  


## Modelling Recommendations
The following two Cells compute the relevance of each question for each user according to the defined models. The reults are fitted into a dict of arrays. Each of the arrays represent the question relevance for each question for one user.<br>

To do this, I am firstly iterating all model types in order to call all user types per model type. The model type is hereby defined outside of this function in the final function call. Therefore this function solely passes on the attained model type.<br>

Before the chosen profiles and the taxonomy are finally combined in a matrix multiplication I made use of an if statement in order to incorporate the possibility of activating a hybrid recommendation approach. Therefore, the if statement checks whether the profiles have preferences defined. Should this not be the case the mean value for the respective model type is used and passed on into the matrix multiplication.<br>

As mentioned above result of this function is a dict indexing the relevance of each question for each user. This relevance is represented by the cosine distance between the user profile and the question in the multi-dimensional model space. Therefore, the value for each question lies between -1 and 1. The values between -1 and 0 are hereby interpreted as dislike probability while the values between 0 and 1 are interpreted as like probability. 0 is interpreted as neutral.
#### Possbile Customization Choices:
profile_type: interest, reputation, combined_profiles<br>
model_type: SimpleUnary, UnitWeights, TFIDF

In [97]:
def q_relevance(taxonomy, profile_type, switched_hybrid, model_types=model_types, profiles_dict=profiles_dict):
    """
    Computing the relevance score for each question and profile. (multiplying profiles and taxonomy)
    """
    relevance_d = {} #defining the output dict
    for model in model_types: #iterate all model types
        u_profiles = profiles_dict[f'{profile_type}_{model}']
                
        for u in u_profiles.columns: #iterating all user profiles
            df = u_profiles[u].values # 1st component of multiplocation (profiles/preferences)
            s = taxonomy.values  #2nd component of multiplication (taxonomy)
   
            if switched_hybrid=='YES' and u_profiles[u].isnull().values.all(axis=0): #if statement to check if values are NAN 
                df = u_profiles['mean_profile'].values #-> if yes choose mean profile in order to compute unpersonalized recommendation
                    
            relevance_d[f'{u}_{model}'] = np.cos(np.dot(s, df)) #index the result as cosine of the mutliplication result
    return relevance_d

The following Cell interprets the question relevance per user that has been computed earlier. It therefore rounds the estimations and translates every estimation < -.5 as dislike prediction, > .5 as like prediction and everything in between as neutral prediction.

In [98]:
def predictions(user, model, profile_type, switched_hybrid):
    """
    Interpreting the estimated relevance scores into real life relevant actions (probability of Like, Dislike & Neutral)
    """
    predictions_v = q_relevance(topics, profile_type=profile_type, switched_hybrid=switched_hybrid)[f'{user}_{model}'] #calling the relevance function
    predictions_df = pd.DataFrame(predictions_v, index=item_dict.keys(), columns=[user]) #create df as destination of the preditcions
    
    predictions_df['like_est'] = round(predictions_df[user]) #rounding the value in order to have a clear action
    predictions_df.loc[predictions_df['like_est'] == -1, 'like_est'] = 'DISLIKE' #dislike when prediction < -.5
    predictions_df.loc[predictions_df['like_est'] == 0, 'like_est'] = 'NEUTRAL' #neutral when -.5 < estimation <.5
    predictions_df.loc[predictions_df['like_est'] == 1, 'like_est'] = 'LIKE' #like when estimation > .5
    
    return predictions_df.sort_values(by=user, axis=0, ascending=False) #return sorted version of the outpur dataframe

## Run Recommendation Engine
The following function runs the recommendation engine by taking the desired model, profile type as well as an indication wheather the hybrid recommendation should be activated or not as input.

In [99]:
def run_recommender(model, profile_type, switched_hybrid):
    """
    Running the recommender workflow & printing the evaluation of the results.
    """
    users_l = ['User 1', 'User 2', 'User 3', 'User 4'] #defining a list of users to iterate
    results_df = pd.DataFrame(index=item_dict.keys()) #defining a result df

    for user in users_l: #iterating the defined users
        pred_df = predictions(user=user, model=model, profile_type=profile_type, switched_hybrid=switched_hybrid) #calling the prediction function per user
        results_df[user] = pred_df[user] #appending results to result df for processing
        results_df[f'{user}_result'] = pred_df['like_est'] #appending results to result df for processing
        
        like_count = len(results_df.loc[results_df[f'{user}_result'] == 'LIKE']) #interpret estimated figures for LIKES
        dislike_count = len(results_df.loc[results_df[f'{user}_result'] == 'DISLIKE']) #interpret estimated figures for DISLIKES
        neutral_count = len(results_df.loc[results_df[f'{user}_result'] == 'NEUTRAL']) #interpret estimated figures for NEUTRALS

        ##print recommended results per user
        print(f'####---------{user}---------####')
        print(f'Number of Likes: {like_count}')
        print(f'Number of Neutral: {dislike_count}')
        print(f'Number of Dislikes: {neutral_count}')
        
        #if no estimations dont print top questions (cold start problem)
        if pred_df[user].isnull().values.all(axis=0) ==False:
            print(f'##---Top 5 Questions---##')
            print(pred_df[user].sort_values(axis=0, ascending=False).index.values[:5])
        

## Challenge 1: Simple Unary Prediction

In [100]:
run_recommender(model='SimpleUnary', profile_type='interest', switched_hybrid='NO')

####---------User 1---------####
Number of Likes: 6
Number of Neutral: 8
Number of Dislikes: 6
##---Top 5 Questions---##
['question11' 'question15' 'question18' 'question7' 'question5']
####---------User 2---------####
Number of Likes: 7
Number of Neutral: 6
Number of Dislikes: 7
##---Top 5 Questions---##
['question19' 'question11' 'question5' 'question6' 'question10']
####---------User 3---------####
Number of Likes: 6
Number of Neutral: 7
Number of Dislikes: 7
##---Top 5 Questions---##
['question4' 'question7' 'question1' 'question18' 'question6']
####---------User 4---------####
Number of Likes: 0
Number of Neutral: 0
Number of Dislikes: 0


## Challenge 2: Unit Weight Prediction

In [101]:
run_recommender('UnitWeights', profile_type='interest', switched_hybrid='NO')

####---------User 1---------####
Number of Likes: 6
Number of Neutral: 8
Number of Dislikes: 6
##---Top 5 Questions---##
['question18' 'question11' 'question15' 'question3' 'question7']
####---------User 2---------####
Number of Likes: 9
Number of Neutral: 6
Number of Dislikes: 5
##---Top 5 Questions---##
['question11' 'question5' 'question19' 'question10' 'question8']
####---------User 3---------####
Number of Likes: 6
Number of Neutral: 7
Number of Dislikes: 7
##---Top 5 Questions---##
['question4' 'question7' 'question13' 'question1' 'question3']
####---------User 4---------####
Number of Likes: 0
Number of Neutral: 0
Number of Dislikes: 0


## Challenge 3: TFIDF Prediction

In [102]:
run_recommender('TFIDF', profile_type='interest', switched_hybrid='NO')

####---------User 1---------####
Number of Likes: 6
Number of Neutral: 8
Number of Dislikes: 6
##---Top 5 Questions---##
['question18' 'question11' 'question15' 'question3' 'question7']
####---------User 2---------####
Number of Likes: 9
Number of Neutral: 6
Number of Dislikes: 5
##---Top 5 Questions---##
['question11' 'question5' 'question19' 'question10' 'question8']
####---------User 3---------####
Number of Likes: 6
Number of Neutral: 7
Number of Dislikes: 7
##---Top 5 Questions---##
['question4' 'question7' 'question13' 'question1' 'question3']
####---------User 4---------####
Number of Likes: 0
Number of Neutral: 0
Number of Dislikes: 0


## Challenge 4: Synthetic Prediction

In [103]:
run_recommender('TFIDF', profile_type='interest', switched_hybrid='YES')

####---------User 1---------####
Number of Likes: 6
Number of Neutral: 8
Number of Dislikes: 6
##---Top 5 Questions---##
['question18' 'question11' 'question15' 'question3' 'question7']
####---------User 2---------####
Number of Likes: 9
Number of Neutral: 6
Number of Dislikes: 5
##---Top 5 Questions---##
['question11' 'question5' 'question19' 'question10' 'question8']
####---------User 3---------####
Number of Likes: 6
Number of Neutral: 7
Number of Dislikes: 7
##---Top 5 Questions---##
['question4' 'question7' 'question13' 'question1' 'question3']
####---------User 4---------####
Number of Likes: 4
Number of Neutral: 6
Number of Dislikes: 10
##---Top 5 Questions---##
['question7' 'question6' 'question19' 'question10' 'question3']


## Challenge 5: Trust Aware Collaborative Filtering
For the final part of the practice I firstly thought about developing a trust aware user-user collaborative filtering model. However, the first testing showed that the limited ammount of data is clearly not enough to work out sufficient similarities between the users. The following cell demontrates the correlation matrix between the given user dataset:

In [104]:
#%%timeit
def pearsonMatrix(prefs):
    """
    Computes the complete pearson correlation Matrix.
    """
    matrix = prefs.corr(method='pearson') #.corr computes pearson correlation of all inputted pairs and outputs in matrix
    return matrix #returns complete correlation matrix

def pearsonSimilarity(person1, person2, corrM):
    """
    Calls the desired pair of persons from the complete correlation Matrix.
    """
    similarity = corrM.loc[person1, person2] #.loc calls the desired pair of persons from the complete correlation Matrix
    return similarity #returns single correlation value

In [105]:
pearsonMatrix(q_feedback.fillna(0))

Unnamed: 0,User 1,User 2,User 3,User 4
User 1,1.0,-0.414141,-0.167968,
User 2,-0.414141,1.0,0.205294,
User 3,-0.167968,0.205294,1.0,
User 4,,,,


As the Matrix is way too limited and next to its small size also does not show any significant similarities I decided to change my approach and shift to a trust aware item-item collaborative filtering model. The following cell displays the first 5 rows of the respective correlation matrix. The quality difference is clearly significant.

In [106]:
pearsonMatrix(q_feedback.transpose().fillna(0)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,1.0,-1.0,,-0.816497,0.0,0.816497,0.0,0.0,,,,0.707107,,,0.0,0.5,-0.816497,,-0.816497,
1,-1.0,1.0,,0.816497,0.0,-0.816497,0.0,0.0,,,,-0.707107,,,0.0,-0.5,0.816497,,0.816497,
2,,,,,,,,,,,,,,,,,,,,
3,-0.816497,0.816497,,1.0,-0.333333,-0.333333,0.333333,-0.333333,,,,-0.57735,,,0.333333,0.0,1.0,,0.333333,
4,0.0,0.0,,-0.333333,1.0,-0.333333,-1.0,1.0,,,,-0.57735,,,-1.0,-0.816497,-0.333333,,0.333333,


Since it is the intention to develop a trust aware recommender system it is crucial to determine initially, how the trust in a platform's user can be determined and where it can best be inserted into the estimation formula. In this case I decided to use the up-and downvotes that have been submitted for the answers that our users posted. These represent the expertise and quality of interaction with the quora platorm.<br>

In order to get started the reputation values therefore first need to be computed per user. In this case the reputation is computed per user and across taxonomy topics. This is relevant as applying the reputation per taxonomy category might lead to even better recommendations.<br>

The reputation needs to elevate or weaken the rating given by the user. Therefore the mean number of up/downvotes is simply multiplied by the interest matrix.

In [107]:
## Creating a reputation dict ## 
reputation_dict = {}
for count, user in enumerate(a_votes.columns): #iterate all users in the dataset
    reputation_dict[user] = a_votes.mean(axis=0)[count] #for every user the reputation = mean of up&downvotes on answers submitted by the same user

In [108]:
## Create the Input Matrix for the recommender ## 
trust_aware_rating = q_feedback * np.array(list(reputation_dict.values())) # Incorporate the reputaiton in the ratings
input_m = trust_aware_rating.transpose().fillna(0) #transpose the matrix in order to focus on item-item similarity

In [109]:
corrM = pearsonMatrix(input_m) #define correlation matrix between questions

The function in the following cell computes the actual relevance scores for each items and one given user. Hereby the solution is trust aware, as the input matrix of the weightings has been multiplied with the reputation scores per user. Therefore the ratings of more trustworthy users count more in the consideration of the individual votes as well as the average. (Numerator of the score estimation formula)<br>

The if statement checks for users that have not given any ratings yet and therefore identifies cases of the cold start problem. In this case the function replaces the user ratings with the mean user rating that has been given by all the other users in the database. We therefore assume that the best recommendation for a user that we dont know at all is the average recommendation we give to all our other users. The average recommendation is hereby also trust-aware!<br>

Since we are talking about probabilities of liking and diskliking a certain item the computed values need to be normalized between -1 & 1

In [110]:
def getRecommendations(prefs,person, similarity_matrix):  
    """
    Compute relevance score per item for a given user. 
    """
    ## Compute mean movie rating given by V in order to normalise later
    prefs.loc['mean_rating_j'] = prefs.mean(axis=0) #Compute j rating mean in new row
    
    #incorporate usage of non-personalized recommendation for the cold-start problem
    if np.count_nonzero(prefs.loc[person]) == 0 : #if statement to check if values are all 0 (representing no rating given = cold start) 
        prefs.loc[person] = prefs.loc['mean_rating_j'].values # if yes choose mean item rating in order to compute unpersonalized recommendat

    results = []
    for q in prefs.columns: #iterate all items in the database
        prefs.loc['similarity_j'] = similarity_matrix.loc[q, :] #append the item-item similarity of the given item
        sum_of_j_weights = prefs.loc['similarity_j'].sum() # sum the similarity weights between items
        if sum_of_j_weights == 0: sum_of_j_weights = 1 #if all the similarity weights balance each other assume 1
        
        prefs.loc['normalised_rating'] = prefs.loc[person, :].values - prefs.loc['mean_rating_j'].values #normalise the rating by substracting the mean rating for the item
        prefs.loc['weighted_rating'] = prefs.loc['normalised_rating'].values * prefs.loc['similarity_j', :].values #item rating of j * Wij (create new row / overwrite row with all results for item in current   iteration)           
        
        results.append(np.nansum(prefs.loc['weighted_rating'].values) / sum_of_j_weights) #compute relevance score by dividing the components and append result to goven result list 
    
    scaler = MinMaxScaler((-1, 1)) #define scaler (sklearn package)
    results = scaler.fit_transform(np.array(results).reshape(-1, 1)) #insert U column (containing estimations into scaler)
        
    result_df = pd.DataFrame(columns=[person], index=prefs.columns) #define result df to pass to interpretation function
    result_df[person] = results #append estimated score to result df to pass on to interpretation
    
    return result_df.round(0) #return sorted list of tuples (reverse = True for descending order)


In [111]:
def predictions(user):
    """
    Interpret the relevance ratings per item and make final predictions on that basis. (like, neutral, dislike for question domain)
    """
    predictions_df = getRecommendations(input_m,user, similarity_matrix=corrM) #call recommendation function
    predictions_df['like_est'] = round(predictions_df[user]) #round values to 0, 1, -1
    predictions_df.loc[predictions_df['like_est'] == -1, 'like_est'] = 'DISLIKE' #dislike when prediction < -.5
    predictions_df.loc[predictions_df['like_est'] == 0, 'like_est'] = 'NEUTRAL' #neutral when -.5 < estimation <.5
    predictions_df.loc[predictions_df['like_est'] == 1, 'like_est'] = 'LIKE' #like when estimation > .5
    
    return predictions_df.sort_values(by=user, axis=0, ascending=False) #return sorted version of the outpur dataframe

In [112]:
def run_recommender():
    """
    Run the recommender and print results.
    """
    users_l = ['User 1', 'User 2', 'User 3', 'User 4'] #specify the users that are to be iterated

    for user in users_l: #iterate the specified users
        pred_df = predictions(user=user) #extract prediction figures
        pred_df.rename({'like_est': f'{user}_result'}, axis=1, inplace=True) #rename column to specify for user
        
        like_count = len(pred_df.loc[pred_df[f'{user}_result'] == 'LIKE']) #interpret estimated figures for LIKES
        dislike_count = len(pred_df.loc[pred_df[f'{user}_result'] == 'DISLIKE']) #interpret estimated figures for DISLIKES
        neutral_count = len(pred_df.loc[pred_df[f'{user}_result'] == 'NEUTRAL']) #interpret estimated figures for NEUTRALS
            
        ##print recommended results per user
        print(f'####---------{user}---------####')
        print(f'Number of Likes: {like_count}')
        print(f'Number of Neutral: {dislike_count}')
        print(f'Number of Dislikes: {neutral_count}')
        
        #if no estimations dont print top questions (cold start problem)
        if pred_df[user].isnull().values.all(axis=0) ==False:
            print(f'##---Top 5 Questions---##')
            print(pred_df[user].sort_values(axis=0, ascending=False).index.values[:5])

In [113]:
run_recommender()

####---------User 1---------####
Number of Likes: 6
Number of Neutral: 10
Number of Dislikes: 4
##---Top 5 Questions---##
[ 4  6  7 15 11]
####---------User 2---------####
Number of Likes: 5
Number of Neutral: 4
Number of Dislikes: 11
##---Top 5 Questions---##
[11  4 14  6  7]
####---------User 3---------####
Number of Likes: 10
Number of Neutral: 6
Number of Dislikes: 4
##---Top 5 Questions---##
[ 2  3  8 17 16]
####---------User 4---------####
Number of Likes: 0
Number of Neutral: 20
Number of Dislikes: 0
##---Top 5 Questions---##
[19  2  1 18 17]
