# Assignment 2 Instructions: Content-Based Recommenders
## Overview
In this assignment, you will hand-create and use some content-based profiles. You’ll go through a set of variations to see how certain features of the computation can introduce (or reduce) biases.


In [1]:
# import packages
import pandas as pd

## The Data Set

In [187]:
# import data
data = pd.read_excel('Assignment 2.xls', index_col=0, nrows=20)
data = data.drop(temp.filter(regex=r'^Unnamed').columns, axis=1)

user_rating = data[['User 1', 'User 2']].fillna(0)
docs_attr = data.loc[:, 'baseball':'family']
num_att = data[['num-attr']]
data.head()

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family,Unnamed: 11,num-attr,Unnamed: 13,User 1,User 2,Unnamed: 16,Pred1,Pred2
doc1,1,0,1,0,1,1,0,0,0,1,,5,,1.0,-1.0,,,
doc2,0,1,1,1,0,0,0,1,0,0,,4,,-1.0,1.0,,,
doc3,0,0,0,1,1,1,0,0,0,0,,3,,,,,,
doc4,0,0,1,1,0,0,1,1,0,0,,4,,,1.0,,,
doc5,0,1,0,0,0,0,0,0,1,1,,3,,,,,,


In [81]:
data[~data['User 1'].isna()]

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family,num-attr,User 1,User 2,Pred1,Pred2
doc1,1,0,1,0,1,1,0,0,0,1,5,1.0,-1.0,,
doc2,0,1,1,1,0,0,0,1,0,0,4,-1.0,1.0,,
doc6,1,0,0,1,0,0,0,0,0,0,2,1.0,,,
doc16,1,0,0,0,0,1,0,0,1,0,3,1.0,,,
doc19,0,1,1,0,1,0,1,0,0,1,5,-1.0,,,


In [82]:
data[~data['User 2'].isna()]

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family,num-attr,User 1,User 2,Pred1,Pred2
doc1,1,0,1,0,1,1,0,0,0,1,5,1.0,-1.0,,
doc2,0,1,1,1,0,0,0,1,0,0,4,-1.0,1.0,,
doc4,0,0,1,1,0,0,1,1,0,0,4,,1.0,,
doc12,1,0,0,0,0,1,1,0,0,0,3,,-1.0,,
doc17,0,1,1,1,0,0,0,1,0,0,4,,1.0,,


## Part 1. Build and use a very basic profile
First, you will build a very simple profile of user preferences for attributes.

In [45]:
# user profile from their rating 1 or -1 on documents
# just sum up their rating
user_profile = pd.DataFrame()

for user in user_rating:
    temp =docs_attr.apply(lambda col: col* user_rating[user])
    user_profile[user] = temp.sum()
    
user_profile  = user_profile.transpose()
user_profile

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
User 1,3.0,-2.0,-1.0,0.0,0.0,2.0,-1.0,-1.0,1.0,0.0
User 2,-2.0,2.0,2.0,3.0,-1.0,-2.0,0.0,3.0,0.0,-1.0


In [124]:
def pred_liking(profile_df, attribute_df):
    """ Dataframe dot product of user profile and docs attribute"""
    prediction = profile_df.dot(attribute_df.transpose()).transpose()
    return prediction

In [125]:
# predict scores for each user for each document
# using simple dot product
prediction = pred_liking(user_profile, docs_attr)
prediction

Unnamed: 0,User 1,User 2
doc1,4.0,-4.0
doc2,-4.0,10.0
doc3,2.0,0.0
doc4,-3.0,8.0
doc5,-1.0,1.0
doc6,3.0,1.0
doc7,-1.0,2.0
doc8,-2.0,4.0
doc9,3.0,-2.0
doc10,-3.0,1.0


In [184]:
#Which document does the simple profile predict user 1 will like best?
#What score does that prediction get?
# top docs for user 1
temp = prediction[['User 1']]
temp.nlargest(n=3, columns='User 1')

Unnamed: 0,User 1
doc16,3.464102
doc12,2.309401
doc6,2.12132


In [259]:
#How many documents does the model predict user 2 will dislike 
#(prediction score that is negative)?
# top docs for user 1
temp = prediction[['User 2']]
print(temp.nlargest(n=3, columns='User 2'))
print('\n')
print(temp[temp<0].count())

       User 2
doc2      5.0
doc17     5.0
doc4      4.0


User 2    4
dtype: int64


## Part 2. Next, let’s treat all articles as having unit weight ...
There could be a bias in preference scores from unbalnaced number of attribute values of the docs. If a user assign possitive on a doc which have very small number of attr like Doc6, it could mean the user has stronger preference on the attributes.

NOTE: User profile should be recalculated with the normalized docs attribute values!!!!

In [175]:
# Count the total number of items in the row (you can do this via SUM or COUNT function).
# Normalize each item value / (sum(items)^0.5)
# doc1’s values will all change from 1 to 0.447214 (approx)
temp =docs_attr.copy()
n_items = (docs_attr.sum(axis=1))**0.5
docs_attr_norm = temp.div(n_items, axis=0)
docs_attr_norm

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
doc1,0.447214,0.0,0.447214,0.0,0.447214,0.447214,0.0,0.0,0.0,0.447214
doc2,0.0,0.5,0.5,0.5,0.0,0.0,0.0,0.5,0.0,0.0
doc3,0.0,0.0,0.0,0.57735,0.57735,0.57735,0.0,0.0,0.0,0.0
doc4,0.0,0.0,0.5,0.5,0.0,0.0,0.5,0.5,0.0,0.0
doc5,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735
doc6,0.707107,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
doc7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.707107
doc8,0.0,0.0,0.5,0.5,0.0,0.0,0.5,0.0,0.0,0.5
doc9,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107,0.0
doc10,0.0,0.57735,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0


In [179]:
# user profile with the normalized docs attributes
user_profile_norm = docs_attr_norm.transpose().dot(user_rating)
user_profile_norm

Unnamed: 0,User 1,User 2
baseball,1.731671,-1.024564
economics,-0.947214,1.0
politics,-0.5,1.052786
Europe,0.207107,1.5
Asia,0.0,-0.447214
soccer,1.024564,-1.024564
war,-0.447214,-0.07735
security,-0.5,1.5
shopping,0.57735,0.0
family,0.0,-0.447214


In [181]:
# compute your second set of user profiles and new predictions. 
# If you did this right, you’ll see a prediction of 1.0090 (approx) for user1/doc1.
prediction_norm = docs_attr_norm.dot(user_profile_norm)
prediction_norm

Unnamed: 0,User 1,User 2
doc1,1.009019,-0.845577
doc2,-0.870053,2.526393
doc3,0.711105,0.016294
doc4,-0.620053,1.987718
doc5,-0.213541,0.319151
doc6,1.370923,0.336184
doc7,-0.353553,0.744432
doc8,-0.370053,1.014111
doc9,1.132724,-0.724476
doc10,-0.805073,0.274493


In [182]:
# top docs for user 1
temp = prediction_norm[['User 1']]
temp.nlargest(n=3, columns='User 1')

Unnamed: 0,User 1
doc16,1.924646
doc6,1.370923
doc12,1.333114


In [183]:
# top docs for user 2
temp = prediction_norm[['User 2']]
temp.nlargest(n=3, columns='User 2')

Unnamed: 0,User 2
doc2,2.526393
doc17,2.526393
doc4,1.987718


## Part 3. Finally, let’s consider how common different terms are among our documents …
We’re going to do one more model -- one that accounts for the fact the the content attributes have vastly different frequencies.
- start from Part 2: normalized doc attributes and their profile

In [148]:
# inverse document frequency
# originally log(#doc/#doc including the word) or 1/log(#doc including the word)
def idf(df):
    total = df.sum()
    return df / total
    
docs_attr_idf = idf(docs_attr)
docs_attr_idf

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
doc1,0.25,0.0,0.1,0.0,0.166667,0.166667,0.0,0.0,0.0,0.2
doc2,0.0,0.166667,0.1,0.090909,0.0,0.0,0.0,0.166667,0.0,0.0
doc3,0.0,0.0,0.0,0.090909,0.166667,0.166667,0.0,0.0,0.0,0.0
doc4,0.0,0.0,0.1,0.090909,0.0,0.0,0.142857,0.166667,0.0,0.0
doc5,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.2
doc6,0.25,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0
doc7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.2
doc8,0.0,0.0,0.1,0.090909,0.0,0.0,0.142857,0.0,0.0,0.2
doc9,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.142857,0.0
doc10,0.0,0.166667,0.0,0.0,0.166667,0.0,0.142857,0.0,0.0,0.0


In [268]:
prediction_norm_idf = (docs_attr_norm*docs_attr_idf).dot(user_profile_norm)
prediction_norm_idf

Unnamed: 0,User 1,User 2
doc1,0.247612,-0.217167
doc2,-0.136187,0.329154
doc3,0.109459,-0.062892
doc4,-0.089197,0.240296
doc5,-0.043527,0.044585
doc6,0.319432,-0.084695
doc7,-0.058926,0.113531
doc8,-0.04753,0.070575
doc9,0.179067,-0.120746
doc10,-0.128031,0.046812


In [269]:
# top docs for user 1
temp = prediction_norm_idf[['User 1']]
temp.nlargest(n=3, columns='User 1')

Unnamed: 0,User 1
doc16,0.396153
doc6,0.319432
doc12,0.311648


In [270]:
# top docs for user 2
temp = prediction_norm_idf[['User 2']]
temp.nlargest(n=3, columns='User 2')

Unnamed: 0,User 2
doc2,0.329154
doc17,0.329154
doc4,0.240296


## Comparing the results
- doc12 and doc6

When docs attributes are normalized by number of attributes in a document, doc6 got higer scores than doc 12. Because, doc6 got higher attribute noramlized scores as it has only two while doc12 has 3. The attribute Europe becomes positive 0.12 from zero. Previously, the user 1 has profile -1 and 1 for Europe from two documents, which resulted zero profile value.

In [211]:
docs_interest = ['doc12', 'doc6']

In [197]:
temp = prediction[['User 1']]
temp.nlargest(n=3, columns='User 1')

Unnamed: 0,User 1
doc16,3.464102
doc12,2.309401
doc6,2.12132


In [198]:
temp = prediction_norm[['User 1']]
temp.nlargest(n=3, columns='User 1')

Unnamed: 0,User 1
doc16,1.924646
doc6,1.370923
doc12,1.333114


In [209]:
temp = docs_attr.loc[['doc12', 'doc6']]
cols_user1 = temp.loc[:, (temp !=0).any(axis=0)].columns
temp.loc[:, cols_user1]

Unnamed: 0,baseball,Europe,soccer,war
doc12,1,0,1,1
doc6,1,1,0,0


In [255]:
temp = user_profile.loc['User 1', cols_user1]
temp / temp.max()

baseball    1.000000
Europe      0.000000
soccer      0.666667
war        -0.333333
Name: User 1, dtype: float64

In [256]:
temp = user_profile_norm.transpose().loc['User 1', cols_user1]
temp / temp.max()

baseball    1.000000
Europe      0.119599
soccer      0.591662
war        -0.258256
Name: User 1, dtype: float64

In [213]:
docs_attr_norm.loc[docs_interest, cols_user1]

Unnamed: 0,baseball,Europe,soccer,war
doc12,0.57735,0.0,0.57735,0.57735
doc6,0.707107,0.707107,0.0,0.0
