# Assignment 2: Content-based Filtering

## Reading Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
tags = pd.read_csv('data/movie-tags.csv', header=None, names=['movie_id', 'tag'])
movies = pd.read_csv('data/movie-titles.csv', header=None, names=['movie_id', 'movie', 'genres'])
ratings = pd.read_csv('data/ratings.csv', header=None, names=['user_id', 'movie_id', 'rating'])

## Part 1: TF-IDF Recommender with Unweighted Profiles

### Compute item-tag vectors (the model)

Step 1: Iterate through items, building the term vector 𝒒𝑖 for each item and a global document frequency vector 𝒅. At this stage, these are unnormalized term and document frequency vectors, storing the number of times the term appears on each document or the number of documents in which it appears.

* first initializing item_tag_dict

In [3]:
item_tag_dict = {}
for movie in movies['movie_id'].tolist():
    item_tag_dict[movie] = {}

* then, filling item_tag_dict

In [4]:
item_tags = tags.groupby('movie_id')
for item in item_tags:
    item_i_tags = item[1]['tag'].value_counts().to_dict()
    item_tag_dict[item[0]] = item_i_tags
Q = pd.DataFrame.from_dict(item_tag_dict, orient='index')

* also, we need to put items with no tag count in Q, and sort it based on item_id

In [5]:
item_with_no_tags = [i for i in item_tag_dict.keys() if len(item_tag_dict[i]) == 0]
for item in item_with_no_tags:
    Q.loc[item] = np.nan
Q = Q.sort_index()

Step 2: Iterate through each item again, performing the following:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a. Divide each term value 𝑞̂𝑖𝑡 by the log of the document frequency (𝑙𝑛 𝑑𝑡). The resulting
vector 𝒒𝑖 is the TF-IDF vector.

In [6]:
df = Q.count()
lnn = np.log(Q.shape[0])
idf = lnn - np.log(df)
Q = Q.fillna(0)
Q = Q*idf

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b. After dividing each term value by the log of the DF, compute the length (Euclidean norm) of the TF-IDF vector 𝒒𝑖, and divide each element of it by the length to yield a unit vector 𝒒𝑖. 

In [9]:
Q = Q.div(np.sqrt(np.square(Q).sum(axis=1)), axis=0)
Q = Q.fillna(0)

### Build user profile for each query user

The profile is the sum of the item-tag vectors of all items the user has rated positively (>= 3.5 stars)

$$ p_{ut} = \sum_{i\in{I},\, r_{ui} \geq{3.5}}{q_{it}} $$

* first building the user rating matrix (movies * users)

In [11]:
item_rating_dict = {}
item_ratings = ratings.groupby('movie_id')
for item in item_ratings:
    item_i_ratings = item[1][['user_id','rating']].set_index('user_id').to_dict()['rating']
    item_rating_dict[item[0]] = item_i_ratings
R = pd.DataFrame.from_dict(item_rating_dict, orient='index')
R = R.sort_index()
R = R.T

* then, converting ratings to 1 if ratings>=3.5 otherwise 0

In [12]:
R35 = R.copy()
R35[R35<3.5]=0
R35[R35>=3.5]=1
R35 = R35.fillna(0)

* finally, doing the dot product

In [13]:
P = R35.dot(Q)

### Generate item scores for each user

The score for an item is the cosine between that item’s tag vector and the user’s profile vector.<br />
Cosine similarity is defined as follows:

$$ \cos{(p_{u},\,q_{i})} = \frac{p_{u} \cdot q_{i}}{\parallel p_{u} \parallel_{2} \parallel q_{i} \parallel_{2}} = \frac{\sum_{t}{q_{ut}p_{ut}}}{\sqrt{\sum_{t}{q_{ut}^2}}\sqrt{\sum_{t}{p_{ut}^2}}} $$

In [14]:
from scipy.spatial.distance import cdist

* pairwise 'cosine' distance

In [15]:
c = cdist(P.values, Q.values, 'cosine')

* cosine similarity score matrix (user * item)

In [16]:
S = pd.DataFrame(1 - c, index=P.index, columns=Q.index)

In [49]:
def GetTopTenForUser(user_id):
    s = S.loc[user_id]
    s = s[s.notnull() & s != 0].sort_values()
    return s.nlargest(10)

user 106512, recomms: 32587, 5878, 34405, 1219, 51662, 96610, 68358, 1035, 1748, 55820

In [91]:
GetTopTenForUser(106512)

3000     0.319965
5618     0.319893
260      0.315742
912      0.314120
31658    0.301880
1206     0.299537
7099     0.298290
48394    0.287041
2959     0.282227
6350     0.278408
Name: 106512, dtype: float64

## Part 2: Weighted User Profile

In this variant, rather than just summing the vectors for all positively-rated items, compute a weighted sum of the item vectors for all rated items, with weights being based on the user’s rating:

$$ p_{u} = \sum_{i\in{I(u)}}{ (r_{ui} - \mu_{u}) \cdot q_{i} } $$

In [17]:
RW = R.copy()
mu = RW.mean(axis=1)
W = RW.sub(mu, axis=0)
W = W.fillna(0)

In [18]:
PW = W.dot(Q)

In [44]:
from scipy.spatial.distance import cdist
cw = cdist(PW.values, Q.values, 'cosine')
SW = pd.DataFrame(1 - cw, index=PW.index, columns=Q.index)

In [45]:
def GetTopTenForUser_Weighted(user_id):
    sw = SW.loc[user_id]
    sw = sw[sw.notnull() & sw != 0].sort_values()
    return sw.nlargest(10)

In [52]:
GetTopTenForUser_Weighted(320)

2959     0.351385
50       0.263043
2858     0.261798
923      0.238596
260      0.222713
6711     0.217404
4878     0.213215
47       0.208188
48394    0.203743
296      0.203630
Name: 320, dtype: float64