# Assignment 2: Content-based Filtering

## Reading Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
tags = pd.read_csv('data/movie-tags.csv', header=None, names=['movie_id', 'tag'])
movies = pd.read_csv('data/movie-titles.csv', header=None, names=['movie_id', 'movie', 'genres'])
ratings = pd.read_csv('data/ratings.csv', header=None, names=['user_id', 'movie_id', 'rating'])

## Part 1: TF-IDF Recommender with Unweighted Profiles

### Compute item-tag vectors (the model)

Step 1: Iterate through items, building the term vector 𝒒𝑖 for each item and a global document frequency vector 𝒅. At this stage, these are unnormalized term and document frequency vectors, storing the number of times the term appears on each document or the number of documents in which it appears.

* first initializing item_tag_dict

In [3]:
item_tag_dict = {}
for movie in movies['movie_id'].tolist():
    item_tag_dict[movie] = {}

* then, filling item_tag_dict

In [4]:
item_tags = tags.groupby('movie_id')
for item in item_tags:
    item_i_tags = item[1]['tag'].value_counts().to_dict()
    item_tag_dict[item[0]] = item_i_tags
Q = pd.DataFrame.from_dict(item_tag_dict, orient='index')

* also, we need to put items with no tag count in Q, and sort it based on item_id

In [5]:
item_with_no_tags = [i for i in item_tag_dict.keys() if len(item_tag_dict[i]) == 0]
for item in item_with_no_tags:
    Q.loc[item] = np.nan
Q = Q.sort_index()

Step 2: Iterate through each item again, performing the following:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a. Divide each term value 𝑞̂𝑖𝑡 by the log of the document frequency (𝑙𝑛 𝑑𝑡). The resulting
vector 𝒒𝑖 is the TF-IDF vector.

In [6]:
Q = Q.fillna(0)
df = Q.count()
idf = np.log(df)
Q = Q/idf

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b. After dividing each term value by the log of the DF, compute the length (Euclidean norm) of the TF-IDF vector 𝒒𝑖, and divide each element of it by the length to yield a unit vector 𝒒𝑖. 

In [7]:
Q = Q.div(np.sqrt(np.square(Q).sum(axis=1)), axis=0)
Q = Q.fillna(0)

### Build user profile for each query user

The profile is the sum of the item-tag vectors of all items the user has rated positively (>= 3.5 stars)

$$ p_{ut} = \sum_{i\in{I},\, r_{ui} \geq{3.5}}{q_{it}} $$

* first building the user rating matrix (movies * users)

In [8]:
item_rating_dict = {}
item_ratings = ratings.groupby('movie_id')
for item in item_ratings:
    item_i_ratings = item[1][['user_id','rating']].set_index('user_id').to_dict()['rating']
    item_rating_dict[item[0]] = item_i_ratings
R = pd.DataFrame.from_dict(item_rating_dict, orient='index')
R = R.sort_index()
R = R.T

* then, converting ratings to 1 if ratings>=3.5 otherwise 0

In [9]:
R35 = R.copy()
R35[R35<3.5]=0
R35[R35>=3.5]=1
R35 = R35.fillna(0)

* finally, doing the dot product

In [10]:
P = R35.dot(Q)

### Generate item scores for each user

The score for an item is the cosine between that item’s tag vector and the user’s profile vector.<br />
Cosine similarity is defined as follows:

$$ \cos{(p_{u},\,q_{i})} = \frac{p_{u} \cdot q_{i}}{\parallel p_{u} \parallel_{2} \parallel q_{i} \parallel_{2}} = \frac{\sum_{t}{q_{ut}p_{ut}}}{\sqrt{\sum_{t}{q_{ut}^2}}\sqrt{\sum_{t}{p_{ut}^2}}} $$

In [11]:
from scipy.spatial.distance import cdist

* pairwise 'cosine' distance

In [12]:
c = cdist(P.values, Q.values, 'cosine')

* cosine similarity score matrix (user * item)

In [13]:
S = pd.DataFrame(1 - c, index=P.index, columns=Q.index)

## Part 2: Weighted User Profile

In this variant, rather than just summing the vectors for all positively-rated items, compute a weighted sum of the item vectors for all rated items, with weights being based on the user’s rating:

$$ p_{u} = \sum_{i\in{I(u)}}{ (r_{ui} - \mu_{u}) \cdot q_{i} } $$

In [72]:
RW = R.copy()
mu = RW.mean(axis=1)
W = RW.sub(s, axis=0)
W = W.fillna(0)

In [74]:
PW = W.dot(Q)

In [75]:
PW

Unnamed: 0,family,classic,humorous,fantasy,bright,fanciful,toys,computer animation,children,Disney,...,abomination,Evangeline Lilly,not enough Bilbo,hobbit,lousy camerawork/cinematography,alcoholic,messy,apprenticeship/training of an adult,political metaphor,derailed by twist
12288,0.680782,2.239962,0.889096,5.076983,-0.190889,0.401534,0.104936,0.236688,-0.478862,-0.107620,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
71680,0.915897,2.721329,-0.317350,-2.257227,0.046179,0.009585,0.024708,-0.094119,-0.000065,-0.564032,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
32780,-2.389675,11.561317,1.349053,2.374417,-0.552643,-0.052796,-0.008220,0.057267,-0.843788,-2.976817,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
106512,0.320252,5.650538,0.767279,0.084404,0.065380,0.352630,-0.070522,-0.026553,-0.174474,0.657810,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
32785,0.335856,2.114862,-0.751138,-3.016836,0.024910,0.192757,0.099642,0.049821,0.234333,0.971619,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
67615,0.412273,1.047495,0.273987,0.534646,0.067169,0.029938,0.360142,0.030958,0.528448,1.848851,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
59424,0.793354,0.113296,-0.710635,1.758995,0.034288,0.144044,0.124346,0.021605,0.216026,-2.261039,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
96296,0.263022,-0.311171,0.347092,0.041191,0.031587,-0.106285,-0.126373,-0.238287,-0.114265,-0.455689,...,0.000000,0.000000,0.000000,0.000000,-0.016512,-0.066048,-0.016512,-0.016512,-0.016512,-0.016512
106537,-0.385471,-0.259102,0.125696,1.947083,0.051618,0.271443,0.112651,0.073034,-0.015353,-3.210425,...,-0.081757,-0.163514,-0.081757,-0.081757,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
57387,0.032051,1.277970,0.211503,-0.845837,0.037970,0.025043,0.033007,0.304464,0.055075,-0.369609,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
