# Assignment 2: Content-based Filtering

## Reading Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
tags = pd.read_csv('data/movie-tags.csv', header=None, names=['movie_id', 'tag'])
movies = pd.read_csv('data/movie-titles.csv', header=None, names=['movie_id', 'movie', 'genres'])
ratings = pd.read_csv('data/ratings.csv', header=None, names=['user_id', 'movie_id', 'rating'])

## Part 1: TF-IDF Recommender with Unweighted Profiles

### Compute item-tag vectors (the model)

Step 1: Iterate through items, building the term vector 𝒒𝑖 for each item and a global document frequency vector 𝒅. At this stage, these are unnormalized term and document frequency vectors, storing the number of times the term appears on each document or the number of documents in which it appears.

* first initializing item_tag_dict

In [3]:
item_tag_dict = {}
for movie in movies['movie_id'].tolist():
    item_tag_dict[movie] = {}

* then, filling item_tag_dict

In [10]:
item_tags = tags.groupby('movie_id')
for item in item_tags:
    item_i_tags = item[1]['tag'].value_counts().to_dict()
    item_tag_dict[item[0]] = item_i_tags
Q = pd.DataFrame.from_dict(item_tag_dict, orient='index')

* also, we need to put items with no tag count in Q, and sort it based on item_id

In [11]:
item_with_no_tags = [i for i in item_tag_dict.keys() if len(item_tag_dict[i]) == 0]
for item in item_with_no_tags:
    Q.loc[item] = np.nan
Q = Q.sort_index()

Step 2: Iterate through each item again, performing the following:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a. Divide each term value 𝑞̂𝑖𝑡 by the log of the document frequency (𝑙𝑛 𝑑𝑡). The resulting
vector 𝒒𝑖 is the TF-IDF vector.

In [12]:
Q = Q.fillna(0)
df = Q.count()
idf = np.log(df)
Q = Q/idf

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b. After dividing each term value by the log of the DF, compute the length (Euclidean norm) of the TF-IDF vector 𝒒𝑖, and divide each element of it by the length to yield a unit vector 𝒒𝑖. 

In [16]:
Q = Q.div(np.sqrt(np.square(Q).sum(axis=1)), axis=0)
Q = Q.fillna(0)

### Build user profile for each query user

The profile is the sum of the item-tag vectors of all items the user has rated positively (>= 3.5 stars)

$$ p_{ut} = \sum_{i\in{I},\, r_{ui} \geq{3.5}}{q_{it}} $$

* first building the user rating matrix (movies * users)

In [19]:
item_rating_dict = {}
item_ratings = ratings.groupby('movie_id')
for item in item_ratings:
    item_i_ratings = item[1][['user_id','rating']].set_index('user_id').to_dict()['rating']
    item_rating_dict[item[0]] = item_i_ratings
R = pd.DataFrame.from_dict(item_rating_dict, orient='index')
R = R.sort_index()
R = R.T

* then, converting ratings to 1 if ratings>=3.5 otherwise 0

In [27]:
R35 = R.copy()
R35[R35<3.5]=0
R35[R35>=3.5]=1
R35 = R35.fillna(0)

* finally, doing the dot product

In [29]:
P = R35.dot(Q)

### Generate item scores for each user

The score for an item is the cosine between that item’s tag vector and the user’s profile vector.<br />
Cosine similarity is defined as follows:

$$ \cos{(p_{u},\,q_{i})} = \frac{p_{u} \cdot q_{i}}{\parallel p_{u} \parallel_{2} \parallel q_{i} \parallel_{2}} = \frac{\sum_{t}{q_{ut}p_{ut}}}{\sqrt{\sum_{t}{q_{ut}^2}}\sqrt{\sum_{t}{p_{ut}^2}}} $$

In [31]:
from scipy.spatial.distance import cdist

* pairwise 'cosine' distance

In [32]:
c = cdist(P.values, Q.values, 'cosine')

* cosine similarity score matrix (user * item)

In [40]:
S = pd.DataFrame(1 - c, index=P.index, columns=Q.index)

In [41]:
S

Unnamed: 0,1,2,3,4,5,6,7,9,10,11,...,106487,106489,106782,106920,109374,109487,111362,111759,112556,112852
12288,0.260589,0.141920,0.042414,0.052112,0.063574,0.161210,0.073750,0.035058,0.113963,0.043102,...,0.125419,0.129688,0.125685,0.203442,0.147962,0.155350,0.135663,0.156640,0.058075,0.189139
71680,0.217191,0.087919,0.043311,0.066537,0.074322,0.231493,0.082732,0.047575,0.106936,0.068978,...,0.163084,0.079942,0.177133,0.201442,0.230809,0.096824,0.070999,0.104080,0.092342,0.114584
32780,0.136928,0.140557,0.128905,0.149486,0.144210,0.261189,0.132700,0.132702,0.152971,0.100496,...,0.110923,0.096986,0.175154,0.138010,0.157874,0.097127,0.093729,0.110862,0.069912,0.107594
106512,0.203233,0.113704,0.038010,0.047018,0.047420,0.250273,0.047422,0.036053,0.090024,0.034287,...,0.100919,0.091430,0.095310,0.156518,0.097070,0.099891,0.082026,0.105504,0.039469,0.121156
32785,0.322097,0.105030,0.052207,0.047264,0.056981,0.179906,0.044876,0.039495,0.084634,0.031025,...,0.111009,0.062421,0.084356,0.203655,0.258966,0.130127,0.100948,0.110773,0.034890,0.136928
67615,0.443800,0.177275,0.026224,0.054388,0.066465,0.137222,0.038928,0.015401,0.093238,0.022663,...,0.172558,0.118836,0.092794,0.203190,0.168481,0.092661,0.167408,0.132886,0.055780,0.161714
59424,0.231022,0.109597,0.080034,0.106593,0.119779,0.208124,0.107999,0.080227,0.103043,0.079916,...,0.110275,0.093967,0.163868,0.173235,0.134575,0.089657,0.091069,0.090941,0.051389,0.094882
96296,0.098815,0.098964,0.066325,0.120173,0.084473,0.153292,0.093446,0.072817,0.322692,0.077215,...,0.144505,0.102195,0.086531,0.193996,0.100217,0.200703,0.146417,0.186894,0.067649,0.248461
106537,0.223140,0.182378,0.104567,0.121662,0.099160,0.139044,0.074228,0.084599,0.111801,0.068341,...,0.199116,0.111509,0.131928,0.249520,0.187976,0.156250,0.116643,0.140940,0.055386,0.187350
57387,0.141627,0.099670,0.027188,0.051587,0.043716,0.198715,0.046773,0.026696,0.145802,0.026683,...,0.124839,0.064734,0.164878,0.279430,0.189789,0.174803,0.119544,0.157608,0.066603,0.188620
