# Assignment 2: Content-based Filtering

## Reading Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
tags = pd.read_csv('data/movie-tags.csv', header=None, names=['movie_id', 'tag'])
movies = pd.read_csv('data/movie-titles.csv', header=None, names=['movie_id', 'movie', 'genres'])
ratings = pd.read_csv('data/ratings.csv', header=None, names=['user_id', 'movie_id', 'rating'])

## Part 1: TF-IDF Recommender with Unweighted Profiles

### Compute item-tag vectors (the model)

Step 1: Iterate through items, building the term vector 𝒒𝑖 for each item and a global document frequency vector 𝒅. At this stage, these are unnormalized term and document frequency vectors, storing the number of times the term appears on each document or the number of documents in which it appears.

In [3]:
item_tag_dict = {}
item_tags = tags.groupby('movie_id')
for item in item_tags:
    item_i_tags = item[1]['tag'].value_counts().to_dict()
    item_tag_dict[item[0]] = item_i_tags
Q = pd.DataFrame.from_dict(item_tag_dict, orient='index')

Step 2: Iterate through each item again, performing the following:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a. Divide each term value 𝑞̂𝑖𝑡 by the log of the document frequency (𝑙𝑛 𝑑𝑡). The resulting
vector 𝒒𝑖 is the TF-IDF vector.

In [5]:
df = Q.count()
idf = np.log(df)
Q = Q/idf

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b. After dividing each term value by the log of the DF, compute the length (Euclidean norm) of the TF-IDF vector 𝒒𝑖, and divide each element of it by the length to yield a unit vector 𝒒𝑖. 

In [10]:
Q = Q.replace([np.inf, -np.inf], np.nan)

Q = Q.div(np.sqrt(np.square(Q).sum(axis=1)), axis=0)

Unnamed: 0,family,classic,humorous,fantasy,bright,fanciful,toys,computer animation,children,Disney,...,abomination,Evangeline Lilly,not enough Bilbo,hobbit,lousy camerawork/cinematography,alcoholic,messy,apprenticeship/training of an adult,political metaphor,derailed by twist
1,0.009899,0.005097,0.005819,0.002891,0.005588,0.004273,0.029911,0.012886,0.011217,0.019258,...,,,,,,,,,,
2,,,,0.015049,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,0.081206,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,
11,,,,,,,,,,,...,,,,,,,,,,


### Build user profile for each query user

The profile is the sum of the item-tag vectors of all items the user has rated positively (>= 3.5 stars)

$$ p_{ut} = \sum_{i\in{I},\, r_{ui} \geq{3.5}}{q_{it}} $$

* first building the user rating matrix (movies * users)

In [43]:
item_rating_dict = {}
item_ratings = ratings.groupby('movie_id')
for item in item_ratings:
    item_i_ratings = item[1][['user_id','rating']].set_index('user_id').to_dict()['rating']
    item_rating_dict[item[0]] = item_i_ratings
R = pd.DataFrame.from_dict(item_rating_dict, orient='index')

In [44]:
R35 = R[R>3.5]=1

Unnamed: 0,12288,71680,32780,106512,32785,67615,59424,96296,106537,57387,...,49806,128079,53819,72633,43116,26008,81574,86309,84598,44194
1,4.0,4.0,3.5,4.0,4.5,4.5,4.5,3.0,4.0,4.0,...,,,,,,,,,,
2,3.0,,,3.0,,4.0,,,,,...,,,,,,,,,,
3,,,3.5,,,,,,3.5,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,3.0,,,,4.5,,,,...,,,,,,,,,,
6,,,5.0,4.0,,,,,,4.5,...,,,,,,,,,,
7,,,3.5,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
10,3.0,3.5,4.0,3.0,,,,5.0,,3.5,...,,,,,,,,,,
11,,,3.0,,,,,3.5,3.0,,...,,,,,,,,,,


### Generate item scores for each user

the score for an item is the cosine between that item’s tag vector and the user’s profile vector.<br />
Cosine similarity is defined as follows:

$$ \cos{(p_{u},\,q_{i})} = \frac{p_{u} \cdot q_{i}}{\parallel p_{u} \parallel_{2} \parallel q_{i} \parallel_{2}} = \frac{\sum_{t}{q_{ut}p_{ut}}}{\sqrt{\sum_{t}{q_{ut}^2}}\sqrt{\sum_{t}{p_{ut}^2}}} $$

In [12]:
R

NameError: name 'R' is not defined