# Content Based Filtering using TFIDF & movie ratings
This dataset has a smaller set of movies but with the movies here we also have ratings provided by the users for these movies. The aim is to create personalised recommendations for each user in the test set and evaluate using mean average precision the top 10 recommendations.

## Data Description

**user_id** - ID of the user

**movie_id** - ID of the movie with title

**rating** - Rating provided by the user

**keywords** - Predefined keywords for each movie

**cast** - Entire cast of the movie

**genres** - list of genres corresponding to the movie

**director** - Director of the movie

## Table of Content

[1. Reading Dataset & Basic Exploration](#Reading-Dataset)

[2. Selecting Movie Features](#movie-features)

[2. Creating Test & Train Data](#train-test)

[3. Building Item & User Profiles](#item-user-profiles)

[4. Generating Content Based Recommendations](#generate)

[5. Checking recommendations for a random user](#check)

[6. Evaluation using Mean Average Precision @10](#evaluation)

## 1. Reading Data & Basic Exploration <a class="anchor" id="Reading-Dataset"></a>

In [1]:
import pandas as pd
import numpy as np

from scipy.sparse import vstack
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

from sklearn.preprocessing import MinMaxScaler, normalize

In [2]:
#Reading data from csv to dataframe
movie_data_with_ratings = pd.read_csv('movie_ratings_with_info.csv')

In [3]:
#Checking the number of movies and number of users in the ratings data
movie_data_with_ratings.user_id.nunique(),movie_data_with_ratings.movie_id.nunique() 

(943, 555)

In [4]:
#Looking at sample
movie_data_with_ratings.sample(2).T

Unnamed: 0,21668,37643
user_id,500,932
movie_id,25: Jarhead,679: Aliens
rating,3,2
keywords,sniper marine corps saudi arabia petrol golf war,android extraterrestrial technology space mari...
cast,Jamie Foxx Jake Gyllenhaal Scott MacDonald Luc...,Sigourney Weaver Michael Biehn James Remar Pau...
genres,Drama War,Horror Action Thriller Science Fiction
director,Sam Mendes,James Cameron


##  2. Selecting movie features <a class="anchor" id="movie-features"></a>

In [5]:
# Choosing features for extracting content information
features = ['keywords','cast','genres','director']

In [6]:
# Filling missing values with a whitespace
for feature in features:
    movie_data_with_ratings[feature] = movie_data_with_ratings[feature].fillna('')

In [7]:
# Single Combined Feature including all the content features
def combine_features(row):
    return row['keywords'] +" "+row['cast']+" "+row["genres"]+" "+row["director"]
        
movie_data_with_ratings["combined_features"] = movie_data_with_ratings.apply(combine_features,axis=1)

## 3. Creating train and test set from the ratings data <a class="anchor" id="train-test"></a>

In [8]:
# Creating train and test data for testing performance later
train_data, test_data = train_test_split(movie_data_with_ratings, test_size = 0.2, random_state = 42)

In [9]:
# Checking shape for train & test
train_data.shape, test_data.shape

((33492, 8), (8373, 8))

## 4. Building Item & User Profiles <a class="anchor" id="item-user-profiles"></a>

### Steps for building the content based recommender system
1. Calculate Item Profile from tfidf values of each movie
2. Calculate User Profile by using item profiles of movies rated by the user, for each user I look in train set what are the movies/items that this user has rated (1-5), for each item I have the item profile, so on the basis of rating as the weight for that item profile I calculate the sum of all tfidf/countvectorisor
3. Next step is to find out the most similar movies to each user profile
4. Finally top n recommendations are created from these most similar movies

In [10]:
# Extracting TF-IDF vectors for combined features
vectorizer = TfidfVectorizer(analyzer='word',
                     ngram_range=(1, 2),
                     min_df=0.003,
                     max_df=0.5,
                     max_features=500)

item_ids = train_data['movie_id'].unique().tolist()
tfidf_matrix = vectorizer.fit_transform(train_data['combined_features'].unique())
tfidf_feature_names = vectorizer.get_feature_names()

In [11]:
# Function to extract movie profile or tfidf for a given movie
def get_item_profile(item_id):
    idx = item_ids.index(item_id)
    item_profile = tfidf_matrix[idx:idx+1]
    return item_profile


# Function to extract movie profile or item profile for items with given ids
def get_item_profiles(ids):
    item_profiles_list = [get_item_profile(x) for x in ids]
    item_profiles = vstack(item_profiles_list)
    return item_profiles


# Function to create user profile using interaction strength or rating
def build_users_profile(person_id, interactions_indexed_df):
    interactions_person_df = interactions_indexed_df.loc[person_id]
    
    # Get item profiles for all items for the given user
    user_item_profiles = get_item_profiles(interactions_person_df['movie_id'])
    
    # Storing ratings or interaction strength for each user or person
    user_item_strengths = np.array(interactions_person_df['rating']).reshape(-1,1)
    
    #Weighted average of item profiles by the interactions strength to get the complete profile for user
    user_item_strengths_weighted_avg = np.sum(user_item_profiles.multiply(user_item_strengths), axis=0) / np.sum(user_item_strengths)
    user_profile_norm = normalize(user_item_strengths_weighted_avg)
    return user_profile_norm

# Function to build user profiles for all users using the train data
def build_users_profiles(): 
    interactions_indexed_df = train_data.set_index('user_id')
    user_profiles = {}
    
    #Loop over each user to build user profiles for each
    for person_id in interactions_indexed_df.index.unique():
        user_profiles[person_id] = build_users_profile(person_id, interactions_indexed_df)
    return user_profiles

In [12]:
user_profiles = build_users_profiles()

In [13]:
# User profile for  a random user
myprofile = user_profiles[392]
pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        user_profiles[392].flatten().tolist()), key=lambda x: -x[1])[:20],
             columns=['token', 'relevance'])

Unnamed: 0,token,relevance
0,action,0.219409
1,fiction,0.198732
2,science,0.198732
3,science fiction,0.198732
4,thriller,0.185941
5,michael,0.168775
6,adventure,0.15942
7,crime,0.155538
8,drama,0.144268
9,action drama,0.138754


## 5. Generating Content Based Recommendations <a class="anchor" id="generate"></a>

In [14]:
#Function to extract all movies that the user has already seen
def get_items_interacted(person_id, train_data):
    
    # Get the user's data and merge in the movie information.
    interacted_items = train_data[train_data['user_id'] == person_id]['movie_id']
    return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items])

#Extract 100 most similar items to the user profile
def get_similar_items_to_user_profile(person_id, topn=100):
    #Computes the cosine similarity between the user profile and all item profiles
    cosine_similarities = cosine_similarity(user_profiles[person_id], tfidf_matrix)
    
    #Gets the top similar items
    similar_indices = cosine_similarities.argsort().flatten()[-topn:]
    
    #Sort the similar items by similarity
    similar_items = sorted([(item_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
    return similar_items
    
#Generate top 10 recommendations from the movies that the user has not watched yet    
def cb_recommend_items(user_id, items_to_ignore=[], topn=10):
    similar_items = get_similar_items_to_user_profile(user_id)
        
    #Ignores items the user has already interacted with
    similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
        
    recommendations_df = pd.DataFrame(similar_items_filtered, columns=['movie_id', 'rating']) \
                                    .head(topn)
    return list(recommendations_df['movie_id'])

## 5. Checking recommendations for a random user <a class="anchor" id="check"></a>

In [16]:
#Extracting recommended items for user 933
userr = 933
iti = list(train_data[train_data['user_id'] == userr]['movie_id'])
cb_recommend_items(userr, items_to_ignore=iti, topn=10)

['152: Star Trek: The Motion Picture',
 '201: Star Trek: Nemesis',
 '85: Raiders of the Lost Ark',
 '817: Austin Powers: The Spy Who Shagged Me',
 '818: Austin Powers in Goldmember',
 '395: AVP: Alien vs. Predator',
 '331: Jurassic Park III',
 '608: Men in Black II',
 '330: The Lost World: Jurassic Park',
 '440: Aliens vs Predator: Requiem']

In [17]:
# 933's rated items from train set
train_data[train_data['user_id'] == userr].sort_values(by = 'rating',ascending = False)['movie_id'].values

array(['179: The Interpreter', '98: Gladiator',
       '22: Pirates of the Caribbean: The Curse of the Black Pearl',
       '186: Lucky Number Slevin', '193: Star Trek: Generations',
       '174: Star Trek VI: The Undiscovered Country', '187: Sin City',
       '73: American History X', '435: The Day After Tomorrow',
       '654: On the Waterfront', '470: 21 Grams', '11: Star Wars',
       '87: Indiana Jones and the Temple of Doom',
       '196: Back to the Future Part III',
       '89: Indiana Jones and the Last Crusade', '182: The Good German',
       '157: Star Trek III: The Search for Spock',
       '746: The Last Emperor', '194: Amélie', '79: Hero', '215: Saw II',
       '173: 20,000 Leagues Under the Sea', '392: Chocolat',
       '508: Love Actually', '218: The Terminator',
       '153: Lost in Translation', '652: Troy', '214: Saw III',
       '168: Star Trek IV: The Voyage Home', '95: Armageddon',
       '239: Some Like It Hot', '403: Driving Miss Daisy',
       '121: The Lord of

In [18]:
#Profile for the user 933
user_profile = user_profiles[userr]
print(user_profile.shape)
pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        user_profiles[userr].flatten().tolist()), key=lambda x: -x[1])[:20],
             columns=['token', 'relevance'])

(1, 500)


Unnamed: 0,token,relevance
0,adventure,0.227359
1,drama,0.201216
2,action,0.194974
3,fiction,0.180199
4,science,0.180199
5,science fiction,0.180199
6,thriller,0.16334
7,crime,0.159915
8,george,0.129635
9,james,0.120917


## 6. Evaluation using Mean Average Precision at 10 <a class="anchor" id="evaluation"></a>

In [19]:
#Implementation for average precision@k
def apk(actual, predicted, k=3):
    actual = list(actual)
    predicted = list(predicted)
    
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)
            
    if not actual:
        return 0.0
    if num_hits == 0:
        return 0
    else:
        return score / num_hits

**Testing performance with MAP@k**

In [20]:
#List of unique users from test set
user_ids = list(test_data['user_id'].unique())
sum_ap = 0
ap_user = []
for user in user_ids:
    #Ignoring movies that the user has already rated
    iti = list(train_data[train_data['user_id'] == user]['movie_id'])
    
    #Creating recommendations for each user
    rec = cb_recommend_items(user, items_to_ignore=iti, topn=10)
    
    #Actual movies rated by the user
    act = list(test_data[test_data['user_id'] == user]['movie_id'])
    
    #Calculating Average Precision@K for each user
    ap = apk(act, rec, k=10)
    ap_user.append(ap)
    
    #Sum of precisions
    sum_ap += ap

#Mean average precision@10
map_at_10 = sum_ap/len(user_ids)
print(map_at_10)


0.11881219559199693
