---

# Assignment 3

Welcome to the third assignment! Here you will implement a simple Content-Based Recommender. We will use part of the MovieLens 20M dataset.

You will write and execute your code in Python using this Jupyter Notebook.

**PREREQUISITE:** Download the Movie Dataset from: <https://www.kaggle.com/rounakbanik/the-movies-dataset/data>. Extract the contents of the zip file into `data_directory`.
So you should have, for example, a file `data_directory/movies_metadata.csv`.

Also download the MovieLens 20M dataset from <https://grouplens.org/datasets/movielens/20m/>. Extract the `ratings.csv` file and overwrite the one in `data_directory` in the code. You should then have a file `data_directory/ratings.csv` that has about 20M ratings, instead of about 26M contained in the full MovieLens dataset from kaggle.

**TASK:** Your job is to *fill in the missing code* only. The place to enter your code is clearly marked with comments.

**SUBMISSION:** You will submit this Notebook via TUWEL.

**GRADING:** We will test whether you code produces the expected output that is provided. We will also run additional tests, not shown here.

## Preparation
Importing necessary modules.

In [1]:
import csv
import pandas as pd
import numpy as np
from scipy import sparse as sp
import sklearn.preprocessing as pp
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

Make sure to enter the correct location of your data.

In [2]:
data_directory = '../../data/'

## Create the items DataFrame

In [3]:
links = pd.read_csv(data_directory + 'links.csv')
movies_plain = pd.read_csv(data_directory + 'movies.csv')
metadata = pd.read_csv(data_directory + 'movies_metadata.csv', low_memory=False)
metadata.drop(metadata.columns[[0,1,2,4,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23]], axis=1, inplace=True)
keywords = pd.read_csv(data_directory + 'keywords.csv', low_memory=False)
credits = pd.read_csv(data_directory + 'credits.csv', low_memory=False)

keywords['id'] = keywords['id'].astype('int')
links=links[links['tmdbId'].isnull()==False]
links['tmdbId'] = links['tmdbId'].astype('int')
metadata = metadata.drop([19730, 29503, 35587])
metadata['id'] = metadata['id'].astype('int')
credits['id'] = credits['id'].astype('int')

movies = metadata.merge(links, how='inner', left_on='id', right_on='tmdbId')
movies = movies.merge(movies_plain, how='inner', left_on='movieId', right_on='movieId')
movies = movies.merge(keywords, how='inner', left_on='id', right_on='id')
movies = movies.merge(credits, how='inner', left_on='id', right_on='id')
movies = movies.drop(columns=['tmdbId','genres_y'])
movies.rename(columns={'genres_x': 'genres'}, inplace=True)

movies=movies[movies['overview'].isnull()==False]

movies = movies[movies['movieId'] <= 1000]

from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    movies[feature] = movies[feature].apply(literal_eval)
    

# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

# Define new director, cast, genres and keywords features that are in a suitable form.
movies['director'] = movies['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    movies[feature] = movies[feature].apply(get_list)

    
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        # Check if string exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    movies[feature] = movies[feature].apply(clean_data)

    
# Drop duplicate movies   
import collections
movie_ids = movies['movieId'].tolist()
movie_ids_dup = [x for  x, y in collections.Counter(movie_ids).items() if y > 1]

for movie_id in movie_ids_dup:
    to_drop = movies.index[movies.movieId == movie_id].tolist()[1:]
    movies.drop(to_drop, inplace=True)

movies.drop(columns='crew', inplace=True)


movies.rename(columns={'overview':'plot'}, inplace=True)

def create_metadata(x):
        return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])  

# Create a new metadata feature
movies['metadata'] = movies.apply(create_metadata, axis=1)

movies.head()

Unnamed: 0,genres,id,plot,movieId,imdbId,title,keywords,cast,director,metadata
0,"[animation, comedy, family]",862,"Led by Woody, Andy's toys live happily in his ...",1,114709,Toy Story (1995),"[jealousy, toy, boy]","[tomhanks, timallen, donrickles]",johnlasseter,jealousy toy boy tomhanks timallen donrickles ...
1,"[adventure, fantasy, family]",8844,When siblings Judy and Peter discover an encha...,2,113497,Jumanji (1995),"[boardgame, disappearance, basedonchildren'sbook]","[robinwilliams, jonathanhyde, kirstendunst]",joejohnston,boardgame disappearance basedonchildren'sbook ...
2,"[romance, comedy]",15602,A family wedding reignites the ancient feud be...,3,113228,Grumpier Old Men (1995),"[fishing, bestfriend, duringcreditsstinger]","[waltermatthau, jacklemmon, ann-margret]",howarddeutch,fishing bestfriend duringcreditsstinger walter...
3,"[comedy, drama, romance]",31357,"Cheated on, mistreated and stepped on, the wom...",4,114885,Waiting to Exhale (1995),"[basedonnovel, interracialrelationship, single...","[whitneyhouston, angelabassett, lorettadevine]",forestwhitaker,basedonnovel interracialrelationship singlemot...
4,[comedy],11862,Just when George Banks has recovered from his ...,5,113041,Father of the Bride Part II (1995),"[baby, midlifecrisis, confidence]","[stevemartin, dianekeaton, martinshort]",charlesshyer,baby midlifecrisis confidence stevemartin dian...


## Create the ratings DataFrame

In [4]:
ratings = pd.read_csv(data_directory + 'ratings.csv')
ratings = ratings.drop(columns=['timestamp'])
ratings = ratings[(ratings['userId'] < 1000) & (ratings['movieId'] < 100) ]

ratings = ratings[ratings['movieId'].isin(movies['movieId'])]

## keep users with more than 2 ratings
ratings_count = ratings.groupby(['userId', 'movieId']).size().groupby('userId').size()
ratings_ok = ratings_count[ratings_count >= 2].reset_index()[['userId']]
ratings = ratings.merge(ratings_ok, 
               how = 'right',
               left_on = 'userId',
               right_on = 'userId')



userIds = ratings.userId.unique()
userIds.sort()
userId_to_userIDX = dict(zip(userIds, range(0, userIds.size)))

ratings = pd.concat([ratings['userId'].map(userId_to_userIDX), ratings['movieId'], ratings['rating']], axis=1)
ratings.columns = ['user', 'item', 'rating']

ratings.head()

Unnamed: 0,user,item,rating
0,0,2,3.5
1,0,29,3.5
2,0,32,3.5
3,0,47,3.5
4,0,50,3.5


## The `ContentBasedRecommender` class

In the following, we will build functionality into the `ContentBasedRecommender` class. The initialization stores the various data sources. A helper function returns the titles of movies.

In [5]:
class ContentBasedRecommender:
    
    def __init__(self, ratings, items):
        self.ratings= ratings
        self.items = items
        self.itemsIds = self.ratings.item.unique()
        self.itemsIds.sort()
        self.userIds = self.ratings.user.unique()
        self.itemsIds.sort()
        self.item_ids = self.items['movieId'].tolist()
        
    def get_movie_titles(self, ids):
        return [ self.items[self.items['movieId'] == id]['title'].item() for id in ids] 

## Build the content of Items --- TO EDIT


For the purpose of this assignment we consider two types of content.

The first one, called *plot* content, is based on the movie's plot description, contained in attribute overview. 

The second one, called *meta* content, is based on the director, the actors, genres, and keywords for the movies.

We will build TF-IDF vectors for the movies based on the two types of contents. For this purpose, we will use the `TfIdfVectorizer` module from `scikit-learn`.

Steps to implement:

1. Apply the `vectorizer` on the `plot` column of `self.items` to retrieve tf-idf vector for each movie. Store teh result into `self.plot_tfidf`
2. Get the feature names from the `vectorizer` and store them in `self.plot_tfidf_tokens`.
3. Apply the `vectorizer` on the `metadata` column of `self.items` to retrieve tf-idf vector for each movie. Store the result into `self.meta_tfidf`
4. Get the feature names from the `vectorizer` and store them in `self.meta_tfidf_tokens`.


In [6]:
def build_item_contents(self):
    
    vectorizer = TfidfVectorizer(stop_words='english') # Define a TF-IDF Vectorizer that removes all english stop words (e.g., 'the', 'a')
    
    # Apply Vectorizer on self.items['plot'], and store in self.plot_tfidf
    self.plot_tfidf = vectorizer.fit_transform(self.items['plot'])
    # get feature names from vectorizer and store them in self.plot_tfidf_tokens
    self.plot_tfidf_tokens = vectorizer.get_feature_names()
    # Apply vectorizer on self.items['metadata'], and store in self.meta_tfidf
    self.meta_tfidf = vectorizer.fit_transform(self.items['metadata'])
    # get feature names from vectorizer and store them in self.meta_tfidf_tokens
    self.meta_tfidf_tokens = vectorizer.get_feature_names()
    

def set_content_type(self, profile_type='plot'):
    if profile_type == 'plot':
        self.tfidf = self.plot_tfidf
        self.tfidf_tokens = self.plot_tfidf_tokens
    else:
        self.tfidf = self.meta_tfidf
        self.tfidf_tokens = self.meta_tfidf_tokens

Add the functions to the class.

In [7]:
ContentBasedRecommender.build_item_contents = build_item_contents
ContentBasedRecommender.set_content_type = set_content_type

Test the function. Show only the nonzero coordinates.

In [8]:
cbr = ContentBasedRecommender(ratings, movies)

cbr.build_item_contents()
cbr.set_content_type()

print(cbr.plot_tfidf.shape)
print(cbr.plot_tfidf.data)

print(cbr.meta_tfidf.shape)
print(cbr.meta_tfidf.data)


(959, 9247)
[0.12730582 0.48323367 0.39884626 ... 0.19143247 0.19143247 0.19143247]
(959, 3796)
[0.26831389 0.38677839 0.36491753 ... 0.38397174 0.38397174 0.38397174]


EXPECTED OUTPUT: 

```
(959, 9247)
[0.12730582 0.48323367 0.39884626 ... 0.19143247 0.19143247 0.19143247]
(959, 3796)
[0.26831389 0.38677839 0.36491753 ... 0.38397174 0.38397174 0.38397174]
```

The following function returns the vector representations for specified items. Vectors are stacked vertically.

In [9]:
def get_item_vectors(self, item_ids):
    item_idx = [self.item_ids.index(item_id) for item_id in item_ids]
    item_vector = self.tfidf[item_idx]
    return item_vector

Add the function to the class.

In [10]:
ContentBasedRecommender.get_item_vectors = get_item_vectors

## Build the profiles of Users --- TO EDIT

The following function computes the user profile as a vector that averages the tf-idf vectors of all items the user has rated, weighted by the ratings of the user. 

Steps to implement:

1. Get the td-idf vectors corresponding to the items rated by the user.
2. Compute a weighted average of these vectors, where each vector is weighted by the rating of the user to it. Store the output into the `user_profile` vector. Tips: You may want to use `scipy.sparse.csr_matrix.multiply` to multiply the sparse td-idf vectors with the user ratings. 


In [11]:
def get_user_profile(self, user_id, ratings):
    user_rated_item_ids = np.array( ratings.loc[ ratings['user'] == user_id ]['item'] )
    user_ratings = np.array( ratings.loc[ ratings['user'] == user_id ]['rating'] )
    
    # Get TF-IDF vectors corresponding to items rated by the user
    user_tfidf = self.get_item_vectors( user_rated_item_ids )
    # Calculate weights for ratings
    user_ratings_sum = user_ratings.sum()
    if user_ratings_sum != 0:
        user_ratings /= user_ratings_sum
    user_profile = user_tfidf.T.dot( user_ratings )
    
    # transform into CSR matrix
    user_profile = sp.csr_matrix( user_profile, [1,user_profile.size] )
    
    # Normalize the vectors
    user_profile = pp.normalize( user_profile )
    
    return user_profile

Add the function to the class.

In [12]:
ContentBasedRecommender.get_user_profile = get_user_profile

Test the function. Show only the nonzero coordinates.

In [13]:
user_profile = cbr.get_user_profile(0, ratings)
print(user_profile[user_profile.nonzero()])


[[0.05198875 0.05737849 0.05737849 0.06578048 0.07887771 0.07887771
  0.14400008 0.13359635 0.05014868 0.05570668 0.05378515 0.05463073
  0.06679817 0.05966707 0.10715418 0.12412508 0.06578048 0.05737849
  0.06606548 0.07125637 0.09501268 0.07417306 0.06196849 0.10845538
  0.06379571 0.07681808 0.07417306 0.1027266  0.05942462 0.0757395
  0.06118151 0.04935083 0.07145868 0.11319656 0.04671291 0.07417306
  0.07417306 0.07125637 0.06606548 0.05110664 0.0757395  0.05198875
  0.06606548 0.0442194  0.07145868 0.04101556 0.07125637 0.04338523
  0.06510325 0.1775305  0.06880285 0.06578048 0.03980115 0.07145868
  0.07417306 0.06110326 0.05874736 0.05570668 0.07681808 0.06013355
  0.12859832 0.05429319 0.07417306 0.06606548 0.13881009 0.05014868
  0.0757395  0.0757395  0.07887771 0.05102264 0.06578048 0.07441951
  0.06578048 0.05737849 0.0757395  0.13904182 0.15415399 0.0757395
  0.06414056 0.04986057 0.05942462 0.11749471 0.0757395  0.06363503
  0.03786354 0.05102264 0.05874736 0.08503428 0.07

EXPECTED OUTPUT: 

```
[0.05198875 0.05737849 0.05737849 0.06578048 0.07887771 0.07887771
 0.14400008 0.13359635 0.05014868 0.05570668 0.05378515 0.05463073
 0.06679817 0.05966707 0.10715418 0.12412508 0.06578048 0.05737849
 0.06606548 0.07125637 0.09501268 0.07417306 0.06196849 0.10845538
 0.06379571 0.07681808 0.07417306 0.1027266  0.05942462 0.0757395
 0.06118151 0.04935083 0.07145868 0.11319656 0.04671291 0.07417306
 0.07417306 0.07125637 0.06606548 0.05110664 0.0757395  0.05198875
 0.06606548 0.0442194  0.07145868 0.04101556 0.07125637 0.04338523
 0.06510325 0.1775305  0.06880285 0.06578048 0.03980115 0.07145868
 0.07417306 0.06110326 0.05874736 0.05570668 0.07681808 0.06013355
 0.12859832 0.05429319 0.07417306 0.06606548 0.13881009 0.05014868
 0.0757395  0.0757395  0.07887771 0.05102264 0.06578048 0.07441951
 0.06578048 0.05737849 0.0757395  0.13904182 0.15415399 0.0757395
 0.06414056 0.04986057 0.05942462 0.11749471 0.0757395  0.06363503
 0.03786354 0.05102264 0.05874736 0.08503428 0.07417306 0.05570668
 0.13648206 0.06842138 0.04125971 0.0757395  0.05306877 0.08503428
 0.06880285 0.06578048 0.06115068 0.06842138 0.06578048 0.06842138
 0.05829674 0.06206254 0.13156539 0.06679817 0.07145868 0.05429319
 0.06578048 0.05570668 0.07201191 0.08503428 0.11680625 0.07887771
 0.07441951 0.05942462 0.11427411 0.06118151 0.05874736 0.11958887
 0.08503428 0.08022811 0.08933356 0.06595685 0.06860188 0.07441951
 0.08503428 0.08503428 0.14400008 0.11781031 0.08503428 0.0362648
 0.0757395  0.12364006 0.03615293 0.06606548 0.15940184 0.12784601
 0.04596717 0.07887771 0.05737849 0.03444142 0.08022811 0.06880285
 0.05306877 0.06578048 0.07681808 0.06880285 0.15147901 0.06860188
 0.07681808 0.11884924 0.06206254 0.05942462 0.03947596 0.03490385
 0.04077101]
```

Build the profiles of all users. Use only the positive rankings (postitive feedback) to determine weights.

In [14]:
def build_user_profiles(self):
    positive_ratings = ratings[ratings.rating>3]
    self.user_profiles = {}
    for user_id in positive_ratings['user'].unique():
        self.user_profiles[user_id] = self.get_user_profile(user_id, positive_ratings)
    

In [15]:
ContentBasedRecommender.build_user_profiles = build_user_profiles

In [16]:
cbr.build_user_profiles()

## Make Recommendations --- TO EDIT


The following function recommends topN items to the user based on her/his profile. The recommendations should exclude items already rated by the user.

Steps to implement:
1. Retrieve the user profile
2. Compute the cosine similarity between the user profile and each td-idf vector, and store it into array `sims`. Tips: Use `linear_kernel` from scikit-learn since to take the inner product, since all vectors are normalized. Also, flatten the output at the end.
3. Identify the indices in `sims` that identify the `topN` + `num_rated` largest similarities. This is to guarantee we can get exactly `topN` recommendations. Tips: `a[::-1]` returns the reverse of list `a`. You may want to use the `numpy.argsort` method.
4. Retrieve the item_ids from `self.item_ids` that correspond to the indices found.
5. Exclude those item_ids that are in `user_rated_item_ids`. Return the remaining as the recommendations. Note that the recommended items should be sorted from most to least similar to user profile.


In [17]:
def recommend(self, user_id, topN=20):
    user_rated_item_ids = np.array( self.ratings.loc[ self.ratings['user'] == user_id ]['item'] )
    num_rated = len(user_rated_item_ids)
    
    # get user profile
    user_profile = self.user_profiles[user_id]

    # calculate similarities (linear kernel)
    sims = linear_kernel( self.tfidf, user_profile ).flatten()
    
    # get indices by order
    a = np.argsort( sims )[::-1]
    
    # shorten to topN + num_rated
    a = a[:topN+num_rated]
    
    recommendations = [ self.item_ids[i] for i in a if self.item_ids[i] not in user_rated_item_ids ]
    recommendations = recommendations[0:topN]
    return recommendations
    

Add the function to the class.

In [18]:
ContentBasedRecommender.recommend = recommend

Test the function.

In [19]:
recs = cbr.recommend(0)
print(recs)


[481, 598, 126, 634, 22, 292, 9, 616, 628, 636, 664, 315, 170, 996, 263, 165, 880, 958, 671, 648]


EXPECTED OUTPUT: 

```
<class 'numpy.ndarray'>
[481, 598, 126, 634, 22, 292, 9, 616, 628, 636, 664, 315, 170, 996, 263, 165, 880, 958, 671, 648]
```

Show the movie titles of the recommendations.

In [20]:
display(cbr.get_movie_titles(recs))

['Kalifornia (1993)',
 'Window to Paris (Okno v Parizh) (1994)',
 'NeverEnding Story III, The (1994)',
 'Theodore Rex (1995)',
 'Copycat (1995)',
 'Outbreak (1995)',
 'Sudden Death (1995)',
 'Aristocats, The (1970)',
 'Primal Fear (1996)',
 'Frisk (1995)',
 'Faithful (1996)',
 'Specialist, The (1994)',
 'Hackers (1995)',
 'Last Man Standing (1996)',
 'Ladybird Ladybird (1994)',
 'Die Hard: With a Vengeance (1995)',
 'Island of Dr. Moreau, The (1996)',
 'Lady of Burlesque (1943)',
 'Mystery Science Theater 3000: The Movie (1996)',
 'Mission: Impossible (1996)']

In [21]:
# feel free to use this field for additional tests

In [22]:
# feel free to use this field for additional tests

In [23]:
# feel free to use this field for additional tests

In [24]:
# feel free to use this field for additional tests

In [25]:
# feel free to use this field for additional tests