# BLU11 - Exercise Notebook

## Create your own movie recommender system

This exercise notebook will help you create a Recommender System using Collaborative and Content-based filtering and, in the end, it will help you to pick some movies according to your preferences. We will add you as a new user and we will check out what are the best suggestions for you!! 😀

## Overall Strategy

1. **Setup:** Import and preprocess the data
1. **Collaborative Filtering:** normally better but may have the cold-start problem
1. **Content-based Filtering:** higher availability of rating suggestions

In [1]:
# Define your setup
import os
import hashlib # for grading purposes
import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix, triu, coo_matrix

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Q0 - Setup (non-graded)

The first step is to import and analyze the data (in `data/ml-latest-small`).

For these exercises, we'll be using a standard dataset for recommendations, called the [MovieLens](https://grouplens.org/datasets/movielens/) dataset. We'll be using the smallest version of the dataset.

Small: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

We have three data files:
* `ratings.csv`: has 100k ratings of 9k movies by 600 users
* `movies.csv`: has the movieId, title and genre for 9k movies
* `tags.csv`: has 3.6k tags applied to 9k movies by 600 users

In [2]:
def import_ratings_list(path):
    """
    Parameters
    ----------
    path : filepath of the ratings file to import
    
    
    Returns
    -------
    all_ratings : DataFrame of the ratings.
    """
    all_ratings = pd.read_csv(path)
    return all_ratings

all_ratings = import_ratings_list(os.path.join("data", "ml-latest-small", "ratings.csv"))

In [3]:
print(f"In total we have {len(all_ratings)} ratings available.")

In total we have 100836 ratings available.


In [4]:
all_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
userId       100836 non-null int64
movieId      100836 non-null int64
rating       100836 non-null float64
timestamp    100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [5]:
# Get a glimpse of the all_ratings DataFrame
all_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
def import_movies_details(path):
    """
    Parameters
    ----------
    path : filepath of the movies file to import
    
    
    Returns
    -------
    movies_df : DataFrame of the movies details
    """
    return pd.read_csv(path).set_index("movieId")

movies_df = import_movies_details(os.path.join("data", "ml-latest-small", "movies.csv"))

In [7]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9742 entries, 1 to 193609
Data columns (total 2 columns):
title     9742 non-null object
genres    9742 non-null object
dtypes: object(2)
memory usage: 228.3+ KB


In [8]:
movies_df.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


# Q1 - Add new Preferences

So, imagine you've just created a Netflix account.
In the setup process, you're asked to rate some movies so that you have some data associated with your profile.

To simulate this, we will add your movie preferences to the existing `all_ratings` DataFrame.

Note: we will use predefined ratings (to evaluate the exercises), but in the end you can play with these initial ratings to test out the recommender system.

In [9]:
# new_user_id is the maximum user id available plus one (because science).
new_user_id = max(all_ratings["userId"]) + 1


preference_1 = {"userId": new_user_id,
                "movieId": 1,          #Toy Story
                "rating": 4.0,
                "timestamp": 964982703}

preference_2 = {"userId": new_user_id,
                "movieId": 32,         #Twelve Monkeys
                "rating": 4.5,
                "timestamp": 964982931}

preference_3 = {"userId": new_user_id,
                "movieId": 33672,      #Lords of Dogtown
                "rating": 2.5,
                "timestamp": 964982224}

new_preferences = [preference_1, preference_2, preference_3]

In [10]:
new_preferences

[{'userId': 611, 'movieId': 1, 'rating': 4.0, 'timestamp': 964982703},
 {'userId': 611, 'movieId': 32, 'rating': 4.5, 'timestamp': 964982931},
 {'userId': 611, 'movieId': 33672, 'rating': 2.5, 'timestamp': 964982224}]

In [11]:
pd.DataFrame(new_preferences)

Unnamed: 0,movieId,rating,timestamp,userId
0,1,4.0,964982703,611
1,32,4.5,964982931,611
2,33672,2.5,964982224,611


In [12]:
all_ratings.append(pd.DataFrame(new_preferences), ignore_index=True).tail(6)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Unnamed: 0,movieId,rating,timestamp,userId
100833,168250,5.0,1494273047,610
100834,168252,5.0,1493846352,610
100835,170875,3.0,1493846415,610
100836,1,4.0,964982703,611
100837,32,4.5,964982931,611
100838,33672,2.5,964982224,611


In [13]:
def add_new_preferences(new_preferences, all_ratings):
    """
    Add the new preferences to the existing all_ratings DataFrame.
    
    Parameters
    ----------
    new_preferences : list
                      A list of dicts containing the new preferences.
                      
    all_ratings : pd.DataFrame
                  DataFrame containing all the ratings available.
    
    Returns
    -------
    all_ratings_extended : pd.DataFrame
                           Existing all_ratings DataFrame extended with the new_preferences.
                           Keep the original structure of 4 columns and guarantee the index is correctly ordered.
    """
    # YOUR CODE HERE
    all_ratings_extended = all_ratings.append(pd.DataFrame(new_preferences), ignore_index=True)
    return all_ratings_extended
    
all_ratings_extended = add_new_preferences(new_preferences, all_ratings)

In [14]:
expected_hash = 'ba4b0c26c8e4a2dd240461d4891c265af433569d06cd61f18fdd2d429615e0d5'
assert hashlib.sha256(str(all_ratings_extended.shape).encode()).hexdigest() == expected_hash

expected_hash = 'b7a56873cd771f2c446d369b649430b65a756ba278ff97ec81bb6f55b2e73569'
test_value = str(int(all_ratings_extended.tail().iloc[-1]["rating"] * 10))
assert hashlib.sha256(test_value.encode()).hexdigest() == expected_hash

# Q2 - Build the Ratings Matrix

Based on the ratings data, create the ratings matrix. This should follow the format adopted in the Learning Notebooks: rows are users, columns are items, and ratings are the matrix values.

Note that the `userId=1` will be represented in the row with index `0`. And the same for items.

In [15]:
all_ratings_extended.pivot(index='userId', columns='movieId', values='rating').sort_index().fillna(0).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 611 entries, 1 to 611
Columns: 9724 entries, 1 to 193609
dtypes: float64(9724)
memory usage: 45.3 MB


In [16]:
def create_ratings_matrix(all_ratings_extended):
    """
    Parameters
    ----------
    all_ratings_extended : pd.DataFrame
                           DataFrame with the original ratings list plus the new preferences defined above.
                   
    Returns
    -------
    ratings : csr_matrix
              DataFrame with the ratings matrix. Index should be sorted.
    """
    # YOUR CODE HERE
    return csr_matrix(all_ratings_extended.pivot(index='userId', columns='movieId', values='rating')
                      # Good practice when setting the index.
                      .sort_index()
                      # Sparse matrices don't assume NaN value as zeros.
                      .fillna(0)) 

    
ratings = create_ratings_matrix(all_ratings_extended)

In [17]:
expected_hash = '1b25bb37050096f8d9533f5c295f77a2705db79def84cccabbdc97db9dd34ef5'
assert hashlib.sha256(str(ratings.shape).encode()).hexdigest() == expected_hash

In [18]:
availability = ratings.nnz / (ratings.shape[0] * ratings.shape[1])
np.testing.assert_approx_equal(availability, 0.01697, significant=3)

In [19]:
print(f"Our ratings matrix has now {ratings.shape[0]} users and {ratings.shape[1]} items.")
print(f"Our availability of ratings is {round(availability*100, 2)}% of the total!")

Our ratings matrix has now 611 users and 9724 items.
Our availability of ratings is 1.7% of the total!



# Q3 - User Similarities

Now that we have the ratings matrix already built, let's check some similarities.


## Q3.1 - Overall User Similarities

Get the similarities between users using the cosine similarity.

In [20]:
def get_users_cosine_similarity(ratings):
    """
    Get the cosine similarity between users.
    
    Parameters
    ----------
    ratings : csr_matrix
              Ratings matrix.

    Returns
    -------
    users_similarities : csr_matrix
                        sparse representaion of the cosine similarity between users.
    """
    # YOUR CODE HERE
    return cosine_similarity(ratings, dense_output=False)
    
users_similarities = get_users_cosine_similarity(ratings)

In [21]:
expected_hash = '8cde0a5542e6a6e1551ffbd9a804edabe20b546a4c3dbdd870ee4a478eed3efe'
assert hashlib.sha256(str(users_similarities.shape).encode()).hexdigest() == expected_hash

expected_hash = '46e0bd98d1e6c6519f99e2a20f74425f2ecc53e654bdb68182663dbf19271f63'
assert hashlib.sha256(str(users_similarities.nnz).encode()).hexdigest() == expected_hash

## Q3.2 - What is the nearest neighbor of our newly added user?

Here we want to find the existing user who is the most similar to the newly added user.
Return the index of the nearest neighbor user.

In [22]:
def get_closest_user_index(users_similarities):
    """
    Return the index of the closest user to the newly added user.
    Hint: since the newly added user had the highest userId, they will be at the last row/column of
    the similarities matrix.
    
    Parameters
    ----------
    users_similarities : csr_matrix
                         Cosine similarity between users matrix.
    
    Returns
    -------
    closest_user_index : int
    """
    
    # YOUR CODE HERE
    # last user
    user_id = users_similarities.shape[0]-1
    
    # similarities for the last user
    S = users_similarities[user_id:].toarray().flatten()
    
    # Make the user the most different from himself
    S[user_id] = -1
    
    # Select the maximizer (last index)
    closest_user_index = np.argsort(S)[-1]
    
    return closest_user_index
    
closest_user_index = get_closest_user_index(users_similarities)

In [23]:
expected_hash = 'bd3a797ba948938978965781bd341bc0fc7711ed00e513b9c63a61cf3d916562'
assert hashlib.sha256(str(closest_user_index).encode()).hexdigest() == expected_hash

## Q3.3 - Compare the preferences of these two users (non-graded)

Now that we have the index of our nearest neighbor user, let's check if the preferences of these two users are in fact, similar.

In [24]:
def compare_with_new_user(all_ratings, userId, new_preferences):
    """
    Compare a userId ratings with the ones we have in the new preferences.
    Remeber that the userId must be the index of the users in the matrix plus 1, since the python index is 0-based
    and our Ids are pegged to 1.
    """
    moviesIds = [x["movieId"] for x in new_preferences]
    comparable_ratings = all_ratings[all_ratings["userId"] == userId]
    comparable_ratings = comparable_ratings[comparable_ratings["movieId"].isin(moviesIds)]
    return comparable_ratings

In [25]:
print("New user's preferences")
pd.DataFrame(new_preferences)

New user's preferences


Unnamed: 0,movieId,rating,timestamp,userId
0,1,4.0,964982703,611
1,32,4.5,964982931,611
2,33672,2.5,964982224,611


In [26]:
print("Nearest neighbor's preferences")
compare_with_new_user(all_ratings, closest_user_index+1, new_preferences)

Nearest neighbor's preferences


Unnamed: 0,userId,movieId,rating,timestamp
83411,529,1,3.0,855583216
83413,529,32,5.0,855583215


## Q3.4 - Get Users Predictions
To make the predictions using only the users, you can leverage what you know about the true ratings (`ratings` matrix) and the similarity between the users previously calculated (`users_similarities`).

In [27]:
def make_user_predictions(user_sims, R):
    """
    Parameters
    ----------
    user_sims : csr_matrix, shape: (n_users, n_users)
                Matrix with the similarities between users.
    
    R : csr_matrix, shape: (n_users, n_items)
        Matrix with the available ratings.
    
    Returns
    -------
    users_predictions : csr_matrix, shape: (n_users, n_items)
                        Ratings predictions.
    """
    # YOUR CODE HERE
    weighted_sum = np.dot(user_sims, R)
    
    # We use the absolute value to support negative similarities.
    # In this particular example there are none.
    sum_of_weights = np.abs(user_sims).sum(axis=1)
    
    preds = weighted_sum / sum_of_weights
    
    # Exclude previously rated items.
    preds[R.nonzero()] = 0
    
    return csr_matrix(preds)

users_predictions = make_user_predictions(users_similarities, ratings)

In [28]:
expected_hash = '1b25bb37050096f8d9533f5c295f77a2705db79def84cccabbdc97db9dd34ef5'
assert hashlib.sha256(str(users_predictions.shape).encode()).hexdigest() == expected_hash

In [29]:
np.testing.assert_almost_equal(users_predictions[-1, 1], 0.7765, decimal=4)
np.testing.assert_almost_equal(users_predictions[-1].toarray()[0, 3000], 0.0526, decimal=4)

# Q4 - Content-based Filtering
Now let's move to predict based on the characteristics of the items themselves.

We already imported the `movies_df` DataFrame, we just need to preprocess it a bit.

We also have the tags file `tags.csv`that we can use. Since we have multiple tags for the same movie, we will join them.

In [30]:
def import_tags_file(path):
    df = pd.read_csv(path)                             # Import the DataFrame
    df = df [["movieId", "tag"]]                       # Select relevant features
    df = df.set_index("movieId")                       # Set movieId as the index
    df["tag"] = df["tag"].str.replace(" ", "")         # Remove the whitespaces for multi-word tags
    df["tag"] = df["tag"].str.lower()                  # Lowercase all the genres for the sake of uniformity

    return df

tags_df = import_tags_file(os.path.join("data", "ml-latest-small", "tags.csv"))
tags_df.head()

Unnamed: 0_level_0,tag
movieId,Unnamed: 1_level_1
60756,funny
60756,highlyquotable
60756,willferrell
89774,boxingstory
89774,mma


## Q4.1  - Preprocessing the contents

Since the preprocessing step will have several steps, we will do what each programmer must do when facing complex tasks: *divide into multiple digestible understandble chunks of work*.

In [31]:
tags_df.index.nunique()

1572

### Q4.1.1 - Preprocessing the Tags

In [32]:
def preprocess_tags(tags_df):
    """
    1st step of preprocessing the movies contents.
    Join the multiple tags in a single row for each movie, with the tags separated by pipes and their whitespaces removed.
    
    Parameters
    ----------
    tags_df : pd.DataFrame
              Original dataframe for the tags
              
    Returns
    -------
    preprocessed_tags : pd.Series
                        Series with movieId as Index and the multiple tags without whitespaces pipe-separated.
    """
    # YOUR CODE HERE
    return tags_df.groupby(by='movieId')['tag'].apply(lambda x: '|'.join(x)).sort_index().str.replace(' ', '')

preprocessed_tags = preprocess_tags(tags_df)

In [33]:
expected_hash = '9a7537a814317708c2e672111fd17f6a117a932ebdff5584bbedd67152396a7b'
assert hashlib.sha256(str(preprocessed_tags.shape).encode()).hexdigest() == expected_hash

assert isinstance(preprocessed_tags, pd.Series)

### Q4.1.2 - Join the Movies with the Tags

In [43]:
def join_movies_with_tags(movies_df, tags_df):
    """
    2nd step of preprocessing the movie contents.
    Join the new tags obtained onto the movies_df.
    CAUTION: Since we will not have tags for all the movies, just fill the missing ones with empty strings.
    
    Parameters
    ----------
    movies_df : pd.DataFrame
    
    tags_df : pd.DataFrame
    
    
    Returns
    -------
    movies_with_tags : pd.DataFrame
    """
    # YOUR CODE HERE
    
    # The input was actually a series! It must be a DataFrame in order to be able to use it in pd.merge()
    tags_df = pd.DataFrame(tags_df)
         
    
    return pd.merge(movies_df, tags_df, how='left', on='movieId').fillna(value="")

    
movies_with_tags = join_movies_with_tags(movies_df, preprocessed_tags)

In [44]:
expected_hash = '4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce'
assert hashlib.sha256(str(len(movies_with_tags.columns)).encode()).hexdigest() == expected_hash

assert isinstance(movies_with_tags, pd.DataFrame)

In [49]:
preprocess_tags(tags_df).head()

movieId
1                              pixar|pixar|fun
2    fantasy|magicboardgame|robinwilliams|game
3                                    moldy|old
5                             pregnancy|remake
7                                       remake
Name: tag, dtype: object

In [47]:
movies_df.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [50]:
join_movies_with_tags(movies_df, preprocessed_tags).head()

Unnamed: 0_level_0,title,genres,tag
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar|pixar|fun
2,Jumanji (1995),Adventure|Children|Fantasy,fantasy|magicboardgame|robinwilliams|game
3,Grumpier Old Men (1995),Comedy|Romance,moldy|old
4,Waiting to Exhale (1995),Comedy|Drama|Romance,
5,Father of the Bride Part II (1995),Comedy,pregnancy|remake


### Q4.1.3 - Preprocessing the Movies Contents

In [84]:
def preprocess_movies(movies_df, tags_df):
    """
    Get the "Description" column by joining the genres, the tags and the year (separated by pipes "|").
    Remove the year from the title and add it as a feature.
    This "Description" column will be used for the TF-IDF.
    
    Steps needed:
    1. Get the tags. Preprocess their content to have multiple pipe-separated tabs for each movie.
    2. Join the movies_df with the new preprocessed_tags.
    3. Get the year as a feature. Clean the text in the "title" feature and add the year as a new "year" feature.
    4. Create a new "Description" feature with "genres", "tag" and "year" features. Join them with a pipe "|".
    
    You can use the functions "preprocess_tags" and "join_movies_with_tags" as defined above.
    
    Parameters
    ----------
    movies_df : pd.DataFrame
                original movies dataframe
                
    tags_df : pd.DataFrame
              contains the tags as imported above
    
    Returns
    -------
    preprocessed_movies : pd.DataFrame, shape: (n_items, 1)
                          Movies DataFrame preprocessed with both tags and genres ready to be TF-IDF Vectorized.
    """
    preprocessed_tags = preprocess_tags(tags_df)
    
    preprocessed_movies = join_movies_with_tags(movies_df, preprocessed_tags)
    
    preprocessed_movies["year"] = preprocessed_movies["title"].str.split().apply(lambda x: x[-1]).str.replace("(\(|\))", "")
    
    preprocessed_movies["Description"] = preprocessed_movies.genres + "|" + preprocessed_movies.tag + "|" + preprocessed_movies.year
    preprocessed_movies.Description = preprocessed_movies.Description.str.replace("||", "|", regex=False)
    
    
    # YOUR CODE HERE
    
    return preprocessed_movies[["Description"]]


preprocessed_movies = preprocess_movies(movies_df, tags_df)

In [85]:
expected_hash = '4c59be6c9b0a96e2ead911ccca3567ac875d14038d9a420c100cb9fe6a0f17d3'
assert hashlib.sha256(str(preprocessed_movies.shape).encode()).hexdigest() == expected_hash

expected_hash = '2ce22ef15aca0e77d083cb5a74a14297279231df2016013f18971f1f598e7fdf'
assert hashlib.sha256(str(preprocessed_movies.loc[1]["Description"].lower()).encode()).hexdigest() == expected_hash

In [86]:
# Get a glimpse of the preprocessed movies
preprocessed_movies.head()

Unnamed: 0_level_0,Description
movieId,Unnamed: 1_level_1
1,Adventure|Animation|Children|Comedy|Fantasy|pi...
2,Adventure|Children|Fantasy|fantasy|magicboardg...
3,Comedy|Romance|moldy|old|1995
4,Comedy|Drama|Romance|1995
5,Comedy|pregnancy|remake|1995


In [102]:
all_ratings_extended.head(10)

Unnamed: 0,movieId,rating,timestamp,userId
0,1,4.0,964982703,1
1,3,4.0,964981247,1
2,6,4.0,964982224,1
3,47,5.0,964983815,1
4,50,5.0,964982931,1
5,70,3.0,964982400,1
6,101,5.0,964980868,1
7,110,4.0,964982176,1
8,151,5.0,964984041,1
9,157,5.0,964984100,1


## Q4.2 - Matching the Contents
### Matching variables (non-graded)

After preprocessing the content for the movies (items), we need to match them to the original ratings listing we have already. To do this we match the index of our `preprocessed_movies` DataFrame (already unique) with the unique items we have in our original listing of ratings `all_ratings_extended`.

In [95]:
np.all(np.array(preprocessed_movies.index) == np.unique(all_ratings_extended["movieId"]))

  """Entry point for launching an IPython kernel.


False


.... and VOILÁ! They don't match! 😡 Let's check their sizes:

In [99]:
print(f"Preprocessed Movies index has {preprocessed_movies.index.shape[0]} elements")
print(f"Original Ratings listing has {ratings.shape[1]} elements")

Preprocessed Movies index has 9742 elements
Original Ratings listing has 9724 elements


Since we have more movieIds in the tags than in the ratings matrix, we need to subset the original `preprocessed_movies` DataFrame to only include the movies for which we have ratings.

In [108]:
def reduce_preprocessed_movies_scope(preprocessed_movies, all_ratings_extended):
    """
    Filter out the movies that have information but that are not present in the original ratings listing.
    
    Tips and tricks:
    Get the unique movieIds for the original listing (the index for preprocessed_movies should already be unique).
    Get the movies for which we have the tags but not the ratings.
    Select the movies in the preprocessed_movies which are not in the above.
    
    Parameters
    ----------
    preprocessed_movies : pd.DataFrame
                          DataFrame with the Description for all the movies available with tags.
                          
    all_ratings_extended : pd.DataFrame
                           Listing of all the ratings available.
                          
    Returns
    -------
    
    filtered_preprocessed_movies : pd.DataFrame
                                   
    """
    # YOUR CODE HERE
    
    #unique_movieId_in_all_ratings_extended = all_ratings_extended.movieId.drop_duplicates()
    
    idx_to_keep = preprocessed_movies.index.isin(all_ratings_extended.movieId)
    
    return preprocessed_movies[idx_to_keep]
    
filtered_preprocessed_movies = reduce_preprocessed_movies_scope(preprocessed_movies, all_ratings_extended)

In [109]:
expected_hash = 'f46df9a8f5e1c4c1f0e17649a4214e19503569f77991533f1db1747b28523fe5'
assert hashlib.sha256(str(len(filtered_preprocessed_movies.index)).encode()).hexdigest() == expected_hash

assert np.all(np.array(filtered_preprocessed_movies.index) == np.unique(all_ratings_extended["movieId"]))

In [111]:
preprocessed_movies.head()

Unnamed: 0_level_0,Description
movieId,Unnamed: 1_level_1
1,Adventure|Animation|Children|Comedy|Fantasy|pi...
2,Adventure|Children|Fantasy|fantasy|magicboardg...
3,Comedy|Romance|moldy|old|1995
4,Comedy|Drama|Romance|1995
5,Comedy|pregnancy|remake|1995


In [112]:
vec = TfidfVectorizer()
vec.fit(preprocessed_movies.Description)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [114]:
vec.transform(preprocessed_movies.Description)

<9742x1625 sparse matrix of type '<class 'numpy.float64'>'
	with 36510 stored elements in Compressed Sparse Row format>

In [118]:
vec.get_feature_names()

['06oscarnominatedbestmovie',
 '1900s',
 '1902',
 '1903',
 '1908',
 '1915',
 '1916',
 '1917',
 '1919',
 '1920',
 '1920s',
 '1921',
 '1922',
 '1923',
 '1924',
 '1925',
 '1926',
 '1927',
 '1928',
 '1929',
 '1930',
 '1931',
 '1932',
 '1933',
 '1934',
 '1935',
 '1936',
 '1937',
 '1938',
 '1939',
 '1940',
 '1941',
 '1942',
 '1943',
 '1944',
 '1945',
 '1946',
 '1947',
 '1948',
 '1949',
 '1950',
 '1950s',
 '1951',
 '1952',
 '1953',
 '1954',
 '1955',
 '1956',
 '1957',
 '1958',
 '1959',
 '1960',
 '1960s',
 '1961',
 '1962',
 '1963',
 '1964',
 '1965',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',


## Q4.3 - Calculate the Profiles for the Items
Now that we have the attributes for each item (pipe-separated items present in the `filtered_preprocessed_movies`), we can calculate the item profiles for each item.

In [115]:
def get_item_profiles(preprocessed_movies: pd.DataFrame):
    """
    Call a fit_transform on a TF-IDF Vectorizer.
    Returns the profiles for the items.
    
    Returns
    -------
    item_profiles : csr_matrix, shape: (n_items, n_tfidf_features)
    """
    # YOUR CODE HERE
    vectorizer = TfidfVectorizer()
    
    return vectorizer.fit_transform(preprocessed_movies.Description)    
    
    
    
item_profiles = get_item_profiles(filtered_preprocessed_movies)

In [116]:
expected_hash = 'a05c961fb42c746751e14ecf45ba31de109eb9bb60189057420399648931e823'
assert hashlib.sha256(str(item_profiles.shape).encode()).hexdigest() == expected_hash

## Q4.4 - User Profiles
Next to step is to check that we have a *(n_users, n_items)* ratings matrix and a *(n_items, n_tfidf_features)*. If we multiply them, we get *(n_users, n_tfidf_features)*. 

Did we just associated the users with the item features?

In [120]:
ratings

<611x9724 sparse matrix of type '<class 'numpy.float64'>'
	with 100839 stored elements in Compressed Sparse Row format>

In [121]:
item_profiles

<9724x1622 sparse matrix of type '<class 'numpy.float64'>'
	with 36434 stored elements in Compressed Sparse Row format>

In [122]:
# YOUR CODE HERE
user_profiles = np.dot(ratings, item_profiles)

In [123]:
user_profiles

<611x1622 sparse matrix of type '<class 'numpy.float64'>'
	with 183737 stored elements in Compressed Sparse Row format>

In [124]:
expected_hash = '7bbb7fa14ce2f1a245bb1ce66f76292520aff0a9c043d054256c4b31706e366b'
assert hashlib.sha256(str(user_profiles.shape).encode()).hexdigest() == expected_hash

expected_hash = '228d8eceeba4d7d8764a521b0c9fb0d50d5010f74375960af5ca62203a6dcb17'
assert hashlib.sha256(str(user_profiles.nnz).encode()).hexdigest() == expected_hash

## Q4.5 - The Moment of Truth
We will do what recommender systems should do: recommend. (mind-blowing, right?!)

In [125]:
def make_predictions(R, item_profiles, user_profiles):
    """
    Make predictions based on the ratings matrix, the item profiles and the user profiles we calculated previously.
    
    
    Parameters
    ----------
    R : csr_matrix. shape: (n_users, n_items)
        Matrix containing the ratings initially assigned.
        
    item_profiles : csr_matrix. shape: (n_items, n_tfidf_features)
                    Matrix containing the TF-IDF features calculated for the items.
                    
    user_profiles : csr_matrix. shape: (n_users, n_tfidf_features)
                    Matrix containing the user profiles as the product of the ratings with the item profiles.
                    
                    
    Returns
    -------
    predictions : csr_matrix. shape: (n_users, n_items)
                  Matrix with the predictions. Already rated content is suppressed to 0.
    """
    # YOUR CODE HERE
    preds = cosine_similarity(user_profiles, item_profiles)
    
    # Exclude previously rated items.
    preds[R.nonzero()] = 0
    
    return csr_matrix(preds)


pred = make_predictions(ratings, item_profiles, user_profiles)

In [126]:
assert isinstance(pred, csr_matrix)
assert pred.shape == ratings.shape # Your predictions shape should match the original ratings matrixa
np.testing.assert_almost_equal(pred[-1, 200], 0.06885, decimal=4)

In [127]:
pred

<611x9724 sparse matrix of type '<class 'numpy.float64'>'
	with 5718842 stored elements in Compressed Sparse Row format>

## Q4.5 Get Top-7 items for each user

In [None]:
def get_top_n(pred, n=3):
    """
    Calculates the sorted best n items for each user.
    
    Parameters
    ----------
    
    pred : csr_matrix, shape: (n_users, n_items)
           Matrix with predictions for the items.
           
    n : int
        Top-n items for each user. Default to 3.
        
        
    Returns
    -------
    
    best_preds : np.ndarray, shape: (n_users, n)
                 Sorted n-best items for each user.
    """
    # YOUR CODE HERE
    pred_ = np.negative(pred).toarray()
    
    return pred_.argsort()[:, :n]

    return np.array(pred.argmax(axis=1))


best_preds = get_top_n(pred, 7)

In [None]:
expected_hash = 'c103df72e15cf510fd1acdd2fa2e71fdb7fb3ebf72441fd79d8aa1bee87169fd'
assert hashlib.sha256(str(best_preds[50, 5]).encode()).hexdigest() == expected_hash

expected_hash = 'd6420a4ee44bc345c7bf3a2efbab98e08a4727016df8e8d6bb8375d0a23a8c72'
assert hashlib.sha256(str(best_preds[-1, 6]).encode()).hexdigest() == expected_hash

## Q5 - New User's Best Predictions (non-graded)

In [None]:
f"The Top-7 items for the user are indexed as {list(best_preds[-1])}"