

## Sketch
1. Load data
3. Train KNN model for item-based collaborative filtering
4. Use this trained model to make movie recommendations to myself

In [1]:
import os

# data science imports
import math
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

In [2]:
# utils import
from fuzzywuzzy import fuzz

In [3]:
# path config
data_path = 'C:/Users/Леся'
movies_filename = 'movies.csv'
ratings_filename = 'ratings.csv'

## 1. Load Data

In [4]:
df_movies = pd.read_csv(
    os.path.join(data_path, movies_filename),
    usecols=['movieId', 'title'],
    dtype={'movieId': 'int32', 'title': 'str'})

df_ratings = pd.read_csv(
    os.path.join(data_path, ratings_filename),
    usecols=['userId', 'movieId', 'rating'],
    dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})

In [5]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int32 
 1   title    27278 non-null  object
dtypes: int32(1), object(1)
memory usage: 319.8+ KB


In [6]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 3 columns):
 #   Column   Dtype  
---  ------   -----  
 0   userId   int32  
 1   movieId  int32  
 2   rating   float32
dtypes: float32(1), int32(2)
memory usage: 228.9 MB


In [7]:
df_movies.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [8]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


In [9]:
num_users = len(df_ratings.userId.unique())
num_items = len(df_ratings.movieId.unique())
print('There are {} unique users and {} unique movies in this data set'.format(num_users, num_items))

There are 138493 unique users and 26744 unique movies in this data set


## 2. Exploratory data analysis
 - Count of each rating
 - Rating frequency of each movie

#### 1. Counts of each rating

We first need to get the counts of each rating from ratings data

In [10]:
# get count
df_ratings_cnt_tmp = pd.DataFrame(df_ratings.groupby('rating').size(), columns=['count'])
df_ratings_cnt_tmp

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
0.5,239125
1.0,680732
1.5,279252
2.0,1430997
2.5,883398
3.0,4291193
3.5,2200156
4.0,5561926
4.5,1534824
5.0,2898660


We can see that above table does not include counts of zero rating score. So we need to add that in rating count dataframe as well

In [11]:
# there are a lot more counts in rating of zero
total_cnt = num_users * num_items
rating_zero_cnt = total_cnt - df_ratings.shape[0]
# append counts of zero rating to df_ratings_cnt
df_ratings_cnt = df_ratings_cnt_tmp.append(
    pd.DataFrame({'count': rating_zero_cnt}, index=[0.0]),
    verify_integrity=True,
).sort_index()
df_ratings_cnt

Unnamed: 0,count
0.0,3683856529
0.5,239125
1.0,680732
1.5,279252
2.0,1430997
2.5,883398
3.0,4291193
3.5,2200156
4.0,5561926
4.5,1534824


#### 2. Plot rating frequency of all movies

In [12]:
# get rating frequency
df_movies_cnt = pd.DataFrame(df_ratings.groupby('movieId').size(), columns=['count'])
df_movies_cnt.head()

Unnamed: 0_level_0,count
movieId,Unnamed: 1_level_1
1,49695
2,22243
3,12735
4,2756
5,12161


We set an arbitrary threshold for popularity.
 - We don't want to run into a memory problem during model training, but with a more performant machine we can avoid these stepr and simply run the code using all the database
 - Improve KNN performance: lesser known movies have ratings from fewer viewers, making the pattern more noisy. Droping out less known movies can improve recommendation quality

In [13]:
# filter data
popularity_thres = 50
popular_movies = list(set(df_movies_cnt.query('count >= @popularity_thres').index))
df_ratings_drop_movies = df_ratings[df_ratings.movieId.isin(popular_movies)]
print('shape of original ratings data: ', df_ratings.shape)
print('shape of ratings data after dropping unpopular movies: ', df_ratings_drop_movies.shape)

shape of original ratings data:  (20000263, 3)
shape of ratings data after dropping unpopular movies:  (19847947, 3)


In [14]:
# get number of ratings given by every user
df_users_cnt = pd.DataFrame(df_ratings_drop_movies.groupby('userId').size(), columns=['count'])
df_users_cnt.head()

Unnamed: 0_level_0,count
userId,Unnamed: 1_level_1
1,174
2,61
3,187
4,28
5,66


In [15]:
# filter data
ratings_thres = 80
active_users = list(set(df_users_cnt.query('count >= @ratings_thres').index))
df_ratings_drop_users = df_ratings_drop_movies[df_ratings_drop_movies.userId.isin(active_users)]
print('shape of original ratings data: ', df_ratings.shape)
print('shape of ratings data after dropping both unpopular movies and inactive users: ', df_ratings_drop_users.shape)

shape of original ratings data:  (20000263, 3)
shape of ratings data after dropping both unpopular movies and inactive users:  (16709109, 3)


## 3. Train KNN model for item-based collaborative filtering
 - Reshaping the Data
 - Fitting the Model

#### 1. Reshaping the Data
For K-Nearest Neighbors, we want the data to be in an (movie, user) array, where each row is a movie and each column is a different user, differently to ALS. To reshape the dataframe, we'll pivot the dataframe to the wide format with movies as rows and users as columns. Then we'll fill the missing observations with 0s since we're going to be performing linear algebra operations (calculating distances between vectors). Finally, we transform the values of the dataframe into a scipy sparse matrix for more efficient calculations.

In [16]:
# pivot and create movie-user matrix
movie_user_mat = df_ratings_drop_users.pivot(index='movieId', columns='userId', values='rating').fillna(0)
# create mapper from movie title to index
movie_to_idx = {
    movie: i for i, movie in 
    enumerate(list(df_movies.set_index('movieId').loc[movie_user_mat.index].title))
}
# transform matrix to scipy sparse matrix
movie_user_mat_sparse = csr_matrix(movie_user_mat.values)

#### 2. Fitting the Model
We'll initialize the NearestNeighbors class as model_knn and fit our sparse matrix to the instance. By specifying the metric = cosine, the model will measure similarity between movies vectors by using cosine similarity.

In [17]:
# define model
#‘brute’ will use a brute-force search (calculating the distance between all the vectors).
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
# fit
model_knn.fit(movie_user_mat_sparse)

NearestNeighbors(algorithm='brute', metric='cosine', n_jobs=-1, n_neighbors=20)

## 4. Use this trained model to make movie recommendations to myself
And we're finally ready to make some recommendations!

In [18]:
def fuzzy_matching(mapper, fav_movie, verbose=True):
    """
    return the closest match via fuzzy ratio. If no match found, return None
    
    Parameters
    ----------    
    mapper: dict, map movie title name to index of the movie in data

    fav_movie: str, name of user input movie
    
    verbose: bool, print log if True

    Return
    ------
    index of the closest match
    """
    match_tuple = []
    # get match
    for title, idx in mapper.items():
        ratio = fuzz.ratio(title.lower(), fav_movie.lower())
        if ratio >= 60:
            match_tuple.append((title, idx, ratio))
    # sort
    match_tuple = sorted(match_tuple, key=lambda x: x[2])[::-1]
    if not match_tuple:
        print('Oops! No match is found')
        return
    if verbose:
        print('Found possible matches in our database: {0}\n'.format([x[0] for x in match_tuple]))
    return match_tuple[0][1]



def make_recommendation(model_knn, data, mapper, fav_movie, n_recommendations):
    """
    return top n similar movie recommendations based on user's input movie


    Parameters
    ----------
    model_knn: sklearn model, knn model

    data: movie-user matrix

    mapper: dict, map movie title name to index of the movie in data

    fav_movie: str, name of user input movie

    n_recommendations: int, top n recommendations

    Return
    ------
    list of top n similar movie recommendations
    """
    # fit
    model_knn.fit(data)
    # get input movie index
    print('You have input movie:', fav_movie)
    idx = fuzzy_matching(mapper, fav_movie, verbose=True)
    # inference
    print('Recommendation system start to make inference')
    print('......\n')
    distances, indices = model_knn.kneighbors(data[idx], n_neighbors=n_recommendations+1)
    print(distances, indices)
    # get list of raw idx of recommendations
    raw_recommends = \
        sorted(list(zip(indices.squeeze().tolist(), distances.squeeze().tolist())), key=lambda x: x[-1], reverse=True)[:0:-1]
    # get reverse mapper
    reverse_mapper = {v: k for k, v in mapper.items()}
    # print recommendations
    print('Recommendations for {}:'.format(fav_movie))
    for i, (idx, dist) in enumerate(raw_recommends):
        print('{0}: {1}, with distance of {2}'.format(i+1, reverse_mapper[idx], dist))

In [21]:
my_favorite = 'Sex and the City'

make_recommendation(
    model_knn=model_knn,
    data=movie_user_mat_sparse,
    fav_movie=my_favorite,
    mapper=movie_to_idx,
    n_recommendations=20)

You have input movie: Sex and the City
Found possible matches in our database: ['Sex and the City (2008)', 'Sex and the City 2 (2010)', 'Night and the City (1950)']

Recommendation system start to make inference
......

[[0.         0.62896717 0.69190425 0.6960783  0.7111819  0.71603847
  0.71903324 0.7424477  0.7566228  0.75702155 0.75815475 0.76376736
  0.7707211  0.77177024 0.77259904 0.77376723 0.7752431  0.77743477
  0.7801356  0.78268915 0.7830843 ]] [[8770 9517 8047 8809 9040 8215 8633 8958 8631 6907 8302 9096 9206 8396
  9167 7794 8902 9011 7962 8503 8755]]
Recommendations for Sex and the City:
1: Sex and the City (2008), with distance of 0.0
2: Sex and the City 2 (2010), with distance of 0.6289671659469604
3: Devil Wears Prada, The (2006), with distance of 0.6919042468070984
4: Mamma Mia! (2008), with distance of 0.6960783004760742
5: He's Just Not That Into You (2009), with distance of 0.7111818790435791
6: Holiday, The (2006), with distance of 0.7160384654998779
7: 27 Dresse

In [20]:
my_favorite = 'In Time'

make_recommendation(
    model_knn=model_knn,
    data=movie_user_mat_sparse,
    fav_movie=my_favorite,
    mapper=movie_to_idx,
    n_recommendations=20)

You have input movie: In Time
Found possible matches in our database: ['In Time (2011)']

Recommendation system start to make inference
......

[[1.6689301e-06 5.8680868e-01 6.1143446e-01 6.1153245e-01 6.1782539e-01
  6.2003028e-01 6.2975186e-01 6.4289349e-01 6.5019882e-01 6.5241933e-01
  6.5779757e-01 6.5845382e-01 6.6007352e-01 6.6476405e-01 6.6788268e-01
  6.6821015e-01 6.6894627e-01 6.7417336e-01 6.7469037e-01 6.7592168e-01
  6.7624104e-01]] [[ 9908  9713  9763 10241  9741 10092 10112  9957 10055  9899  9847 10280
   9682  9816  9726  9876  9275  9747 10050  9706  9994]]
Recommendations for In Time:
1: In Time (2011), with distance of 1.6689300537109375e-06
2: Limitless (2011), with distance of 0.5868086814880371
3: Source Code (2011), with distance of 0.6114344596862793
4: Oblivion (2013), with distance of 0.61153244972229
5: Adjustment Bureau, The (2011), with distance of 0.6178253889083862
6: Total Recall (2012), with distance of 0.6200302839279175
7: Looper (2012), with distanc