## Item Based Collaborative Filtering Recommender System

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Loading u.info -> The number of users, items, and ratings in the MovieLens dataset

In [2]:
data_info = pd.read_csv('u.info', header=None)
list(data_info[0])

['943 users', '1682 items', '100000 ratings']

Loading u.data -> A dataset comprising user id, movie id, rating and timestamp

In [3]:
column_names = ['user id','movie id','rating','timestamp']
u_data = pd.read_csv('u.data', sep='\t',header=None,names=column_names)
print(len(u_data))
u_data.head()

100000


Unnamed: 0,user id,movie id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Loading u.item -> Dataset comprising movie id, movie title, release date, IMDb URL and 19 fields of genre (1 indicates the movie is of that genre, a 0 indicates it is not)

In [4]:
c = 'movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western'
column_names2 = c.split(' | ')
column_names2

['movie id',
 'movie title',
 'release date',
 'video release date',
 'IMDb URL',
 'unknown',
 'Action',
 'Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']

In [5]:
data_items = pd.read_csv('u.item', sep='|',header=None,names=column_names2,encoding='latin-1')
data_items

Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Merging u.data and u.items

In [6]:
merged_data = pd.merge(u_data, data_items[['movie id', 'movie title']], how='left', left_on='movie id', right_on='movie id')
print(len(merged_data))
print(merged_data)

100000
       user id  movie id  rating  timestamp                   movie title
0          196       242       3  881250949                  Kolya (1996)
1          186       302       3  891717742      L.A. Confidential (1997)
2           22       377       1  878887116           Heavyweights (1994)
3          244        51       2  880606923    Legends of the Fall (1994)
4          166       346       1  886397596           Jackie Brown (1997)
...        ...       ...     ...        ...                           ...
99995      880       476       3  880175444  First Wives Club, The (1996)
99996      716       204       5  879795543     Back to the Future (1985)
99997      276      1090       1  874795795                 Sliver (1993)
99998       13       225       2  882399156         101 Dalmatians (1996)
99999       12       203       3  879959583             Unforgiven (1992)

[100000 rows x 5 columns]


There is an issue with this dataset that for the same set of user id and movie id, ratings can be different at different timestamps. Example of such duplicates are shown below:-

In [7]:
duplicates = merged_data[merged_data.duplicated(['user id', 'movie title', 'rating'], keep=False)]
duplicates

Unnamed: 0,user id,movie id,rating,timestamp,movie title
157,99,268,3,885678247,Chasing Amy (1997)
493,269,246,5,891457067,Chasing Amy (1997)
501,299,303,3,877618584,Ulee's Gold (1997)
553,230,680,4,880484286,Kull the Conqueror (1997)
776,49,1003,2,888068651,That Darn Cat! (1997)
...,...,...,...,...,...
99179,880,268,5,892958128,Chasing Amy (1997)
99292,919,297,4,875288749,Ulee's Gold (1997)
99418,655,305,4,887523909,"Ice Storm, The (1997)"
99721,451,876,4,879012431,Money Talks (1997)


Therefore a dataset is created from the existing merged dataset by grouping the unique user id and movie title combination and the ratings by a user to the same movie in different instances (timestamps) are averaged and stored in the new dataset.

In [8]:
dataset = merged_data.groupby(by=['user id','movie title'], as_index=False).agg({"rating":"mean"})
print(len(dataset))
dataset.head()

99693


Unnamed: 0,user id,movie title,rating
0,1,101 Dalmatians (1996),2.0
1,1,12 Angry Men (1957),5.0
2,1,"20,000 Leagues Under the Sea (1954)",3.0
3,1,2001: A Space Odyssey (1968),4.0
4,1,"Abyss, The (1989)",3.0


 # Training KNN model to build item-based collaborative Recommender System.
 
 Reshaping model in such a way that each user has n-dimensional rating space where n is total number of movies

We will train the KNN model in order to find the closely matching similar users to the user we give as input and we recommend the top movies which would interest the input user.

In [9]:
user_to_movie_dataset = dataset.pivot(
    index='user id',
     columns='movie title',
      values='rating').fillna(0)

user_to_movie_dataset

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,2.0,5.0,0.0,0.0,3.0,4.0,0.0,0.0,...,0.0,0.0,0.0,5.0,3.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,2.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
941,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


This user to movie dataset is converted to a scipy sparse matrix using the csr_matrix of the submodule scipy.sparse within the SciPy library. This is done as here we are dealing with a large and sparse matrix and a csr_matrix can offer significant memory and computational efficiency. Sparse matrices are a more memory-efficient representation for datasets where the majority of the elements are zero.

In a sparse matrix, only non-zero elements are stored along with their indices, while zero elements are implicitly assumed and not stored. This can lead to substantial savings in terms of memory usage, which is crucial when working with large datasets, such as user-item interaction matrices in collaborative filtering for recommendation systems.

In [10]:
from scipy.sparse import csr_matrix
user_to_movie_sparse_df = csr_matrix(user_to_movie_dataset.values)
user_to_movie_sparse_df

<943x1664 sparse matrix of type '<class 'numpy.float64'>'
	with 99693 stored elements in Compressed Sparse Row format>

Fitting K-Nearest Neighbours model to the scipy sparse matrix:

In [11]:
from sklearn.neighbors import NearestNeighbors
knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_model.fit(user_to_movie_sparse_df)

 Function to find top n similar users of the given input user:-

In [12]:
def get_neighbours(user, n = 5):
  knn_input = np.asarray([user_to_movie_dataset.values[user-1]]) 
  distances, indices = knn_model.kneighbors(knn_input, n_neighbors=n+1)
  distances= distances.reshape(n+1,)
  indices= indices.reshape(n+1,)
  print("Top",n,"users who are similar to the User-",user, "are: ")
  print(" ")
  for i in range(1,len(distances)):
    print(i,". User:", indices[i]+1, "separated by distance of",distances[i])
  return indices[1:] + 1, distances[1:]

In [13]:
user_id = 324
print(" Few of the movies seen by the User:")
print(list(dataset[dataset['user id'] == user_id]['movie title'])[:10])
print("")
similar_users, distance_of_similar_users = get_neighbours(user_id,5)

 Few of the movies seen by the User:
['Air Force One (1997)', 'Anastasia (1997)', 'Boogie Nights (1997)', 'Chasing Amy (1997)', 'Conspiracy Theory (1997)', 'Contact (1997)', 'Cop Land (1997)', 'Courage Under Fire (1996)', "Dante's Peak (1997)", 'Daylight (1996)']

Top 5 users who are similar to the User- 324 are: 
 
1 . User: 624 separated by distance of 0.5012734201207962
2 . User: 526 separated by distance of 0.5334790084066022
3 . User: 294 separated by distance of 0.5335733371605309
4 . User: 529 separated by distance of 0.5489347605442338
5 . User: 634 separated by distance of 0.5744718801676113


In [14]:
similar_users, distance_of_similar_users

(array([624, 526, 294, 529, 634], dtype=int64),
 array([0.50127342, 0.53347901, 0.53357334, 0.54893476, 0.57447188]))

In [15]:
weights = distance_of_similar_users/np.sum(distance_of_similar_users)
print(weights)

[0.18622706 0.1981917  0.19822674 0.20393363 0.21342087]


In [16]:
rating_of_similar_users= user_to_movie_dataset.values[similar_users]
print(rating_of_similar_users)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [2. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [17]:
all_movies = user_to_movie_dataset.columns
all_movies

Index([''Til There Was You (1997)', '1-900 (1994)', '101 Dalmatians (1996)',
       '12 Angry Men (1957)', '187 (1997)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '3 Ninjas: High Noon At Mega Mountain (1998)', '39 Steps, The (1935)',
       ...
       'Yankee Zulu (1994)', 'Year of the Horse (1997)', 'You So Crazy (1994)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Poisoner's Handbook, The (1995)',
       'Zeus and Roxanne (1997)', 'unknown',
       'Á köldum klaka (Cold Fever) (1994)'],
      dtype='object', name='movie title', length=1664)

In [18]:
weights = weights[:,np.newaxis] + np.zeros(len(all_movies))
weights=weights.reshape(5,1664)
weights

array([[0.18622706, 0.18622706, 0.18622706, ..., 0.18622706, 0.18622706,
        0.18622706],
       [0.1981917 , 0.1981917 , 0.1981917 , ..., 0.1981917 , 0.1981917 ,
        0.1981917 ],
       [0.19822674, 0.19822674, 0.19822674, ..., 0.19822674, 0.19822674,
        0.19822674],
       [0.20393363, 0.20393363, 0.20393363, ..., 0.20393363, 0.20393363,
        0.20393363],
       [0.21342087, 0.21342087, 0.21342087, ..., 0.21342087, 0.21342087,
        0.21342087]])

To built an efficient model, we take weighted ratings of movies which is used for recommendation of movies to the input user

In [19]:
weighted_ratings = weights*rating_of_similar_users
print(weighted_ratings)
mean_weighted_ratings_of_all_movies = weighted_ratings.sum(axis =0)
print(mean_weighted_ratings_of_all_movies)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.40786726 0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
[0.40786726 0.         0.         ... 0.         0.         0.        ]


The mean weighted rating matrix can be used for recommendation but these set of recommended movies might contain the movies which the user has already seen so we are required to remove such movies from recommender list

In [21]:
def movie_recommendations(n):

  last_zero_index = np.where(mean_weighted_ratings_of_all_movies == 0)[0][-1] # last index of zero rating in the weighted rating list
  sorted_indices = np.argsort(mean_weighted_ratings_of_all_movies)[::-1] # list of the sorted indices in descending order of the ratings 
  sorted_indices = sorted_indices[:list(sorted_indices).index(last_zero_index)] # extracting indices for non-zero mean ratings
  n = min(len(sorted_indices),n)
  all_movies_watched_by_user = list(dataset[dataset['user id'] == user_id]['movie title']) # all movies watched by the user
  filtered_movie_list = list(all_movies[sorted_indices]) # list of all the movies whose mean ratings are non-zero
  count = 0
  recommended_movies = []
  for i in filtered_movie_list:
    if i not in all_movies_watched_by_user:
      count+=1
      recommended_movies.append(i)
    if count == n:
      break
  if count == 0:
    print("There are no movies left which are not seen by the input users and seen by similar users. May be increasing the number of similar users who are to be considered may give a chance of suggesting an unseen good movie.")
  else:
    print(recommended_movies)

In [22]:
movie_recommendations(10)

['Amadeus (1984)', "One Flew Over the Cuckoo's Nest (1975)", 'Fargo (1996)', 'Empire Strikes Back, The (1980)', 'Four Weddings and a Funeral (1994)', 'Back to the Future (1985)', 'Raiders of the Lost Ark (1981)', 'Return of the Jedi (1983)', 'Three Colors: Blue (1993)', 'Citizen Kane (1941)']
