# Contents

[Imports](#Imports)<br>
[Model with raw data](#Model_raw)<br>
[Model with pre-processed data](#Model_pre)<br>
[Model_pre: user-based](#Model_pre_user-based)<br>
[Model_pre: item-based](#Model_pre_item-based)

## General concept
Central Ideas: 
- user-based filtering: a user is likely to have the same preferences as a user with the same rating behaviour as them. <br>
- item-based filtering: a user is likely to enjoy movies similar to the ones he/she rated highly. Two films which have received the same or similar ratings from users are likely to be similar. 

- One idea was to compare the recommendation results based on the preprocessed data (~ 5M rows) with results from raw data (df_ratings ~25M rows) <br>
 The raw data is too big to be handled, e.g. by the cosine_similarity function => idea was discarded

<div class="alert alert-block alert-info"><b>Info:</b> the sections "Model with raw data" and "Model with raw data" do not work currently (performance issues due to big size). 
The code is kept for possible further investigation.</div>

# Imports

In [1]:
import math
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix, vstack, hstack
import sklearn.metrics.pairwise as dist
import sys

In [2]:
df_pre = pd.read_csv('../data/processed/preprocessed_data_movielens.csv')
df_pre.head()

Unnamed: 0,movieId,title,genres,relevance,tag,userId,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[0.99925, 0.99875, 0.99575, 0.98575, 0.98425, ...","['toys', 'computer animation', 'pixar animatio...",74244,4.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[0.99925, 0.99875, 0.99575, 0.98575, 0.98425, ...","['toys', 'computer animation', 'pixar animatio...",54322,4.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[0.99925, 0.99875, 0.99575, 0.98575, 0.98425, ...","['toys', 'computer animation', 'pixar animatio...",106130,4.5
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[0.99925, 0.99875, 0.99575, 0.98575, 0.98425, ...","['toys', 'computer animation', 'pixar animatio...",43484,3.5
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[0.99925, 0.99875, 0.99575, 0.98575, 0.98425, ...","['toys', 'computer animation', 'pixar animatio...",16874,4.0


In [None]:
## export movies for own rating

# id = df_pre.movieId.unique()
# df = pd.DataFrame(id, columns=['movieId'])
# df_mov = df.merge(right=df_pre[['movieId','title']], on='movieId', how='inner')
# df_mov.drop_duplicates(inplace=True)
# df_mov['year'] = None
# df_mov['year'].loc[df_mov.title.str.slice(-1) == ')'] = df_mov['title'].str.slice(-5,-1)
# df_mov['year'].loc[df_mov.title.str.slice(-2) == ') '] = df_mov['title'].str.slice(-6,-2)
# # df_mov
# df_mov.to_csv('movie_ratings_limited.csv',sep=';')


In [3]:
df_raw = pd.read_csv('../data/raw/ml-25m/ratings.csv')
df_raw.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [None]:
print('df_pre:')
df_pre.info()
display(df_pre.describe())
print()
print('df_raw:')
df_raw.info()
display(df_raw.describe())

# => df_raw has about 5 times as many entries as df_pre, the mean rating in df_raw is 0.06 higher, the std is almost the same

# Model_raw
Does not work currently

## Preparation: creation of rating matrix

In [3]:
df_raw.duplicated(subset=['userId','movieId']).sum()
# => no user rated a movie twice

0

In [4]:
# drop unnecessary columns
df_raw.drop('timestamp', axis=1, inplace=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 3 columns):
 #   Column   Dtype  
---  ------   -----  
 0   userId   int64  
 1   movieId  int64  
 2   rating   float64
dtypes: float64(1), int64(2)
memory usage: 572.2 MB


In [38]:
# Splitting df_raw in smaller blocks to avoid index limit in creation of sparse matrix mat_ratings
raw = []
raw.append(df_raw.loc[df_raw.userId <= 25000])
raw.append(df_raw.loc[(df_raw.userId > 25000) & (df_raw.userId <= 50000)])
raw.append(df_raw.loc[(df_raw.userId > 50000) & (df_raw.userId <= 75000)])
raw.append(df_raw.loc[(df_raw.userId > 75000) & (df_raw.userId <= 100000)])
raw.append(df_raw.loc[(df_raw.userId > 100000) & (df_raw.userId <= 125000)])
raw.append(df_raw.loc[(df_raw.userId > 125000) & (df_raw.userId <= 150000)])
raw.append(df_raw.loc[(df_raw.userId > 150000) & (df_raw.userId <= 175000)])


<div class="alert alert-block alert-info"><b>Info:</b> the following cell takes about 6 minutes to execute.</div>

In [48]:
# initiate empty DataFrame, which will be iteratively filled and extended
mat_ratings = pd.DataFrame()

for i in range(len(raw)):
    # expand mat_ratings by concatenation and reduce size of new part by using .astype("Sparse[float]")
    mat_ratings = pd.concat([mat_ratings,raw[i].pivot_table(index='userId', columns='movieId', values='rating').astype("Sparse[float]")], sort=True)
    #print(f'mat_ratings {i} sparse:',mat_ratings.info())


<class 'pandas.core.frame.DataFrame'>
Index: 25000 entries, 1 to 25000
Columns: 34655 entries, 1 to 209163
dtypes: Sparse[float64, nan](34655)
memory usage: 43.6 MB
mat_ratings 0 sparse: None
<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, 1 to 50000
Columns: 44073 entries, 1 to 209163
dtypes: Sparse[float64, nan](44073)
memory usage: 88.6 MB
mat_ratings 1 sparse: None
<class 'pandas.core.frame.DataFrame'>
Index: 75000 entries, 1 to 75000
Columns: 52776 entries, 1 to 209163
dtypes: Sparse[float64, nan](52776)
memory usage: 132.9 MB
mat_ratings 2 sparse: None
<class 'pandas.core.frame.DataFrame'>
Index: 100000 entries, 1 to 100000
Columns: 55592 entries, 1 to 209163
dtypes: Sparse[float64, nan](55592)
memory usage: 177.4 MB
mat_ratings 3 sparse: None
<class 'pandas.core.frame.DataFrame'>
Index: 125000 entries, 1 to 125000
Columns: 57309 entries, 1 to 209171
dtypes: Sparse[float64, nan](57309)
memory usage: 221.4 MB
mat_ratings 4 sparse: None
<class 'pandas.core.frame.DataFra

In [52]:
# replacie NaNs with 0
mat_ratings.fillna(0, inplace=True)

# Extract user IDs and book titles from the ratings matrix.
user_ids = mat_ratings.index.tolist()
movie_ids = mat_ratings.columns.tolist()


## Model_raw: user-based

In [None]:
l = 1000
# Calculate the cosine similarity between users.
user_similarity = dist.cosine_similarity(mat_ratings.ilc[:l]) 


In [None]:

# Creation of a pandas DataFrame from the similarity matrix between users.
# The indexes and columns of the DataFrame are the user identifiers.
user_similarity = pd.DataFrame(user_similarity, index=user_ids[:l], columns=user_ids[:l])
user_similarity.max().max()

## Model_raw: item-based

<div class="alert alert-block alert-info"><b>Info:</b> the following cell takes about 12 minutes to execute.</div>

In [65]:
# mat_ratings is too big (row and col-wise) for simple transpose (.T)
# thus we separate it in small chunks, transpose those and concatenate in the end

# define number of chunks
n = 5
size = math.ceil(len(mat_ratings)/n)
trans_mat_ratings = pd.DataFrame()

for i in range(n):
    # concat trans_mat_ratings with new part along axis=1
    chunk = mat_ratings.iloc[i*size:(i+1)*size]
    trans_mat_ratings = pd.concat([trans_mat_ratings, chunk.T.astype("Sparse[float]")], axis=1, sort=True)
    #print(trans_mat_ratings.info())

# replacie NaNs with 0
trans_mat_ratings.fillna(0, inplace=True)

<class 'pandas.core.frame.DataFrame'>
Index: 59047 entries, 1 to 209171
Columns: 32509 entries, 1 to 32509
dtypes: Sparse[float64, nan](32509)
memory usage: 59.6 MB
None
<class 'pandas.core.frame.DataFrame'>
Index: 59047 entries, 1 to 209171
Columns: 65018 entries, 1 to 65018
dtypes: Sparse[float64, nan](65018)
memory usage: 117.1 MB
None
<class 'pandas.core.frame.DataFrame'>
Index: 59047 entries, 1 to 209171
Columns: 97527 entries, 1 to 97527
dtypes: Sparse[float64, nan](97527)
memory usage: 174.8 MB
None
<class 'pandas.core.frame.DataFrame'>
Index: 59047 entries, 1 to 209171
Columns: 130036 entries, 1 to 130036
dtypes: Sparse[float64, nan](130036)
memory usage: 231.3 MB
None
<class 'pandas.core.frame.DataFrame'>
Index: 59047 entries, 1 to 209171
Columns: 162541 entries, 1 to 162541
dtypes: Sparse[float64, nan](162541)
memory usage: 288.6 MB
None


In [70]:
trans_mat_ratings.fillna(0, inplace=True)
trans_mat_ratings

userId,1,2,3,4,5,6,7,8,9,10,...,162532,162533,162534,162535,162536,162537,162538,162539,162540,162541
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,3.5,4.0,3.0,4.0,0,0,4.0,0,3.5,...,0,4.5,4.0,0,0,0,2.0,0,0,0
2,0,0,0,0,0,0,0,0,5.0,0,...,0,4.0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,4.0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209157,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
209159,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
209163,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
209169,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [92]:
# create similarity matrix for items from mat_ratings
l = 10
item_similarity = dist.cosine_similarity(trans_mat_ratings.iloc[:l])#.astype('Sparse[float]')
item_similarity_sparse = dist.cosine_similarity(trans_mat_ratings.iloc[:l], dense_output=False)

In [93]:
item_similarity

array([[1.        , 0.39902562, 0.27891837, 0.10370932, 0.2782566 ,
        0.35963833, 0.28099063, 0.09700156, 0.1547813 , 0.35893208],
       [0.39902562, 1.        , 0.22095601, 0.14585652, 0.24466817,
        0.2705228 , 0.21467052, 0.17202942, 0.1351864 , 0.40213427],
       [0.27891837, 0.22095601, 1.        , 0.15860121, 0.44053243,
        0.26427931, 0.37945037, 0.13349758, 0.25095663, 0.20055471],
       [0.10370932, 0.14585652, 0.15860121, 1.        , 0.18025054,
        0.10699686, 0.17611951, 0.12591843, 0.10124978, 0.11769976],
       [0.2782566 , 0.24466817, 0.44053243, 0.18025054, 1.        ,
        0.23105026, 0.39470768, 0.15137207, 0.23053015, 0.20105056],
       [0.35963833, 0.2705228 , 0.26427931, 0.10699686, 0.23105026,
        1.        , 0.24884271, 0.05923356, 0.23136461, 0.36749906],
       [0.28099063, 0.21467052, 0.37945037, 0.17611951, 0.39470768,
        0.24884271, 1.        , 0.11278525, 0.19754212, 0.206819  ],
       [0.09700156, 0.17202942, 0.1334975

In [90]:
print(sys.getsizeof(item_similarity))
print(sys.getsizeof(item_similarity_sparse))

80128
80128


In [86]:

item_similarity = pd.DataFrame(item_similarity, index=movie_ids[:l], columns=movie_ids[:l])
item_similarity_sp = pd.DataFrame(item_similarity, index=movie_ids[:l], columns=movie_ids[:l]).astype('Sparse[float]')

In [88]:
print(sys.getsizeof(item_similarity))#.info())
print(sys.getsizeof(item_similarity_sp))#.info())
#item_similarity

80832
120832


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,92,93,94,95,96,97,98,99,100,101
1,1.000000,0.399026,0.278918,0.103709,0.278257,0.359638,0.280991,0.097002,0.154781,0.358932,...,0.119817,0.098157,0.174297,0.352931,0.058642,0.108397,0.022288,0.057649,0.150874,0.177440
2,0.399026,1.000000,0.220956,0.145857,0.244668,0.270523,0.214671,0.172029,0.135186,0.402134,...,0.095608,0.168336,0.095529,0.327256,0.034759,0.078847,0.022734,0.029698,0.095387,0.111472
3,0.278918,0.220956,1.000000,0.158601,0.440532,0.264279,0.379450,0.133498,0.250957,0.200555,...,0.175733,0.160829,0.178315,0.370088,0.045601,0.029453,0.015565,0.070636,0.237225,0.096131
4,0.103709,0.145857,0.158601,1.000000,0.180251,0.106997,0.176120,0.125918,0.101250,0.117700,...,0.098647,0.192280,0.115934,0.132438,0.043007,0.019857,0.016502,0.056206,0.110779,0.052178
5,0.278257,0.244668,0.440532,0.180251,1.000000,0.231050,0.394708,0.151372,0.230530,0.201051,...,0.162483,0.141494,0.177312,0.343979,0.047633,0.029352,0.015610,0.063657,0.207903,0.077471
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,0.108397,0.078847,0.029453,0.019857,0.029352,0.142583,0.039662,0.010536,0.034432,0.079610,...,0.039764,0.025477,0.059929,0.060223,0.062263,1.000000,0.041872,0.057542,0.052473,0.126326
98,0.022288,0.022734,0.015565,0.016502,0.015610,0.039142,0.020362,0.011039,0.017063,0.027303,...,0.031425,0.033538,0.037613,0.030211,0.025320,0.041872,1.000000,0.022874,0.025046,0.047298
99,0.057649,0.029698,0.070636,0.056206,0.063657,0.073737,0.072126,0.046678,0.073464,0.036498,...,0.104141,0.047316,0.108366,0.071364,0.050319,0.057542,0.022874,1.000000,0.097878,0.073159
100,0.150874,0.095387,0.237225,0.110779,0.207903,0.268265,0.242368,0.058802,0.226863,0.124724,...,0.224471,0.094195,0.218074,0.268660,0.074410,0.052473,0.025046,0.097878,1.000000,0.093728


In [29]:
import sys
sys.getsizeof(sparse_ratings)
print(sparse_ratings[0])

  (0, 292)	5.0
  (0, 302)	3.5
  (0, 303)	5.0
  (0, 654)	5.0
  (0, 877)	3.5
  (0, 1060)	4.0
  (0, 1146)	3.5
  (0, 1185)	3.5
  (0, 1204)	5.0
  (0, 1216)	4.0
  (0, 1226)	3.5
  (0, 1590)	4.0
  (0, 1918)	2.5
  (0, 1919)	2.5
  (0, 1975)	2.5
  (0, 2067)	3.5
  (0, 2253)	4.5
  (0, 2474)	4.0
  (0, 2532)	5.0
  (0, 2591)	5.0
  (0, 2742)	4.5
  (0, 3335)	4.0
  (0, 3452)	5.0
  (0, 3821)	5.0
  (0, 4014)	5.0
  :	:
  (24999, 2094)	4.0
  (24999, 2271)	2.0
  (24999, 2330)	3.0
  (24999, 2335)	2.0
  (24999, 2442)	4.0
  (24999, 2473)	4.0
  (24999, 2593)	4.0
  (24999, 2598)	4.0
  (24999, 2605)	5.0
  (24999, 2609)	2.0
  (24999, 2617)	3.0
  (24999, 2658)	4.0
  (24999, 2669)	3.0
  (24999, 2670)	4.0
  (24999, 2726)	5.0
  (24999, 2757)	5.0
  (24999, 2780)	4.0
  (24999, 2807)	4.0
  (24999, 2858)	5.0
  (24999, 2875)	3.0
  (24999, 2896)	4.0
  (24999, 2904)	4.0
  (24999, 3132)	4.0
  (24999, 3177)	4.0
  (24999, 3190)	5.0


In [10]:
mat_ratings[0].info()

<class 'pandas.core.frame.DataFrame'>
Index: 25000 entries, 1 to 25000
Columns: 34655 entries, 1 to 209163
dtypes: float64(34655)
memory usage: 6.5 GB


In [32]:
n_users = len(df_raw.userId.unique())
n_movies = len(df_raw.movieId.unique())
print('Number of users:', n_users, 'Number of movies:', n_movies)

mat_ratings = df_raw.pivot_table(index='userId', columns='movieId', values='rating') # values='rating' seems to be optional
mat_ratings.head(10)

Number of users: 162541 Number of movies: 59047


  num_cells = num_rows * num_columns


IndexError: index 1007637055 is out of bounds for axis 0 with size 1007623835

# Model_pre

In [3]:
n_users = len(df_pre.userId.unique())
n_movies = len(df_pre.movieId.unique())
print('Number of users:', n_users, 'Number of movies:', n_movies)

mat_ratings = df_pre.pivot_table(index='userId', columns='movieId', values='rating') # values='rating' seems to be optional
display(mat_ratings.head())

print('Number of non-NaN cells:', mat_ratings.count().sum())
print('Percentage of non-NaN cells:', np.round(100*mat_ratings.count().sum()/(mat_ratings.shape[0]*mat_ratings.shape[1]),2),'%')

Number of users: 161393 Number of movies: 2428


movieId,1,2,3,4,5,6,7,9,10,11,...,177765,179819,180031,182715,183897,185029,187541,187593,192803,195159
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,3.5,,,,,,4.5,,
4,,,,,,,,,,,...,,2.5,,4.5,,,,,,5.0
5,,,,,,,,,,,...,,,,,,,,,,


Number of non-NaN cells: 5273559
Percentage of non-NaN cells: 1.35 %


In [4]:
mat_ratings.min().min()
# => 0.5 is the minimum value, we can replace NaNs with 0 without loosing information (after scaling)

0.5

In [5]:
# Scaling of user ratings to make them comparable: subtract user's mean rating fromm all ratings given
# => all ratings center around 0 (mean=0)

user_mean = mat_ratings.mean(axis=1)

# creation of new DataFrame mat_ratings_scaled
mat_ratings_scaled = pd.DataFrame()
for col in mat_ratings.columns:
    mat_ratings_scaled[col] = mat_ratings[col] - user_mean

# replacing NaNs with zero, which, after scaling, represents the mean
mat_ratings_scaled.fillna(0, inplace=True)

  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat_ratings[col] - user_mean
  mat_ratings_scaled[col] = mat

In [7]:
# transformation in Compressed Sparse Row (CSR) format for reducing memory usage
sparse_ratings = csr_matrix(mat_ratings_scaled)

# Extract user IDs and book titles from the ratings matrix.
user_ids = mat_ratings_scaled.index.tolist()  
movie_ids = mat_ratings_scaled.columns.tolist()  

print(sparse_ratings)
# (0,166) -0.46 (formerly 3.5) means in row 0 (equal to usreId 1) and column 166 there ist a rating of -0.46 (formerly 3.5)

  (0, 166)	-0.4615384615384617
  (0, 521)	-0.4615384615384617
  (0, 551)	0.038461538461538325
  (0, 561)	-0.4615384615384617
  (0, 1449)	1.0384615384615383
  (0, 1541)	0.5384615384615383
  (0, 1566)	0.038461538461538325
  (0, 1627)	0.038461538461538325
  (0, 1648)	1.0384615384615383
  (0, 1758)	1.0384615384615383
  (0, 1774)	-0.4615384615384617
  (0, 1808)	-0.9615384615384617
  (0, 1834)	-0.9615384615384617
  (1, 45)	-2.8255813953488373
  (1, 180)	1.6744186046511627
  (1, 291)	-1.3255813953488373
  (1, 320)	-0.32558139534883734
  (1, 414)	1.1744186046511627
  (1, 576)	1.1744186046511627
  (1, 584)	0.6744186046511627
  (1, 622)	1.6744186046511627
  (1, 716)	1.1744186046511627
  (1, 768)	0.6744186046511627
  (1, 859)	1.1744186046511627
  (1, 890)	1.1744186046511627
  :	:
  (161392, 882)	-0.7976190476190474
  (161392, 902)	1.7023809523809526
  (161392, 930)	-0.7976190476190474
  (161392, 1107)	1.2023809523809526
  (161392, 1118)	0.20238095238095255
  (161392, 1145)	-1.7976190476190474
  (

## Model_pre_user-based
Does not work currently due to memory needed.

In [9]:
print(sparse_ratings.get_shape())
sparse1 = sparse_ratings[:1000,:]
sparse1

(161393, 2428)


<1000x2428 sparse matrix of type '<class 'numpy.float64'>'
	with 30179 stored elements in Compressed Sparse Row format>

In [10]:
# Calculate the cosine similarity between users.
user_similarity = dist.cosine_similarity(sparse1) 

# Creation of a pandas DataFrame from the similarity matrix between users.
# The indexes and columns of the DataFrame are the user identifiers.
user_similarity = pd.DataFrame(user_similarity, index=user_ids[:1000], columns=user_ids[:1000])
user_similarity.max().max()

1.0000000000000042

In [None]:
print((mat_ratings.loc[3]==0))
print(sparse_ratings.getrow(2).toarray()[0]==0)
print(mat_ratings.loc[3][mat_ratings.loc[3]==0])
print()
print(sparse_ratings.getrow(2).toarray()[0][sparse_ratings.getrow(2).toarray()[0]==0])

In [None]:
user_id = 4
k = 3
print(user_similarity.loc[user_id].sort_values(ascending=False)[1:k+1])
block = user_similarity.loc[user_id]<0.99
print(user_similarity.loc[user_id][block].sort_values(ascending=False)[:k])
print(user_similarity.loc[user_id][1])
print(user_similarity.loc[user_id][4])

In [None]:
mindex = np.array(range(len(movie_ids)))
mids = np.array(movie_ids)
df = pd.Series(data=mindex, index=mids)
display(df)
zero_cells = sparse_ratings.getrow(2).toarray()[0]==0
df[zero_cells]

In [None]:
i,j = 187593,3
movie_index = movie_ids.to_frame()
print(movie_index)

In [11]:
mat_ratings

movieId,1,2,3,4,5,6,7,9,10,11,...,177765,179819,180031,182715,183897,185029,187541,187593,192803,195159
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.5,0.0,0.0,0.0,0.0,0.0,4.5,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.5,0.0,4.5,0.0,0.0,0.0,0.0,0.0,5.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162537,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162538,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162539,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162540,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
user_id=3
k=3
i = 1
similar_users = user_similarity.loc[user_id][user_similarity.loc[user_id]<0.99].sort_values(ascending=False)[:k]
print('similar_users:', similar_users)
similar_users.index
sparse_ratings.getcol(i).toarray()[similar_users.index]
#mat_ratings[i]#.loc[similar_users.index]

similar_users: 997    0.263094
709    0.237195
606    0.225071
Name: 3, dtype: float64


array([0.])

In [None]:
# Define a function to predict based on user_similarity,, for a given user, the ratings for all movies they have not rated.
def pred_user(sp_ratings, user_similarity, user_ids, k, user_id):
    user_index = user_ids.index(user_id)

    # Select in mat_ratings the books that have not yet been read by the user
    to_predict = mat_ratings.loc[user_id][mat_ratings.loc[user_id]==0]

    zero_cells = sp_ratings.getrow(user_index).toarray()[0]==0
    #to_predict = sp_ratings.getrow(user_index).toarray()[0][zero_cells]
    index_to_predict = movie_ids.index(zero_cells)
    # mat_ratings.loc[user_id] returns row/column for user with id user_id
    # [mat_ratings.loc[user_id]==0] selects only entries with 0

    # Select the k most similar users excluding the user itself and users with identical ratings (since they add no value)
    exclude_identical = user_similarity.loc[user_id]<0.99
    similar_users = user_similarity.loc[user_id][exclude_identical].sort_values(ascending=False)[:k]
    # user_similarity.loc[user_id] returns row/column for user with id user_id => has similarity to all other users
    print('similar_users:', similar_users)
    # Calculation of the denominator (=Nenner)
    norm = np.sum(np.abs(similar_users))
    print('norm:', norm)

    #print('cool', to_predict)
    #print('index', to_predict.index)

    for i in index_to_predict:
        # if i != 109487:
        #     continue
        
        # Retrieve similar user ratings associated with the movie i
        ratings = mat_ratings[i].loc[similar_users.index]
        ratings = sp_ratings.getcol(i).toarray()[similar_users.index][0]
        print('ratings:', ratings)
        #TODO: if a user did not rate the movie yet, use global avg (3) - but only if at least one user did rate it
        #TODO: maybe use better avg rating instead of 3 (eg. the avg rating of the movie)
        #TODO: replace mat_ratings with sp_matrix
        

        # Calculate the dot product between ratings and similar_users
        scalar_prod = np.dot(ratings, similar_users)
        print('scalar_prod:', scalar_prod)

        # Calculate predicted rating for movie i
        pred = scalar_prod / norm
        print('pred:', pred)
        # Replace with prediction
        to_predict[i] = pred

    # for i in to_predict.index:

    #     # Retrieve similar user ratings associated with the movie i
    #     ratings = mat_ratings[i].loc[similar_users.index]
        
    #     # Calculate the dot product between ratings and similar_users
    #     scalar_prod = np.dot(ratings, similar_users)
        
    #     # Calculate predicted rating for movie i
    #     pred = scalar_prod / norm

    #     # Replace with prediction
    #     to_predict[i] = pred

    return to_predict

pred_user(sparse_ratings, user_similarity, user_ids, k=3, user_id=4).sort_values()

similar_users: 195    0.218377
647    0.192369
740    0.178808
Name: 4, dtype: float64
norm: 0.5895540636953082
ratings: 195    5.0
647    0.0
740    0.0
Name: 109487, dtype: float64
scalar_prod: 1.0918871168291546
pred: 1.8520559590162726


movieId
1         0.000000
5283      0.000000
5291      0.000000
5293      0.000000
5294      0.000000
            ...   
1944      0.000000
1945      0.000000
1947      0.000000
192803    0.000000
109487    1.852056
Name: 4, Length: 2363, dtype: float64

In [158]:
# Define a function to predict based on user_similarity,, for a given user, the ratings for all movies they have not rated.
def pred_user(sp_ratings, user_similarity, user_ids, k, user_id):
    user_index = user_ids.index(user_id)

    # Select in mat_ratings the books that have not yet been read by the user
    to_predict = mat_ratings.loc[user_id][mat_ratings.loc[user_id]==0]

    ###zero_cells = sp_ratings.getrow(user_index).toarray()[0]==0
    #to_predict = sp_ratings.getrow(user_index).toarray()[0][zero_cells]
    ###index_to_predict = movie_ids.index(zero_cells)
    # mat_ratings.loc[user_id] returns row/column for user with id user_id
    # [mat_ratings.loc[user_id]==0] selects only entries with 0

    # Select the k most similar users excluding the user itself and users with identical ratings (since they add no value)
    exclude_identical = user_similarity.loc[user_id]<0.99
    similar_users = user_similarity.loc[user_id][exclude_identical].sort_values(ascending=False)[:k]
    # user_similarity.loc[user_id] returns row/column for user with id user_id => has similarity to all other users
    print('similar_users:', similar_users)
    # Calculation of the denominator (=Nenner)
    norm = np.sum(np.abs(similar_users))
    print('norm:', norm)

    #print('cool', to_predict)
    #print('index', to_predict.index)

    for i in to_predict.index:
        # if i != 109487:
        #     continue
        
        # Retrieve similar user ratings associated with the movie i
        ratings = mat_ratings[i].loc[similar_users.index]
        print('ratings:', ratings)
        #TODO: if a user did not rate the movie yet, use global avg (3) - but only if at least one user did rate it
        #TODO: maybe use better avg rating instead of 3 (eg. the avg rating of the movie)
        #TODO: replace mat_ratings with sp_matrix
        

        # Calculate the dot product between ratings and similar_users
        scalar_prod = np.dot(ratings, similar_users)
        print('scalar_prod:', scalar_prod)

        # Calculate predicted rating for movie i
        pred = scalar_prod / norm
        print('pred:', pred)
        # Replace with prediction
        to_predict[i] = pred

    # for i in to_predict.index:

    #     # Retrieve similar user ratings associated with the movie i
    #     ratings = mat_ratings[i].loc[similar_users.index]
        
    #     # Calculate the dot product between ratings and similar_users
    #     scalar_prod = np.dot(ratings, similar_users)
        
    #     # Calculate predicted rating for movie i
    #     pred = scalar_prod / norm

    #     # Replace with prediction
    #     to_predict[i] = pred

    return to_predict

pred_user(sparse_ratings, user_similarity, user_ids, k=3, user_id=4).sort_values()

similar_users: 195    0.218377
647    0.192369
740    0.178808
Name: 4, dtype: float64
norm: 0.5895540636953082
ratings: 195    5.0
647    0.0
740    0.0
Name: 109487, dtype: float64
scalar_prod: 1.0918871168291546
pred: 1.8520559590162726


movieId
1         0.000000
5283      0.000000
5291      0.000000
5293      0.000000
5294      0.000000
            ...   
1944      0.000000
1945      0.000000
1947      0.000000
192803    0.000000
109487    1.852056
Name: 4, Length: 2363, dtype: float64

In [155]:
# Example: top 10 predictions for userId 1
preds = pred_user(mat_ratings, user_similarity, 3, user_id = 1).sort_values(ascending=False).head(10)

# Series has userId as name; rename for better display:
preds.name = 'predicted rating for user ' + str(preds.name)

df_pred = preds.to_frame().reset_index().rename(columns={'title':'Title'})
# df to link movieId to title 
df_mov = df_pre[['movieId','title']].rename(columns={'title':'Title'}).drop_duplicates()
df_pred = df_pred.merge(right=df_mov, on='movieId', how='left')
#df_pred = df_pred[['Title',1]]
df_pred

TypeError: pred_user() missing 1 required positional argument: 'k'

## Model_pre_item-based

In [8]:
item_similarity = dist.cosine_similarity(sparse_ratings.T)
item_similarity = pd.DataFrame(item_similarity, index=movie_ids, columns=movie_ids)

In [9]:
# Define a function to predict based on item_similarity, for a given user, the ratings for all movies they have not rated.
def pred_item(mat_ratings, item_similarity, k, user_id):

    # Select in mat_ratings the books that have not yet been read by the user
    to_predict = mat_ratings.loc[user_id][mat_ratings.loc[user_id]==0]
    
    # Iterate over all these books
    for i in to_predict.index:

        #Find the k most similar books excluding the book itself
        similar_items = item_similarity.loc[i].sort_values(ascending=False)[1:k+1]

        # Calculation of the norm of the similar_items vector
        norm = np.sum(np.abs(similar_items))

        # Retrieve the ratings given by the user to the k nearest neighbors
        ratings = mat_ratings[similar_items.index].loc[user_id]


        # Calculate the dot product between ratings and similar_items
        scalar_prod = np.dot(ratings, similar_items)
        
        #Calculate predicted rating for movie i
        pred = scalar_prod / norm

        # Replace with prediction
        to_predict[i] = pred


    return to_predict

In [10]:
u = 1
# Example: top 10 predictions for userId u
preds = pred_item(mat_ratings_scaled, item_similarity, 3, user_id = u).sort_values(ascending=False).head(10)

# Series has userId as name; rename for better display:
preds.name = 'predicted rating for user ' + str(preds.name)

df_pred = preds.to_frame().reset_index().rename(columns={'index':'movieId'})

# df to link movieId to title 
df_mov = df_pre[['movieId','title']].rename(columns={'title':'Title'}).drop_duplicates()
df_pred = df_pred.merge(right=df_mov, on='movieId', how='left')
# add users mean rating to prediction to obtain original scale
df_pred[f'predicted rating for user {u}'] = df_pred[f'predicted rating for user {u}'] + user_mean.loc[u]
df_pred

Unnamed: 0,movieId,predicted rating for user 1,Title
0,123,4.475785,Chungking Express (Chung Hing sam lam) (1994)
1,7981,4.314406,Infernal Affairs (Mou gaan dou) (2002)
2,2858,4.307267,American Beauty (1999)
3,1237,3.979008,"Seventh Seal, The (Sjunde inseglet, Det) (1957)"
4,44694,3.977549,Volver (2006)
5,3083,3.976201,All About My Mother (Todo sobre mi madre) (1999)
6,1251,3.975511,8 1/2 (8½) (1963)
7,3089,3.975464,Bicycle Thieves (a.k.a. The Bicycle Thief) (a....
8,2313,3.974869,"Elephant Man, The (1980)"
9,4914,3.97477,Breathless (À bout de souffle) (1960)


### Top users (1000+ ratings)

In [11]:
u = 72315
# Example: top 10 predictions for userId u
preds = pred_item(mat_ratings_scaled, item_similarity, 3, user_id = u).sort_values(ascending=False).head(10)

# Series has userId as name; rename for better display:
preds.name = 'predicted rating for user ' + str(preds.name)

df_pred = preds.to_frame().reset_index().rename(columns={'index':'movieId'})

# df to link movieId to title 
df_mov = df_pre[['movieId','title']].rename(columns={'title':'Title'}).drop_duplicates()
df_pred = df_pred.merge(right=df_mov, on='movieId', how='left')
# add users mean rating to prediction to obtain original scale
df_pred[f'predicted rating for user {u}'] = df_pred[f'predicted rating for user {u}'] + user_mean.loc[u]
df_pred

Unnamed: 0,movieId,predicted rating for user 72315,Title
0,1104,4.650524,"Streetcar Named Desire, A (1951)"
1,2303,4.511378,Nashville (1975)
2,4008,4.5,Born on the Fourth of July (1989)
3,1295,4.497002,"Unbearable Lightness of Being, The (1988)"
4,1952,4.491257,Midnight Cowboy (1969)
5,5013,4.346674,Gosford Park (2001)
6,608,4.339833,Fargo (1996)
7,2160,4.337089,Rosemary's Baby (1968)
8,2020,4.336568,Dangerous Liaisons (1988)
9,4326,4.336109,Mississippi Burning (1988)


In [12]:
u = 80974
# Example: top 10 predictions for userId u
preds = pred_item(mat_ratings_scaled, item_similarity, 3, user_id = u).sort_values(ascending=False).head(10)

# Series has userId as name; rename for better display:
preds.name = 'predicted rating for user ' + str(preds.name)

df_pred = preds.to_frame().reset_index().rename(columns={'index':'movieId'})

# df to link movieId to title 
df_mov = df_pre[['movieId','title']].rename(columns={'title':'Title'}).drop_duplicates()
df_pred = df_pred.merge(right=df_mov, on='movieId', how='left')
# add users mean rating to prediction to obtain original scale
df_pred[f'predicted rating for user {u}'] = df_pred[f'predicted rating for user {u}'] + user_mean.loc[u]
df_pred

Unnamed: 0,movieId,predicted rating for user 80974,Title
0,3095,4.67748,"Grapes of Wrath, The (1940)"
1,2398,4.432207,Miracle on 34th Street (1947)
2,5995,4.340409,"Pianist, The (2002)"
3,934,4.331132,Father of the Bride (1950)
4,6016,4.180702,City of God (Cidade de Deus) (2002)
5,1204,4.167497,Lawrence of Arabia (1962)
6,919,4.163476,"Wizard of Oz, The (1939)"
7,48738,4.16181,"Last King of Scotland, The (2006)"
8,969,4.160201,"African Queen, The (1951)"
9,593,4.150331,"Silence of the Lambs, The (1991)"


In [13]:
u = 137293
# Example: top 10 predictions for userId u
preds = pred_item(mat_ratings_scaled, item_similarity, 3, user_id = u).sort_values(ascending=False).head(10)

# Series has userId as name; rename for better display:
preds.name = 'predicted rating for user ' + str(preds.name)

df_pred = preds.to_frame().reset_index().rename(columns={'index':'movieId'})

# df to link movieId to title 
df_mov = df_pre[['movieId','title']].rename(columns={'title':'Title'}).drop_duplicates()
df_pred = df_pred.merge(right=df_mov, on='movieId', how='left')
# add users mean rating to prediction to obtain original scale
df_pred[f'predicted rating for user {u}'] = df_pred[f'predicted rating for user {u}'] + user_mean.loc[u]
df_pred

Unnamed: 0,movieId,predicted rating for user 137293,Title
0,3788,4.6955,Blow-Up (Blowup) (1966)
1,154,4.5,Beauty of the Day (Belle de jour) (1967)
2,111,4.264053,Taxi Driver (1976)
3,1237,4.231817,"Seventh Seal, The (Sjunde inseglet, Det) (1957)"
4,5291,4.049756,Rashomon (Rashômon) (1950)
5,8014,4.029337,"Spring, Summer, Fall, Winter... and Spring (Bo..."
6,6377,4.0148,Finding Nemo (2003)
7,152081,4.012226,Zootopia (2016)
8,2857,4.004374,Yellow Submarine (1968)
9,8961,4.003128,"Incredibles, The (2004)"


### Users with only 20 ratings

### Users with 20 ratings and low average (0.5*)

In [14]:
u = 63044
# Example: top 10 predictions for userId u
preds = pred_item(mat_ratings_scaled, item_similarity, 3, user_id = u).sort_values(ascending=False).head(10)

# Series has userId as name; rename for better display:
preds.name = 'predicted rating for user ' + str(preds.name)

df_pred = preds.to_frame().reset_index().rename(columns={'index':'movieId'})

# df to link movieId to title 
df_mov = df_pre[['movieId','title']].rename(columns={'title':'Title'}).drop_duplicates()
df_pred = df_pred.merge(right=df_mov, on='movieId', how='left')
# add users mean rating to prediction to obtain original scale
df_pred[f'predicted rating for user {u}'] = df_pred[f'predicted rating for user {u}'] + user_mean.loc[u]
df_pred

Unnamed: 0,movieId,predicted rating for user 63044,Title
0,1,0.5,Toy Story (1995)
1,5621,0.5,"Tuxedo, The (2002)"
2,5669,0.5,Bowling for Columbine (2002)
3,5673,0.5,Punch-Drunk Love (2002)
4,5679,0.5,"Ring, The (2002)"
5,5690,0.5,Grave of the Fireflies (Hotaru no haka) (1988)
6,5693,0.5,Saturday Night Fever (1977)
7,5782,0.5,"Professional, The (Le professionnel) (1981)"
8,5785,0.5,Jackass: The Movie (2002)
9,5791,0.5,Frida (2002)


In [16]:
u = 38998
# Example: top 10 predictions for userId u
preds = pred_item(mat_ratings_scaled, item_similarity, 3, user_id = u).sort_values(ascending=False).head(10)

# Series has userId as name; rename for better display:
preds.name = 'predicted rating for user ' + str(preds.name)

df_pred = preds.to_frame().reset_index().rename(columns={'index':'movieId'})

# df to link movieId to title 
df_mov = df_pre[['movieId','title']].rename(columns={'title':'Title'}).drop_duplicates()
df_pred = df_pred.merge(right=df_mov, on='movieId', how='left')
# add users mean rating to prediction to obtain original scale
df_pred[f'predicted rating for user {u}'] = df_pred[f'predicted rating for user {u}'] + user_mean.loc[u]
df_pred

Unnamed: 0,movieId,predicted rating for user 38998,Title
0,1,0.5,Toy Story (1995)
1,5621,0.5,"Tuxedo, The (2002)"
2,5669,0.5,Bowling for Columbine (2002)
3,5673,0.5,Punch-Drunk Love (2002)
4,5679,0.5,"Ring, The (2002)"
5,5690,0.5,Grave of the Fireflies (Hotaru no haka) (1988)
6,5693,0.5,Saturday Night Fever (1977)
7,5782,0.5,"Professional, The (Le professionnel) (1981)"
8,5785,0.5,Jackass: The Movie (2002)
9,5791,0.5,Frida (2002)


### Users with average rating of 5.0 stars

### 5* users with many ratings

In [17]:
u = 75309
# Example: top 10 predictions for userId u
preds = pred_item(mat_ratings_scaled, item_similarity, 3, user_id = u).sort_values(ascending=False).head(10)

# Series has userId as name; rename for better display:
preds.name = 'predicted rating for user ' + str(preds.name)

df_pred = preds.to_frame().reset_index().rename(columns={'index':'movieId'})

# df to link movieId to title 
df_mov = df_pre[['movieId','title']].rename(columns={'title':'Title'}).drop_duplicates()
df_pred = df_pred.merge(right=df_mov, on='movieId', how='left')
# add users mean rating to prediction to obtain original scale
df_pred[f'predicted rating for user {u}'] = df_pred[f'predicted rating for user {u}'] + user_mean.loc[u]
df_pred

Unnamed: 0,movieId,predicted rating for user 75309,Title
0,1,5.0,Toy Story (1995)
1,5621,5.0,"Tuxedo, The (2002)"
2,5669,5.0,Bowling for Columbine (2002)
3,5673,5.0,Punch-Drunk Love (2002)
4,5679,5.0,"Ring, The (2002)"
5,5690,5.0,Grave of the Fireflies (Hotaru no haka) (1988)
6,5693,5.0,Saturday Night Fever (1977)
7,5782,5.0,"Professional, The (Le professionnel) (1981)"
8,5785,5.0,Jackass: The Movie (2002)
9,5791,5.0,Frida (2002)


In [24]:
preds = pred_item(mat_ratings, item_similarity, 3, user_id = 12002).sort_values(ascending=False).head(10)

# Series has userId as name; rename for better display:
preds.name = 'predicted rating for user ' + str(preds.name)

df_pred = preds.to_frame().reset_index().rename(columns={'title':'Title'})
# df to link movieId to title 
df_mov = df_pre[['movieId','title']].rename(columns={'title':'Title'}).drop_duplicates()
df_pred = df_pred.merge(right=df_mov, on='movieId', how='left')
#df_pred = df_pred[['Title',1]]
df_pred

Unnamed: 0,movieId,predicted rating for user 12002,Title
0,112552,5.0,Whiplash (2014)
1,81591,5.0,Black Swan (2010)
2,82459,5.0,True Grit (2010)
3,164909,5.0,La La Land (2016)
4,174055,5.0,Dunkirk (2017)
5,152077,5.0,10 Cloverfield Lane (2016)
6,2648,3.854613,Frankenstein (1931)
7,1279,3.539312,Night on Earth (1991)
8,103688,3.512931,"Conjuring, The (2013)"
9,123,3.495122,Chungking Express (Chung Hing sam lam) (1994)


### 5* users with 20 ratings

In [25]:
preds = pred_item(mat_ratings, item_similarity, 3, user_id = 36868).sort_values(ascending=False).head(10)

# Series has userId as name; rename for better display:
preds.name = 'predicted rating for user ' + str(preds.name)

df_pred = preds.to_frame().reset_index().rename(columns={'title':'Title'})
# df to link movieId to title 
df_mov = df_pre[['movieId','title']].rename(columns={'title':'Title'}).drop_duplicates()
df_pred = df_pred.merge(right=df_mov, on='movieId', how='left')
#df_pred = df_pred[['Title',1]]
df_pred

Unnamed: 0,movieId,predicted rating for user 36868,Title
0,1,0.0,Toy Story (1995)
1,5620,0.0,Sweet Home Alabama (2002)
2,5630,0.0,Red Dragon (2002)
3,5669,0.0,Bowling for Columbine (2002)
4,5673,0.0,Punch-Drunk Love (2002)
5,5679,0.0,"Ring, The (2002)"
6,5690,0.0,Grave of the Fireflies (Hotaru no haka) (1988)
7,5693,0.0,Saturday Night Fever (1977)
8,5782,0.0,"Professional, The (Le professionnel) (1981)"
9,5785,0.0,Jackass: The Movie (2002)


In [26]:
preds = pred_item(mat_ratings, item_similarity, 3, user_id = 31747).sort_values(ascending=False).head(10)

# Series has userId as name; rename for better display:
preds.name = 'predicted rating for user ' + str(preds.name)

df_pred = preds.to_frame().reset_index().rename(columns={'title':'Title'})
# df to link movieId to title 
df_mov = df_pre[['movieId','title']].rename(columns={'title':'Title'}).drop_duplicates()
df_pred = df_pred.merge(right=df_mov, on='movieId', how='left')
#df_pred = df_pred[['Title',1]]
df_pred

Unnamed: 0,movieId,predicted rating for user 31747,Title
0,1379,1.892806,Young Guns II (1990)
1,1968,1.805313,"Breakfast Club, The (1985)"
2,1215,1.759249,Army of Darkness (1993)
3,4105,1.670292,"Evil Dead, The (1981)"
4,3273,1.662333,Scream 3 (2000)
5,1407,1.647415,Scream (1996)
6,2144,1.625387,Sixteen Candles (1984)
7,1101,1.593618,Top Gun (1986)
8,480,1.567363,Jurassic Park (1993)
9,1358,1.419986,Sling Blade (1996)
