## __Collaborative Filtering and Memory-Based Modeling__ #
Collaborative filtering is a technique that can filter items a user might like based on reactions by similar users. It is a recommendation engine.


## Step 1: Import Required Libraries and Load the Dataset

- Import the pandas and NumPy libraries
- Load the dataset using pandas


In [14]:
import pandas as pd
import numpy as np

In [15]:
header =['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('https://raw.githubusercontent.com/nachikethmurthy/Source-Code-Dataset-for-Machine-Learning-using-Python/main/Data/ratings.csv')
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


__Observations:__
- Here, we have defined the headers, as the user data has these columns.
- The data contains user_id, item_id, rating, and timestamp.

## Step 2: Create a N User

- Create an N user by taking unique values for the user and applying the same to the items




In [16]:
df.columns = header

In [17]:
df['user_id'].nunique()

610

In [18]:
n_users = df.user_id.unique().shape[0]
n_items = df.item_id.unique().shape[0]
print('number of user = ' + str(n_users) + ' | number of items = ' + str(n_items))

number of user = 610 | number of items = 9724


__Observation:__
- There are 610 users and 9724 items.

## Step 3: Split the Data into Train and Test Sets

- Import train_test_split from sklearn.model_selection
- Split the data into train and test sets


In [19]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.25)

## Step 4: Create a Matrix for Train and Test Sets

- Create user-item matrices

In [20]:
train_data

Unnamed: 0,user_id,item_id,rating,timestamp
87728,566,110,5.0,849005345
49933,321,225,4.0,843212509
99089,608,2762,4.5,1189471095
34394,232,33164,2.5,1218169753
40312,274,44191,3.5,1201902133
...,...,...,...,...
46722,306,161634,3.5,1518327218
53433,352,94864,4.5,1493674430
96982,603,4259,3.0,1002403751
7705,51,4876,3.0,1230930898


In [21]:
# train_data_mat = np.zeros((n_users, n_items))
# for line in train_data.itertuples():
#     train_data_mat[line[1]-1, line[2]-1] = line[3]
                      
# test_data_mat = np.zeros((n_users, n_items))
# for line in test_data.itertuples():
#     test_data_mat[line[1]-1, line[2]-1] = line[3]                       

IndexError: index 33163 is out of bounds for axis 1 with size 9724

In [27]:
train_pivot = train_data.pivot_table(values="rating",index="user_id",columns="item_id")

In [28]:
test_pivot = test_data.pivot_table(values="rating",index="user_id",columns="item_id")

In [47]:
test_pivot = test_pivot.fillna(0)

__Observation:__
-  Here, we have created user-item matrices for train and test sets by comparing line items.


## Step 5: Calculate Similarity Matrices for Users and Items

- Import pairwise_distances from sklearn.metrics.pairwise
- Calculate similarity matrices for users and items


In [34]:
train_pivot = train_pivot.fillna(0)

In [36]:
train_pivot.isna().sum().sum()

0

In [46]:
train_pivot

item_id,1,2,3,4,5,6,7,8,9,10,...,191005,193565,193571,193573,193579,193581,193583,193585,193587,193609
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [62]:
from sklearn.metrics.pairwise import pairwise_distances
user_sim = 1 - pairwise_distances(train_pivot.values,metric='cosine' )
item_sim = 1- pairwise_distances(train_pivot.T,metric='cosine')

In [63]:
user_sim

array([[1.        , 0.01730604, 0.04172931, ..., 0.22357913, 0.07302605,
        0.11437938],
       [0.01730604, 1.        , 0.        , ..., 0.0296589 , 0.03995441,
        0.08892494],
       [0.04172931, 0.        , 1.        , ..., 0.01321501, 0.        ,
        0.01311874],
       ...,
       [0.22357913, 0.0296589 , 0.01321501, ..., 1.        , 0.09674455,
        0.22647253],
       [0.07302605, 0.03995441, 0.        , ..., 0.09674455, 1.        ,
        0.04022964],
       [0.11437938, 0.08892494, 0.01311874, ..., 0.22647253, 0.04022964,
        1.        ]])

In [64]:
item_sim

array([[1.        , 0.30154378, 0.24880174, ..., 0.        , 0.        ,
        0.        ],
       [0.30154378, 1.        , 0.14837147, ..., 0.        , 0.        ,
        0.        ],
       [0.24880174, 0.14837147, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [82]:
np.fill_diagonal(user_sim,0)
np.fill_diagonal(item_sim,0)

In [83]:
item_sim

array([[0.        , 0.30154378, 0.24880174, ..., 0.        , 0.        ,
        0.        ],
       [0.30154378, 0.        , 0.14837147, ..., 0.        , 0.        ,
        0.        ],
       [0.24880174, 0.14837147, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [84]:
user_df = pd.DataFrame(user_sim, train_pivot.index, columns=train_pivot.index)

In [85]:
user_df.iloc[0].sort_values(ascending=False)

user_id
266    0.307581
313    0.304075
57     0.277151
368    0.273884
452    0.271424
         ...   
250    0.000000
175    0.000000
148    0.000000
12     0.000000
306    0.000000
Name: 1, Length: 610, dtype: float64

In [86]:
item_df = pd.DataFrame(item_sim, train_pivot.columns, columns=train_pivot.columns)
item_df.iloc[0].sort_values(ascending=False)

item_id
260       0.516123
780       0.457496
296       0.452833
4886      0.450295
648       0.449493
            ...   
32387     0.000000
32392     0.000000
32440     0.000000
32442     0.000000
193609    0.000000
Name: 1, Length: 8745, dtype: float64

In [68]:
user_sim.shape

(610, 610)

In [69]:
user_sim[:5]

array([[1.        , 0.01730604, 0.04172931, ..., 0.22357913, 0.07302605,
        0.11437938],
       [0.01730604, 1.        , 0.        , ..., 0.0296589 , 0.03995441,
        0.08892494],
       [0.04172931, 0.        , 1.        , ..., 0.01321501, 0.        ,
        0.01311874],
       [0.08826458, 0.00488667, 0.00308053, ..., 0.11150475, 0.        ,
        0.09464531],
       [0.07996931, 0.        , 0.        , ..., 0.08415442, 0.15833528,
        0.03548323]])

In [88]:
item_sim.shape

(8745, 8745)

## Step 6: Define the Prediction Function

- Define a `predict` function that takes the following parameters:
  - ratings: the user-item matrix
  - similarity: the similarity matrix
  - type (default = user): the type of collaborative filtering (user or item)

In [87]:
train_pivot.mean(axis=1)

user_id
1      0.090452
2      0.010349
3      0.007890
4      0.065523
5      0.013608
         ...   
606    0.349571
607    0.065066
608    0.227844
609    0.008462
610    0.420926
Length: 610, dtype: float64

In [70]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [71]:
item_prediction = predict(train_pivot.values, item_sim, type='item')
user_prediction = predict(train_pivot.values, user_sim, type='user')

In [72]:
item_prediction.shape

(610, 8745)

In [73]:
user_prediction.shape

(610, 8745)

In [74]:
user_prediction

array([[ 1.4466002 ,  0.65739995,  0.27768228, ...,  0.01949481,
         0.01949481,  0.02094823],
       [ 1.03450997,  0.42121483,  0.04908048, ..., -0.04017817,
        -0.04017817, -0.03195433],
       [ 1.00064751,  0.48674656,  0.18630227, ..., -0.06330641,
        -0.06330641, -0.06330641],
       ...,
       [ 1.58626101,  0.87233896,  0.42411587, ...,  0.15005571,
         0.15005571,  0.15731749],
       [ 1.39152303,  0.62074115,  0.24274802, ..., -0.03488143,
        -0.03488143, -0.03342736],
       [ 1.78041266,  0.98694984,  0.53082621, ...,  0.33783327,
         0.33783327,  0.3479256 ]])

__Observations:__
- Item predictions and user predictions are saved.
- Though the memory algorithm is easy to implement, there are drawbacks, such as not scaling up to the real-world scenario and not addressing the well-known cold start problem.
- The problem with a cold start is that when a new user or a new item enters the system, they won’t be able to create a recommendation.

## Step 7: Create a Function for RMSE

- Import mean_squared_error from sklearn.metrics
- Define the RMSE function
- Calculate RMSE for user-based and item-based predictions


In [75]:
from sklearn.metrics import mean_squared_error
from math import sqrt

def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

print('User-based CF RMSE: ' + str(rmse(user_prediction, test_pivot.values)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_pivot.values)))

User-based CF RMSE: 3.4574951424128675
Item-based CF RMSE: 3.4030416481791312


__Observation:__
- As shown, we have calculated the RMSE for user-based and item-based predictions.


This is how we evaluate the recommendation called collaborative filtering with memory.