## Movie Recommender considering Linear dependency between user and item matrix

### It uses approach 1 where in we use linear equation to minimise the error between assumed weight matrices for user and item

$$\min_{x,\theta} f(x,\theta) = R - X\theta$$

Lets assume there are 4 latent features and equations formed from the user and item/movie matrices are going to be

User1 $$|a1|a2|a3|a4|$$    
Movie1 $$|b1|b2|b3|b4|$$

i.e     $$a_1b_1 + a_2b_2 + a_3b_3 + a_4b_4 = d_{11}$$

$$\gamma = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n$$


$$\gamma = \sum_{i=1}^n a_i b_i$$

$$a_1 = \frac{\gamma - (a_2 b_2 + a_3 b_3 + \cdots + a_n b_n)}{b_1}$$

$$a_1'' = \frac{\gamma - (a_1' b_1' + a_2' b_2' + \cdots + a_n' b_n') + a_1' b_1'}{b_1'}$$

$$a_1'' = \frac{\gamma - \gamma' + a_1' b_1'}{b_1'}$$

$$a_1'' = \frac{\varepsilon}{b_1'} + a_1'$$

$$where, \varepsilon = \gamma - \gamma'$$

$$a_1'' \approx a_1' + \frac{1}{(b_1')^2} \cdot \varepsilon \cdot b_1'$$

$$a_1'' \approx a_1' + \alpha \cdot \varepsilon \cdot b_1'$$

where, $\alpha$ is some small stepsize for the equation.

$$Similarly,$$

$$b_1'' \approx b_1' + \alpha \cdot\varepsilon \cdot a_1'$$

In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm


In [47]:
# Loading the data and processing it
address = 'D:\\Project Ideas\\Confusionlist\\Recommender\\ml-latest-small\\ml-latest-small\\'
M = pd.read_csv(address + 'movies.csv')

R = pd.read_csv(address + 'ratings.csv')
print('\nMovie',M.head())
print('\nRatings\n')
R


Movie    movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

Ratings



Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


## Hiding some data while we are training and check the accuracy

In [48]:
from sklearn.model_selection import train_test_split
X_ratings,Y_ratings = train_test_split(R,test_size=0.1, random_state=42)

In [49]:
print(X_ratings.shape)
print(Y_ratings.shape)

(90752, 4)
(10084, 4)


In [50]:
new = Y_ratings.copy()
new['rating'] = np.nan

In [51]:
train_ratings = pd.concat([X_ratings, new], ignore_index=True, axis=0)

In [52]:
train_ratings = train_ratings.drop(columns=["timestamp"])
train_ratings = train_ratings.pivot(index='userId', columns='movieId', values='rating')
train_ratings = train_ratings.fillna(0)
R_Main = train_ratings.copy()
R_Main

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [53]:
Rat_arr = R_Main.to_numpy()
Rat_arr.shape

(610, 9724)

In [54]:
# Define a vector of movie titles
movies = R['movieId'].unique()
movies.sort()

# Get the number of users and movies
num_users = Rat_arr.shape[0]
num_movies = Rat_arr.shape[1]

# Set the number of latent features
num_features = 100

# Initialize the user and movie matrices with random values
user_matrix = np.random.rand(num_users, num_features)
movie_matrix = np.random.rand(num_movies, num_features)


In [55]:
# MINIMISING ERROR AND FINDING RATINGS


# Define a learning rate and number of iterations
alpha = 0.02
num_iterations = 100
tol = 0.0000001

# Loop through the number of iterations
for i in tqdm(range(num_iterations)):
    # Loop through all the users
    for j in (range(num_users)):
        # Find the non-zero ratings for the current user
        rated_movies = np.nonzero(Rat_arr[j, :])[0]
        # Loop through the rated movies for the current user
        for k in range(len(rated_movies)):        
            # Get the movie ID
            movie_id = rated_movies[k]
            # Get the rating for the current movie
            rating = Rat_arr[j, movie_id]
            # Calculate the error between the predicted rating and the actual rating
            error = rating - np.dot(user_matrix[j, :], movie_matrix[movie_id, :])
            # Update the user and movie matrices
            user_matrix[j, :] = user_matrix[j, :] + alpha * error * movie_matrix[movie_id, :]
            movie_matrix[movie_id, :] = movie_matrix[movie_id, :] + alpha * error * user_matrix[j, :]
            # breaking the loop if the accuracy is reached within the tolerance declared
            if error < tol:
                continue

# Predicting the ratings and adding it up in Dataframe

predicted_ratings = np.dot(user_matrix, movie_matrix.T)


100%|██████████████████████████████████████████████| 100/100 [01:25<00:00,  1.17it/s]


In [56]:
df_pred = pd.DataFrame(predicted_ratings, columns = [movies])

df_pred

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
0,4.062838,2.199149,4.048531,1.399093,-0.175799,4.043689,-0.330281,0.369577,2.119650,3.890022,...,3.826860,2.796378,3.465577,3.856849,2.851569,3.873122,3.607924,3.118713,6.550062,4.713048
1,2.976700,2.614055,2.065147,1.503311,0.139654,3.366724,1.172149,3.778130,2.163095,1.800271,...,3.490989,3.857785,3.758513,3.711777,2.688345,3.298057,4.713494,2.998711,4.551250,3.247063
2,1.959779,0.196767,1.765931,-0.954478,0.818215,2.669703,3.423777,1.690869,1.661871,3.789483,...,2.213138,2.281745,2.962908,1.467307,4.238475,3.366688,3.047904,2.958012,3.171918,3.636609
3,3.153460,6.049132,2.676181,1.797794,3.693108,3.454706,3.008033,0.692238,1.169128,2.830528,...,2.619225,2.274698,3.173154,3.666163,4.004631,4.099336,4.424074,1.081380,2.986189,3.861168
4,2.695427,1.737326,3.771748,2.166950,3.858710,5.740854,5.378762,1.956185,3.152999,4.157031,...,4.431071,3.730648,4.698807,5.143915,4.591678,4.298213,5.336024,5.327005,4.695185,4.244932
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,2.469425,2.192765,2.089815,2.120587,2.165427,2.969089,2.526346,1.499336,1.948524,4.983921,...,2.401878,4.611901,3.277352,4.082530,2.915040,3.033647,3.560831,3.406884,4.009622,1.616158
606,4.026327,3.936919,5.407281,2.934030,1.871094,6.042969,1.926669,2.144106,0.502520,6.366241,...,5.146578,3.819700,3.427068,1.291735,2.960827,4.416673,4.718236,4.305919,3.681296,3.109957
607,2.525177,2.011019,1.979824,0.096703,3.355866,5.407963,2.271573,-2.151326,0.542064,3.990787,...,1.553269,3.492152,4.059764,3.102581,4.122720,3.286038,3.678800,4.543598,6.378450,3.206499
608,3.002593,3.188217,3.074766,1.928906,1.546601,3.495923,2.608184,0.645349,1.286258,4.001053,...,4.800076,2.312177,3.741682,4.780070,4.298614,4.420778,6.254004,4.128904,5.440051,5.615392


In [57]:
userId = Y_ratings['userId']
movieId = Y_ratings['movieId']

In [58]:
y_true = Y_ratings['rating']
y_pred = []
        
for i,j in zip(userId,movieId):
    y_pred.append(df_pred.loc[i-1,j])

In [59]:
len(y_pred)

10084

In [60]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_pred,y_true, squared=False)

1.378544373893474

### Sometimes the above calculation runs into exception caused due to runtime warnings or errors. To handle that we are calculating the whole thing inside a function and we are recrursively calling the function again if error is thrown by the kernel.

In [61]:
import warnings


def PredictRatings(Rat_arr,numFeatures=100,alpha = 0.02,iterations = 100 ,tol = 1e-7):
    try:
        
        # Get the number of users and movies
        ra = Rat_arr.copy()
        num_users = ra.shape[0]
        num_movies = ra.shape[1]

        # Set the number of latent features
        num_features = numFeatures

        # Initialize the user and movie matrices with random values
        user_matrix = np.random.rand(num_users, num_features)
        movie_matrix = np.random.rand(num_movies, num_features)

        # Loop through the number of iterations
        for i in tqdm(range(num_iterations)):
            # Loop through all the users
            for j in (range(num_users)):
                # Find the non-zero ratings for the current user
                rated_movies = np.nonzero(ra[j, :])[0]
                # Loop through the rated movies for the current user
                for k in range(len(rated_movies)):        
                    # Get the movie ID
                    movie_id = rated_movies[k]
                    # Get the rating for the current movie
                    rating = ra[j, movie_id]
                    # Calculate the error between the predicted rating and the actual rating
                    error = rating - np.dot(user_matrix[j, :], movie_matrix[movie_id, :])
                    # Update the user and movie matrices
                    user_matrix[j, :] = user_matrix[j, :] + alpha * error * movie_matrix[movie_id, :]
                    movie_matrix[movie_id, :] = movie_matrix[movie_id, :] + alpha * error * user_matrix[j, :]
                    # breaking the loop if the accuracy is reached within the tolerance declared
                    if error < tol:
                        continue

    # Predicting the ratings and adding it up in Dataframe

        pred_rate = np.dot(user_matrix, movie_matrix.T)
        return pred_rate
    
    except (RuntimeWarning or RuntimeError):
        print('Runtime Error/Warning encountered, going for re-execution !/n')
        return PredictRatings(Rat_arr,numFeatures=100,alpha = 0.02,iterations = 100 ,tol = 1e-7)
        
    

In [62]:
R_Main = train_ratings.copy()
Ratarr = R_Main.to_numpy()

predicted_ratings = PredictRatings(Ratarr,100,0.02,100,1e-7)
predicted_ratings

100%|██████████████████████████████████████████████| 100/100 [01:16<00:00,  1.30it/s]


array([[4.05077252, 3.77727   , 3.98548023, ..., 5.08461439, 3.57383812,
        4.48682273],
       [1.43648929, 2.1226544 , 3.76070944, ..., 4.15090453, 3.54443706,
        4.44432706],
       [2.88061475, 3.1111922 , 1.66350301, ..., 3.83274925, 3.53318012,
        2.34624371],
       ...,
       [2.53645827, 2.02818851, 2.02765542, ..., 3.72724239, 2.91568265,
        3.15408894],
       [2.9997103 , 3.73305547, 4.60036088, ..., 4.72884207, 4.56784619,
        3.727067  ],
       [5.06651485, 3.77024125, 2.53398673, ..., 4.62238095, 1.68033798,
        4.16252297]])

In [63]:
df_pred = pd.DataFrame(predicted_ratings, columns = [movies])

df_pred

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
0,4.050773,3.777270,3.985480,3.093635,1.937857,4.010370,2.294091,2.431572,-0.566307,2.541807,...,3.219267,4.558107,3.310197,4.867995,3.080148,4.328760,4.615560,5.084614,3.573838,4.486823
1,1.436489,2.122654,3.760709,1.822395,2.292198,3.984614,1.716106,2.339441,0.362502,1.071866,...,3.126299,3.866378,2.484659,4.173326,3.293555,4.235572,3.124969,4.150905,3.544437,4.444327
2,2.880615,3.111192,1.663503,3.802663,3.585360,2.770864,2.902345,1.465479,-0.843412,-0.431968,...,4.042488,5.398089,3.133002,2.833701,1.708839,2.770151,2.609357,3.832749,3.533180,2.346244
3,3.377140,-1.000050,2.979964,0.479090,5.394309,4.075321,1.696440,0.402145,-0.486078,-0.101680,...,2.379524,3.585128,1.743409,2.261216,2.992987,4.120224,2.969121,3.808765,2.668189,3.694294
4,3.107482,4.643800,2.651022,0.960221,2.648354,3.820260,2.825310,3.207485,1.615867,3.600185,...,5.879659,5.133172,3.946527,4.944430,4.193024,5.242106,4.945025,4.474098,4.249117,4.023236
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,2.622125,4.498728,1.900010,1.298914,2.403680,3.506746,2.540201,2.425359,1.393772,2.454784,...,3.868096,4.066517,4.319109,5.219371,3.567536,4.937749,2.919242,4.910563,2.466216,3.104247
606,3.961049,0.379094,4.390479,3.781647,3.124707,3.962435,4.646624,2.325193,1.315180,3.815688,...,1.592205,3.303470,3.278171,3.847520,2.056608,4.641679,4.351292,2.608091,2.423967,2.781926
607,2.536458,2.028189,2.027655,1.925932,2.316220,5.374712,3.007674,3.953427,1.245552,4.019213,...,1.953108,5.321287,4.895812,5.768556,5.038065,5.286219,4.349468,3.727242,2.915683,3.154089
608,2.999710,3.733055,4.600361,0.989981,2.939363,4.526988,3.888254,3.435674,1.220642,4.004936,...,4.477555,4.055580,4.726917,5.316243,4.479485,3.620564,5.293807,4.728842,4.567846,3.727067


In [64]:
userId = Y_ratings['userId']
movieId = Y_ratings['movieId']

In [65]:
y_true = Y_ratings['rating']
y_pred = []
        
for i,j in zip(userId,movieId):
    y_pred.append(df_pred.loc[i-1,j])

In [66]:
len(y_pred)

10084

In [67]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_true,y_pred, squared=False)

1.3759161307959689