# Recommender System using Collaborative Filtering


• Input 

    • User-Rating Matrix (Incomplete : Sparse)

• Output

    • For a particular user, complete the row
 
<br>

User Based Collaborative Filtering:<br>
        
       • If user-u likes item-j, recommend item-j’ that was liked by other users like him

<img src="img/UBCF.png">

<br><br><br>
Item Based Collaborative Filtering:<br>


    • If user-u likes item-j, recommend item-j’ that is similar to item-j
    

<img src="img/IBCF.png">

### Surprise is a python package used to make recommender systems

Surprise is an easy-to-use Python scikit for recommender systems.

There are some frequently used modules in Suprise package<br>


<li>similarities module</li>
    
    The similarities module includes tools to compute similarity metrics between users or items.
    
    Popular ones are as shown below
    
    cosine --- Compute the cosine similarity between all pairs of users (or items).
    msd -- Compute the Mean Squared Difference similarity between all pairs of users (or items).
    pearson -- Compute the Pearson correlation coefficient between all pairs of users (or items).

<li>accuracy module</li>

    The accuracy module provides with tools for computing accuracy metrics on a set of predictions.
    
    Popular choices are as shown below
    
    rmse -- Compute RMSE (Root Mean Squared Error).
    mae -- Compute MAE (Mean Absolute Error).

<li> dataset module</li>

    The dataset module defines the Dataset class and other subclasses which are used for managing datasets.
    
    Popular choices are shown below
    
    Dataset.load_builtin -- Load a built-in dataset.
    Dataset.load_from_file -- Load a dataset from a (custom) file.
    Dataset.load_from_folds -- Load a dataset where folds (for cross-validation) are predefined by some files.
    Dataset.load_from_df -- Load a dataset from a pandas dataframe.
    
<li>Reader Module</li>

    The Reader class is used to parse a file containing ratings. Such a file is assumed to specify only one 
    rating  per line, and each line needs to respect the following structure:
    
    user ; item ; rating ; [timestamp]
    
    where the order of the fields and the separator (here ‘;’) may be arbitrarily defined. 
    brackets indicate that the timestamp field is optional.


http://surpriselib.com/
<br>https://github.com/NicolasHug/Surprise
<br>http://surprise.readthedocs.io/en/stable/index.html

In [None]:
#### Install surprise package using Anaconda
#! conda install -c anaconda cython
#!conda install -c conda-forge scikit-surprise

In [20]:
from surprise import Dataset
from surprise import Reader, KNNWithMeans
from surprise.model_selection import cross_validate
from surprise import accuracy

import pandas as pd

#### Jokes Dataset
http://eigentaste.berkeley.edu/dataset/
<br>149 Jokes
<br>59132 Users
<br>Reading jokes files
<br>Note - Data is tab seperated

In [21]:
jokes = pd.read_csv("jester_items.tsv",sep="\t",names=["ItemID","Joke"])

In [22]:
jokes.shape

(149, 2)

In [23]:
jokes.head()

Unnamed: 0,ItemID,Joke
0,1:,"A man visits the doctor. The doctor says, ""I h..."
1,2:,This couple had an excellent relationship goin...
2,3:,Q. What's 200 feet long and has 4 teeth? A. Th...
3,4:,Q. What's the difference between a man and a t...
4,5:,Q. What's O. J. Simpson's web address? A. Slas...


#### Reading the ratings file

In [24]:
ratings = pd.read_csv("jester_ratings.csv", index_col=None)
ratings.head()

Unnamed: 0,UserID,ItemID,Rating
0,1,5,0.219
1,1,7,-9.281
2,1,8,-9.281
3,1,13,-6.781
4,1,15,0.875


In [25]:
ratings.shape

(1761439, 3)

#### Check the unique users and unique jokes that were rated

In [26]:
ratings.UserID.nunique()

59132

In [27]:
ratings.ItemID.nunique()

140

#### Get the summary of the dataset

In [28]:
#Observe the ratings 
ratings.describe()

Unnamed: 0,UserID,ItemID,Rating
count,1761439.0,1761439.0,1761439.0
mean,32723.22,70.71133,1.618602
std,18280.11,46.0079,5.302608
min,1.0,5.0,-10.0
25%,17202.0,21.0,-2.031
50%,34808.0,69.0,2.219
75%,47306.0,112.0,5.719
max,63978.0,150.0,10.0


### Defining the parser to read data into surprise dateframe
#### The parser requires the scale of ratings, and the columns to be mentioned using rating_scale and line_format

Lets limit to 1000 users for sake of convenience 

In [29]:
no_of_users = 1000
reader = Reader(line_format = 'user item rating', rating_scale=(-10, 10))
data = Dataset.load_from_df(ratings[ratings.UserID < no_of_users], reader)

In [30]:
data

<surprise.dataset.DatasetAutoFolds at 0x228ad717b70>

#### Simulation Parameters
-  Algorithm Type
-  User-Based vs Item-Based
-  Similarity Metric

In [31]:
sim_parameters = {'name': 'cosine', 'user_based': True}
algo = KNNWithMeans(k=5,sim_options=sim_parameters)

#### Cross Validation Accuracies

In [32]:
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    4.9924  5.0941  5.0170  4.9270  4.9286  4.9918  0.0621  
MAE (testset)     3.8459  3.9020  3.8751  3.7589  3.7837  3.8331  0.0541  
Fit time          2.63    2.43    2.92    2.88    2.50    2.67    0.20    
Test time         2.65    2.93    3.31    2.92    2.90    2.94    0.21    


{'test_rmse': array([4.99235576, 5.09405078, 5.0169957 , 4.92695254, 4.9286321 ]),
 'test_mae': array([3.84589609, 3.90197679, 3.87505627, 3.75891902, 3.78365651]),
 'fit_time': (2.6319620609283447,
  2.431499481201172,
  2.9221479892730713,
  2.8762857913970947,
  2.5044827461242676),
 'test_time': (2.653397560119629,
  2.934189558029175,
  3.31172513961792,
  2.9181978702545166,
  2.9032366275787354)}

#### Training the model on complete data

In [33]:
trainset = data.build_full_trainset()
print(trainset)

<surprise.trainset.Trainset object at 0x00000228AD717780>


In [34]:
# Train the algorithm on the trainset
algo.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x228ad717cf8>

In [35]:
# Then predict ratings for all pairs (uid, iid) that are NOT in the training set.
testset = trainset.build_anti_testset(fill=0)
# print(testset)

In [36]:
predictions = algo.test(testset)

In [37]:
# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 12.2496


12.249567692363518

#### Filtering instances which can be used for predictions

In [38]:
predictions[0:2]

[Prediction(uid=1, iid=28, r_ui=-11.0, est=3.936034447386426, details={'actual_k': 5, 'was_impossible': False}),
 Prediction(uid=1, iid=30, r_ui=-11.0, est=-1.1820314615519951, details={'actual_k': 5, 'was_impossible': False})]

#### Function to calculate top 10 predictions for each user

In [39]:
from collections import defaultdict

In [52]:
top_n = defaultdict(list)

In [53]:
top_n

defaultdict(list, {})

In [64]:
for uid, iid, true_r, est, _ in predictions:
    top_n[uid].append((iid, est))

In [65]:
top_n

defaultdict(list,
            {1: [(96, 6.581633382724611),
              (114, 5.787077860630429),
              (48, 5.414480069133184),
              (129, 5.300475457734191),
              (45, 5.16185421176872),
              (28, 3.936034447386426),
              (30, -1.1820314615519951),
              (48, 5.414480069133184),
              (33, -2.2265257486034553),
              (37, -0.8300469668290784),
              (38, 3.9159422191919884),
              (39, 1.7636803903389726),
              (40, 0.4645460308931799),
              (41, -0.4251391460586529),
              (43, -0.32717903748461197),
              (44, -5.755900968517227),
              (56, 4.905193348704925),
              (78, 3.872917512602017),
              (97, -1.1703709608475403),
              (96, 6.581633382724611),
              (88, 3.595798606003404),
              (95, 1.8161748537824352),
              (47, 4.028335026189696),
              (94, 0.7659911206282253),
              (46, 2.41

In [59]:
len(top_n[3])

122

In [61]:
n = 5

# Then sort the predictions for each user and retrieve the k highest ones.
for uid, user_ratings in top_n.items():
    user_ratings.sort(key=lambda x: x[1], reverse=True)
    top_n[uid] = user_ratings[:n]

In [66]:
top_n[1]

[(96, 6.581633382724611),
 (114, 5.787077860630429),
 (48, 5.414480069133184),
 (129, 5.300475457734191),
 (45, 5.16185421176872),
 (28, 3.936034447386426),
 (30, -1.1820314615519951),
 (48, 5.414480069133184),
 (33, -2.2265257486034553),
 (37, -0.8300469668290784),
 (38, 3.9159422191919884),
 (39, 1.7636803903389726),
 (40, 0.4645460308931799),
 (41, -0.4251391460586529),
 (43, -0.32717903748461197),
 (44, -5.755900968517227),
 (56, 4.905193348704925),
 (78, 3.872917512602017),
 (97, -1.1703709608475403),
 (96, 6.581633382724611),
 (88, 3.595798606003404),
 (95, 1.8161748537824352),
 (47, 4.028335026189696),
 (94, 0.7659911206282253),
 (46, 2.4176356631094205),
 (82, 2.5312783305966384),
 (45, 5.16185421176872),
 (73, 1.641055379813162),
 (84, -2.8220127484727797),
 (77, 2.4275823233251277),
 (70, 5.155726987648176),
 (90, -0.8877486859712072),
 (63, 1.6483446343526023),
 (85, -0.7892164307001224),
 (101, -1.134590850954032),
 (100, -0.37757659874712246),
 (99, 0.8757996346849257),
 (

In [22]:
# Fetching top 10 predictions for each user
from collections import defaultdict

def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

from itertools import islice

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

top_n = get_top_n(predictions, n=10)
take(5, top_n.items())

[(1,
  [(96, 6.581633382724611),
   (114, 5.787077860630429),
   (48, 5.414480069133184),
   (129, 5.300475457734191),
   (45, 5.16185421176872),
   (70, 5.155726987648176),
   (56, 4.905193348704925),
   (133, 4.728843142830067),
   (130, 4.520294557797246),
   (111, 4.295786505826236)]),
 (2,
  [(121, 6.744331362728712),
   (129, 6.611773942929187),
   (114, 6.203979987827466),
   (137, 6.1347226686587),
   (98, 6.0581417924467615),
   (139, 6.01817464700671),
   (105, 6.017571123160547),
   (42, 5.992143877836309),
   (130, 5.987976203824399),
   (126, 5.827191241435635)]),
 (3,
  [(76, -0.581330873625749),
   (110, -1.3310616072438286),
   (138, -1.736834257121135),
   (120, -2.100732353796996),
   (139, -2.1640607360183672),
   (113, -2.2746186329465257),
   (126, -2.4638551596404206),
   (114, -2.6979488093427886),
   (87, -2.726843735242719),
   (36, -2.727156673729244)]),
 (4,
  [(143, 1.5451944558720605),
   (120, 0.3878504616573828),
   (138, -0.007322023035616354),
   (142, 

#### Top Predictions Matrix

In [23]:
# Printing top predictions
for uid, user_ratings in take(10,top_n.items()):
    print(uid, [iid for (iid, _) in user_ratings])

1 [96, 114, 48, 129, 45, 70, 56, 133, 130, 111]
2 [121, 129, 114, 137, 98, 139, 105, 42, 130, 126]
3 [76, 110, 138, 120, 139, 113, 126, 114, 87, 36]
4 [143, 120, 138, 142, 25, 97, 139, 36, 87, 145]
5 [127, 143, 148, 129, 105, 128, 131, 149, 137, 71]
6 [66, 127, 135, 112, 32, 138, 88, 102, 76, 148]
7 [127, 117, 148, 110, 105, 111, 104, 116, 141, 28]
8 [145, 138, 148, 134, 137, 117, 130, 129, 120, 111]
9 [139, 29, 27, 130, 91, 60, 68, 66, 93, 148]
10 [138, 114, 143, 149, 129, 148, 125, 140, 113, 126]


#### Top Jokes for each User

In [24]:
# Printing top predictions
for uid, user_ratings in take(10,top_n.items()):
    print("For User",uid)
    for  (iid, _) in user_ratings:
        print(iid)
        print(jokes.loc[int(iid)-1,"Joke"]) #iid -1 as row index in dataframe starts from 0 and not 1

For User 1
96
Two attorneys went into a diner and ordered two drinks. Then they produced sandwiches from their briefcases and started to eat. The owner became quite concerned and marched over and told them, "You can't eat your own sandwiches in here!" The attorneys looked at each other, shrugged their shoulders and then exchanged sandwiches.
114
Sherlock Holmes and Dr. Watson go on a camping trip, set up their tent, and fall asleep. Some hours later, Holmes wakes his faithful friend. "Watson, look up at the sky and tell me what you see." Watson replies, "I see millions of stars." "What does that tell you?" Watson ponders for a minute. "Astronomically speaking, it tells me that there are millions of galaxies and potentially billions of planets. Astrologically, it tells me that Saturn is in Leo. Timewise, it appears to be approximately a quarter past three. Theologically, it's evident the Lord is all-powerful and we are small and insignificant. Meteorologically, it seems we will have a b