In [1]:
import os
os.chdir('../movies')
from movieLens import MovieLens

In [2]:
ml = MovieLens()

# Algorithm

In [3]:
from surprise import Dataset
from surprise import Reader
from surprise import KNNWithMeans
import heapq
from collections import defaultdict
from operator import itemgetter

In [4]:
# Load the rating dataset
ratings = ml.ratings

# Method from the Surprise library to load the DataFrame 
# Define the Reader object to parse the dataframe
reader = Reader(rating_scale=(ratings['rating'].min(), ratings['rating'].max()))

# Load the dataframe as a ratings dataset
ratingsDataset = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Build the full trainset
trainSet = ratingsDataset.build_full_trainset()

## Item user rating matrix

Matriz en la que encontramos los ratings por película para cada uno de los usuarios existentes. Una columna para las películas y una fila en la que se encuentran todos los usuarios disponibles. El valor de la celda corresponde con el rating otorgado a cada una de las películas por el usuario correspondiente

In [5]:
# Cosine similarity function
sim_options = {'name': 'cosine',   # alternative: pearson
               'user_based': False, # compute  similarities between films
               'min_support':5      # minimum number of common items between users
               }

model = KNNWithMeans(sim_options=sim_options)
model.fit(trainSet)
simsMatrix = model.compute_similarities()

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


## Look up similar items

Buscamos las k películas que el usuario de referencia haya valorado mejor

In [6]:
def getNeighbors(referenceUser,k,trainSet):
     
    referenceUserInnerID = trainSet.to_inner_uid(referenceUser) 

    # Get top N items rated
    # Sort the elements in decreasing order by score and select top N
    referenceUserRatings = trainSet.ur[referenceUserInnerID]
    #print(referenceUserRatings)
    kNeighbors = heapq.nlargest(k, referenceUserRatings, key=lambda t: t[1])
    
    return kNeighbors

In [7]:
# Reference user = user to recommend to
referenceUser = 1 

# Set the number of desired similar users
k = 10

In [8]:
# Get neighbours
kNeighbors = getNeighbors(referenceUser,10,trainSet)
kNeighbors

[(3, 5.0),
 (4, 5.0),
 (6, 5.0),
 (8, 5.0),
 (9, 5.0),
 (10, 5.0),
 (11, 5.0),
 (13, 5.0),
 (15, 5.0),
 (18, 5.0)]

De esta forma se seleccionan las k primeras películas con mayor puntuación, en este caso se seleccionan las 10 primeras, pero, no deberíamos seleccionar un número variable de k de tal forma que se seleccionen todas aquellas que tengan un rating máximo (5)?

## Candidate generation and scoring

Selecionamos las películas que podríamos recomendar en primera instancia. Para ello normalizamos los ratings y multiplicamos por el coeficiente de semejanza entre las películas

In [9]:
# Get similar items to stuff we liked (weighted by rating)
candidates = defaultdict(float)
for itemID, rating in kNeighbors:
    similarityRow = simsMatrix[itemID]
    for innerID, score in enumerate(similarityRow):
#         print(score)
#         print(rating)
        candidates[innerID] += score*(rating/5.0)

# Sort the candidates by score
candidates = sorted(candidates.items(), key=itemgetter(1), reverse=True)
#candidates

El score corresponde con la medida cosine similarity entre las películas calculada en la matriz de similaridad, mientras que rating corresponde con la valoración del usuario

## Candidate filtering and recommendations

Filtramos aquellas recomendaciones con un score pequeño y que ya haya visto el usuario. Para ello utilizamos un set porque únicamente nos interesa saber los items que el reference user ya ha visto, plus es un objeto eficiente para datasets largos

In [13]:
def filterRec(referenceUser,trainSet,k,candidates):
    
    # Get top N similar users to our reference user
    referenceUserInnerID = trainSet.to_inner_uid(referenceUser)
    
    # Build a set of movies the user has already seen
    watched = set(trainSet.ur[referenceUserInnerID])
    
    # Initialize a list to store the recommendations
    recommendations = []

    # Get top-rated items from similar users:
    pos = 0
    for itemID, ratingSum in candidates:
        if not itemID in watched:
            movieID = trainSet.to_raw_iid(itemID)
            recommendation = ml.getMovieName(int(movieID)), ratingSum
            recommendations.append(recommendation)
            pos += 1
            if (pos >= k):
                break            

    rec_movies = [rec[0] for rec in recommendations]
    return rec_movies

ratingSum represents the total similarity of the reference user to all other users who rated that item

In [14]:
# Results for the first approach
rec_movies = filterRec(referenceUser,trainSet,k,candidates)
rec_movies

['Basic Instinct',
 'Dr. No',
 'Good Morning, Vietnam',
 'Femme Nikita, La (Nikita)',
 'Green Mile, The',
 'From Russia with Love',
 'Seven Samurai (Shichinin no samurai)',
 'Office Space',
 'American History X',
 'Goodfellas']

# Metrics