# Recommender Systems Using Pre-Trained Transformer

## User-User collaborative filtering 

A User-User model tries to identify users with similar "interaciton profiles" and recommends the top rated items among them. For e.g. to recommend a movie to user A, we find all the other users who have similar likings and then recommend their top rated movies to user A. We decided to implement this mechanism by exploiting SBERT which is a Sentence-Transformer. The key idea is to use the encodings from a transformer as it takes into account the positional encodings which are a crucial for any recommender system. 


### Import Libraries

In [None]:
pip install -U sentence-transformers

In [14]:
from sentence_transformers import SentenceTransformer, util
from csv import reader
import csv
from collections import defaultdict
import random
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import numpy as np
from torch.nn.functional import softmax
import ast, sys
import torch
from re import search

### Data Structure to store movie, its rating and its timestamp

In [22]:
class Movie:
    def __init__(self, mov_id, rating, timestamp, mov_name = "UKN"):
        self.mov_id = mov_id
        self.rating = float(rating)
        self.timestamp = timestamp
        self.mov_name = mov_name
  
    def __lt__(self, other):
        if self.rating != other.rating:
            return self.rating > other.rating
        return int(self.mov_id) > int(other.mov_id)

### Load Dataset and for each user sort the movies by their ratings (lexicographically to break ties)

In [15]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [23]:
FIRST_N_USERS = 10000  #only want to use the data for a limited number of users
SKIP_FIRST_N = 0
movName = defaultdict(str)
with open('/content/drive/MyDrive/movies.csv', 'r', encoding='utf-8') as read_obj:
    csv_reader = reader(read_obj)
    header = next(csv_reader)
    if header != None:
        for mov_id, name, genre in csv_reader:
            movName[mov_id] = name

ratings = defaultdict(dict)

avg_rating = 0
num_raitings = 0

with open('/content/drive/MyDrive/ratings.csv', 'r') as read_obj:
    csv_reader = reader(read_obj)
    header = next(csv_reader)
    if header != None:
        for usr_id, mov_id, rating, ts in csv_reader:
            if int(usr_id) < SKIP_FIRST_N:
              continue
            if int(usr_id) >  FIRST_N_USERS + SKIP_FIRST_N:
                break
            if mov_id not in ratings[usr_id]:
              ratings[usr_id][mov_id] = []
            ratings[usr_id][mov_id] = Movie(mov_id, rating, ts, movName[mov_id])
            avg_rating += float(rating)
            num_raitings += 1

avg_rating /= num_raitings
print(f"Average Rating: {avg_rating}")

user_bias = defaultdict(float)
for usr in ratings:
    all_usr_ratings = [mov.rating for mov in ratings[usr].values()]
    avg = sum(all_usr_ratings) / len(all_usr_ratings)
    user_bias[usr] = avg - avg_rating
#print(user_bias["1"])

#print(len(ratings["10"]))

Average Rating: 3.544125665169062


### Create sequence of movies for each user to feed to the Transformer

In [24]:
sequences = defaultdict(str) # to prevent computing sequences every time
seq_to_usr = defaultdict(str) 
RCAP = 15
for usr in ratings:
    movs = list(ratings[usr].values())
    movs.sort()
    seq = []
    cap, range_cap = RCAP, 5 # maximum 15 movies from each rating range (1->(0,1] ..5->(4,5])
 
    i = 0
    while i < len(ratings[usr]) and range_cap > 0:
        while i < len(ratings[usr]) and range_cap - 1 < movs[i].rating <= range_cap and cap > 0:
            seq.append(f"\"{movs[i].mov_id}\":{str(movs[i].rating)}")
            cap -= 1
            i += 1
        #print("No[e ================================")
        cap = RCAP
        range_cap -= 1
        while i < len(ratings[usr]) and movs[i].rating > range_cap:
            i += 1

    seq = ", ".join(seq)
    #print(seq)
    sequences[usr] = seq
    seq_to_usr[seq] = usr

### Helper methods to find top N similar users based on movie ratings and get recommended movies

In [25]:
model = SentenceTransformer('stsb-distilbert-base')

def getNSimilarUsers(user, user_sequence, n):
    save_seq = sequences[user]
    del sequences[user]

    #print(user_sequence)
    
    s2 = list(sequences.values())
    #print(s2)
    embeddings1 = model.encode(user_sequence, convert_to_tensor=True)
    embeddings2 = model.encode(s2, convert_to_tensor=True)
    #print(embeddings1, embeddings1.size())
    sequences[user] = save_seq

    cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
    l = cosine_scores[0]
  
    ret = []
    for i, seq in enumerate(s2):
        score = l[i].item()
        # if score < 0.6:
        #     continue
        ret.append((seq_to_usr[seq], score))
  
    ret.sort(key = lambda x: -x[1])
    n = min(n, len(ret))
    return ret[:n]

In [26]:
def getRecommendations(user, neighs):
  movs = []
  for n, _ in neighs:
    nmov = sequences[n].split(', ')
    for m in nmov:
      mov, rating = m.split(":")
      mov = mov[1:len(mov) - 2]
      #print(mov, rating)
      if float(rating) < 4 or mov not in movName:
        break
      
      movs.append((movName[mov], n))
      if len(movs) >= 5:
        break
    if len(movs) >= 5:
      break 
      #print(movName[mov], rating)
  print("The top 5 recommended movies are: \n")
  for i, m in enumerate(movs):
    print(f"{i + 1}. {m[0]} rated highly by {m[1]}")

### Recommendations

In [29]:
NUM_NEIGH = 5
uid = ""
uid = input("Please Enter user id or 0 for a random user selection ")

if uid == "0":
  uid = str(random.sample(range(SKIP_FIRST_N, SKIP_FIRST_N + FIRST_N_USERS + 1), 1)[0])

print(f"The user selected: {uid}")
seq = sequences[uid]
neighProb = getNSimilarUsers(uid, seq, NUM_NEIGH) #neighbors with scores
print(f"Closest Users to {uid} with scores: {neighProb}")
getRecommendations(uid, neighProb)

Please Enter user id or 0 for a random user selection 0
The user selected: 9239
Closest Users to 9239 with scores: [('5089', 0.9673824310302734), ('5281', 0.9608524441719055), ('2811', 0.959283709526062), ('9546', 0.9591524004936218), ('5440', 0.9579577445983887)]
The top 5 recommended movies are: 

1. Shock Treatment (1981) rated highly by 5089
2. Last Wave, The (1977) rated highly by 5089
3. Back to the Beach (1987) rated highly by 5089
4. On the Beach (1959) rated highly by 5089
5. Bound (1996) rated highly by 5089


### Validation

For validation we decided to choose random users from the dataset. For each of the test user, we do a random 70:30 split on the movies rated by them. We use 70 percent of movies and their ratings to find similar users. Then we use the 30 percent to see if the similar users have rated those movies. If they have, we compare the ratings done by the similar users with that of test user. This gives us an estimate how well our method works.

In [30]:
# helper method to get the split
def get_test_train(user):
    movs = [x.mov_id for x in ratings[user].values()]
    rand_inds = set(random.sample(range(0, len(movs)), int(0.7 * len(movs))))
    #print(rand_inds, len(movs))
    train, test = [], [] #train_test_split(movs, train_size = 0.8, test_size=0.2)
    for i, mov in enumerate(movs):
        if i in rand_inds:
            train.append(f"\"{mov}\":{ratings[user][mov].rating}")
        else:
            test.append(mov)
        user_seq = ", ".join(train)
    return user_seq, test

def get_weighted_avg(l):
    weights = [((len(l) - i)**2) for i in range(0, len(l))]
    sum_w = sum(weights)
    return sum([(val * w) / sum_w for val, w in zip(l, weights)])

def error(user, test, neighs):
    mov_to_rat = defaultdict(list)
    for mov_id in test:
        for neigh in neighs:
            if mov_id not in ratings[neigh]:
                continue
            mov_to_rat[mov_id].append(ratings[neigh][mov_id].rating - user_bias[neigh]) # subtract neighbor bias

    if not mov_to_rat:
        print("NO COMMON MOVIES FOUND")
        return 1.0
    #print(f"Movies: {mov_to_rat}")
    e = []
    for mov in mov_to_rat:
        pred = min(5.0, get_weighted_avg(mov_to_rat[mov]) + user_bias[user]) # adding the user bias
        actual = ratings[user][mov].rating
        #print(f"Mov: {mov}, Pred: {pred}, Actual: {actual}")
        e.append((pred - actual)**2)

    rmse = (sum(e) / len(e))**0.5
    return rmse

In [31]:
NUM_NEIGH = 5
test_users = random.sample(range(SKIP_FIRST_N, SKIP_FIRST_N + FIRST_N_USERS + 1), 5)
print(f"Test Users : {test_users}")
errors = []
for user in test_users:
    user = str(user)
    user_sequence, test = get_test_train(user)
    neighProb = getNSimilarUsers(user, user_sequence, NUM_NEIGH) #neighbors with scores
    print(f"Closest Users to {user} with scores: {neighProb}")
    neighs = [x[0] for x in neighProb]
    err = error(user, test, neighs)
    errors.append(err)
    print(f"RMSE: {err}")

sme = sum(errors)
print(f"Avg. RMSE: {sme/len(errors)}")

Test Users : [6864, 4775, 3302, 9859, 2011]
Closest Users to 6864 with scores: [('669', 0.9322478175163269), ('8691', 0.9301897287368774), ('3638', 0.9275006055831909), ('8663', 0.9233279228210449), ('3881', 0.9200374484062195)]
NO COMMON MOVIES FOUND
RMSE: 1.0
Closest Users to 4775 with scores: [('728', 0.9188240170478821), ('1565', 0.9157223701477051), ('3331', 0.9130793213844299), ('6560', 0.9024902582168579), ('6402', 0.899236261844635)]
RMSE: 0.9818452264634624
Closest Users to 3302 with scores: [('9106', 0.9255680441856384), ('1573', 0.9188029766082764), ('2905', 0.9187506437301636), ('7026', 0.9171526432037354), ('7859', 0.9170346260070801)]
RMSE: 1.2944710070503838
Closest Users to 9859 with scores: [('2574', 0.9513771533966064), ('1267', 0.9461947679519653), ('9990', 0.940888524055481), ('7562', 0.9384541511535645), ('5186', 0.9380144476890564)]
RMSE: 0.3250000000000002
Closest Users to 2011 with scores: [('8568', 0.943993330001831), ('6388', 0.9409623146057129), ('4283', 0.94

## Item-Item collaborative filtering 

Use movie plots to find movies with similar plots.


### Data structure to store info about movies

In [16]:
class Movie_Info:
    def __init__(self, mov_id, mov_name, genre):
        self.mov_id = mov_id
        self.mov_name = mov_name
        self.genre = genre

### Load data and extract information

In [17]:
mov_to_info = defaultdict(Movie_Info)

tsv_file = open("/content/drive/MyDrive/movie.metadata.tsv", encoding='utf-8')
read_tsv = csv.reader(tsv_file, delimiter="\t")

mov_to_summary = defaultdict(str)
f = open("/content/drive/MyDrive/plot_summaries.txt", "r", encoding='utf-8')
for line in f:
    tokens = line.strip().split("\t")
    mov_to_summary[tokens[0]] = tokens[1].split(". ")
#print(list(mov_to_summary.values())[5])

#print(len(mov_to_summary.values()), len(mov_to_summary.keys()), len(mov_to_summary))

#genres_to_mov = defaultdict(set)
for row in read_tsv:
    if row[0] not in mov_to_summary:
        continue
    gnrs = set(ast.literal_eval(row[8]).values())
    
#     for gnr in gnrs:
#         genres_to_mov[gnr].add(row[0])
    mov = Movie_Info(row[0], row[2], gnrs)
    mov_to_info[row[0]] = mov
    
# make sure to only have movies with summaries    
print(f"numMovies: {len(mov_to_info)}, numSummaries: {len(mov_to_summary)}")

numMovies: 42207, numSummaries: 42306


### Using Sentence Transformer to make Recommendations

In [18]:
#print(genres_to_mov['Thriller'])
#print(len(mov_to_info))

model = SentenceTransformer('stsb-distilbert-base')

#helper methods
def print_movie_info(mov_id):
    mov = mov_to_info[mov_id]
    print(f"\nMovie info, Id: {mov_id}, Name: {mov.mov_name}, Genre: {mov.genre}\nSummary: \n{getSummary(mov_to_summary[mov_id])}\n")
    
def getSummary(seq):
    return ". ".join(seq)

def getFinalScore(cosine_scores):
    #print(cosine_scores)
    max_from_rows, _ = torch.max(cosine_scores, dim=1)
    #print(max_from_rows)
    return torch.mean(max_from_rows)

# parsing all the movies doesn't make sense so only taking movies with similar genres
def filterSimilarGenreMov(mov_id):
    ret = defaultdict(str)
    gen = mov_to_info[mov_id].genre
    #print(mov_to_info[mov_id].genre)
    for mid, mov in mov_to_info.items():
        if mov_id == mid:
            continue
        curgen = mov_to_info[mid].genre
        s_intersection = gen & curgen
        s_union = gen | curgen
        if len(s_intersection) / len(s_union) > 0.5: #len(s)/len(gen) > 0.5 # and len(s)/len(curgen) > 0.2:
            ret[mid] = mov_to_summary[mid]
    return ret
        

HBox(children=(FloatProgress(value=0.0, max=244715968.0), HTML(value='')))




In [19]:
def recommendMovies(mov_id, top_k):
    print_movie_info(mov_id)
    save_summary = mov_to_summary[mov_id]
    del mov_to_summary[mov_id]
    
    mov_with_score = []
    
    seq1 = save_summary
    
    sim_genre_movs = filterSimilarGenreMov(mov_id)
    #print(len(sim_genre_movs),len(mov_to_summary))
    for mov, seq2 in sim_genre_movs.items():
        embeddings1 = model.encode(seq1, convert_to_tensor=True)
        embeddings2 = model.encode(seq2, convert_to_tensor=True)

        cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
        
        final_score = getFinalScore(cosine_scores).item()
        #print(final_score)
        mov_with_score.append((mov, final_score))
    
    mov_to_summary[mov_id] = save_summary
    
    if top_k == 1:
        return max(mov_with_score, key = lambda x: x[1])
    
    mov_with_score.sort(key = lambda x: -x[1])
    
    return mov_with_score[:top_k]
#recommendMovies("9991132")

### Recommendations

In [20]:
TOP_K = 5
mov_id = ""

mov_id = input("Please Enter Movie id or 0 for a random movie selection or # for name search ")

if mov_id == "#":
    name = input("Please enter keyword for the movie ") 
    found = []
    ind = 0
    for mov in mov_to_info:
        #print(mov)
        if search(name, mov_to_info[mov].mov_name):
            found.append((ind, mov))
            ind += 1
    print(f"Enter your choice")
    for ind, mov in found:
        print(f"Name: {mov_to_info[mov].mov_name}, Id: {mov}, choice: {ind}")
    
    index = input()
    mov_id = found[int(index)][1]
    
elif mov_id == "0":
    #print("hehe")
    mov_id = random.choice(list(mov_to_info.keys()))

if mov_id not in mov_to_info:
    print("OOPS Movie not found")

movies = recommendMovies(mov_id, TOP_K)

Please Enter Movie id or 0 for a random movie selection or # for name search 0

Movie info, Id: 883446, Name: The Angry Red Planet, Genre: {'Horror', 'Science Fiction', 'Creature Film', 'Adventure', 'Cult'}
Summary: 
The rocketship MR-1 , returns to Earth after the first manned flight to Mars. Thought first lost in space, when the rocket reappeared, mission control couldn't raise the crew by radio. Its ground-crew land the rocket successfully by remote control. Two survivors are found aboard: Dr. Iris Ryan  and Colonel Tom O'Bannion , his arm covered by a strange alien growth. The mission report is recounted by Dr. Ryan as she attempts to find a cure for Col. O'Bannion's arm. While exploring Mars, Ryan was attacked by a carnivorous plant, which was killed by O'Bannion; They also discover, after mistaking its legs for trees, an immense bat/rat/spider creature, who is later repelled by a freeze ray fired by Weapons Officer Jacobs. When they return to their ship, the crew finds that their

In [21]:
print(f"Recommended Movies for {mov_to_info[mov_id].mov_name} are: \n")
for mov, score in movies:
    print(f"Movie Name: {mov_to_info[mov].mov_name}, Similarity score: {score}, Genre: {mov_to_info[mov].genre}")
    print("\nSummary:")
    print(". ".join(mov_to_summary[mov]))
    print("\n---------------------------------------\n")    
  

Recommended Movies for The Angry Red Planet are: 

Movie Name: Alien, Similarity score: 0.5250837206840515, Genre: {'Horror', 'Science Fiction', 'Sci-Fi Horror', 'Creature Film', 'Adventure', 'New Hollywood'}

Summary:

---------------------------------------

Movie Name: Dr. Who and the Daleks, Similarity score: 0.48248225450515747, Genre: {'Science Fiction', 'Cult', 'Adventure'}

Summary:

---------------------------------------

Movie Name: Leprechaun 4: In Space, Similarity score: 0.4564892649650574, Genre: {'Horror', 'Science Fiction', 'Supernatural', 'Monster movie', 'Creature Film', 'Adventure', 'Cult'}

Summary:
{{Plot}} The Leprechaun is on a desolate planet attempting to court a narcissistic princess named Zarina, whom he has kidnapped in a plot to marry her, then murder her father in order to become king of her home planet, Dominia. A group of space marines attack and the Leprechaun kills one of the marines, Lucky, with a lightsaber-like weapon. When another marine, Kowalski