# Mini project: simple movie recommendation

> We will create a simple movie recommendation system based on the movie reviews from users. We will use a known dataset that contains movie reviews from many users. Each user rates more than one movie. 

In [1]:
# Import pandas to play with the dataset
import pandas as pd

# We will use the MovieLens dataset which contains user ratings for movies.
# We have downloaded the dataset locally in the data folder.
ratings_filename = './data/u.data'

In [2]:
# The data is provided in a TSV form.
# This is like the CSV, but the columns instead of being comma seperated, they are tab seperated.
# Use the read_csv() function of pandas and specify the delimiter
all_ratings = pd.read_csv(ratings_filename, delimiter="\t", header=None, 
                          names = ["UserID", "MovieID", "Rating", "Datetime"])

# The full dataset: 100000 ratings by 943 users on 1682 movies.
# Each user has rated many movies (at least 20)
# Users and items are numbered consecutively from 1.  
# The data is randomly ordered. 
# This is a tab separated list of: 
# user id | item id | rating | timestamp. 

print(f'Shape: {all_ratings.shape}')

# Sample:
all_ratings.head()

Shape: (100000, 4)


Unnamed: 0,UserID,MovieID,Rating,Datetime
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [3]:
# The last attribute seems to be a unix datetime.
# Although we will not need it, let's transform it to datetime object.
all_ratings["Datetime"] = pd.to_datetime(all_ratings['Datetime'], unit='s')
all_ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Datetime
0,196,242,3,1997-12-04 15:55:49
1,186,302,3,1998-04-04 19:22:22
2,22,377,1,1997-11-07 07:18:36
3,244,51,2,1997-11-27 05:02:03
4,166,346,1,1998-02-02 05:33:16


### We need to transform the dataset into a transactions-like format that we have seen in the class.

> Also: this is a movie recommendation project based on a dataset with movie reviews. We need to incorporate the reviews somehow. The review is the "Rating" attribute which is categorical. How are we going to handle it?

In [4]:
# Attribute construction: 
# Create an additional attribute that we will call "Favorable"
# This will be true if the user gave a rating above 3

# Select all rows with rating > 3 and add the attribute Favorable = True. 
# Otherwise, Favorable = False
all_ratings["Favorable"] = all_ratings["Rating"] > 3
all_ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Datetime,Favorable
0,196,242,3,1997-12-04 15:55:49,False
1,186,302,3,1998-04-04 19:22:22,False
2,22,377,1,1997-11-07 07:18:36,False
3,244,51,2,1997-11-27 05:02:03,False
4,166,346,1,1998-02-02 05:33:16,False


In [5]:
# Let's use only a small sample of 200 users
# I.e. we will use the ratings of those users (not 200 samples)

ratings = all_ratings[all_ratings['UserID'].isin(range(200))]
ratings

Unnamed: 0,UserID,MovieID,Rating,Datetime,Favorable
0,196,242,3,1997-12-04 15:55:49,False
1,186,302,3,1998-04-04 19:22:22,False
2,22,377,1,1997-11-07 07:18:36,False
4,166,346,1,1998-02-02 05:33:16,False
6,115,265,2,1997-12-03 17:51:28,False
...,...,...,...,...,...
99951,130,121,5,1997-10-07 18:59:06,True
99959,193,690,4,1998-03-05 18:40:21,True
99978,113,975,5,1997-10-04 03:40:24,True
99998,13,225,2,1997-12-17 22:52:36,False


In [6]:
# Our movie recommendation system will rely only on favorable movies.
# Out of the above sample, get the Favorable ratings.

# Filter the dataframe and get the rows that have Favorable = True
favorable_ratings = ratings[ratings["Favorable"]]
favorable_ratings
# about half (more or less..)

Unnamed: 0,UserID,MovieID,Rating,Datetime,Favorable
16,122,387,5,1997-11-11 17:47:39,True
20,119,392,4,1998-01-30 16:13:34,True
21,167,486,4,1998-04-16 14:54:12,True
26,38,95,5,1998-04-13 01:14:54,True
28,63,277,4,1997-10-01 23:10:01,True
...,...,...,...,...,...
99848,5,174,5,1997-09-30 16:15:30,True
99950,130,93,5,1997-09-22 18:41:05,True
99951,130,121,5,1997-10-07 18:59:06,True
99959,193,690,4,1998-03-05 18:40:21,True


### Transactions-like data
> Let's create sequential patterns for the users. Each user did not watch all the movies that has rated at once! 

In [7]:
# Now, for each user create a list with their favorite movies
# And put those lists in a bigger list in order to create an "itemset list"
# e.g. [
#.       [id1, id2, id3, ...],               # Each row is a list of movies for a specific user
#.       [id10, id1, id4, id3, ...],
#.       ...
#     ]

# Does the above remind the HORIZONTAL DATA FORMAT? 

# Group favorable ratings by user id and movie id and get the list of movies for each user
favorable_movies_by_user = [list(v.values) for k,v in favorable_ratings.groupby("UserID")["MovieID"]]
# print a couple of examples
print('Movies from the 1st user:')
print(favorable_movies_by_user[0])
print('')
print('Movies from the 2nd user:')
print(favorable_movies_by_user[1])


Movies from the 1st user:
[61, 33, 160, 20, 202, 171, 265, 47, 222, 253, 113, 227, 90, 64, 228, 121, 114, 132, 134, 98, 186, 221, 84, 60, 177, 174, 82, 56, 80, 229, 235, 6, 206, 76, 72, 185, 96, 258, 81, 212, 151, 51, 175, 107, 209, 108, 12, 14, 44, 163, 210, 184, 157, 150, 183, 248, 208, 128, 242, 193, 236, 250, 91, 129, 241, 267, 86, 196, 39, 230, 23, 224, 65, 190, 100, 154, 214, 161, 170, 9, 246, 22, 187, 135, 68, 146, 176, 166, 89, 249, 269, 32, 270, 133, 239, 194, 256, 93, 234, 1, 197, 173, 75, 268, 144, 119, 181, 257, 109, 182, 223, 46, 169, 162, 66, 77, 199, 57, 50, 192, 178, 87, 238, 156, 106, 115, 137, 127, 16, 79, 45, 48, 25, 251, 195, 168, 123, 191, 203, 55, 42, 7, 43, 165, 198, 124, 95, 58, 216, 204, 3, 207, 19, 18, 59, 15, 111, 52, 88, 13, 28, 172, 152]

Movies from the 2nd user:
[292, 251, 50, 297, 13, 303, 257, 316, 301, 313, 279, 299, 277, 282, 111, 295, 242, 283, 276, 1, 14, 293, 310, 306, 25, 273, 311, 269, 255, 284, 237, 300, 100, 127, 285, 304, 272, 286, 275, 302]


In [8]:
# We have an itemset list that contains the favorite movies for each user
# Run the apriori algorithm to get the frequent itemsets and the association rules
from apriori_algorithm import *

freqItemSet, rules = apriori(favorable_movies_by_user, minSup=0.2, minConf=0.9)

In [9]:
# Let's print the frequent k-itemsets
for key in freqItemSet:
    values = [list(x) for x in freqItemSet[key]]
    print(f'{key}: {values}\n')

1: [[127], [56], [238], [89], [294], [28], [25], [181], [1], [258], [204], [79], [237], [117], [96], [300], [121], [216], [269], [69], [172], [135], [22], [195], [423], [9], [276], [183], [318], [286], [196], [15], [12], [100], [50], [302], [210], [191], [176], [64], [174], [288], [7], [98], [357], [168], [222], [173], [313]]

2: [[98, 100], [96, 174], [56, 100], [50, 127], [64, 50], [89, 50], [50, 181], [174, 79], [1, 174], [50, 98], [50, 79], [181, 174], [89, 174], [56, 174], [64, 98], [50, 173], [50, 7], [50, 195], [258, 50], [1, 50], [98, 174], [64, 174], [56, 98], [50, 174], [100, 174], [100, 7], [1, 100], [195, 174], [100, 127], [50, 172], [50, 100], [56, 50], [172, 174], [172, 181]]

3: [[50, 172, 181], [50, 181, 174], [50, 174, 98], [172, 181, 174], [50, 195, 174], [56, 50, 174], [50, 172, 174], [50, 174, 79]]

4: [[50, 181, 172, 174]]



In [10]:
# Store the rules and print them in descending order of confidence

ordered_rules = []
for rule in rules:
    ordered_rules.append(rule)
ordered_rules.reverse()

for rule in ordered_rules:
    print(f'{rule[0]} --> {rule[1]} [conf: {rule[2]}]')

{50, 195} --> {174} [conf: 1.0]
{172, 181} --> {50} [conf: 0.9782608695652174]
{172, 181, 174} --> {50} [conf: 0.9761904761904762]
{56, 50} --> {174} [conf: 0.9565217391304348]
{195, 174} --> {50} [conf: 0.9545454545454546]
{50, 79} --> {174} [conf: 0.9523809523809523]
{172, 174} --> {50} [conf: 0.9423076923076923]
{195} --> {174} [conf: 0.9361702127659575]
{181, 174} --> {50} [conf: 0.9230769230769231]
{172} --> {50} [conf: 0.9152542372881356]
{172, 181} --> {174} [conf: 0.9130434782608695]
{50, 172, 181} --> {174} [conf: 0.9111111111111111]
{50, 172} --> {174} [conf: 0.9074074074074074]


In [11]:
# Let's print a few top rules in a more friendly way
for index in range(5):
    print(f'Rule #{index+1}')
    print(f'If a person recommends {ordered_rules[index][0]} they will also recommend {ordered_rules[index][1]}')
    print(f' - Confidence: {ordered_rules[index][2]}')
    print('')

Rule #1
If a person recommends {50, 195} they will also recommend {174}
 - Confidence: 1.0

Rule #2
If a person recommends {172, 181} they will also recommend {50}
 - Confidence: 0.9782608695652174

Rule #3
If a person recommends {172, 181, 174} they will also recommend {50}
 - Confidence: 0.9761904761904762

Rule #4
If a person recommends {56, 50} they will also recommend {174}
 - Confidence: 0.9565217391304348

Rule #5
If a person recommends {195, 174} they will also recommend {50}
 - Confidence: 0.9545454545454546



### Let's just make our recommendation system a bit better by providing the movie title

In [12]:
# There is an additional file in the dataset that provides the movie info

# -- Information about the (movies): "|" separated list of:
#    movie id | movie title | release date | video release date | IMDb URL | unknown |
#    Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy |
#    Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western |
#
# -- The last 19 fields are the genres, a 1 indicates the movie
#    is of that genre, a 0 indicates it is not; movies can be in
#    several genres at once. The movie ids are the ones used in the u.data dataset that we used above.

movie_name_filename = './data/u.item'
movie_name_data = pd.read_csv(movie_name_filename, delimiter="|", header=None, encoding = "mac-roman")
movie_name_data.columns = ["MovieID", "Title", "Release Date", "Video Release", "IMDB", "<UNK>", "Action", "Adventure",
                           "Animation", "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir",
                           "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]

# Sample
movie_name_data.head()

# Note how the genre of the movie is encoded in the dataset.
# Instead of having a single feature of nominal values, it is one hot - encoded. 
# We have a separate feature for each genre that take value either 0 or 1.

Unnamed: 0,MovieID,Title,Release Date,Video Release,IMDB,<UNK>,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [13]:
# Let's print the same rules by using the movie name
import copy
rules = copy.deepcopy(ordered_rules)

# Define a function to get movie name
def get_movie_name(movie_id):
    title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"]
    title = title_object.values[0]
    return title

# Print the rules again using the movie title this time
for index in range(10):
    print(f'Rule #{index+1}')
    premise = rules[index][0]
    conclusion = rules[index][1].pop()
    premise_names = ", ".join(get_movie_name(idx) for idx in premise)
    conclusion_name = get_movie_name(conclusion)
    print(f'If a person recommends {premise_names} they will also recommend {conclusion_name}')
    print(f' - Confidence: {rules[index][2]}')
    print('')

Rule #1
If a person recommends Star Wars (1977), Terminator, The (1984) they will also recommend Raiders of the Lost Ark (1981)
 - Confidence: 1.0

Rule #2
If a person recommends Empire Strikes Back, The (1980), Return of the Jedi (1983) they will also recommend Star Wars (1977)
 - Confidence: 0.9782608695652174

Rule #3
If a person recommends Empire Strikes Back, The (1980), Return of the Jedi (1983), Raiders of the Lost Ark (1981) they will also recommend Star Wars (1977)
 - Confidence: 0.9761904761904762

Rule #4
If a person recommends Pulp Fiction (1994), Star Wars (1977) they will also recommend Raiders of the Lost Ark (1981)
 - Confidence: 0.9565217391304348

Rule #5
If a person recommends Terminator, The (1984), Raiders of the Lost Ark (1981) they will also recommend Star Wars (1977)
 - Confidence: 0.9545454545454546

Rule #6
If a person recommends Star Wars (1977), Fugitive, The (1993) they will also recommend Raiders of the Lost Ark (1981)
 - Confidence: 0.9523809523809523

Ru

## How would you use such a system?
> Example:
You browse movies in your streaming platform and you select to watch "The Terminator". 
Given rule 8, a recommendation could be "Raiders of the lost Ark", in a way such as:
"Users also liked Raiders of the lost Ark!"

*Note:* This is a simple recommendation engine that is based on "market basket analysis", and frequent itemsets. It does not consider meta-data and other characteristics or features that are based on user preferences. It relies on items that appear in the same "transaction". The transaction here (if we were to simulate the market basket analysis) is the list of favorable movies of each user.