## Frequent items - A-Priori algorithm

In this notebook, a recommender system based on the A-Priori algorithm is developed. Since the `ratings.csv` file contains users and the movies they have rated, baskets of frequent itemsets can be created.  

The idea is to create baskets of movies that have been rated highly. From these baskets, the frequent itemsets can be found and association rules created. By using these association rules, an input movie from the user is used as basket and the movies sorted by different metrics are printed. These metrics can be either `confidence` or [`lift`](https://en.wikipedia.org/wiki/Lift_(data_mining)).

In [19]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

import re

# We tried implementing the apriori algorithm, however, our pc's could not complete the couonting.
# As a result, we used the mlxtend library
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

We are going to use the [`ratings.csv`](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=ratings.csv) dataset and the [`movies_metadata.csv`](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv) dataset from The movies Dataset: 

In [246]:
# Load the ratings dataframe
ratings_df = pd.read_csv('../Data/movies_dataset/ratings.csv')

# Keep only the columns of interest
columns_to_keep = ['userId', 'movieId', 'rating']
ratings_df = ratings_df[columns_to_keep]

# Also load the movied dataframe to link the movieId's
movies_df = pd.read_csv('../Data/movies_dataset/movies_metadata.csv', low_memory=False)
movies_df = movies_df.dropna(subset=['vote_count', 'id']) # remove na

# Since the id is a column of interest, it is better to have it as an int for locating movies
movies_df[['id']] = movies_df[['id']].astype(int)

# Remove duplicates, since there can be movies that have same titles but different release year
movies_df = movies_df.drop_duplicates(subset=['title', 'original_title'], keep='last')

# Keep only movies with english as original language
movies_df = movies_df.loc[movies_df.original_language=='en']

Now, we want to have only the movies in the `ratings` dataframe that have a matching id and title in the `movies` dataframe:

In [21]:
# Create a list of all the movies with titles from the movies dataframe
movies_df_ids = movies_df.id.to_list()

# Keep only the ids in the ratings dataframe that exist in the list of the ids from the movies df
ratings_df =  ratings_df.loc[ratings_df.movieId.isin(movies_df_ids)]

Now, to apply the A-Priori algorithm, we need the list of items for each user. So for each `userId`, we need to get the movies he/she has seen. A threshold is also set so that the movies that have not been rated well by the user are not included. Later, the users with only a few ratings are removed.  

In the following cell, the cutoff for the number of rated movies is found. We want to keep the users that have rated more movies than a percentile of the movies' ratings count:

In [22]:
movie_rating_counts =  movies_df.vote_count.to_numpy()

q = 70 # percentile over which to keep
counts_perc = np.percentile(movie_rating_counts, q=q).astype(int)
print(f"The {q}'th percentile of the movie counts rating is {counts_perc} movies rated")

The 70'th percentile of the movie counts rating is 28 movies rated


The next step for the baskets creation is to keep the moviesthat have been individually rated by each user above a threshold. This ensures that the movies that end up in the baskets are liked by the users.

In [23]:
rating_threshold = 3.5 # Min movie rating to be considered liked

In [24]:
# Collect all the movies a user has seen in a list
movies_per_user = []
for el in tqdm(list(ratings_df.groupby('userId'))):
    
    # Get all the movies a user has seen
    if len(el[1]['movieId'].to_list()) >= counts_perc:
        good_movies_by_user = [movie for (movie, rating) in zip(el[1]['movieId'].to_list(),el[1]['rating'].to_list()) if rating >= rating_threshold]
    
        movies_per_user.append(np.array(good_movies_by_user).astype(int))

  0%|          | 0/260554 [00:00<?, ?it/s]

In [25]:
# Find unique movies id's list, to use for one-hot-encoding
all_movies_ids = np.concatenate(movies_per_user)
all_movies_ids = sorted( set(list(all_movies_ids.flat))) # ascending order, not essential

In [26]:
# We can create a dataframe with the following columns: movie_id, movie_title
movie_titles = []
movie_original_titles = []
movie_ids = []

for id_ in tqdm(all_movies_ids):
    if not movies_df.loc[movies_df['id']==id_].empty:
        movie_titles.append(movies_df.loc[movies_df['id']==id_].title.item())
        movie_original_titles.append(movies_df.loc[movies_df['id']==id_].original_title.item()) 
        movie_ids.append(id_)

  0%|          | 0/4507 [00:00<?, ?it/s]

Create a dataframe taht holds the movies that will go into the basket analysis

In [27]:
movie_titles_id_df = pd.DataFrame.from_dict({'title':movie_titles, 'original_title':movie_original_titles, 'id':movie_ids})
movie_titles_id_df

Unnamed: 0,title,original_title,id
0,Four Rooms,Four Rooms,5
1,Judgment Night,Judgment Night,6
2,Star Wars,Star Wars,11
3,Finding Nemo,Finding Nemo,12
4,Forrest Gump,Forrest Gump,13
...,...,...,...
4502,Cheap Thrills,Cheap Thrills,175291
4503,Fratricide,Brudermord,175331
4504,These Birds Walk,These Birds Walk,175427
4505,Enter the Dangerous Mind,Enter the Dangerous Mind,176077


Now, all the unique ids are in the column `id`, and the index can be used as a hashcode. But first, a list of baskets needs to be created. This is the list `movies_per_user_fitlered`, however not all movies in each basket exist. As a result, the baskets list is created by taking only the movies from each basket that exist in the `movie_titles_df` data frame:

In [28]:
print('Generating baskets...')
baskets = []
unique_movies_id = movie_titles_id_df.id.to_list()
for user in tqdm(movies_per_user):
    basket = []
    for movie in user:
        if movie in unique_movies_id:
            basket.append(movie)
    baskets.append(basket)

baskets_df = pd.DataFrame.from_dict({'basket_no':np.arange(1, len(baskets)+1), 'baskets':baskets})
baskets_df.to_csv('baskets.csv', index=False)

Generating baskets...


  0%|          | 0/69837 [00:00<?, ?it/s]

In [29]:
def frozen_set_to_list(frznset):
    '''
    Simple function to parse the frozen dataframes from the frequent items dataframe to lists of movie ids
    '''
    l = frznset[1:-1].split(',')
    l = [int(i.strip()) for i in l]
    return l

We can now use the a priori algorithm to find 'transcations' above a given support. Use `mlxtend` library.

In [34]:
te = TransactionEncoder()
te_ary = te.fit_transform(baskets)

In [35]:
df_one_hot = pd.DataFrame(te_ary, columns=te.columns_)
df_one_hot

Unnamed: 0,5,6,11,12,13,14,15,16,18,20,...,174645,174671,174675,175245,175287,175291,175331,175427,176077,176143
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69832,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
69833,False,True,True,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
69834,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
69835,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


At this point, the minimum support is decided. This decision is affected by the amount of RAM needed to run the algorithm:

In [36]:
min_support = 0.08
nm = str(min_support).replace('.', '_')

frq_items = apriori(df_one_hot, min_support = min_support, use_colnames = True)
frq_items['length'] = frq_items['itemsets'].apply(lambda x: len(x))    
frq_items.to_csv(f'freq_items_{nm}.csv', index=0)

In [37]:
frq_items

Unnamed: 0,support,itemsets,length
0,0.206724,(6),1
1,0.132165,(11),1
2,0.167862,(16),1
3,0.163824,(21),1
4,0.153357,(25),1
...,...,...,...
6056,0.084855,"(4993, 4226, 296, 2762, 2959, 858)",6
6057,0.086659,"(4993, 4226, 296, 2762, 2028, 2959)",6
6058,0.080287,"(4993, 4226, 2762, 527, 2959, 318)",6
6059,0.081905,"(4993, 4226, 2762, 2959, 858, 318)",6


In [38]:
assoc_rules = association_rules(frq_items, metric="confidence", min_threshold=0.6)
assoc_rules.to_csv('association_rules.csv', index=False)

Define the following helper function:

In [43]:
def get_title_from_index(movies_df_in, idx):
    '''
    Helper function to return the title of a movie from its id
    '''
    return movies_df_in.loc[movies_df_in.id==idx].title.item()

In [241]:
def recommend_movies(movie_title_str, movies_df, association_rules_df):
    user_movies = [movie_title_str]
    assoc_rules = association_rules_df.sort_values(by=['lift'], ascending=False)
    try:
        user_movies_id = movies_df.loc[movies_df.title==movie_title_str].id.item()
    except ValueError:
        print(f"Movie: '{movie_title_str}' not found.")
        return []
    a = assoc_rules.loc[assoc_rules.antecedents.apply(lambda x: (user_movies_id in list(x)) and (len(list(x)))>=1)]
    ans = a.consequents.apply(lambda x:  [get_title_from_index(movies_df,list(x)[i]) for i in range(len(list(x)))] )
    idx = ans.index.to_list()
    recomm = []
    tot_len = 0
    for i in idx:
        consequent = list(a.loc[a.index==i].consequents.item())
        mv_title = get_title_from_index(movies_df, consequent[0])
        if mv_title in recomm:
            continue
        recomm.append(mv_title)
        tot_len += 1
        
        confidence = np.round(a.loc[a.index==i].confidence.item(),3)
        lift = np.round(a.loc[a.index==i].lift.item(),3)
        print(f"{mv_title}\t Confidence: {confidence}\t Lift: {lift}")
        if tot_len>=5:
            break


In [235]:
movies_to_try = ["Die Hard 2", "Terminator 3: Rise of the Machines", "Young and Innocent", "Reservoir Dogs", "Rocky Balboa"]

In [245]:
recommended_movies = recommend_movies('Cars', movies_df, assoc_rules)

The Thomas Crown Affair	 Confidence: 0.643	 Lift: 2.494
Once Were Warriors	 Confidence: 0.63	 Lift: 1.307
The Million Dollar Hotel	 Confidence: 0.643	 Lift: 1.067


In [242]:
for movie in movies_to_try:
    print(f"Input: {movie}\n\n Recommendations:\n")
    recommend_movies(movie, movies_df, assoc_rules)
    print('-'*50)

Input: Die Hard 2

 Recommendations:

Rope	 Confidence: 0.609	 Lift: 1.949
Say Anything...	 Confidence: 0.619	 Lift: 1.608
Young and Innocent	 Confidence: 0.629	 Lift: 1.536
Terminator 3: Rise of the Machines	 Confidence: 0.692	 Lift: 1.155
The Million Dollar Hotel	 Confidence: 0.68	 Lift: 1.128
--------------------------------------------------
Input: Terminator 3: Rise of the Machines

 Recommendations:

The Talented Mr. Ripley	 Confidence: 0.65	 Lift: 3.372
Loose Screws	 Confidence: 0.635	 Lift: 3.311
Sleepless in Seattle	 Confidence: 0.624	 Lift: 3.242
Shriek If You Know What I Did Last Friday the Thirteenth	 Confidence: 0.656	 Lift: 3.13
Beetlejuice	 Confidence: 0.6	 Lift: 3.121
--------------------------------------------------
Input: Young and Innocent

 Recommendations:

Shriek If You Know What I Did Last Friday the Thirteenth	 Confidence: 0.656	 Lift: 3.13
Terminator 3: Rise of the Machines	 Confidence: 0.689	 Lift: 3.065
5 Card Stud	 Confidence: 0.642	 Lift: 3.025
Beetlejuice