## Movie recommender - Using association rules

I will describe all the steps for deriving positive recommendations (negative recommendations are obtained in the same way).

In [399]:
# imports
import numpy as np
import pandas as pd
import mlxtend
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules




An utility function to find id by the title of a movie

In [400]:
def title_to_id(df, title):
    matched = df.loc[df.title.str.contains(title, case = False)]
    print(matched)
    n_rows, _ = matched.shape
    if n_rows == 1:
        return int(matched.movieId)
    print('Specify the title better')
    return None

Loading datasets
> I drop the columns that are irrelevant for the research

In [401]:
movies = pd.read_csv('data/movies.csv', sep=',')
movies.drop('genres', inplace=True, axis=1)
# print(movies.shape)
# print(movies.dtypes)
# movies.head()

In [402]:
ratings = pd.read_csv('data/ratings.csv' )
ratings.drop('timestamp', inplace=True, axis=1)
# print(len(ratings.userId.unique()))
# print(ratings.describe())
# print(ratings.dtypes)
# print(ratings.shape)
# ratings.head()

Filtering the ratings that are treated as positive

In [403]:
positive_ratings = ratings[ratings.rating >=4.0].drop('rating', axis=1)
print(positive_ratings.shape)
positive_ratings.head()

(51568, 2)


Unnamed: 0,userId,movieId
4,1,1172
12,1,1953
13,1,2105
20,2,10
21,2,17


In [404]:
negative_ratings = ratings[ratings.rating <=2.0].drop('rating', axis= 1)
print(negative_ratings.shape)
negative_ratings.head()

(13385, 2)


Unnamed: 0,userId,movieId
3,1,1129
5,1,1263
6,1,1287
7,1,1293
9,1,1343


In [405]:
users = ratings.userId.unique()

I transform ratings into the llist of transactions

- user --> transaction index
- all ratings of a given user --> itemset 

In [406]:

positive_transactions = dict.fromkeys(users, list())
negative_transactions = dict.fromkeys(users, list())

for i, row in positive_ratings.iterrows():
    positive_transactions[row.userId].append(row.movieId)


for i, row in negative_ratings.iterrows():
    negative_transactions[row.userId].append(row.movieId)


In [407]:
positive_transactions = []
negative_transactions = []

for u in users:
    p_list= list(positive_ratings[positive_ratings.userId == u].movieId)
    if len(p_list)>0 : positive_transactions.append(p_list)
    n_list= list(negative_ratings[negative_ratings.userId == u].movieId)
    if len(n_list)>0 : negative_transactions.append(n_list)



### Creating positive recommender

I find the indices of given movies

In [408]:
pulp_id = title_to_id(movies, 'Pulp Fiction')
re_dogs_id = title_to_id(movies, 'Reservoir Dog')

     movieId                title
266      296  Pulp Fiction (1994)
     movieId                  title
880     1089  Reservoir Dogs (1992)


I transform the list of transactions into a list of boolean flags

In [409]:
te = TransactionEncoder()

te_array = te.fit(positive_transactions).transform(positive_transactions)

# te_array


Apriori algorithm takes `pd.DataFrame` as the input

In [410]:
df = pd.DataFrame(te_array, columns=te.columns_)

I look for the support of the itemset (Pulp Fiction, Reservoir Dogs)

In [411]:
frequent_itemsets = apriori(df, min_support=0.05, use_colnames=True)
watched_idx = frequent_itemsets['itemsets'].apply(
            lambda x: x == frozenset({pulp_id, re_dogs_id}))

frequent_itemsets[watched_idx]

Unnamed: 0,support,itemsets
1027,0.135618,"(296, 1089)"


I know that the `min_support` can't be grater than 0.136, because this is the support of seeing both (Pulp Fiction, Reservoir Dogs) in one transaction.

I apply `apriori` on the transaction list
- I set `min_support` = 0.07 (< 0.136) using trial and error method

In [412]:
frequent_itemsets = apriori(df, min_support=0.07, use_colnames=True)

#frequent_itemsets

Rules generation
- I set the `min_threshold` high enough not to generate too many rules

In [413]:
rules = association_rules(frequent_itemsets, metric= 'confidence', min_threshold=0.5)

In [414]:
watched_idx = rules['antecedents'].apply(
            lambda x: x == frozenset({pulp_id, re_dogs_id}))

propositions = rules[watched_idx].sort_values(by=['confidence'], axis=0, ascending=False)
propositions

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2124,"(296, 1089)",(593),0.135618,0.356185,0.09389,0.692308,1.943676,0.045584,2.092399
1971,"(296, 1089)",(318),0.135618,0.408346,0.090909,0.67033,1.641574,0.03553,1.794685
2201,"(296, 1089)",(858),0.135618,0.265276,0.089419,0.659341,2.485492,0.053442,2.156771
2260,"(296, 1089)",(2858),0.135618,0.268256,0.087928,0.648352,2.416911,0.051548,2.080896
2180,"(296, 1089)",(608),0.135618,0.274218,0.083458,0.615385,2.244147,0.046269,1.887034
2253,"(296, 1089)",(1213),0.135618,0.149031,0.083458,0.615385,4.129231,0.063246,2.212519
2264,"(296, 1089)",(2959),0.135618,0.235469,0.081967,0.604396,2.566769,0.050033,1.932563
1037,"(296, 1089)",(50),0.135618,0.257824,0.080477,0.593407,2.301594,0.045511,1.825351
893,"(296, 1089)",(47),0.135618,0.217586,0.078987,0.582418,2.676727,0.049478,1.873676
1379,"(296, 1089)",(260),0.135618,0.345753,0.078987,0.582418,1.684492,0.032096,1.56675


I take 5 rules with the highest confidence. Rules' consequents are the movies that are recommended to watch.

In [415]:
proposition_ids = propositions.iloc[0:5, 1]
proposition_ids = frozenset().union(*proposition_ids)
# proposition_ids

If you like **Pulp Fiction** and **Reservoir Dogs** then you will enjoy these 5 movies:

In [416]:
f = movies.movieId.isin(proposition_ids)
movies[f]

Unnamed: 0,movieId,title
284,318,"Shawshank Redemption, The (1994)"
525,593,"Silence of the Lambs, The (1991)"
535,608,Fargo (1996)
695,858,"Godfather, The (1972)"
2288,2858,American Beauty (1999)


### Creating negative recommender

The process of finding movies that are advised not to watch if one hates "The Mask" works the same as the previous one, so I believe there is no need to describe it in details.

In [417]:
# transforming the data into boolean values

te = TransactionEncoder()

te_array = te.fit(negative_transactions).transform(negative_transactions)

df = pd.DataFrame(te_array, columns=te.columns_)

In [418]:
# Applying `apriori` on our dataset
# the mask itself has the support 0.052
frequent_itemsets = apriori(df, min_support=0.01, use_colnames=True)
# frequent_itemsets

In [419]:
# rules generation

rules = association_rules(frequent_itemsets, metric= 'confidence', min_threshold=0.25)
#rules


In [420]:
# getting the index of a movie
mask_id = title_to_id(movies, 'mask')
mask_id = 367  # I specified the index of the right movie by hand

      movieId                                              title
331       367                                   Mask, The (1994)
1404     1801                   Man in the Iron Mask, The (1998)
1568     2006                          Mask of Zorro, The (1998)
2087     2609              King of Masks, The (Bian Lian) (1996)
2100     2625                        Black Mask (Hak hap) (1996)
2583     3213                Batman: Mask of the Phantasm (1993)
5586     8880                                        Mask (1985)
6246    37857                                  MirrorMask (2005)
6767    54910  Behind the Mask: The Rise of Leslie Vernon (2006)
6972    59846                              Iron Mask, The (1929)
Specify the title better


In [421]:
# filtering rules
watched_idx = rules['antecedents'].apply(
            lambda x: x == frozenset({mask_id}))

rejections = rules[watched_idx].sort_values(by=['confidence'], axis=0, ascending=False)
# rejections

In [422]:
rejection_ids = rejections.iloc[0:5, 1]
rejection_ids = frozenset().union(*rejection_ids)
#rejection_ids

These are 4 movies you should not watch if you haven't enjoyed **The Mask** 

In [423]:
f = movies.movieId.isin(rejection_ids)
movies[f]

Unnamed: 0,movieId,title
18,19,Ace Ventura: When Nature Calls (1995)
184,208,Waterworld (1995)
309,344,Ace Ventura: Pet Detective (1994)
519,586,Home Alone (1990)
