# Apriori Algorithm

In this notebook, we use the built in apriori algorithm from the mlxtend library

In [1]:
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen
from mlxtend.frequent_patterns import apriori, association_rules
from sklearn.model_selection import train_test_split

We can edit these parameters to compare the accuracy of the Apriori algorithm, simply change the values here and run the entire notebook

In [2]:
metric_to_use = 'lift'
min_threshold_val = 48.161422 #this is the mean value calculated from various iterations
num_products = 500
min_support_val = 0.02
# dataset = 'AMAZON_FASHION_5.json.gz'
dataset = 'Sports_and_Outdoors_5.json.gz'

## Loading and proprocessing data

We first load data to be used for this analysis and sort data by review times

In [3]:
data = []
with gzip.open(dataset) as f:
    for l in f:
        data.append(json.loads(l.strip()))

df = pd.DataFrame.from_dict(data)
df.sort_values("reviewTime")
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,style,vote,image
0,5.0,True,"06 3, 2015",A180LQZBUWVOLF,32034,Michelle A,What a spectacular tutu! Very slimming.,Five Stars,1433289600,,,
1,1.0,True,"04 1, 2015",ATMFGKU5SVEYY,32034,Crystal R,What the heck? Is this a tutu for nuns? I know...,Is this a tutu for nuns?!,1427846400,,,
2,5.0,True,"01 13, 2015",A1QE70QBJ8U6ZG,32034,darla Landreth,Exactly what we were looking for!,Five Stars,1421107200,,,
3,5.0,True,"12 23, 2014",A22CP6Z73MZTYU,32034,L. Huynh,I used this skirt for a Halloween costume and ...,I liked that the elastic waist didn't dig in (...,1419292800,,,
4,4.0,True,"12 15, 2014",A22L28G8NRNLLN,32034,McKenna,This is thick enough that you can't see throug...,This is thick enough that you can't see throug...,1418601600,,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2839940 entries, 0 to 2839939
Data columns (total 12 columns):
 #   Column          Dtype  
---  ------          -----  
 0   overall         float64
 1   verified        bool   
 2   reviewTime      object 
 3   reviewerID      object 
 4   asin            object 
 5   reviewerName    object 
 6   reviewText      object 
 7   summary         object 
 8   unixReviewTime  int64  
 9   style           object 
 10  vote            object 
 11  image           object 
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 241.0+ MB


The key columns that we focus on are:
- `reviewTime`: We assume this to be the time the item is bought
- `asin`: Serial number of the product being bought
- `reviewerName`: Name of the customer who made the purchase

Upon analysing the data, we notice that we have too many unique products which would significantly reduce computational efficiency and take too much time, hence, we narrow down our dataset to the top 500 most commonly purchased product

In [5]:
print('Total number of products before filtering= ',len(list(df['asin'].unique())))
product_counts = df['asin'].value_counts()
top_500_products = product_counts.head(num_products).index
df = df[df['asin'].isin(top_500_products)]
print('Number of products after filtering= ',len(list(df['asin'].unique())))

Total number of products before filtering=  104687
Number of products after filtering=  500


We then perform a train test split to calculate the accuracy of our algorithm. Ensure `shuffle` is set to False as this is required by our accuracy metric.

In [6]:
df, testdf = train_test_split(df, test_size=0.7, shuffle=False)

We then perform one-hot encoding on the dataset with each row being a unique user and each column being a product. The values are the counts or number of times each item has been bought by the user.

In [7]:
transactions = pd.crosstab(df['reviewerName'],df['asin'])
transactions.sample(10)

asin,7245456313,B00004TBLW,B00005BAIB,B00005OU9D,B00008BFYG,B0000WR6W8,B00012M5MS,B00014ZY0Q,B000276CZS,B00029PJZ0,...,B0016LJWEW,B0016SRA4Y,B0016WVQJU,B00177BQF8,B00177BQJE,B00178AI8S,B001796T2Q,B0017IFSIS,B0017IHRNC,B0017IHRNM
reviewerName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
NCwoodsman,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Jonny J.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
G. L.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bobo labonski jr,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
stephen,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
S. E. Smith,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Dave Bejarano,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Birano,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
German,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Gary Q,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


`df_support` here is the adding an attribute `Support` to the df which is the items' frequency of occurrence. This allows us to analyze how frequently an item was bought and which items are more popular.

In [8]:
result = transactions.sum()/len(transactions)
numberOfObjects = transactions.sum()
df_support = pd.DataFrame({'asin': result.index, "Quantity": numberOfObjects, 'Support': result.values})
df_support.sort_values(by="Quantity",ascending=False, inplace=True)
df_support.sample(10)

Unnamed: 0_level_0,asin,Quantity,Support
asin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
B00005BAIB,B00005BAIB,404,0.006271
B000ZKATZG,B000ZKATZG,203,0.003151
B0002YTO7E,B0002YTO7E,733,0.011379
B000FF05L4,B000FF05L4,946,0.014685
B0002ECYRQ,B0002ECYRQ,1193,0.018519
B0010HFAKC,B0010HFAKC,209,0.003244
B00168PI4S,B00168PI4S,473,0.007343
B000MF63M2,B000MF63M2,1478,0.022944
B00125M48I,B00125M48I,247,0.003834
B000LJUS4I,B000LJUS4I,544,0.008445


We then encode each value to a 1 or 0 if the user has bought or not bought the product. This completes our one-hot encoding.

In [9]:
def encode(item_freq):
    res = 0
    if item_freq > 0:
        res = 1
    return res
    
transactions = transactions.applymap(encode)

In [10]:
print("There are",len(transactions),"unique users")

There are 64419 unique users


## Running the Apriori algorithm

We then feed in the one-hot encoded dataset to the apriori algorithm which then filters out items with a support less than 0.02, in order to weave out the less frequently bought items which may skew results of the algorithm.

In [11]:
frequent_itemset = apriori(transactions, use_colnames=True,min_support=min_support_val)
frequent_itemset



Unnamed: 0,support,itemsets
0,0.020196,(B00079ULA8)
1,0.020258,(B000GCRWCG)
2,0.030193,(B000VAPCU2)
3,0.029556,(B0010O748Q)
4,0.023735,(B0012Q2S4W)
5,0.020211,(B00136X6VU)
6,0.021531,(B0014VX2M2)
7,0.020615,(B0015LT03G)
8,0.020631,(B0015LY0DG)
9,0.02001,"(B0015LT03G, B00136X6VU)"


Generate the association rules, gives the frequent itemsets. The following are important columns we look at:
- `antecedents` here refers to items that a user has bought 
- `consequents` are items that a user is likely to buy given they have bought the antecedent product
- All other columns are metrics to measure the accuracy of this "theory"

In [12]:
rules = association_rules(frequent_itemset, metric=metric_to_use, min_threshold=min_threshold_val)
res = rules.sort_values(by=[metric_to_use],ascending=False)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(B0015LT03G),(B0015LY0DG),0.020615,0.020631,0.0206,0.999247,48.435283,0.020174,1300.602617,0.999968
1,(B0015LY0DG),(B0015LT03G),0.020631,0.020615,0.0206,0.998495,48.435283,0.020174,650.801309,0.999984


Here we calculate the confidence score for each pair of items, as we can see in the `Rules` column, confidence simply is a metric that tells you, if you bought item A, how likely should you buy item B.

In [13]:
df_confidence = rules[["antecedents","consequents","confidence"]].copy()
df_confidence["antecedents"] = df_confidence["antecedents"].apply(lambda x: ', '.join(list(x))).astype("unicode")
df_confidence["consequents"] = df_confidence["consequents"].apply(lambda x: ', '.join(list(x))).astype("unicode")
df_confidence['description'] = "If you buy "+df_confidence['antecedents']+" you should also buy "+df_confidence['consequents']
df_confidence

Unnamed: 0,antecedents,consequents,confidence,description
0,B0015LT03G,B0015LY0DG,0.999247,If you buy B0015LT03G you should also buy B001...
1,B0015LY0DG,B0015LT03G,0.998495,If you buy B0015LY0DG you should also buy B001...


We do the same to calculate lift. This metric tells us how many times more likely a user is to buy item A and item B together instead of just item B.

In [14]:
df_lift = rules[["antecedents","consequents","lift"]].copy()
df_lift["antecedents"] = df_lift["antecedents"].apply(lambda x: ', '.join(list(x))).astype("unicode")
df_lift["consequents"] = df_lift["consequents"].apply(lambda x: ', '.join(list(x))).astype("unicode")
df_lift['description'] = "If you buy "+df_lift['antecedents']+ " you are " + df_lift['lift'].astype(str) + " times more likely to buy "+df_confidence['consequents'] + " too, instead of buying it on its own."
df_lift

Unnamed: 0,antecedents,consequents,lift,description
0,B0015LT03G,B0015LY0DG,48.435283,If you buy B0015LT03G you are 48.4352834588919...
1,B0015LY0DG,B0015LT03G,48.435283,If you buy B0015LY0DG you are 48.435283458892 ...


## Prediction and testing accuracy

We then generate a prediction function. If a user has bought the input item, output all items that the user might be interested in together with the probability that the user will buy it

In [15]:
def getUniqueReccsPerAntecedent(res_df, antecedent):
    answer = []
    setOfConsequents = set()
    listOfConsequents = res_df[res_df["antecedents"].apply(lambda x: len(x) == 1 and next(iter(x)) == antecedent)][["consequents","confidence"]]
    for _,listOfConsequent in listOfConsequents.iterrows():
        for x in listOfConsequent["consequents"]:
            if x not in setOfConsequents:
                answer.append((x,listOfConsequent[metric_to_use]))
                setOfConsequents.add(x)
    return answer

If a user has bought all items within the item array input, output all items that the user might be interested in together with its probabilities

In [16]:
def getAllUniqueReccs(res_df,antecedents):
    res = []
    for antecedent in antecedents:
        res += getUniqueReccsPerAntecedent(res_df,antecedent)
    res.sort(key = lambda x: x[1],reverse=True)
    return res[:5]

A bit more specific to our use case of a recommender system for targetted advertising, if we want to market an item, return the top N users that we should target together with the probabilities that they will purchase this product.

In [17]:
# get users who have bought the exact antecedent (be it 1 or multiple products)
def getUsersFromAntecedents(df_transactions,antecedents):
    listOfProducts = list(df_transactions.columns.values)
    antecedentsSet = set(antecedents)
    encoding = [0]*len(listOfProducts)
    res = set()
    for i,x in enumerate(listOfProducts):
        if x in antecedentsSet:
            encoding[i] = 1
    for i,row in df_transactions.iterrows():
        values = row[:].tolist()
        if values == encoding:
            res.add(i)
    return res

In [18]:
# get users who have bought an item that is part of an antecedent and not necessarily the entire anticendent
def getSecondaryUsersFromAntecedents(df_original,antecedents):
    res = set()
    for antecedent in antecedents:
        for i,row in df_original.iterrows():
            if antecedent in row["asin"] and row['reviewerName'] not in res:
                res.add(row["reviewerName"])
    return list(res)

In [19]:
def getTopNUsers(res_df,item,N):    
    answer = []
    users = set()
    for i,row in res_df.iterrows():
        if item in row['consequents']:
            recommended = getUsersFromAntecedents(transactions,row["antecedents"])
            if recommended:
                for temp in recommended:
                    if temp not in users:
                        answer.append((temp,row[metric_to_use]))
                        users.add(temp)
            if len(answer) > N:
                break 
    for i,row in res_df.iterrows():
        if item in row['consequents']:
            secondary_recommended = getSecondaryUsersFromAntecedents(df,row["antecedents"])
            if secondary_recommended:
                for temp in secondary_recommended:
                    if temp not in users:
                        answer += [(temp,row[metric_to_use])]
                        users.add(temp)
            if len(answer) > N:
                break
    return answer[:N]

In order to test the accuracy of our algorithm, for all products, we find the top N users we should reccommend the product to, and then check against our testdf if the user actually bought the product

In [20]:
allProducts = set()
for x in rules["antecedents"]:
    for a in x:
        allProducts.add(a)
print("All products that we are testing against:",allProducts)
print()
ans = []
for i,product in enumerate(allProducts):
    print("Testing product:",product)
    match = 0
    count = 15
    recc_users = getTopNUsers(res,product,count)
    for user in recc_users:
        productsUserBought = testdf[testdf['reviewerName']==user[0]]
        itemCount = productsUserBought[productsUserBought['asin']==product]
        if len(itemCount)!=0:
            match += 1
    print("Accuracy of predictions:",(match/count)*100,"%")
    print()
    ans.append((match/count)*100)
    
print("Overall accuracy:",(sum(ans)/len(ans)),"%") 

All products that we are testing against: {'B0015LT03G', 'B0015LY0DG'}

Testing product: B0015LT03G
Accuracy of predictions: 100.0 %

Testing product: B0015LY0DG
Accuracy of predictions: 100.0 %

Overall accuracy: 100.0 %
