# SENG 474
# Assignment 3 - Problem 3 Retail
# Nolan Kurylo
# V00893175
To execute notebook, ensure ALL cells are run from top to bottom (since imports/df creation are only called once)

References:

1) https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python

2) https://www.w3schools.com/python/ref_func_sorted.asp


In [1]:
import pandas as pd
import numpy as np

df = pd.read_excel('Online Retail.xlsx', sheet_name="Online Retail")
# run this cell once, takes ~3 minutes to load in

In [3]:
from apyori import apriori
pd.options.mode.chained_assignment = None


countries = np.unique(df['Country']) # find all unique countries in the dataset

minsups = {'Austria':0.2, 'Bahrain': 0.5,'Canada': 0.3,'Channel Islands':0.19, 'Cyprus':0.2, 'Denmark':0.3,'European Community':0.5,'Greece':0.4, 'Hong Kong':0.4,
'Iceland':0.5,'Israel':0.3,'Japan':0.2,'Lithuania':0.5, 'Malta':0.4,'Poland':0.2, 'Singapore':0.4,'USA':0.4, 'United Kingdom':0.05} # manually tested minimum support values needed for aprior to run effectively for each country (described in md below)

for country in countries:
    print("Country: " + country)
    # Preprocessing below
    df1 = df[(df['Country'] == country) & (df['StockCode'] != 'POST') & (df['Quantity'] > 0)] # filter to find all transactions for current currently that have an actual quantity and are not postage info
    df1 = df1.dropna() # remove rows with NAN values
    df1['Description'] = df1['Description'].str.strip() # remove any additional whitespace

    numInvoices = len(np.unique(df1['InvoiceNo']))
    if(country == 'Unspecified'): # explained in md below
        print("Not a country, skipping...")
        print() 
        continue
    if(numInvoices <= 2 ): # explained in md below
        print("Not enough invoices to perform Apriori algorithm, moving on to next country...")
        print() 
        continue


    if(country in minsups): # if country was found to need a different minsup value in manual testing
        minsup = minsups[country]

    else: # manual testing showed minsup=0.1 is the best for this country
        minsup = 0.1

    transactions = list(df1.groupby('InvoiceNo')['Description'].apply(list)) # organize all products purchased in the same order to be in their own list

    associations = list(apriori(transactions,min_support=minsup)) # run apriori


    confidence_values = {} # for stroing all records in a dictionary
    i = 0 # unique identifier for dictionary

    for association in associations: # find the support, confidence, lift and aossication rules for each record generated by apriori

        support = association.support
        for rule in association.ordered_statistics:
            
            confidence = rule.confidence
            lift = rule.lift
            left_asc = list(rule.items_base)
            right_asc = list(rule.items_add)
            confidence_values['record_' + str(i)] = {'confidence': confidence, 'lift': lift, 'left_asc': left_asc, 'right_asc':right_asc, 'support': support} # store record for later sorting
            i+=1 

    top5AssociationRecords = sorted(confidence_values.items(), key=lambda record: record[1]['confidence'], reverse=True)[:5] # sort each record to find the top 5 associations according to confidence value
    top5AssociationRecords = dict(top5AssociationRecords) # convert to dictionary for easy looping

    for record in top5AssociationRecords: # loop thru top 5 associations and print their details
        confidence = top5AssociationRecords[record]['confidence']
        support = top5AssociationRecords[record]['support']
        lift = top5AssociationRecords[record]['lift']
        left_asc = top5AssociationRecords[record]['left_asc']
        right_asc = top5AssociationRecords[record]['right_asc']
        print("Association: " + str(left_asc) +" --> " + str(right_asc) + " Support = "+str(support)+" Confidence = "+str(confidence) + " Lift = "+str(lift))
    print()




    
    



   

Country: Australia
Association: ['ALARM CLOCK BAKELIKE GREEN'] --> ['ALARM CLOCK BAKELIKE RED'] Support = 0.10714285714285714 Confidence = 1.0 Lift = 9.333333333333334
Association: ['ALARM CLOCK BAKELIKE RED'] --> ['ALARM CLOCK BAKELIKE GREEN'] Support = 0.10714285714285714 Confidence = 1.0 Lift = 9.333333333333334
Association: ['DOLLY GIRL LUNCH BOX'] --> ['SPACEBOY LUNCH BOX'] Support = 0.10714285714285714 Confidence = 1.0 Lift = 9.333333333333334
Association: ['SPACEBOY LUNCH BOX'] --> ['DOLLY GIRL LUNCH BOX'] Support = 0.10714285714285714 Confidence = 1.0 Lift = 9.333333333333334
Association: [] --> ['RED TOADSTOOL LED NIGHT LIGHT'] Support = 0.16071428571428573 Confidence = 0.16071428571428573 Lift = 1.0

Country: Austria
Association: ['ROUND SNACK BOXES SET OF 4 FRUITS'] --> ['ROUND SNACK BOXES SET OF4 WOODLAND'] Support = 0.23529411764705882 Confidence = 1.0 Lift = 4.25
Association: ['ROUND SNACK BOXES SET OF4 WOODLAND'] --> ['ROUND SNACK BOXES SET OF 4 FRUITS'] Support = 0.2352

3 Retail

The above code runs the Apriori algorithm on each qualifying country in the dataset, grouping all products together that were a part of the same transaction. "Qualifying country" means that some assumptions about the assignment problem had to be made. Through testing the Apriori algorithm on each of the different countries in the dataset, it was determined that due to the quickly increasing complexity of the algorithm, the association rule generation may take too long ( > 10 minutes) for a country. This was discovered to be the case for some countries that had a low number of total transactions (2 or less). What occured here was that the Apriori algorithm generated a lot of rules with the same statistics (support, confidence and lift) so specifying different hyperparamters would not speed up the generation. So for simplicity, these countries were ignored. Furthermore, this occured for the 'Unspecified' country, so it was filtered out of the output.

With needing to run the Apriori algorithm on 30+ countries in this dataset, each country was manually tested to find the optimal minimum support threshold, as seen in the "minsups" list. This was a list created from manual testing Apriori on each country. This improved the overall efficiency of the program while also finding the top 5 associations for each country. 

The associations that were found were noticed to have the left hand side of the association being an empty set ([]), with the right hand side being the product description. This means that the association's product description found on the right hand side is commonly found in transactions in general. Also, associations show that both the left and right hand sides contains a list of product descriptions. This means that when the left hand side products are present, the right hand side products are commonly found to be present too (to the degree of statistics of support, confidence and lift).

For finding the "Top 5 Associations" for each country, each record outputed from the Apriori algorithm was stored and later sorted based on each record's confidence value. This means that each of the 5 associations and their corresponding statistics were those with the top 5 highest confidence values.