**Rule generation** is a common task in the mining of frequent patterns. An association rule is an implication expression of the form $X\to Y$ where $X$ and $Y$ are *disjoint itemsets* (Tan et al. 2014). A more concrete example based on consumer behaviour would be $\{Diapers\}\to\{Beer\}$ suggesting that people who buy diapers are also likely to buy beer. To evaluate the "interest" of such an association rule, different metrics have been developed. The current implementation make use of the confidence and lift metrics. 

<img src="images/Association_Rule_Learning.png" alt="drawing" width="500"/>

Main Associaton Rules:

- *Support:* It calculates how often the product is purchased and is given by the formula:

- *Confidence:* It measures how often items in $Y$ appear in transactions that contain $X$ and is given by the formula.

- *Lift:* It is the value that tells us how likely item $Y$ is bought together with item $X$. Values greater than one indicate that the items are likely to be purchased together. It tells us how much better a rule is at predicting the result than just assuming the result in the first place. When lift > 1 then the rule is better at predicting the result than guessing. When lift < 1, the rule is doing worse than informed guessing.

<br />

*For example*: We have 10 different product and check invoices to recommend product. If there are 8 invoices from customers and includes 4 pieces of milk, the result of support is Support(Milk) = 4/8 = 0.5. In other words, there is one milk in every two carts on average.

More information can be found in https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/


**References**
- Tan, Steinbach, Kumar. Introduction to Data Mining. Pearson New International Edition. Harlow: Pearson Education Ltd., 2014. (pp. 327-414).


# Import libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# # Basic libraries
# 
import pandas   as pd
import numpy    as np
import datetime as dt

# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# Visualization libraries
#
import matplotlib.pyplot as plt
import seaborn           as sns


# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# mlxtend library
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# Import data

In [3]:
# Load Dataset
#
df = pd.read_csv('Data/OnlineRetail.csv', encoding="ISO-8859-1")

print('[INFO] Number of instances: ', df.shape[0])
print('[INFO] Number of features:  ', df.shape[1])

# Visualize DataFrame
#
df.head( 3 )

  and should_run_async(code)


[INFO] Number of instances:  541909
[INFO] Number of features:   8


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,1/12/10 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,1/12/10 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,1/12/10 8:26,2.75,17850.0,United Kingdom


## Data Pre-processing

**Observations/Findings:**

- The min and max value for Quantity is 80995, this could represent cancelled or returned orders.
- The UnitPrice also have few negative values which is uncommon,these transactions could represent cancelled orders by customers or bad-debt incurred by the business.
- Bad debt adjustments will be dropped from the dataset as these do not represent actual sales. 

We need to clean the above values by removing from the dataset


- There are almost 25% missing CustomerID. We need to remove them as there is no way we can get the number of CustomerID
- As customer clusters may vary by geography, we will restrict the data to only Germany customers - Notice that United Kingdom contains 90% of the customers. If this analysis is applied for these customers then "memory errors" occurs

In [4]:
# More than 90% of customers are 'Germany' customers.
#
df = df[df.Country == 'Germany']

  and should_run_async(code)


In [5]:
# Removing the negative values from UnitPrice and Quantity
#
df = df[df[ 'Quantity'  ] > 0]
df = df[df[ 'UnitPrice' ] > 0]


# Remove instances with wrong InvoiceNo
#
df = df[~df["InvoiceNo"].str.contains("C", na = False)]


# Removing the Null values from the data.
#
df = df[ pd.notnull(df['CustomerID']) ]

  and should_run_async(code)


In [6]:
# Replace 'InvoiceDate' to 'Date'
#
df = df.rename(columns = {'InvoiceDate': 'Date'})


# Convert 'Date' to datetime64
#
df[ 'Date' ] = df[ 'Date' ].astype( 'datetime64' )

  and should_run_async(code)


In [7]:
# Convert 'CustomerID' to int
#
df[ 'CustomerID' ] = df[ 'CustomerID' ].astype( 'int' )


# Convert 'InvoiceNo' to int
#
df[ 'InvoiceNo' ] = df[ 'InvoiceNo' ].astype( 'int' )

  and should_run_async(code)


# Build association rule from Germany customers

In [8]:
# Changing the values of all product between 0 and 1 according to Invoice.
#
def create_invoice_product_df(dataframe, id = False):
    if id:
        return dataframe.groupby(['InvoiceNo', "StockCode"])['Quantity'].sum().\
            unstack().fillna(0).applymap(lambda x: 1 if x > 0 else 0)
    else:
        return dataframe.groupby(['InvoiceNo', 'Description'])['Quantity'].sum().\
            unstack().fillna(0).applymap(lambda x: 1 if x > 0 else 0)

  and should_run_async(code)


In [9]:
# Sees items in the columns, invoices in the rows
# invoice-product matrix
# 
inv_df = create_invoice_product_df(df, id = True)

  and should_run_async(code)


## Application of Apriori algorithm

In [10]:
# Number of frequencies of items and their support ratio
#
frequent_itemsets = apriori(inv_df, min_support = 0.01, use_colnames = True)
frequent_itemsets.sort_values("support", ascending = False).head( 5 )

  and should_run_async(code)


In [None]:
# Export association rules
# 
rules = association_rules(frequent_itemsets, metric = "support", min_threshold = 0.01)

# Top 5 associated products according to the “lift” measure
rules.sort_values("lift", ascending=False).head( 5 )

# Case studies

- Make a product recommendation for users in the cart.


In [None]:
def arl_recommender(df_rules, product_id, rec_count=1):

    sorted_rules = df_rules.sort_values("lift", ascending=False)

    recommendation_list = []

    for i, product in sorted_rules["antecedents"].items():
        for j in list(product):
            if j == str(product_id):
                recommendation_list.append(list(sorted_rules.iloc[i]["consequents"]))

    recommendation_list = list({item for item_list in recommendation_list for item in item_list})

    return recommendation_list[:rec_count]


def id_finder(dataframe, stock_code):
    product_name = dataframe[dataframe["StockCode"] == stock_code][["Description"]].values[0][0]
    
    return product_name

In [None]:
# Top 3 recommendations
#
L = arl_recommender(rules, 21086, 3)

for x in L:
    print('ID: {} - Description: {}'.format(x, id_finder( df, x )))

In [None]:
# Top 5 recommendations
#
L = arl_recommender(rules, 21086, 5)

for x in L:
    print('ID: {} - Description: {}'.format(x, id_finder( df, x )))