# Association Rule Mining

The problem of association rule mining is defined as:

Let ${\displaystyle I=\{i_{1},i_{2},\ldots ,i_{n}\}}$ be a set of ${\displaystyle n}$ binary attributes(usually 0 or 1 valued) called **items**.

Let ${\displaystyle D=\{t_{1},t_{2},\ldots ,t_{m}\}}$ be a set of transactions called the **database**.

Each transaction in ${\displaystyle D}$ has a unique transaction ID and contains a subset of the items in ${\displaystyle I}$.

A rule is defined as an implication of the form:

${\displaystyle X\Rightarrow Y}$, where ${\displaystyle X,Y\subseteq I}$.

<table style="float: left; margin-left: 1em; text-align:center;">
<caption>Example database with 5 transactions and 5 items</caption>
<tr>
<th>transaction ID</th>
<th>milk</th>
<th>bread</th>
<th>butter</th>
<th>beer</th>
<th>diapers</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>3</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>5</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</table>

In [None]:
import pandas
import numpy


# please visit 'https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/' for more information
from mlxtend.preprocessing import OnehotTransactions
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules


pandas.set_option('display.max_rows', 10)
pandas.set_option('display.max_columns', 10)

# set a fixed seed for numpy pseudo random generator
numpy.random.seed(100)

In [None]:
data = pandas.read_excel("./datasets/Online Retail.xlsx", 
                         parse_dates=['InvoiceDate'])

In [None]:
%timeit data.head()
data.head()

In [None]:
data.columns, data.shape

In [None]:
# tell me how many unique customers do we have?
groupby_result = data.groupby(by=["CustomerID"])

In [None]:
# users columnar count information
groupby_result.count().reset_index()

In [None]:
# get the groupby result for a particular userId
groupby_result.groups[12347.0]

In [None]:
data[data.CustomerID == 12347.0]

In [None]:
groupby_result = data.groupby(by=["StockCode"])
product_id = groupby_result.count().reset_index()['StockCode'].astype("str")

# let's keep only products with 'StockCode' equal to 5
def is_valid_productid(x):
    
#     def is_int(x):
#         try:
#             int(x)
#         except:
#             return False
#         return True
    
#     if is_int(x):
#         return len(x) == 5
#     else:
#         return False
    return True
    
# is_valid_productid('84625A')
selcted_products = product_id[product_id.apply(is_valid_productid)]

# I
selcted_products

In [None]:
data['StockCode'] = data['StockCode'].astype("str")
raw_transactions = data[data['StockCode'].isin(selcted_products)]
raw_transactions.reset_index(inplace=True)
# raw_transactions

In [None]:
# transaction set(D)
transaction_set = []
counter = 0
for key, value in raw_transactions.groupby(by=["InvoiceNo"]):
    if counter < 10:
        print(key, len(value['StockCode']))
        counter += 1
        
    # add the data to transaction set(D)
    transaction_set.append(list(pandas.unique(value['StockCode'])))

In [None]:
# let's see ...
transaction_id = 0
for transaction in transaction_set[0:5]:
    print("transaction_id = %s, items = %s\r" % 
          (transaction_id, transaction))
    transaction_id += 1

In [None]:
# I and D
selcted_products[0:2], transaction_set[0:2]

# Additional Terminology

Let ${\displaystyle X}$ be an itemset, ${\displaystyle X\Rightarrow Y}$ an association rule and ${\displaystyle T}$ a set of transactions of a given database.

### Support
Support is an indication of how frequently the itemset appears in the dataset.

The support of ${\displaystyle X}$ with respect to ${\displaystyle T}$ is defined as the proportion of transactions ${\displaystyle t}$ in the dataset which contains the itemset ${\displaystyle X}$.

${\displaystyle \mathrm {supp} (X)={\frac {|\{t\in T;X\subseteq t\}|}{|T|}}}$


### Confidence
Confidence is an indication of how often the rule has been found to be true.

The confidence value of a rule, ${\displaystyle X\Rightarrow Y}$, with respect to a set of transactions ${\displaystyle T}$, is the proportion of the transactions that contains ${\displaystyle X}$ which also contains ${\displaystyle Y}$.

Confidence is defined as:

${\displaystyle \mathrm {conf} (X\Rightarrow Y)=\mathrm {supp} (X\cup Y)/\mathrm {supp} (X)}$

### Lift(Interest)
The lift of a rule is defined as:

${\displaystyle \mathrm {lift} (X\Rightarrow Y)={\frac {\mathrm {supp} (X\cup Y)}{\mathrm {supp} (X)\times \mathrm {supp} (Y)}}}$

or the ratio of the observed support to that expected if X and Y were independent.

### Conviction
The conviction of a rule is defined as ${\displaystyle \mathrm {conv} (X\Rightarrow Y)={\frac {1-\mathrm {supp} (Y)}{1-\mathrm {conf} (X\Rightarrow Y)}}}$.

Conviction can be interpreted as the ratio of the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions.

### There are more!

# Association Rule Mining Process
Association rules are usually required to satisfy a user-specified **minimum support** and a user-specified **minimum confidence** at the same time. Association rule generation is usually split up into two separate steps:

* A minimum support threshold is applied to find all frequent itemsets in a database.
* A minimum confidence constraint is applied to these frequent itemsets in order to form rules.

While the second step is straightforward, the first step needs more attention.

**Brute-force search for optimal itemsets is not computationaly feasble most of the time!(power set of I has ${\displaystyle \mathrm 2^{|I|} - 1}$ members!(excluding the empty set))**

In [None]:
# let's mine!
new_transaction_set = [i for i in transaction_set if len(i) > 2]
margin = len(new_transaction_set)
oht = OnehotTransactions()
oht_ary = oht.fit_transform(new_transaction_set[0:margin])
df = pandas.DataFrame(oht_ary, columns=oht.columns_)
frequent_itemsets = apriori(df, min_support=0.04, use_colnames=True)

# show me the good stuff please!
frequent_itemsets

In [None]:
# find rules with a particular 'confidence' 
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.001)

In [None]:
# find rules with a particular 'lift' 
association_rules(frequent_itemsets, metric="lift", min_threshold=1.1)

In [None]:
new_transaction_set