The main purpose of this document is to introduce the association rule mining with the Apriori algorithm, implemented by [mlxtend](https://rasbt.github.io/mlxtend/). We will use the provided grocery dataset as an example.

# 1. Data preparation

We first import the packages that will be used in this document.

1. [Pandas](https://pandas.pydata.org/): Pandas is an open-source Python library widely used for data manipulation, analysis, and cleaning tasks. The central data structure in Pandas is the [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) which provides methods to facilitate the preliminary examination of essential properties, statistical summaries, and a select number of rows for a cursory exploration of the data.

2. [Numpy](https://numpy.org/): Numpy is a powerful Python library for numerical and array-based computing. It provides support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical functions to operate on these arrays efficiently. 

3. [mlxtend.preprocessing.TransactionEncoder](https://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/): TransactionEncoder is a class used for encoding transaction data represented in Python lists within the MLxtend library.

4. [mlxtend.frequent_patterns.apriori](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/): Apriori is an implementation of Apriori algorithm in the MLxtend library to extract frequent itemsets for association rule mining.

**Note: To import [mlxtend.preprocessing.TransactionEncoder](https://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/) and [mlxtend.frequent_patterns.apriori](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/), you should first install the [mlxtend](https://rasbt.github.io/mlxtend/) library:**

Please run **`conda install -c conda-forge mlxtend`** on your *Anaconda command promote*.

*OR*

Please run **`pip install mlxtend`** on your *terminal*.


In [1]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

First, we load the data.

In [2]:
df = pd.read_csv('groceries.csv')

Let's have a look at the first 5 rows of the dataset by [head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), which show the first 5 transactions.

In [3]:
df.head()

Unnamed: 0,item1,item2,item3,item4,item5,item6,item7,item8,item9,item10,...,item21,item22,item23,item24,item25,item26,item27,item28,item29,item30
0,citrus fruit,semi-finished bread,margarine,ready soups,,,,,,,...,,,,,,,,,,
1,tropical fruit,yogurt,coffee,,,,,,,,...,,,,,,,,,,
2,whole milk,,,,,,,,,,...,,,,,,,,,,
3,pip fruit,yogurt,cream cheese,meat spreads,,,,,,,...,,,,,,,,,,
4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,,,...,,,,,,,,,,


# 2. Data format conversion

To apply the apriori algorithm, we need the dataset to be in the format:

|     | Type 1 | Type 2 | ...    | Type N |
| --- | ------ | ----   | ----   |  ----  |
| 0   | True   | False  | ...    | True   |
| 1   | False  | True   | ...    | True   |
| 2   | True   | True   | ...    | False  |
| 3   | True   | True   | ...    | True   |

We first make each transaction without `NaN` into a [list](https://docs.python.org/3/tutorial/introduction.html#lists) and put them into a NumPy array.

In [4]:
dataset = df.values.tolist()
cleanList = []

for trans in dataset: # for each transaction
    cleanTrans = []
    for x in trans: # for each element in the transaction
        if str(x) != 'nan': # if the item is not 'nan', put it in the list
            cleanTrans.append(x)
    cleanList.append(cleanTrans)
dataset = np.asarray(cleanList, dtype=object)

Take a look at the `dataset`.

In [5]:
dataset

array([list(['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups']),
       list(['tropical fruit', 'yogurt', 'coffee']), list(['whole milk']),
       ...,
       list(['chicken', 'citrus fruit', 'other vegetables', 'butter', 'yogurt', 'frozen dessert', 'domestic eggs', 'rolls/buns', 'rum', 'cling film/bags']),
       list(['semi-finished bread', 'bottled water', 'soda', 'bottled beer']),
       list(['chicken', 'tropical fruit', 'other vegetables', 'vinegar', 'shopping bags'])],
      dtype=object)

We then change the list into a mlxtend required format use the function [TransactionEncoder()](https://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/).

Using the `TransactionEncoder` object, we can transform this dataset into an array format suitable for typical machine learning APIs. Via the `fit` method, the `TransactionEncoder` learns the unique labels in the dataset, and via the `transform` method, it transforms the input dataset (a Python list of lists) into a one-hot encoded NumPy boolean array:

In [55]:
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset) 
df = pd.DataFrame(te_ary, columns=te.columns_) # fit the transferred data back into a pandas data format

Let's have a look at the first 5 rows of the transformed dataset by [head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), which show the first 5 transactions.

In [19]:
df.head()

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,beef,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


Since we have the data organized as required, we can apply the apriori algorithm.

# 3. Association Rule Mining with the Apriori algorithm

First we define the `MIN_SUPP` and apply the defined [apriori](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/) algorithm.

In [8]:
MIN_SUPP = 0.02
freq_set = apriori(df, min_support=MIN_SUPP,use_colnames=True)

Let's see our result.

In [9]:
freq_set

Unnamed: 0,support,itemsets
0,0.033452,(UHT-milk)
1,0.052466,(beef)
2,0.033249,(berries)
3,0.026029,(beverages)
4,0.080529,(bottled beer)
...,...,...
117,0.032232,"(whole milk, whipped/sour cream)"
118,0.020742,"(yogurt, whipped/sour cream)"
119,0.056024,"(whole milk, yogurt)"
120,0.023183,"(whole milk, root vegetables, other vegetables)"


We can see that we have 122 frequent itemsets, sorted by their support.

## 3.1 Check the i-th frequent itemset

Check the frequent itemset at location 10 with [loc()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html), and have a try on others by yourselves.

In [44]:
freq_set.loc[[10]]

Unnamed: 0,support,itemsets
10,0.077682,(canned beer)


## 3.2 Check whether an itemset is frequent

If it is frequent, provide the location of the itemset in **freq_set**; otherwise provide "Not frequent". 

**Check whether 'beef' is frequent:**

First, we specify the itemset we want to check.

In [52]:
check_set = ['beef']

Select the index from the frequent set based on the given `check_set` by [index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.index.html), which returns the index labels in the format of [pandas.Index](https://pandas.pydata.org/docs/reference/api/pandas.Index.html). By [tolist()](https://pandas.pydata.org/docs/reference/api/pandas.Series.tolist.html), we can get the index number easier.

In [53]:
itemset_idx = freq_set.index[freq_set['itemsets'] == frozenset(check_set)].tolist()

Try to find the itemset with the index in the frequent set, and print "Not frequent" if does not exist; otherwise, give the position.

In [51]:
if itemset_idx==[]: # given check_set does not exist in the frequent set
    print('Not frequent!')
else:
    print('Found at location ' + str(itemset_idx[0]))

Found at location 1


**Check whether 'whole milk, yogurt' is frequent:**

In [54]:
# specify the itemset you want to check
check_set = ['yogurt','whole milk']

# Select the idx from the frequent set based on the given check_set
itemset_idx = freq_set.index[freq_set['itemsets'] == frozenset(check_set)].tolist()

if itemset_idx==[]: # given check_set does not exist in the frequent set
    print('Not frequent!')
else:
    print('Found at location ' + str(itemset_idx[0]))

Found at location 119


**Check whether 'university, queensland' is frequent:**

Obviously, 'university, queensland' should not exist in the frequent set. Let's have a check on the printout.

In [56]:
# specify the itemset you want to check
check_set = ['university','queensland']

# Select the idx from the frequent set based on the given check_set
itemset_idx = freq_set.index[freq_set['itemsets'] == frozenset(check_set)].tolist()

if itemset_idx==[]:
    print('Not frequent!') # given check_set does not exist in the frequent set
else:
    print('Found at location ' + str(itemset_idx[0]))

Not frequent!


## 3.3 Calculation of the confidence

First, we define a function `get_itemset_support`, returning the support of the given itemset X or returning None if the itemset X does not exist.

In [None]:
def get_itemset_support(freq_set, X):
    # Select the idx from the frequent set based on the given check_set
    itemset_idx = freq_set.index[freq_set['itemsets'] == frozenset(X)].tolist()
    
    if itemset_idx==[]:
        return None # Request itemset X does not exist in the frequent itemset
    else:
        return freq_set.loc[itemset_idx[0],['support']] # Return the corresponding support

Then we define a function `get_rule_confidence`, returning the confidence of the corresponding rule \{X\} $\rightarrow$ \{Y\} or returning information if any of the related support is None.

In [60]:
def get_rule_confidence(freq_set, X, Y):
   
    itemset = X + Y # join itemset X and itemset Y 
    x_support = get_itemset_support(freq_set, X) # get support of X 
    joint_support = get_itemset_support(freq_set, itemset) # get support of X joint Y
    
    if joint_support is None or x_support is None: 
        return "Make sure the X, Y and X+Y are in the frequent list."
        
    return "The confidence of rule {%s} -> {%s} is: %3f"%(X, Y, joint_support/x_support)

Let's calculate the confidence of the rule  \{X\} $\rightarrow$ \{Y\}.

In [61]:
# Specify the content of X and Y
X = ['yogurt', 'whole milk']
Y = ['other vegetables']

# Get the confidence
get_rule_confidence(freq_set, X, Y)

"The confidence of rule {['yogurt', 'whole milk']} -> {['other vegetables']} is: 0.397459"

In [16]:
# Specify the content of X and Y
X = ['queensland']
Y = ['university']

# Get the confidence
get_rule_confidence(freq_set, X, Y)

'Make sure the X, Y and X+Y are in the frequent list.'

Author: *Kaki Zhou* 12/9/2024 