# Practical 8: Association Mining

### In this practical
1. [Performing association mining](#apriori)
2. [Running and understanding results of association mining](#assoc)

---
**Written by Richi Nayak (r.nayak@qut.edu.au). All rights reserved.**

This practical introduces you to association rules mining using Python. You will learn how to preprocess data for association, building apriori model and evaluate/understanding the results. Different from the algorithms/models introduced in the first half of the semester, the dataset used for association rules mining is **transactional**. This dataset do not have the label information that is mandatory in predictive data mining. 

## 1. Performing Association Mining<a name = "apriori"></a>

A bank’s Marketing department is interested in examining associations between
various retail banking services used by its customers. Marketing would like to
determine both typical and atypical service combinations. The dataset is in
provided in the file `bank.csv`.

These requirements suggest association mining, a market basket analysis. The
data for this problem is usually consists of two variables: a transaction ID and an item. For each transaction, there is a list of items. For the banking
dataset, a transaction is an individual customer account, and items are products bought by the customer. An association rule is a statement of the form (item set A) => (item set B).

Recall from the lecture that the most common association rule mining algorith is **Apriori algorithm**. Unfortunately, `sklearn` does not provide any
implementation of Apriori algorithm. Therefore, we will install and use a
library called `apyori` for this task.

Open your Anaconda prompt and install the library using this command:

```bash
pip install apyori
```

Once the library is installed, we need to perform some data preprocessing on the `bank` dataset. Firstly, load the data set using pandas.

In [1]:
import pandas as pd

# load the bank transaction dataset
df = pd.read_csv('bank.csv')

# info and the first 10 transactions
print(df.info())
print(df.head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32367 entries, 0 to 32366
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ACCOUNT  32367 non-null  int64 
 1   SERVICE  32367 non-null  object
 2   VISIT    32367 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 758.7+ KB
None
   ACCOUNT SERVICE  VISIT
0   500026   CKING      1
1   500026     SVG      2
2   500026     ATM      3
3   500026     ATM      4
4   500075   CKING      1
5   500075    MMDA      2
6   500075     SVG      3
7   500075     ATM      4
8   500075   TRUST      5
9   500075   TRUST      6


The BANK data set contains service information for nearly 8,000 customers. There are three variables in the data set:
1. ACCOUNT: Account number, nominal
2. SERVICE: Type of service, nominal
3. VISIT: Order of product purchased, ordinal

The BANK data set has over 32,000 rows. Each row of the data set represents a customer-service combination. Therefore, a single customer can have multiple rows in the data set, and each row represents one of the products he or she owns. The median number of products per customer is three. The 13 products are represented in the data set using the following abbreviations:
* ATM - automated teller machine debit card
* AUTO automobile installment loan
* CCRD credit card
* CD certificate of deposit
* CKCRD check/debit card
* CKING checking account
* HMEQLC home equity line of credit
* IRA individual retirement account
* MMDA money market deposit account
* MTG mortgage
* PLOAN personal/consumer installment loan
* SVG saving account
* TRUST personal trust account

As we are looking to generate association rules from items purchased by each account holder, we need to group our accounts and then generate list of all services purchased.

In [2]:
# group by account, then list all services
transactions = df.groupby(['ACCOUNT'])['SERVICE'].apply(list)

print(transactions.head(5))

ACCOUNT
500026                   [CKING, SVG, ATM, ATM]
500075    [CKING, MMDA, SVG, ATM, TRUST, TRUST]
500129              [CKING, SVG, IRA, ATM, ATM]
500256               [CKING, SVG, CKCRD, CKCRD]
500341               [CKING, SVG, CKCRD, CKCRD]
Name: SERVICE, dtype: object


## 2. Running and understanding results of association mining <a name = "assoc"></a>

Once the `transactions` table contains all services purchased by each account number, we are ready to generate association rules. The `apyori`'s `apriori` function accepts a number of parameters, mainly:
1. `transactions`: list of list of items in transactions (eg. [['A', 'B'], ['B', 'C']]).
2. `min_support`: Minimum support of relations in float percentage. It specifies a minimum level of support to claim that items are associated (i.e. they occur together in the dataset). Default 0.1.
3. `min_confidence`: Minimum confidence of relations in float percentage. Default 0.0.
4. `min_lift`: Minimum lift of relations in float percentage. Default 0.0.
5. `max_length`: Max length of the relations. Default None.

Note: The min_support and min_confidence controls the numbers and types of rules generated. If you are interested in associations that involve fairly rare products, you should consider reducing the min_support. If you obtain too many rules to be practically useful, you should consider raising min_suport and min_confidence as a possible solution

We will run the `apyori` model with the pre-processed transactions and min_support of 0.05.

In [3]:
from apyori import apriori

# type cast the transactions from pandas into normal list format and run apriori
transaction_list = list(transactions)
results = list(apriori(transaction_list, min_support=0.05))

# print first 5 rules
print(results[:5])

[RelationRecord(items=frozenset({'ATM'}), support=0.3845576273307471, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'ATM'}), confidence=0.3845576273307471, lift=1.0)]), RelationRecord(items=frozenset({'AUTO'}), support=0.09285446126892755, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'AUTO'}), confidence=0.09285446126892755, lift=1.0)]), RelationRecord(items=frozenset({'CCRD'}), support=0.154799149042673, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'CCRD'}), confidence=0.154799149042673, lift=1.0)]), RelationRecord(items=frozenset({'CD'}), support=0.24527593542735576, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'CD'}), confidence=0.24527593542735576, lift=1.0)]), RelationRecord(items=frozenset({'CKCRD'}), support=0.11300212739331748, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'CKCRD'}), confidence=0.1

The output might look very cluttered. The following function can be used printing them neatly. We will not go deeper to explain how it works and it is not essential for your learning objective, but we have included some comments to help you out.

In [4]:
def convert_apriori_results_to_pandas_df(results):
    rules = []
    
    for rule_set in results:
        for rule in rule_set.ordered_statistics:
            # items_base = left side of rules, items_add = right side
            # support, confidence and lift for respective rules
            rules.append([','.join(rule.items_base), ','.join(rule.items_add),
                         rule_set.support, rule.confidence, rule.lift]) 
    
    # typecast it to pandas df
    return pd.DataFrame(rules, columns=['Left_side', 'Right_side', 'Support', 'Confidence', 'Lift']) 

result_df = convert_apriori_results_to_pandas_df(results)

print(result_df.head(20))

   Left_side  Right_side   Support  Confidence      Lift
0                    ATM  0.384558    0.384558  1.000000
1                   AUTO  0.092854    0.092854  1.000000
2                   CCRD  0.154799    0.154799  1.000000
3                     CD  0.245276    0.245276  1.000000
4                  CKCRD  0.113002    0.113002  1.000000
5                  CKING  0.857840    0.857840  1.000000
6                 HMEQLC  0.164685    0.164685  1.000000
7                    IRA  0.108372    0.108372  1.000000
8                   MMDA  0.174446    0.174446  1.000000
9                    MTG  0.074334    0.074334  1.000000
10                   SVG  0.618696    0.618696  1.000000
11                ATM,CD  0.071581    0.071581  1.000000
12       ATM          CD  0.071581    0.186137  0.758889
13        CD         ATM  0.071581    0.291837  0.758889
14             ATM,CKING  0.361907    0.361907  1.000000
15       ATM       CKING  0.361907    0.941100  1.097058
16     CKING         ATM  0.361

The table contains statistics of support, confidence and lift for each of the rules.

Consider the rule A ⇒ B. Recall the following concepts from lecture:
* Support of A ⇒ B is the probability that a customer has both A and B.
* Confidence of A ⇒ B is the probability that a customer has B given that the customer has A.
* Expected confidence (not shown here) of A ⇒ B is the probability that a customer has B.
* Lift of A ⇒ B is a measure of strength of the association. The Lift=2 for the rule A=>B indicates that a customer having A is twice as likely to have B than a customer chosen at random.

In a typical setting, we would like to view the rules by lift value. Sort the rules using code below:

In [5]:
# sort all acquired rules descending by lift
result_df = result_df.sort_values(by='Lift', ascending=False)
print(result_df.head(10))

         Left_side     Right_side   Support  Confidence      Lift
131          CKCRD     CKING,CCRD  0.055813    0.493909  3.325045
134     CKING,CCRD          CKCRD  0.055813    0.375737  3.325045
33            CCRD          CKCRD  0.055813    0.360550  3.190645
130           CCRD    CKING,CKCRD  0.055813    0.360550  3.190645
135    CKING,CKCRD           CCRD  0.055813    0.493909  3.190645
34           CKCRD           CCRD  0.055813    0.493909  3.190645
203     HMEQLC,SVG      ATM,CKING  0.060944    0.546577  1.510268
198      ATM,CKING     HMEQLC,SVG  0.060944    0.168396  1.510268
196         HMEQLC  ATM,CKING,SVG  0.060944    0.370061  1.489001
205  ATM,CKING,SVG         HMEQLC  0.060944    0.245217  1.489001


The highest lift rule is *checking*, and *credit card* implies *check card*. This is not surprising given that many check cards include credit card logos.

## End Notes

In these notes, we learned to build association mining model using `apyori` library and explored the rules produced by it. Market basket analysis, association rule mining, infers rules from a transaction data set. The value of the generated rules is gauged by lift, confidence and support.