<a href="https://colab.research.google.com/github/gulabpatel/Pycaret/blob/main/Association_Rule_Mining_Tutorial_ARUL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  <span style="color:orange">Association Rule Mining Tutorial (ARUL101)</span>

**Date Updated: Feb 25, 2020**

# 1.0 Tutorial Objective
Welcome to Association Rule Mining Tutorial (#ARUL101). This tutorial assumes that you are new to PyCaret and looking to get started with Association Rule Mining using the `pycaret.arules` Module.

In this tutorial we will learn:


* **Getting Data:**  How to import data from PyCaret repository
* **Setting up Environment:**  How to setup an experiment in PyCaret and get started with association rule mining
* **Create Model:**  How to create a model and evaluate results
* **Plot Model:**  How to analyze model performance using various plots

Read Time : Approx. 15 Minutes


## 1.1 Installing PyCaret
The first step to get started with PyCaret is to install pycaret. Installation is easy and will only take a few minutes. Follow the instructions below:

#### Installing PyCaret in Local Jupyter Notebook
`pip install pycaret`  <br />

#### Installing PyCaret on Google Colab or Azure Notebooks
`!pip install pycaret`


## 1.2 Pre-Requisites
- Python 3.x
- Latest version of pycaret
- Internet connection to load data from pycaret's repository
- Basic Knowledge of Association Rule Mining

## 1.3 For Google colab users:
If you are running this notebook on Google colab, run the following code at top of your notebook to display interactive visuals.<br/>
<br/>
`from pycaret.utils import enable_colab` <br/>
`enable_colab()`

In [1]:
!pip install pycaret

Collecting pycaret
[?25l  Downloading https://files.pythonhosted.org/packages/da/99/18f151991b0f06107af9723417c64e304ae2133587f85ea734a90136b4ae/pycaret-2.3.1-py3-none-any.whl (261kB)
[K     |████████████████████████████████| 266kB 8.1MB/s 
[?25hCollecting umap-learn
[?25l  Downloading https://files.pythonhosted.org/packages/75/69/85e7f950bb75792ad5d666d86c5f3e62eedbb942848e7e3126513af9999c/umap-learn-0.5.1.tar.gz (80kB)
[K     |████████████████████████████████| 81kB 8.4MB/s 
Collecting scikit-learn==0.23.2
[?25l  Downloading https://files.pythonhosted.org/packages/f4/cb/64623369f348e9bfb29ff898a57ac7c91ed4921f228e9726546614d63ccb/scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8MB)
[K     |████████████████████████████████| 6.8MB 10.0MB/s 
[?25hCollecting kmodes>=0.10.1
  Downloading https://files.pythonhosted.org/packages/9b/34/fffc601aa4d44b94e945a7cc72f477e09dffa7dce888898f2ffd9f4e343e/kmodes-0.11.0-py2.py3-none-any.whl
Collecting Boruta
[?25l  Downloading https://f

In [2]:
from pycaret.utils import enable_colab
enable_colab()

Colab mode enabled.


# 2.0 What is Association Rule Mining?
Association rule learning is a rule-based machine learning method for discovering interesting relationships between variables in large databases. It is intended to identify strong rules using measures of interestingness. For example, the rule {onions, potatoes} --> {burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy a burger. Such information can be used as the basis for decisions about marketing activities such as promotional pricing or product placements.

__[Learn More about Association Rule Mining](https://en.wikipedia.org/wiki/Association_rule_learning)__

# 3.0 Overview of Association Rule Module in PyCaret
PyCaret's association rule module (`pycaret.arules`) is a supervised machine learning module which is used for discovering interesting relationships between variables in a dataset. This module automatically transforms any transactional database into a shape that is acceptable for the apriori algorithm which is used for frequent item set mining and association rule learning over relational databases.

# 4.0 Dataset for the Tutorial

For this tutorial we will use a small sample from the UCI dataset called **Online Retail Dataset**. This is a transactional dataset which contains records occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered online retailer. The company mainly sells unique all-occasion gifts with many customers being wholesalers. Short descriptions of each column are as follows:

- **InvoiceNo:** Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
- **StockCode:** Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- **Description:** Product (item) name. Nominal.
- **Quantity:** The quantity of each product (item) per transaction. Numeric.
- **InvoiceData:** Invoice Date and time. Numeric, the day and time when each transaction was generated.
- **UnitPrice:** Unit price. Numeric, Product price per unit in sterling.
- **CustomerID:** Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
- **Country:** Country name. Nominal, the name of the country where each customer resides.

#### Dataset Acknowledgement:
Dr Daqing Chen, Director: Public Analytics group. chend@lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.


The original dataset and data dictionary can be __[found here.](http://archive.ics.uci.edu/ml/datasets/online+retail)__ 

# 5.0 Getting the Data

You can download the data from the original source __[found here](http://archive.ics.uci.edu/ml/datasets/online+retail)__ and load it using pandas __[(Learn How)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)__ or you can use PyCaret's data respository to load the data using `get_data()` function (This will require internet connection).

In [3]:
from pycaret.datasets import get_data
data = get_data('france')

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536370,22728,ALARM CLOCK BAKELIKE PINK,24,12/1/2010 8:45,3.75,12583.0,France
1,536370,22727,ALARM CLOCK BAKELIKE RED,24,12/1/2010 8:45,3.75,12583.0,France
2,536370,22726,ALARM CLOCK BAKELIKE GREEN,12,12/1/2010 8:45,3.75,12583.0,France
3,536370,21724,PANDA AND BUNNIES STICKER SHEET,12,12/1/2010 8:45,0.85,12583.0,France
4,536370,21883,STARS GIFT TAPE,24,12/1/2010 8:45,0.65,12583.0,France


**Note:** If you are downloading the data from original source, you will need to filter the `Country` for 'France' if you wish to reproduce the results in this experiment.

In [4]:
#check the shape of data
data.shape

(8557, 8)

# 6.0 Setting up Environment in PyCaret

The `setup()` function initializes the environment in pycaret and transforms the transactional dataset into a shape that is acceptable to the apriori algorithm. It requires two mandatory parameters: `transaction_id` which is the name of the column representing the transaction id and will be used to pivot the matrix, and `item_id` which is the name of the column used for the creation of rules. Normally, this will be the variable of interest. You can also pass the optional parameter `ignore_items` to ignore certain values when creating rules.

In [5]:
from pycaret.arules import *

In [6]:
exp_arul101 = setup(data = data, 
                    transaction_id = 'InvoiceNo',
                    item_id = 'Description') 

Description,Value
session_id,5685.0
# Transactions,461.0
# Items,1565.0
Ignore Items,


Once the setup has been succesfully executed it prints the information grid which contains several important pieces of information.

- **# Transactions :**  Unique number of transactions in the dataset. In this case unique `InvoiceNo`. <br/>
<br/>
- **# Items :** Unique number of items in the dataset. In this case `Description`. <br/>
<br/>
- **Ignore Items :** The items to be ignored in rule mining. Many times there are relations which are highly obvious and may want to be ignored for the analysis. For example, many transactional datasets will contain shipping cost which is a very obvious relationship that can be ignored in the `setup()` using the `ignore_items` parameter. In this tutorial we will run the `setup()` twice, first without ignoring any items and later with ignored items. <br/>

# 7.0 Create a Model

Creating an association rule model is simple. `create_model()` requires no mandatory parameters but has 4 optional inputs which are as follows:

- **metric:** Metric to evaluate if a rule is of interest. Default is set to confidence. Other available metrics include 'support', 'lift', 'leverage', 'conviction'. <br/>
<br/>
- **threshold:** Minimal threshold for the evaluation metric, via the `metric` parameter, to decide whether a candidate rule is of interest. Default is set to `0.5`. <br/>
<br/>
- **min_support:** A float between 0 and 1 for minumum support of the itemsets returned. The support is computed as the fraction `transactions_where_item(s)_occur / total_transactions`. Default is set to `0.05`. <br/>
<br/>
- **round:** Number of decimal places metrics in score grid will be rounded to. <br/>

Let's create an association rule model with all default values.

In [7]:
model1 = create_model() #model created and stored in model1 variable.

In [8]:
print(model1.shape) #141 rules created.

(141, 9)


In [9]:
model1.head() #see the rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(JUMBO BAG WOODLAND ANIMALS),(POSTAGE),0.0651,0.6746,0.0651,1.0,1.4823,0.0212,inf
1,"(SET/6 RED SPOTTY PAPER PLATES, SET/20 RED RET...",(SET/6 RED SPOTTY PAPER CUPS),0.0868,0.1171,0.0846,0.975,8.3236,0.0744,35.3145
2,"(SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE...",(SET/6 RED SPOTTY PAPER PLATES),0.0868,0.1085,0.0846,0.975,8.9895,0.0752,35.6616
3,"(SET/6 RED SPOTTY PAPER PLATES, SET/20 RED RET...",(SET/6 RED SPOTTY PAPER CUPS),0.0716,0.1171,0.0694,0.9697,8.2783,0.061,29.1345
4,"(SET/20 RED RETROSPOT PAPER NAPKINS , POSTAGE,...",(SET/6 RED SPOTTY PAPER PLATES),0.0716,0.1085,0.0694,0.9697,8.9406,0.0617,29.4208


# 8.0 Setup with `ignore_items`

In `model1` created above, notice that the number 1 rule of `JUMBO BAG WOODLAND ANIMALS` with `POSTAGE` is very obvious. In the example below, we will use the `ignore_items` parameter in `setup()` to ignore `POSTAGE` and re-create the association rule model.

In [10]:
exp_arul101 = setup(data = data, 
                    transaction_id = 'InvoiceNo',
                    item_id = 'Description',
                    ignore_items = ['POSTAGE']) 

Description,Value
session_id,6507
# Transactions,461
# Items,1565
Ignore Items,['POSTAGE']


In [11]:
model2 = create_model()

In [12]:
print(model2.shape) #notice how only 45 rules are created vs. 141 above.

(45, 9)


In [13]:
model2.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,"(SET/6 RED SPOTTY PAPER PLATES, SET/20 RED RET...",(SET/6 RED SPOTTY PAPER CUPS),0.0868,0.1171,0.0846,0.975,8.3236,0.0744,35.3145
1,"(SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE...",(SET/6 RED SPOTTY PAPER PLATES),0.0868,0.1085,0.0846,0.975,8.9895,0.0752,35.6616
2,(SET/6 RED SPOTTY PAPER PLATES),(SET/6 RED SPOTTY PAPER CUPS),0.1085,0.1171,0.1041,0.96,8.1956,0.0914,22.0716
3,(CHILDRENS CUTLERY SPACEBOY ),(CHILDRENS CUTLERY DOLLY GIRL ),0.0586,0.0629,0.0542,0.9259,14.719,0.0505,12.6508
4,(SET/6 RED SPOTTY PAPER CUPS),(SET/6 RED SPOTTY PAPER PLATES),0.1171,0.1085,0.1041,0.8889,8.1956,0.0914,8.0239


# 9.0 Plot Model

In [14]:
plot_model(model2)

In [15]:
plot_model(model2, plot = '3d')