# Association Rule Mining with mlxtend

## Introduction

Association rule mining is a powerful technique for discovering interesting relationships, patterns, and associations among a set of items in large datasets. It is widely used in market basket analysis, recommendation systems, and various fields where understanding the co-occurrence of items is valuable.

### What is mlxtend?

`mlxtend` (Machine Learning Extensions) is a Python library that provides a range of tools and extensions for data science and machine learning. Among its many features, `mlxtend` includes robust implementations for frequent itemset generation and association rule mining.

### Key Concepts in Association Rule Mining:

- **Frequent Itemsets**: Sets of items that appear together in a dataset with a frequency above a specified threshold.
- **Support**: The proportion of transactions in the dataset that contain a particular itemset.
- **Confidence**: The likelihood that a rule's consequent is present in transactions that contain the rule's antecedent.
- **Lift**: The ratio of the observed support to the expected support if the items were independent.

### Why Use Association Rule Mining?

- **Market Basket Analysis**: Understand which products are often bought together to optimize product placement, promotions, and inventory management.
- **Recommendation Systems**: Suggest items to users based on their past behaviors and the behaviors of similar users.
- **Anomaly Detection**: Identify unusual patterns or associations that could indicate errors, fraud, or other significant events.


---

# 1. Setup and Installation
First, ensure you have the necessary libraries installed. You'll need `dowhy`, and `causalnex`.


In [12]:
!pip install mlxtend
!pip install openpyxl

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable
Collecting openpyxl
  Downloading openpyxl-3.1.3-py2.py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.3/251.3 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To up

## Import libraries

In [37]:
import os
import pandas as pd
import json
import warnings
import sys
import site
import warnings
import requests
import zipfile
import random
sys.path.append(site.getusersitepackages())

import ipywidgets as widgets
from IPython.display import display, clear_output, Markdown
from mlxtend.frequent_patterns import apriori, fpgrowth, association_rules
from mlxtend.preprocessing import TransactionEncoder
from io import BytesIO

warnings.filterwarnings("ignore", category=DeprecationWarning)


---
# Load dataset

Generate a small dataset:
1. Creates 20 unique order IDs.
2. Defines a list of 10 items.
3. Defines a list of counts (quantities).
4. Randomly selects an order ID, item, and count 100 times to create the dataset.

Create a DataFrame: Converts the generated data into a pandas DataFrame and displays the first few rows.

In [38]:
# Generate a small dataset with 100 rows
order_ids = [f'Order{str(i).zfill(3)}' for i in range(1, 21)]  # 20 unique orders
items = ['Milk', 'Bread', 'Butter', 'Eggs', 'Cheese', 'Jam', 'Juice', 'Apples', 'Bananas', 'Chicken']
counts = [1, 2, 3, 4, 5]

data = []

for _ in range(100):
    order_id = random.choice(order_ids)
    item = random.choice(items)
    count = random.choice(counts)
    data.append([order_id, item, count])

# Create a DataFrame
df = pd.DataFrame(data, columns=['OrderID', 'Item', 'Count'])

# Display the first few rows of the dataset
df.head()


Unnamed: 0,OrderID,Item,Count
0,Order008,Apples,4
1,Order003,Milk,3
2,Order010,Bananas,5
3,Order007,Eggs,4
4,Order008,Milk,2


---
# Association rules using Apriori and FP-Growth algorithms

**Steps to Calculate and Display Results**:
1. **Transform the Dataset**: The dataset is transformed into a format suitable for association rules mining.
2. **Run Apriori Algorithm**: The algorithm identifies frequent itemsets and generates association rules based on the specified minimum support.
3. **Run FP-Growth Algorithm**: Similar to Apriori, this algorithm also identifies frequent itemsets and generates association rules.
4. **Display Results**:
   - **Frequent Itemsets**: A table showing the itemsets and their support values.
   - **Association Rules**: A table showing the antecedents, consequents, and various metrics such as support, confidence, and lift.

After clicking the "Calculate" button, the notebook will:
- Transform the dataset.
- Run both Apriori and FP-Growth algorithms.
- Display the frequent itemsets and association rules along with their metrics.
- Provide a summary of the total frequent itemsets and association rules

In [39]:
# Explanation markdowns
display(Markdown("""
### Step 1: Select Columns for Grouping and Item Generation

- **Transaction Column**: This is typically a unique identifier for each transaction or entity. It could be an order ID, transaction ID, or customer ID.
- **Item Column**: The feature you want to analyze for association rules. It could be the product name, item code, etc.

Please select the appropriate columns from the dropdowns below:
"""))

# User selects columns for grouping and item generation
transaction_col = widgets.Dropdown(
    options=df.columns,
    description='Transaction:',
    value=None
)
item_col = widgets.Dropdown(
    options=df.columns,
    description='Item:',
    value=None
)
display(transaction_col, item_col)

# Explanation markdowns for minimum support
display(Markdown("""
### Step 2: Enter Minimum Support Value

- **Minimum Support**: This value determines the threshold for an itemset to be considered frequent. It is the proportion of transactions that contain the itemset.
- Standard values range from 0.1 (10%) to 0.5 (50%) depending on the dataset and the desired granularity of the rules.

Please enter the minimum support value:
"""))

min_support = widgets.FloatText(
    value=0.1,
    description='Min Support:'
)
display(min_support)

# Button to trigger the transformation and calculation
calculate_button = widgets.Button(description="Calculate", button_style='info')
busy_indicator = widgets.Output()

# Placeholder for transformed dataset and results
transformed_output = widgets.Output()
apriori_output = widgets.Output()
fpgrowth_output = widgets.Output()

def transform_dataset(df, transaction_col, item_col):
    df_copy = df[[transaction_col, item_col]].dropna().copy()
    df_copy.columns = ['Transaction', 'Item']
    df_trans = df_copy.groupby(['Transaction', 'Item']).size().unstack(fill_value=0)
    df_trans = df_trans.applymap(lambda x: 1 if x > 0 else 0).astype(bool)
    return df_trans

def on_calculate_button_clicked(b):
    with busy_indicator:
        busy_indicator.clear_output()
        with busy_indicator:
            print("Calculating...")

        transformed_output.clear_output()
        apriori_output.clear_output()
        fpgrowth_output.clear_output()

        with transformed_output:
            df_transformed = transform_dataset(df, transaction_col.value, item_col.value)
            display(Markdown("### Transformed Dataset Preview"))
            display(df_transformed.head())
            display(Markdown("### Step 3: Run Algorithms and Display Results"))

            # Run Apriori Algorithm
            try:
                frequent_itemsets_apriori, rules_apriori = apriori_algorithm(df_transformed, min_support.value)
                with apriori_output:
                    apriori_output.clear_output()
                    display(Markdown("#### Frequent Itemsets using Apriori Algorithm"))
                    display(frequent_itemsets_apriori)
                    display(Markdown("#### Association Rules using Apriori Algorithm"))
                    display(rules_apriori)
            except ValueError as e:
                with apriori_output:
                    apriori_output.clear_output()
                    display(Markdown(f"**Error running Apriori algorithm: {str(e)}**"))

            # Run FP-Growth Algorithm
            try:
                frequent_itemsets_fpgrowth, rules_fpgrowth = fpgrowth_algorithm(df_transformed, min_support.value)
                with fpgrowth_output:
                    fpgrowth_output.clear_output()
                    display(Markdown("#### Frequent Itemsets using FP-Growth Algorithm"))
                    display(frequent_itemsets_fpgrowth)
                    display(Markdown("#### Association Rules using FP-Growth Algorithm"))
                    display(rules_fpgrowth)
            except ValueError as e:
                with fpgrowth_output:
                    fpgrowth_output.clear_output()
                    display(Markdown(f"**Error running FP-Growth algorithm: {str(e)}**"))

            # Summary
            with transformed_output:
                display(Markdown("### Summary of the Association Rules Mining"))
                summary = f"""
                - Total frequent itemsets found using Apriori: {len(frequent_itemsets_apriori) if 'frequent_itemsets_apriori' in locals() else 0}
                - Total association rules found using Apriori: {len(rules_apriori) if 'rules_apriori' in locals() else 0}
                - Total frequent itemsets found using FP-Growth: {len(frequent_itemsets_fpgrowth) if 'frequent_itemsets_fpgrowth' in locals() else 0}
                - Total association rules found using FP-Growth: {len(rules_fpgrowth) if 'rules_fpgrowth' in locals() else 0}
                """
                display(Markdown(summary))
        
        busy_indicator.clear_output()

calculate_button.on_click(on_calculate_button_clicked)
display(calculate_button, busy_indicator)

display(transformed_output)
display(apriori_output)
display(fpgrowth_output)

# Function to perform Apriori algorithm
def apriori_algorithm(dataframe, min_support=0.1):
    frequent_itemsets = apriori(dataframe, min_support=min_support, use_colnames=True)
    if frequent_itemsets.empty:
        raise ValueError("No frequent itemsets found for the given minimum support.")
    rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
    return frequent_itemsets, rules

# Function to perform FP-Growth algorithm
def fpgrowth_algorithm(dataframe, min_support=0.1):
    frequent_itemsets = fpgrowth(dataframe, min_support=min_support, use_colnames=True)
    if frequent_itemsets.empty:
        raise ValueError("No frequent itemsets found for the given minimum support.")
    rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
    return frequent_itemsets, rules


### Step 1: Select Columns for Grouping and Item Generation

- **Transaction Column**: This is typically a unique identifier for each transaction or entity. It could be an order ID, transaction ID, or customer ID.
- **Item Column**: The feature you want to analyze for association rules. It could be the product name, item code, etc.

Please select the appropriate columns from the dropdowns below:


Dropdown(description='Transaction:', options=('OrderID', 'Item', 'Count'), value=None)

Dropdown(description='Item:', options=('OrderID', 'Item', 'Count'), value=None)


### Step 2: Enter Minimum Support Value

- **Minimum Support**: This value determines the threshold for an itemset to be considered frequent. It is the proportion of transactions that contain the itemset.
- Standard values range from 0.1 (10%) to 0.5 (50%) depending on the dataset and the desired granularity of the rules.

Please enter the minimum support value:


FloatText(value=0.1, description='Min Support:')

Button(button_style='info', description='Calculate', style=ButtonStyle())

Output()

Output()

Output()

Output()

---
## Understanding Association Rules

__Please note that your results will be different because we used a randomizer to build your input dataset. Below is a general explanation.__

**Frequent Itemsets**:
- **Support**: Indicates how frequently the itemset appears in the dataset. Higher support means the itemset is more common.
- **Itemsets**: Combination of items being considered.

For example:
- `support: 0.4, itemsets: (Apples)` means that item 'Apples' appears in 40% of all transactions.
- `support: 0.1, itemsets: (Chicken, Eggs, Milk)` means that items 'Chicken' and 'Eggs' and 'Milk' appear together in 10% of all transactions.

**Association Rules**:
- **Antecedents**: Items on the left-hand side of the rule.
- **Consequents**: Items on the right-hand side of the rule.
- **Support**: Proportion of transactions that contain the rule.
- **Confidence**: Accuracy of the rule. Higher confidence means the rule is more reliable.
- **Lift**: Ratio of observed support to that expected if the antecedents and consequents were independent. Lift > 1 indicates a positive association.
- **Leverage**: Difference between observed support and expected support.
- **Conviction**: Measure of how frequently the rule makes an incorrect prediction.

**Example**:
- A rule `(Apples) -> (Bananas, Juice, Cheese, Jam, Bread)` with confidence 25% and lift 2.5 means:
  - 25% of transactions containing item `Apples` also contain items `Bananas, Juice, Cheese, Jam, Bread`.
  - Items `Apples` and `Bananas, Juice, Cheese, Jam, Bread` are significantly more likely to appear together, `lift` number higher than 1 (in our case 2.5) would indicate that the items are more than twice as likely to be bought together compared to being bought independently.
