# Association Rule Mining


Using the previous notebooks, we performed the following tasks:
* Clean EDA

* Leakage-free temporal split

* Saved parquet files

On the notebook, we will perform association rules and dicuss the main results of each algorithm in order to be albe to make recommendation using these Algorithms.

## 1️⃣ Load Parquet Files
First, we need to load the train/test data set


In [15]:
import pandas as pd
import os
# You are inside: instacart-retail-project/notebooks
PROJECT_ROOT = os.path.abspath("..")

RAW_DIR = os.path.join(PROJECT_ROOT, "data", "raw")
INTERIM_DIR = os.path.join(PROJECT_ROOT, "data", "interim")
PROCESSED_DIR = os.path.join(PROJECT_ROOT, "data", "processed")
from modules.association_rules import (
    load_temporal_parquets, build_transactions, transactions_to_onehot,
    run_apriori, run_fpgrowth, run_eclat, derive_rules_from_itemsets,
    rules_to_recommender, evaluate_recommender,evaluate_recommender_proper,
    load_product_lookup, attach_product_names, top_rules, enhance_transaction
)
del evaluate_recommender

In [2]:
op_train, op_test = load_temporal_parquets(
    os.path.join(PROCESSED_DIR,"op_train_temporal.parquet"),
    os.path.join(PROCESSED_DIR,"op_test_temporal.parquet")
)

train_tx = build_transactions(op_train)   # Series: order_id -> list(product_id)
test_tx  = build_transactions(op_test)

In [None]:
df_onehot = transactions_to_onehot(train_tx.tolist(), sparse=True)

ap = run_apriori(df_onehot, min_support=0.01, metric="confidence", min_threshold=0.3)
fp = run_fpgrowth(df_onehot, min_support=0.005, metric="confidence", min_threshold=0.3)

The program crashes here for my computer
### ⚠️ Computational Constraint & Product Filtering Strategy
#### 1️⃣ Problem: High-Dimensional Transaction Space

During the implementation of Apriori and FP-Growth, we encountered computational instability when mining rules at full product granularity.

The root cause is structural:

The Instacart dataset contains tens of thousands of unique products.

One-hot encoding creates a matrix of shape:

$ Number of Orders × Number of Unique Products$


This results in:

- Extremely high dimensionality

- Large memory consumption

- Candidate itemset explosion (especially for Apriori)

- Kernel crashes or excessive runtime

This is not a modeling issue — it is a combinatorial scaling issue inherent to association rule mining in high-dimensional retail data.

#### 2️⃣ Structural Insight from Pareto (Long-Tail) Analysis

From the Product Distribution & Long Tail analysis, we observed:

* A small fraction of products accounts for the majority of purchases.

* A large proportion of products are rarely purchased.

* The purchase distribution follows a strong Pareto-like pattern.

The cumulative coverage curve showed that:

* A limited subset of products explains most transaction volume.

* The long tail contains thousands of low-frequency items.

This means:

> Rare products contribute little to global co-occurrence structure but dramatically increase computational complexity.

3️⃣ Data-Driven Dimensionality Reduction

Instead of arbitrarily reducing the dataset, we applied a Pareto-based filtering strategy:

**Strategy:**

Keep only products that:

* Appear at least 200 times
OR
* Belong to the top 80% cumulative purchase coverage

This ensures that:

* We preserve the core transactional structure.

* We eliminate extremely sparse dimensions.

* We reduce noise.

* We prevent combinatorial explosion.

We choose to select the to 80% cumulative purchase coverage, and we follow the steps from function enhance_transactions

In [3]:
# Keep 80% coverage
op_train_filtered, stats = enhance_transaction(op_train, coverage=0.80)

print(stats)

{'coverage_target': 0.8, 'original_products': 49677, 'retained_products': 4537, 'original_transactions': 3214874, 'retained_transactions': 3151193, 'purchase_volume_retained': 0.8, 'dimensionality_reduction_ratio': 0.9087}


Now rebuild transactions:

In [4]:
train_tx = build_transactions(op_train_filtered)
df_onehot = transactions_to_onehot(train_tx.tolist(), sparse=True)

Now, I can run Apriori and FP-Growth Algorithms

In [5]:
ap = run_apriori(df_onehot, min_support=0.005, metric="confidence", min_threshold=0.3)


ap.meta

{'algorithm': 'apriori',
 'min_support': 0.005,
 'metric': 'confidence',
 'min_threshold': 0.3,
 'n_itemsets': 338,
 'n_rules': 5,
 'runtime_sec': 1652.6259}

In [6]:
fp = run_fpgrowth(df_onehot, min_support=0.005, metric="confidence", min_threshold=0.3)


fp.meta

{'algorithm': 'fp-growth',
 'min_support': 0.005,
 'metric': 'confidence',
 'min_threshold': 0.3,
 'n_itemsets': 338,
 'n_rules': 5,
 'runtime_sec': 137.7508}

In [7]:
ec = run_eclat(train_tx, min_support=0.005, max_len=3)
ec_rules = derive_rules_from_itemsets(
    ec.frequent_itemsets,
    metric="confidence",
    min_threshold=0.3
)

In [18]:
ap_rules_ranked = rules_to_recommender(ap.rules, sort_by="lift")
fp_rules_ranked = rules_to_recommender(fp.rules, sort_by="lift")
ec_rules_ranked = rules_to_recommender(ec_rules, sort_by="lift")

print("Apriori:", evaluate_recommender_proper(ap_rules_ranked, test_tx, k=10))
print("FP-Growth:", evaluate_recommender_proper(fp_rules_ranked, test_tx, k=10))
print("Eclat:", evaluate_recommender_proper(ec_rules_ranked, test_tx, k=10))

Apriori: {'HitRate@K': 0.012447332025344955, 'Precision@K': 0.0012447332025344593, 'Recall@K': 0.001903592585106979}
FP-Growth: {'HitRate@K': 0.012447332025344955, 'Precision@K': 0.0012447332025344593, 'Recall@K': 0.001903592585106979}
Eclat: {'HitRate@K': 0.012447332025344955, 'Precision@K': 0.0012447332025344593, 'Recall@K': 0.001903592585106979}


In [None]:
products_lookup = load_product_lookup("products.csv")
ap_named = attach_product_names(ap_rules_ranked, products_lookup)

top_rules(ap_named, sort_by="lift", n=15)[
    ["antecedents_names","consequents_names","support","confidence","lift"]
]

In [None]:
from modules.association_rules import build_spmf_utility_file, run_uptree_spmf

# Example: utility = price (or price * quantity if you had quantities)
prices = pd.read_csv("api_prices.csv")  # must contain product_id, price
op_train_u = op_train.merge(prices, on="product_id", how="left")
op_train_u["price"] = op_train_u["price"].fillna(0.0)
op_train_u["utility"] = op_train_u["price"]

build_spmf_utility_file(op_train_u, "instacart_utility.txt")

res_upt = run_uptree_spmf(
    spmf_jar_path="spmf.jar",
    input_utility_file="instacart_utility.txt",
    output_file="uptree_output.txt",
    min_utility=10000,
    item_separator=" "
)

res_upt.meta