# Association Rule Mining


Using the previous notebooks, we performed the following tasks:
* Clean EDA

* Leakage-free temporal split

* Saved parquet files

On the notebook, we will perform association rules and dicuss the main results of each algorithm in order to be albe to make recommendation using these Algorithms.

## 1Ô∏è‚É£ Load Parquet Files
First, we need to load the train/test data set


In [4]:
import pandas as pd
import os
# You are inside: instacart-retail-project/notebooks
PROJECT_ROOT = os.path.abspath("..")

RAW_DIR = os.path.join(PROJECT_ROOT, "data", "raw")
INTERIM_DIR = os.path.join(PROJECT_ROOT, "data", "interim")
PROCESSED_DIR = os.path.join(PROJECT_ROOT, "data", "processed")
from modules.association_rules import (
    load_temporal_parquets, build_transactions, transactions_to_onehot,
    run_apriori, run_fpgrowth, run_eclat, derive_rules_from_itemsets,
    rules_to_recommender, evaluate_recommender,evaluate_recommender_proper,
    load_product_lookup, attach_product_names, top_rules, enhance_transaction
)
del evaluate_recommender

In [5]:
op_train, op_test = load_temporal_parquets(
    os.path.join(PROCESSED_DIR,"op_train_temporal.parquet"),
    os.path.join(PROCESSED_DIR,"op_test_temporal.parquet")
)

train_tx = build_transactions(op_train)   # Series: order_id -> list(product_id)
test_tx  = build_transactions(op_test)

In [None]:
df_onehot = transactions_to_onehot(train_tx.tolist(), sparse=True)

ap = run_apriori(df_onehot, min_support=0.01, metric="confidence", min_threshold=0.3)
fp = run_fpgrowth(df_onehot, min_support=0.005, metric="confidence", min_threshold=0.3)

The program crashes here for my computer
### ‚ö†Ô∏è Computational Constraint & Product Filtering Strategy
#### 1Ô∏è‚É£ Problem: High-Dimensional Transaction Space

During the implementation of Apriori and FP-Growth, we encountered computational instability when mining rules at full product granularity.

The root cause is structural:

The Instacart dataset contains tens of thousands of unique products.

One-hot encoding creates a matrix of shape:

$ Number of Orders √ó Number of Unique Products$


This results in:

- Extremely high dimensionality

- Large memory consumption

- Candidate itemset explosion (especially for Apriori)

- Kernel crashes or excessive runtime

This is not a modeling issue ‚Äî it is a combinatorial scaling issue inherent to association rule mining in high-dimensional retail data.

#### 2Ô∏è‚É£ Structural Insight from Pareto (Long-Tail) Analysis

From the Product Distribution & Long Tail analysis, we observed:

* A small fraction of products accounts for the majority of purchases.

* A large proportion of products are rarely purchased.

* The purchase distribution follows a strong Pareto-like pattern.

The cumulative coverage curve showed that:

* A limited subset of products explains most transaction volume.

* The long tail contains thousands of low-frequency items.

This means:

> Rare products contribute little to global co-occurrence structure but dramatically increase computational complexity.

3Ô∏è‚É£ Data-Driven Dimensionality Reduction

Instead of arbitrarily reducing the dataset, we applied a Pareto-based filtering strategy:

**Strategy:**

Keep only products that:

* Appear at least 200 times
OR
* Belong to the top 80% cumulative purchase coverage

This ensures that:

* We preserve the core transactional structure.

* We eliminate extremely sparse dimensions.

* We reduce noise.

* We prevent combinatorial explosion.

We choose to select the to 80% cumulative purchase coverage, and we follow the steps from function enhance_transactions

In [7]:
# Keep 80% coverage
op_train_filtered, stats = enhance_transaction(op_train, coverage=0.80)

print(stats)

{'coverage_target': 0.8, 'original_products': 49677, 'retained_products': 4537, 'original_transactions': 3214874, 'retained_transactions': 3151193, 'purchase_volume_retained': 0.8, 'dimensionality_reduction_ratio': 0.9087}


Now rebuild transactions:

In [8]:
train_tx = build_transactions(op_train_filtered)
df_onehot = transactions_to_onehot(train_tx.tolist(), sparse=True)

Now, I can run Apriori and FP-Growth Algorithms

In [5]:
ap = run_apriori(df_onehot, min_support=0.005, metric="confidence", min_threshold=0.3)


ap.meta

{'algorithm': 'apriori',
 'min_support': 0.005,
 'metric': 'confidence',
 'min_threshold': 0.3,
 'n_itemsets': 338,
 'n_rules': 5,
 'runtime_sec': 1652.6259}

In [6]:
fp = run_fpgrowth(df_onehot, min_support=0.005, metric="confidence", min_threshold=0.3)


fp.meta

{'algorithm': 'fp-growth',
 'min_support': 0.005,
 'metric': 'confidence',
 'min_threshold': 0.3,
 'n_itemsets': 338,
 'n_rules': 5,
 'runtime_sec': 137.7508}

In [7]:
ec = run_eclat(train_tx, min_support=0.005, max_len=3)
ec_rules = derive_rules_from_itemsets(
    ec.frequent_itemsets,
    metric="confidence",
    min_threshold=0.3
)

In [18]:
ap_rules_ranked = rules_to_recommender(ap.rules, sort_by="lift")
fp_rules_ranked = rules_to_recommender(fp.rules, sort_by="lift")
ec_rules_ranked = rules_to_recommender(ec_rules, sort_by="lift")

print("Apriori:", evaluate_recommender_proper(ap_rules_ranked, test_tx, k=10))
print("FP-Growth:", evaluate_recommender_proper(fp_rules_ranked, test_tx, k=10))
print("Eclat:", evaluate_recommender_proper(ec_rules_ranked, test_tx, k=10))

Apriori: {'HitRate@K': 0.012447332025344955, 'Precision@K': 0.0012447332025344593, 'Recall@K': 0.001903592585106979}
FP-Growth: {'HitRate@K': 0.012447332025344955, 'Precision@K': 0.0012447332025344593, 'Recall@K': 0.001903592585106979}
Eclat: {'HitRate@K': 0.012447332025344955, 'Precision@K': 0.0012447332025344593, 'Recall@K': 0.001903592585106979}


As we see most Algorithms resulted in:

* HitRate@K ‚âà 1.2%
* Precision@K ‚âà 0.12%
* Recall@K ‚âà 0.19%

Which is a very low, and identical across **Apriori, FP-Growth and Eclat**, All three algorithms find the same frequent patterns (which is normal). So improving performance means improving:

* Data preparation
* Rule filtering
* Recommendation strategy
* Evaluation setup

Let‚Äôs go step by step.
#### üî• 1Ô∏è‚É£ Why Are The Scores Low?

Association rules for next-basket prediction are naturally weak because:

* They are global (not personalized)
* They don‚Äôt use user history depth
* They don‚Äôt rank intelligently
* They don‚Äôt optimize for prediction directly

But 1.2% hit rate is very low ‚Äî we can improve.

#### üöÄ STRATEGY TO IMPROVE PERFORMANCE

To improve the results of our algorithms we need to work on 5 layers.

__‚úÖ STEP 1 ‚Äî Tune min_support & min_confidence__

Right we are using:
<code>
min_support = 0.005
min_threshold = 0.3
</code>
That might be too strict.

Try:
<code>
min_support = 0.003
min_threshold = 0.2
</code>
Then evaluate again.

Once we change these values we can re-evaluate our algorithms.

__‚úÖ STEP 2 ‚Äî Filter Rules Smartly__

Instead of using all rules, filter them.

For example:

Keep only rules with:

* lift > 1.1
* confidence > 0.25
* support > 0.002

<code>
filtered_rules = ap.rules[
    (ap.rules["lift"] > 1.1) &
    (ap.rules["confidence"] > 0.25)
]
</code>

Then rank by lift, Often improves hit rate.

__‚úÖ STEP 3 ‚Äî Use Multi-Item Antecedents__

Right now we are using many single-item rules.

But better predictive power often comes from:

* 2-item antecedents
* 3-item antecedents

Therefore we need to Filter using:
<code>
ap.rules["ante_len"] = ap.rules["antecedents"].apply(len)

rules_multi = ap.rules[ap.rules["ante_len"] >= 2]
</code>
Single-item rules are often too generic.

__‚úÖ STEP 4 ‚Äî Better Recommendation Strategy__

Right now our recommender likely:

* Iterates rules

* Stops at first matches

And this is weak.

Instead:

*Score consequents by weighted score:*

For each rule that matches:

$Score = confidence √ó lift$

Aggregate scores per recommended product.

A better recommender is described on function: `recommend_weighted`.
This alone can significantly improve performance.

__‚úÖ STEP 5 ‚Äî Personalization (Major Boost)__

Association rules are global. But users have history.

Instead of using only observed basket, use:

* All previous purchases of that user (from train)

We can build user profile:

`user_history = op_train.groupby("user_id")["product_id"].apply(set)`

Then we can use:

`observed = user_history[user_id]`

That dramatically increases hit rate.
Let's apply all these modifications and evaluate our algorithms.


### Run the improved pipeline (Apriori / FP / Eclat)

In [9]:
# 2Ô∏è‚É£ Lower support slightly (example: 0.003 instead of 0.005)
MIN_SUPPORT = 0.003
MIN_CONF    = 0.2

In [10]:
# Build one-hot on TRAIN (we use sparse=True to reduce memory)
# Use the Pareto data
# df_onehot = transactions_to_onehot(train_tx.tolist(), sparse=True)

# ---- Apriori ----
ap = run_apriori(
    df_onehot,
    min_support=MIN_SUPPORT,
    metric="confidence",
    min_threshold=MIN_CONF,
    max_len=3  # We keep it small for scalability; we can try 4 later
)

In [11]:
# ---- FP-Growth ----
fp = run_fpgrowth(
    df_onehot,
    min_support=MIN_SUPPORT,
    metric="confidence",
    min_threshold=MIN_CONF,
    max_len=3
)

In [12]:
# ---- Eclat ---- (itemsets then derive rules)
ec = run_eclat(
    train_tx,
    min_support=MIN_SUPPORT,
    max_len=3
)
ec_rules = derive_rules_from_itemsets(
    ec.frequent_itemsets,
    metric="confidence",
    min_threshold=MIN_CONF
)

In [13]:
from modules.association_rules import filter_rules_for_prediction, evaluate_weighted_recommender
# 3Ô∏è‚É£ Filter rules with lift > 1
# 4Ô∏è‚É£ Use multi-item antecedents (len >= 2)
ap_f = filter_rules_for_prediction(ap.rules, min_lift=1.0, min_confidence=MIN_CONF, min_antecedent_len=2)
fp_f = filter_rules_for_prediction(fp.rules, min_lift=1.0, min_confidence=MIN_CONF, min_antecedent_len=2)
ec_f = filter_rules_for_prediction(ec_rules,   min_lift=1.0, min_confidence=MIN_CONF, min_antecedent_len=2)

print("Rules after filtering:")
print("Apriori:", len(ap_f), "| FP-Growth:", len(fp_f), "| Eclat:", len(ec_f))


Rules after filtering:
Apriori: 14 | FP-Growth: 14 | Eclat: 14


In [14]:
# 1Ô∏è‚É£ Weighted recommender + 5Ô∏è‚É£ Evaluate again
K = 10
HIDE_RATIO = 0.5

ap_eval = evaluate_weighted_recommender(ap_f, test_tx, k=K, hide_ratio=HIDE_RATIO, score_mode="conf_lift")
fp_eval = evaluate_weighted_recommender(fp_f, test_tx, k=K, hide_ratio=HIDE_RATIO, score_mode="conf_lift")
ec_eval = evaluate_weighted_recommender(ec_f, test_tx, k=K, hide_ratio=HIDE_RATIO, score_mode="conf_lift")

print("Apriori:", ap_eval)
print("FP-Growth:", fp_eval)
print("Eclat:", ec_eval) # Optional: Try scoring modes
# print(evaluate_weighted_recommender(ap_f, test_tx, k=K, hide_ratio=HIDE_RATIO, score_mode="conf"))
# print(evaluate_weighted_recommender(ap_f, test_tx, k=K, hide_ratio=HIDE_RATIO, score_mode="lift"))

Apriori: {'HitRate@K': 0.007735357498954681, 'Precision@K': 0.0007984625775947925, 'Recall@K': 0.0009941800009024239, 'n_eval_orders': 124364}
FP-Growth: {'HitRate@K': 0.007735357498954681, 'Precision@K': 0.0007984625775947925, 'Recall@K': 0.0009941800009024239, 'n_eval_orders': 124364}
Eclat: {'HitRate@K': 0.007735357498954681, 'Precision@K': 0.0007984625775947925, 'Recall@K': 0.0009941800009024239, 'n_eval_orders': 124364}


In [24]:
from modules.association_rules import evaluate_hybrid, evaluate_popularity_only
# Build popularity ranking from TRAIN set only
global_popularity = (
    op_train["product_id"]
    .value_counts()
    .index
    .tolist()
)

# Filter rules (allow single antecedents for better firing)
ec_rules = filter_rules_for_prediction(
    ec.rules,
    min_lift=1.0,
    min_confidence=0.2,
    min_antecedent_len=1
)

hybrid_results = evaluate_hybrid(
    ec_f,
    test_tx,
    global_popularity,
    k=10,
    hide_ratio=0.5
)

print("Hybrid FP Growth:", hybrid_results)

Hybrid FP Growth: {'HitRate@K': 0.2998697372229906, 'Precision@K': 0.03817503457591679, 'Recall@K': 0.0731707490826234, 'n_eval_orders': 124364}


In [25]:

pop_results = evaluate_popularity_only(
    test_tx,
    global_popularity,
    k=10,
    hide_ratio=0.5
)

print("Popularity Only:", pop_results)

print("Hybrid:", hybrid_results)

Popularity Only: {'HitRate@K': 0.2998697372229906, 'Precision@K': 0.03817503457591679, 'Recall@K': 0.0731707490826234, 'n_eval_orders': 124364}
Hybrid: {'HitRate@K': 0.2998697372229906, 'Precision@K': 0.03817503457591679, 'Recall@K': 0.0731707490826234, 'n_eval_orders': 124364}


In [None]:
products_lookup = load_product_lookup("products.csv")
ap_named = attach_product_names(ap_rules_ranked, products_lookup)

top_rules(ap_named, sort_by="lift", n=15)[
    ["antecedents_names","consequents_names","support","confidence","lift"]
]

In [None]:
# Optional: Try scoring modes
# print(evaluate_weighted_recommender(ap_f, test_tx, k=K, hide_ratio=HIDE_RATIO, score_mode="conf"))
# print(evaluate_weighted_recommender(ap_f, test_tx, k=K, hide_ratio=HIDE_RATIO, score_mode="lift"))


In [None]:
from modules.association_rules import build_spmf_utility_file, run_uptree_spmf

# Example: utility = price (or price * quantity if you had quantities)
prices = pd.read_csv("api_prices.csv")  # must contain product_id, price
op_train_u = op_train.merge(prices, on="product_id", how="left")
op_train_u["price"] = op_train_u["price"].fillna(0.0)
op_train_u["utility"] = op_train_u["price"]

build_spmf_utility_file(op_train_u, "instacart_utility.txt")

res_upt = run_uptree_spmf(
    spmf_jar_path="spmf.jar",
    input_utility_file="instacart_utility.txt",
    output_file="uptree_output.txt",
    min_utility=10000,
    item_separator=" "
)

res_upt.meta