<div style="font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; color: #2c3e50; line-height: 1.6; max-width: 900px; margin: auto; border: 1px solid #e1e4e8; border-radius: 15px; background-color: #ffffff; overflow: hidden; box-shadow: 0 10px 30px rgba(0,0,0,0.1);">

<div style="background: linear-gradient(135deg, #1e3c72 0%, #2a5298 100%); padding: 40px 30px; color: white; text-align: center;">
    <h1 style="margin: 0; font-size: 2.8em; font-weight: 800; letter-spacing: -1px;">üõí Association Rule Mining</h1>
    <div style="width: 60px; height: 4px; background: #ffcc00; margin: 20px auto; border-radius: 2px;"></div>
    <p style="font-size: 1.1em; opacity: 0.9; max-width: 600px; margin: auto;">
        Strategic Basket Analysis to Drive Revenue and Optimize Customer Experience
    </p>
</div>

<div style="padding: 30px;">
    
<h3 style="color: #1e3c72; border-bottom: 2px solid #f0f2f5; padding-bottom: 10px; margin-top: 0;">üéØ Analysis Objectives</h3>
    <p>In this phase of the project, we leverage transaction data to decode consumer behavior with two primary goals:</p>
    
<div style="display: flex; gap: 20px; margin: 25px 0;">
        <div style="flex: 1; background: #fff9db; padding: 20px; border-radius: 10px; border-left: 5px solid #fab005;">
            <strong style="color: #862e1b; font-size: 1.1em;">üí∞ Profit Maximization</strong><br>
            <span style="font-size: 0.95em;">Identifying high-value product bundles and cross-selling opportunities to increase Average Order Value (AOV).</span>
        </div>
        <div style="flex: 1; background: #e7f5ff; padding: 20px; border-radius: 10px; border-left: 5px solid #228be6;">
            <strong style="color: #1864ab; font-size: 1.1em;">‚ú® User Experience</strong><br>
            <span style="font-size: 0.95em;">Streamlining the customer journey through intuitive recommendations and intelligent store layouts.</span>
        </div>
    </div>

<h3 style="color: #1e3c72; border-bottom: 2px solid #f0f2f5; padding-bottom: 10px;">üõ†Ô∏è Methodological Framework</h3>
    <p>We will implement and compare four industry-standard models to extract frequent itemsets:</p>
    
    

<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px; margin-top: 15px;">
        <div style="padding: 10px 15px; background: #f8f9fa; border-radius: 6px; font-family: monospace; border: 1px solid #e9ecef;">‚Ä¢ Apriori Algorithm</div>
        <div style="padding: 10px 15px; background: #f8f9fa; border-radius: 6px; font-family: monospace; border: 1px solid #e9ecef;">‚Ä¢ Eclat (Equivalence Class Transformation)</div>
        <div style="padding: 10px 15px; background: #f8f9fa; border-radius: 6px; font-family: monospace; border: 1px solid #e9ecef;">‚Ä¢ FP-Growth (Frequent Pattern)</div>
        <div style="padding: 10px 15px; background: #f8f9fa; border-radius: 6px; font-family: monospace; border: 1px solid #e9ecef;">‚Ä¢ UP-Tree (Utility Pattern)</div>
    </div>

<div style="margin-top: 35px; padding: 15px; background: #f1f3f5; border-radius: 8px; text-align: center; font-style: italic; color: #495057;">
        "Following each model, we will perform a deep-dive into the extracted rules to translate data patterns into actionable business insights."
    </div>

</div>
</div>

We start first with preparing the data for Association Mining Algorithms

<h2>Appriori Algorithm</h2>

Appriori generates frequent itemsets using a level wise approach and prunes combinaisons that do not satisfy the minimum support threshold.





In [4]:
import os
import pandas as pd
from collections import Counter
from itertools import combinations

PROJECT_ROOT = os.path.abspath("..")
PROCESSED_DIR = os.path.join(PROJECT_ROOT, "data", "processed")

transactions_df = pd.read_parquet(
    os.path.join(PROCESSED_DIR, "transactions.parquet")
)

print("Loaded transactions_df:", transactions_df.shape)
transactions_df.head()


Loaded transactions_df: (3346083, 2)


Unnamed: 0,order_id,product_id
0,1,"[49302, 11109, 10246, 49683, 43633, 13176, 472..."
1,2,"[33120, 28985, 9327, 45918, 30035, 17794, 4014..."
2,3,"[33754, 24838, 17704, 21903, 17668, 46667, 174..."
3,4,"[46842, 26434, 39758, 27761, 10054, 21351, 225..."
4,5,"[13176, 15005, 47329, 27966, 23909, 48370, 132..."


In [6]:
#basket_sizes = transactions_df["items"].apply(len)
#basket_sizes.describe(percentiles=[0.5, 0.75, 0.9, 0.95])

In [3]:
item_counter = Counter()

for items in transactions_df["items"]:
    item_counter.update(set(items))

item_freq_df = (
    pd.DataFrame(item_counter.items(), columns=["item", "count"])
    .sort_values("count", ascending=False)
    .reset_index(drop=True)
)

item_freq_df.head(15)


Unnamed: 0,item,count
0,Banana,473023
1,Bag of Organic Bananas,376928
2,Organic Strawberries,267943
3,Organic Baby Spinach,244707
4,Organic Hass Avocado,216041
5,Organic Avocado,180406
6,Large Lemon,157288
7,Limes,143889
8,Strawberries,142586
9,Organic Raspberries,139544


In [4]:
pair_counter = Counter()

for items in transactions_df["items"]:
    unique_items = sorted(set(items))
    for pair in combinations(unique_items, 2):
        pair_counter[pair] += 1

pair_freq_df = (
    pd.DataFrame(pair_counter.items(), columns=["item_pair", "count"])
    .sort_values("count", ascending=False)
    .reset_index(drop=True)
)

pair_freq_df.head(10)


Unnamed: 0,item_pair,count
0,"(Bag of Organic Bananas, Organic Hass Avocado)",64524
1,"(Bag of Organic Bananas, Organic Strawberries)",64401
2,"(Banana, Organic Strawberries)",58147
3,"(Banana, Organic Avocado)",55442
4,"(Banana, Organic Baby Spinach)",53271
5,"(Bag of Organic Bananas, Organic Baby Spinach)",52343
6,"(Banana, Strawberries)",43010
7,"(Banana, Large Lemon)",42915
8,"(Organic Hass Avocado, Organic Strawberries)",42207
9,"(Bag of Organic Bananas, Organic Raspberries)",42127


In [5]:
num_transactions = len(transactions_df)

# Map item -> count
item_count_map = dict(item_freq_df[["item", "count"]].values)

rules = []

for (item_a, item_b), pair_count in pair_counter.items():
    support = pair_count / num_transactions

    confidence_a_to_b = pair_count / item_count_map[item_a]
    confidence_b_to_a = pair_count / item_count_map[item_b]

    lift = support / (
        (item_count_map[item_a] / num_transactions) *
        (item_count_map[item_b] / num_transactions)
    )

    rules.append({
        "antecedent": item_a,
        "consequent": item_b,
        "pair_count": pair_count,
        "support": support,
        "confidence": confidence_a_to_b,
        "lift": lift
    })

rules_df = pd.DataFrame(rules)
rules_df.sort_values("lift", ascending=False).head(10)


Unnamed: 0,antecedent,consequent,pair_count,support,confidence,lift
260458,Unsweetened Whole Milk Peach Greek Yogurt,Unsweetened Whole Milk Strawberry Yogurt,985,0.000371,0.496472,642.461856
41981,Oh My Yog! Organic Wild Quebec Blueberry Cream...,Oh My Yog! Pacific Coast Strawberry Trilayer Y...,1521,0.000574,0.661592,608.682772
205226,"Mighty 4 Kale, Strawberry, Amaranth & Greek Yo...","Mighty 4 Sweet Potato, Blueberry, Millet & Gre...",1188,0.000448,0.529176,543.003126
31938,Mighty 4 Purple Carrot Blackberry Quinoa & Gre...,"Mighty 4 Sweet Potato, Blueberry, Millet & Gre...",948,0.000358,0.480974,493.541807
96751,"Fiber & Protein Organic Pears, Raspberries, Bu...",Organic Fiber & Protein Pear Blueberry & Spina...,1099,0.000414,0.493046,458.70926
184102,Organic Stage 2 Carrots Baby Food,Sweet Potatoes Stage 2,835,0.000315,0.359914,452.927509
12989,Raspberry Essence Water,Unsweetened Blackberry Water,1064,0.000401,0.522337,452.461898
110404,"Mighty 4 Kale, Strawberry, Amaranth & Greek Yo...",Mighty 4 Purple Carrot Blackberry Quinoa & Gre...,755,0.000285,0.336303,452.416802
65503,Oh My Yog! Madagascar Vanilla Trilayer Yogyurt,Oh My Yog! Organic Wild Quebec Blueberry Cream...,1008,0.00038,0.385763,444.914843
46875,Oh My Yog! Madagascar Vanilla Trilayer Yogyurt,Oh My Yog! Pacific Coast Strawberry Trilayer Y...,1252,0.000472,0.479143,440.824462


In [6]:
filtered_rules_df = rules_df[
    (rules_df["support"] >= 0.001) &      # appears in at least 0.1% of orders
    (rules_df["confidence"] >= 0.3) &     # decent implication strength
    (rules_df["lift"] >= 1.5)             # real positive association
].sort_values("lift", ascending=False)

print("Filtered rules:", filtered_rules_df.shape)
filtered_rules_df.head(15)


Filtered rules: (20, 6)


Unnamed: 0,antecedent,consequent,pair_count,support,confidence,lift
33030,Almond Milk Blueberry Yogurt,Almond Milk Strawberry Yogurt,2762,0.001042,0.571961,260.131328
71329,Organic Whole Milk Strawberry Beet Berry Yogur...,Yotoddler Organic Pear Spinach Mango Yogurt,2875,0.001084,0.444771,191.199556
12392,Blueberry on the Bottom Nonfat Greek Yogurt,Strawberry on the Bottom Nonfat Greek Yogurt,2847,0.001074,0.437394,149.261197
12426,Peach on the Bottom Nonfat Greek Yogurt,Strawberry on the Bottom Nonfat Greek Yogurt,2751,0.001038,0.346168,118.130244
61967,Fat Free Blueberry Yogurt,Total 0% Raspberry Yogurt,2703,0.001019,0.369464,78.876112
69396,Fat Free Strawberry Yogurt,Total 0% Raspberry Yogurt,3108,0.001172,0.310211,66.226192
42278,Icelandic Style Skyr Blueberry Non-fat Yogurt,Non Fat Raspberry Yogurt,7413,0.002796,0.375513,59.376204
37181,Non Fat Acai & Mixed Berries Yogurt,Non Fat Raspberry Yogurt,3257,0.001228,0.364685,57.664064
18826,Nonfat Icelandic Style Strawberry Yogurt,Vanilla Skyr Nonfat Yogurt,3972,0.001498,0.364437,51.863026
44047,Icelandic Style Skyr Blueberry Non-fat Yogurt,Vanilla Skyr Nonfat Yogurt,6803,0.002566,0.344613,49.041821


In [7]:
OUTPUT_DIR = os.path.join(PROJECT_ROOT, "outputs")
os.makedirs(OUTPUT_DIR, exist_ok=True)

filtered_rules_df.to_csv(
    os.path.join(OUTPUT_DIR, "association_rules_named.csv"),
    index=False
)

print("‚úÖ Saved association_rules_named.csv")


‚úÖ Saved association_rules_named.csv
