# 03 - Apriori, Association Rule Generation, and Results

This notebook loads the transaction matrix, runs Apriori to generate frequent itemsets,
derives association rules, filters them, visualizes, and saves the top rules. It contains
guidance for parameter tuning and interpretation.

### Load Transaction Matrix

In [None]:
import pickle
from pathlib import Path
import pandas as pd

TRAN_PKL = Path("../data/processed/transactions.pkl")
with open(TRAN_PKL, "rb") as f:
    obj = pickle.load(f)
transaction_df = obj['transaction_df']
vocab = obj.get('vectorizer_vocab', transaction_df.columns.tolist())
print("Loaded transaction matrix:", transaction_df.shape)

### Run Apriori (Tunable)

In [None]:
from mlxtend.frequent_patterns import apriori, association_rules

# Parameters â€” adjust as needed
min_support = 0.02   # start at 2% (0.02). Lower to 0.01 or 0.005 if too few results.
min_confidence = 0.6
min_lift = 1.2

print("Running apriori with min_support =", min_support)
frequent_itemsets = apriori(transaction_df, min_support=min_support, use_colnames=True)
print("Found frequent itemsets:", frequent_itemsets.shape[0])
display(frequent_itemsets.sort_values('support', ascending=False).head(20))

### Generate Association Rules and Filter

In [None]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)
print("Total rules (conf>=%.2f):" % min_confidence, rules.shape[0])

# Filter by lift and optionally support
rules = rules[(rules['lift'] >= min_lift)]
rules = rules.sort_values(['lift','confidence','support'], ascending=False)
print("Rules after lift>=%.2f filter:" % min_lift, rules.shape[0])

### Show Top Rules and Save

In [None]:
# Show top 20 rules (format antecedents -> consequents)
def fmt_itemset(s):
    return ', '.join(sorted(list(s)))

top_rules = rules.head(20).copy()
top_rules['antecedents'] = top_rules['antecedents'].apply(fmt_itemset)
top_rules['consequents'] = top_rules['consequents'].apply(fmt_itemset)
display(top_rules[['antecedents','consequents','support','confidence','lift']])

# Save rules to CSV
OUT_RULES = Path("../data/processed/top_rules.csv")
top_rules.to_csv(OUT_RULES, index=False)
print("Saved top rules to:", OUT_RULES)

### Visualization (Support vs Confidence Scatter Sized by Lift)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,6))
sns.scatterplot(data=rules, x='support', y='confidence', size='lift', sizes=(20,200), alpha=0.7)
plt.title('Association Rules: support vs confidence (size=lift)')
plt.xlabel('support')
plt.ylabel('confidence')
plt.legend(title='lift', bbox_to_anchor=(1.05,1), loc='upper left')
plt.tight_layout()
plt.show()

### Quick Network of Top-K Rules

In [None]:
# Optional: network visualization of top-K items (requires networkx & matplotlib)
try:
    import networkx as nx
    G = nx.DiGraph()
    topk = rules.head(30)
    for _, r in topk.iterrows():
        for a in r['antecedents']:
            for b in r['consequents']:
                G.add_edge(tuple(a) if isinstance(a, (list,set)) else a, 
                           tuple(b) if isinstance(b, (list,set)) else b,
                           weight=r['lift'])
    plt.figure(figsize=(10,8))
    pos = nx.spring_layout(G, k=0.5, seed=42)
    nx.draw(G, pos, with_labels=True, node_size=800, font_size=8, arrowsize=15)
    plt.title('Rule network (top rules)')
    plt.show()
except Exception as e:
    print("Network plot skipped (install networkx for this):", e)

## Interpretation & Next steps

- Focus on high-lift and high-confidence rules. A rule such as `{bank, verify} -> {account}` suggests a common phishing tactic (impersonation of banking/account verification).
- If Apriori produced too few itemsets: reduce `min_support` (try 0.01 or 0.005).
- To speed up or focus analysis, mine only the phishing subset (we did that optionally in the transaction conversion).
- Next: present the top rules in your report, explain why each rule is suspicious, and suggest how a rule-based filter could use combinations of tokens (rather than single keywords) to flag likely phishing emails.

Save the notebook with outputs and include `data/processed/top_rules.csv` with your submission.