# Association rule mining

You will use:
* orders.csv
* order_products__prior.csv
* products.csv
* aisles.csv (optional but VERY powerful later)

### Step 1 — Merge the tables

In [1]:
import pandas as pd

orders = pd.read_csv("..\data_raw\orders.csv")
order_products = pd.read_csv("..\data_raw\order_products__prior.csv")
products = pd.read_csv("..\data_raw\products.csv")

# Merge product names
df = order_products.merge(products, on="product_id")

df.head()


  orders = pd.read_csv("..\data_raw\orders.csv")
  order_products = pd.read_csv("..\data_raw\order_products__prior.csv")
  products = pd.read_csv("..\data_raw\products.csv")


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
0,2,33120,1,1,Organic Egg Whites,86,16
1,2,28985,2,1,Michigan Organic Kale,83,4
2,2,9327,3,0,Garlic Powder,104,13
3,2,45918,4,1,Coconut Butter,19,13
4,2,30035,5,0,Natural Sweetener,17,13


### Step 2: Create baskets

We group products per order:

In [2]:
transactions = df.groupby('order_id')['product_name'].apply(list)
transactions.head()

order_id
2    [Organic Egg Whites, Michigan Organic Kale, Ga...
3    [Total 2% with Strawberry Lowfat Greek Straine...
4    [Plain Pre-Sliced Bagels, Honey/Lemon Cough Dr...
5    [Bag of Organic Bananas, Just Crisp, Parmesan,...
6    [Cleanse, Dryer Sheets Geranium Scent, Clean D...
Name: product_name, dtype: object

### Step 3 — One-Hot Encoding (Basket Matrix)

Association algorithms need:

| order_id | Banana | Milk | Yogurt | Bread |
| -------- | ------ | ---- | ------ | ----- |
| 1        | 1      | 1    | 1      | 0     |
| 2        | 0      | 0    | 1      | 1     |
--------------------------------------------

We create it:


In [5]:
#import sys
#print(sys.executable)
#import sys
#!{sys.executable} -m pip install mlxtend

In [10]:
# because of the RAM error, I will delete rare products (90% of products are almost never bought)
product_counts = df['product_name'].value_counts()

popular_products = product_counts[product_counts > 5000].index

df_filtered = df[df['product_name'].isin(popular_products)]


In [11]:
# Limit commands :
# Instacart df contain the entire history of the cliens so we will consider only a sample
sample_orders = df_filtered['order_id'].drop_duplicates().sample(200000, random_state=42)

df_filtered = df_filtered[df_filtered['order_id'].isin(sample_orders)]

# Step 3: Recreate the trandactions DF
transactions = df_filtered.groupby('order_id')['product_name'].apply(list)


In [14]:
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)

basket = pd.DataFrame(te_ary, columns=te.columns_)
basket.head()

Unnamed: 0,0% Greek Strained Yogurt,1% Low Fat Milk,1% Lowfat Milk,100 Calorie Per Bag Popcorn,100% Apple Juice,100% Grated Parmesan Cheese,100% Lactose Free Fat Free Milk,100% Natural Spring Water,100% Pure Apple Juice,100% Pure Pumpkin,...,YoKids Squeeze! Organic Strawberry Flavor Yogurt,"YoKids Squeezers Organic Low-Fat Yogurt, Strawberry",YoKids Strawberry Banana/Strawberry Yogurt,Yobaby Organic Plain Yogurt,"Yogurt, Lowfat, Strawberry","Yogurt, Strained Low-Fat, Coconut",Yotoddler Organic Pear Spinach Mango Yogurt,Yukon Gold Potatoes 5lb Bag,ZBar Organic Chocolate Brownie Energy Snack,Zero Calorie Cola
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [17]:

from mlxtend.frequent_patterns import apriori, association_rules

# itemsets fréquents
frequent_itemsets = apriori(
    basket,
    min_support=0.01,
    use_colnames=True
)

# règles d'association
rules = association_rules(
    frequent_itemsets,
    metric="lift",
    min_threshold=1.2
)

rules.head()


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(Organic Baby Spinach),(Bag of Organic Bananas),0.08074,0.126995,0.016865,0.20888,1.644792,1.0,0.006611,1.103506,0.426452,0.088359,0.093797,0.17084
1,(Bag of Organic Bananas),(Organic Baby Spinach),0.126995,0.08074,0.016865,0.132801,1.644792,1.0,0.006611,1.060033,0.449047,0.088359,0.056633,0.17084
2,(Organic Hass Avocado),(Bag of Organic Bananas),0.072465,0.126995,0.02113,0.291589,2.296067,1.0,0.011927,1.232343,0.608573,0.118488,0.188537,0.228987
3,(Bag of Organic Bananas),(Organic Hass Avocado),0.126995,0.072465,0.02113,0.166385,2.296067,1.0,0.011927,1.112665,0.646586,0.118488,0.101257,0.228987
4,(Organic Raspberries),(Bag of Organic Bananas),0.04569,0.126995,0.013265,0.290326,2.286122,1.0,0.007463,1.23015,0.589513,0.083208,0.187091,0.19739


In [None]:
strong_rules = rules[
    (rules['confidence'] > 0.4) &
    (rules['lift'] > 1.8) &
    (rules['support'] > 0.01)
].sort_values(by='lift', ascending=False)

strong_rules.head(20)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski


Now you are ready for mining 🔥

## 3) The Algorithms (What each one really does)

### A) Apriori — The Foundational Algorithm

Idea:

Find items that appear frequently together.

It uses support pruning:
If {Milk, Bread} is not frequent → {Milk, Bread, Eggs} can NEVER be frequent.

Run Apriori