Tìm hiểu thư viện apyori **(https://pypi.org/project/apyori)** để thực hiện tìm luật kết hợp từ tập phổ 
biến từ dữ liệu mua hàng tại siêu thị. Tham khảo bài hướng dẫn 
https://www.kaggle.com/code/rockystats/apriori-algorithm-or-market-basket-analysis  


#### Apriori (apyori) cho Market_Basket_Optimisation.csv

1. Import thư viện

In [1]:
#import all required packages..
import pandas as pd
import numpy as np
from apyori import apriori

2. Chuẩn bị dữ liệu

In [5]:
# đọc dữ liệu (mỗi dòng là 1 giao dịch, tối đa 20 mặt hàng)
df = pd.read_csv("Data/Market_Basket_Optimisation.csv", header=None)

# chuyển sang dạng list transactions (list of list)
transactions = []
for i in range(df.shape[0]):
    items = df.iloc[i].dropna().astype(str).str.strip().tolist()
    items = [x for x in items if x and x.lower() != "nan"]
    transactions.append(items)

3. Chạy Apriori (apyori) để lấy luật kết hợp

In [6]:
rules = apriori(
    transactions,
    min_support=0.003,      # ~0.3%
    min_confidence=0.2,
    min_lift=3,
    min_length=2
)

results = list(rules)
print("Số RelationRecord tìm được:", len(results))


Số RelationRecord tìm được: 80


4. Trích xuất luật ra bảng (support / confidence / lift)

In [7]:
import pandas as pd

rows = []
for r in results:
    for stat in r.ordered_statistics:
        if len(stat.items_base) == 0 or len(stat.items_add) == 0:
            continue
        rows.append({
            "antecedent": ", ".join(sorted(stat.items_base)),
            "consequent": ", ".join(sorted(stat.items_add)),
            "support": r.support,
            "confidence": stat.confidence,
            "lift": stat.lift
        })

rules_df = pd.DataFrame(rows)

# top 10 luật theo lift rồi confidence
top10 = rules_df.sort_values(["lift","confidence","support"], ascending=False).head(10)
top10


Unnamed: 0,antecedent,consequent,support,confidence,lift
107,"frozen vegetables, soup","milk, mineral water",0.003066,0.383333,7.987176
103,"frozen vegetables, olive oil","milk, mineral water",0.003333,0.294118,6.128268
69,"mineral water, whole wheat pasta",olive oil,0.003866,0.402778,6.115863
108,"milk, soup","frozen vegetables, mineral water",0.003066,0.201754,5.646864
56,tomato sauce,"ground beef, spaghetti",0.003066,0.216981,5.535971
109,"frozen vegetables, milk, mineral water",soup,0.003066,0.277108,5.484407
3,fromage blanc,honey,0.003333,0.245098,5.164271
58,"spaghetti, tomato sauce",ground beef,0.003066,0.489362,4.9806
0,light cream,chicken,0.004533,0.290598,4.843951
2,pasta,escalope,0.005866,0.372881,4.700812


5. Lưu kết quả cho Market_Basket_Optimisation.csv

In [17]:
# Lưu toàn bộ luật
rules_df.to_csv(
    "apyori_rules_marketbasket_all.csv",
    index=False,
    encoding="utf-8-sig"
)

# Lưu top 10 luật
top10.to_csv(
    "apyori_rules_marketbasket_top10.csv",
    index=False,
    encoding="utf-8-sig"
)

print("Saved Market Basket rules to CSV files")


Saved Market Basket rules to CSV files


#### Apriori (apyori) cho data-2.csv

1. Đọc dữ liệu

In [10]:
# đọc file data-2.csv
df = pd.read_csv("Data/data-2.csv/data-2.csv")
df.head()


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


2. Làm sạch dữ liệu 

In [11]:
# chuyển InvoiceNo sang string
df["InvoiceNo"] = df["InvoiceNo"].astype(str)

# loại hóa đơn hủy (InvoiceNo chứa 'C')
df = df[~df["InvoiceNo"].str.contains("C", na=False)]

# loại số lượng <= 0
df = df[df["Quantity"] > 0]

# loại dòng thiếu dữ liệu
df = df.dropna(subset=["InvoiceNo", "Description"])

# làm sạch tên sản phẩm
df["Description"] = df["Description"].astype(str).str.strip()
df = df[df["Description"] != ""]


3. Tạo danh sách giao dịch (transactions)
- Mỗi InvoiceNo = 1 giao dịch, mỗi giao dịch là danh sách sản phẩm.

In [12]:
transactions = (
    df.groupby("InvoiceNo")["Description"]
      .apply(lambda x: list(pd.unique(x)))
      .tolist()
)

print("Number of transactions:", len(transactions))
print("Average basket size:", np.mean([len(t) for t in transactions]))


Number of transactions: 20136
Average basket size: 25.81768970997219


4. (Khuyến nghị) Loại giỏ hàng quá lớn
- Data-2 thường có invoice rất lớn → lọc để luật có ý nghĩa.

In [13]:
basket_size = df.groupby("InvoiceNo")["Description"].nunique()
valid_invoices = basket_size[basket_size <= 200].index

df = df[df["InvoiceNo"].isin(valid_invoices)]

transactions = (
    df.groupby("InvoiceNo")["Description"]
      .apply(lambda x: list(pd.unique(x)))
      .tolist()
)

print("Transactions after filtering:", len(transactions))


Transactions after filtering: 19900


5. Chạy Apriori bằng apyori

In [14]:
rules = apriori(
    transactions,
    min_support=0.01,      # 1%
    min_confidence=0.3,
    min_lift=3,
    min_length=2,
    max_length=3
)

results = list(rules)
print("Number of association rules:", len(results))


Number of association rules: 423


6. Chuyển kết quả sang DataFrame

In [15]:
rows = []
for r in results:
    for stat in r.ordered_statistics:
        if len(stat.items_base) == 0 or len(stat.items_add) == 0:
            continue
        rows.append({
            "Antecedent": ", ".join(stat.items_base),
            "Consequent": ", ".join(stat.items_add),
            "Support": r.support,
            "Confidence": stat.confidence,
            "Lift": stat.lift
        })

rules_df = pd.DataFrame(rows)
rules_df.head()


Unnamed: 0,Antecedent,Consequent,Support,Confidence,Lift
0,60 TEATIME FAIRY CAKE CASES,72 SWEETHEART FAIRY CAKE CASES,0.011809,0.31886,11.838281
1,72 SWEETHEART FAIRY CAKE CASES,60 TEATIME FAIRY CAKE CASES,0.011809,0.438433,11.838281
2,60 TEATIME FAIRY CAKE CASES,PACK OF 60 DINOSAUR CAKE CASES,0.011457,0.309362,11.337586
3,PACK OF 60 DINOSAUR CAKE CASES,60 TEATIME FAIRY CAKE CASES,0.011457,0.41989,11.337586
4,60 TEATIME FAIRY CAKE CASES,PACK OF 60 PINK PAISLEY CAKE CASES,0.015276,0.412483,10.87207


7. Top 10 luật kết hợp mạnh nhất (theo Lift)

In [16]:
top10_rules = rules_df.sort_values(
    ["Lift", "Confidence", "Support"],
    ascending=False
).head(10)

top10_rules


Unnamed: 0,Antecedent,Consequent,Support,Confidence,Lift
154,HERB MARKER ROSEMARY,HERB MARKER THYME,0.010101,0.934884,86.531098
155,HERB MARKER THYME,HERB MARKER ROSEMARY,0.010101,0.934884,86.531098
867,REGENCY TEA PLATE PINK,"REGENCY TEA PLATE GREEN, REGENCY TEA PLATE ROSES",0.010352,0.830645,64.569682
870,"REGENCY TEA PLATE ROSES, REGENCY TEA PLATE GREEN",REGENCY TEA PLATE PINK,0.010352,0.804688,64.569682
871,"REGENCY TEA PLATE PINK, REGENCY TEA PLATE ROSES",REGENCY TEA PLATE GREEN,0.010352,0.944954,61.252727
866,REGENCY TEA PLATE GREEN,"REGENCY TEA PLATE PINK, REGENCY TEA PLATE ROSES",0.010352,0.67101,61.252727
441,REGENCY TEA PLATE PINK,REGENCY TEA PLATE GREEN,0.011307,0.907258,58.809236
440,REGENCY TEA PLATE GREEN,REGENCY TEA PLATE PINK,0.011307,0.732899,58.809236
435,REGENCY SUGAR BOWL GREEN,REGENCY MILK JUG PINK,0.011106,0.775439,53.395253
434,REGENCY MILK JUG PINK,REGENCY SUGAR BOWL GREEN,0.011106,0.764706,53.395253


8. Lưu kết quả cho data-2.csv

In [19]:
top10_rules.to_csv("apyori_rules_data2_top10.csv", index=False)
rules_df.to_csv("apyori_rules_data2_all.csv", index=False)
