# Asociační pravidla

### Dataset pro tuto metodu dolování dat

 Tento dataset patří pekárně "The Bread Basket" v Edinburghu. Dataset má 20 507 vstupu, 9465 transakcí a má 4 sloupce.
 Dataset obsahuje transakce zákazníků kteří objednali různé položky z pekárny online.

 https://www.kaggle.com/datasets/mittalvasu95/the-bread-basket?resource=download

Dataset se načte pomocí knihovny Pandas využitím funcke *read_csv()*

In [2]:
import os
import pandas as pd

data_folder = os.path.join("datasets")
filename = os.path.join(data_folder, "bread-basket.csv")
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Transaction,Item,date_time,period_day,weekday_weekend
0,1,Bread,30-10-2016 09:58,morning,weekend
1,2,Scandinavian,30-10-2016 10:05,morning,weekend
2,2,Scandinavian,30-10-2016 10:05,morning,weekend
3,3,Hot chocolate,30-10-2016 10:07,morning,weekend
4,3,Jam,30-10-2016 10:07,morning,weekend


Knihovna pandas nabízí řadu užitečných funkcí které pomáhají prozkoumat daný dataset.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20507 entries, 0 to 20506
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Transaction      20507 non-null  int64 
 1   Item             20507 non-null  object
 2   date_time        20507 non-null  object
 3   period_day       20507 non-null  object
 4   weekday_weekend  20507 non-null  object
dtypes: int64(1), object(4)
memory usage: 801.2+ KB


In [4]:
df.nunique().sort_values()

weekday_weekend       2
period_day            4
Item                 94
date_time          9182
Transaction        9465
dtype: int64

In [5]:
print("Celkový počet transakcí je {} a počet jednotlivých položek z kterých se budou vytvářet asociačním pravidla je {}.".format(len(df['Transaction'].unique()), len(df['Item'].unique())))

Celkový počet transakcí je 9465 a počet jednotlivých položek z kterých se budou vytvářet asociačním pravidla je 94.


In [6]:
df.isna().sum()

Transaction        0
Item               0
date_time          0
period_day         0
weekday_weekend    0
dtype: int64

In [7]:
df.describe(include=object)

Unnamed: 0,Item,date_time,period_day,weekday_weekend
count,20507,20507,20507,20507
unique,94,9182,4,2
top,Coffee,05-02-2017 11:58,afternoon,weekday
freq,5471,12,11569,12807


### Implementace Apriori 

Tento dataset je potřeba předzpracovat do vhodné podoby pro Apriori algoritmus. Dataset pomoci další funkce z knihovny Pandas *groupby()* je transformován na transakční tabulku která obsahuje 2 sloupce: číslo transakce a položky, které byly koupené v té transakci. Jednotlivé položky byly uložené jako *fronzenset*, aby později byla jednoduší práce s množinami.

In [8]:
transaction_df = df.groupby(['Transaction'])['Item'].apply(frozenset).reset_index()
transaction_df.columns = ['Transaction', 'Item']
transaction_df

Unnamed: 0,Transaction,Item
0,1,(Bread)
1,2,(Scandinavian)
2,3,"(Cookies, Jam, Hot chocolate)"
3,4,(Muffin)
4,5,"(Bread, Pastry, Coffee)"
...,...,...
9460,9680,(Bread)
9461,9681,"(Tea, Christmas common, Truffles, Spanish Brunch)"
9462,9682,"(Tea, Tacos/Fajita, Coffee, Muffin)"
9463,9683,"(Pastry, Coffee)"


In [10]:
transactions = transaction_df['Item']
transaction_dict = transactions.to_dict()

Zde se vytvoří množina frektentovaných vzorů, které mají jen jednu položku, které splňují minimální podporu. Zde minimální podpora byla zvolena 1%, aby jsme dostali nějaké zajímavé asociační pravidla

In [11]:
#funkce na vyvoreni kandidatni mnoziny 1-prvkovych polozek, ktere splnuji minimalni podporu 
def get_k_1_itemset(min_support = 0.5, frequent_itemsets = {}, transactions=None):
    #vytvari se vsechny kanditaty
    count = {}
    for items in transactions:
        for item in items:
            if item in count:
                count[item] += 1
            else:
                count[item] = 1
    #odstrani se polozky, ktere nesplnuji minimalni podporu
    list_of_items_not_min = []
    for item_count in count:
        count[item_count] = float(count[item_count]/len(transactions))
        if count[item_count] <= min_support:
            list_of_items_not_min.append(item_count)
    for item_eliminate in list_of_items_not_min:
        count.pop(item_eliminate)
    #vytvori se mnozina polozek, ktere splnuji minimalni podporu
    l1 = {}
    for key in count:
        l1[frozenset([key])] = count[key]
    
    frequent_itemsets[1] = l1

In [12]:

frequent_itemsets = {}
min_support = 0.01 # 1%

get_k_1_itemset(min_support, frequent_itemsets, transactions)


In [13]:
from collections import defaultdict
#funkce pro hledani frektventovane mnoziny
def find_frequent_itemsets(transactions_dict, k_1_itemsets, min_support):
    counter = defaultdict(int)
    for transactions_id, items in transactions.items():
        for itemset in k_1_itemsets:
            if itemset.issubset(items):
                for other_items in items - itemset:
                    current_superset = itemset | frozenset((other_items,))
                    counter[current_superset] += 1
    for i in counter:
        counter[i] = float(counter[i]/len(transactions))
    
    new_dict = {}
    for itemset, f in counter.items():
        if f >= min_support:
            new_dict[itemset] = f
    return new_dict

In [14]:
import sys
#cyklus pro vytvoreni frekventovanych mnozin o velikosti k, skonci kdyz aktualni frekventovana mnozina je 0
print("There are {} items with more than {} % min_support".format(len(frequent_itemsets[1]), (min_support*100)))
sys.stdout.flush()
k = 2 
while True:
    cur_frequent_itemsets = find_frequent_itemsets(transaction_dict, frequent_itemsets[k-1], min_support)
    if len(cur_frequent_itemsets) == 0:
        print("Did not find any frequent itemsets of length {}".format(k))
        sys.stdout.flush()
        break
    else:
        print("Found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k))
        sys.stdout.flush()
        frequent_itemsets[k] = cur_frequent_itemsets
    k += 1
#odstraneni mnoziny L1
del frequent_itemsets[1]

There are 30 items with more than 1.0 % min_support
Found 66 frequent itemsets of length 2
Found 27 frequent itemsets of length 3
Did not find any frequent itemsets of length 4


In [15]:
candidate_rules = []
#vytvoreni asociacnich pravidel z frekvetovane mnozine ve tvaru A->B
for itemset_length, itemset_counts in frequent_itemsets.items():
    for itemset in itemset_counts.keys():
        for B in itemset:
            A = itemset - set((B,))
            candidate_rules.append((A,B))
print("There are {} candidate rules".format(len(candidate_rules)))

There are 213 candidate rules


In [16]:
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
#pokud A i B patri do stejne frekventovane mnoziny, tak se vytvori spravne pravidlo
for transactions_id, items in transaction_dict.items():
    for candidate_rule in candidate_rules:
        A,B = candidate_rule
        if A.issubset(items):
            if B in items:
                correct_counts[candidate_rule] += 1
            else:
                incorrect_counts[candidate_rule] += 1
#pocita se podpora pro jednotlive pravidla na zaklade vzorce s(A U B)/s(A), kde s(A U B) je pocet vyskytu 
rule_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule]) for candidate_rule in candidate_rules}

In [17]:
from operator import itemgetter
sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True)


Vypis 10. pravidel s nejvetsi spolehlivosti a minimalni podporou 1%

In [18]:
for index in range(10):
    (A,B) = sorted_confidence[index][0]
    print("Rule {0}: {1} -> {2} | confidence : {3:.3f}".format(index+1, [x for x in A], B, rule_confidence[(A,B)]))
    print("")

Rule 1: ['Toast'] -> Coffee | confidence : 0.704

Rule 2: ['Cake', 'Sandwich'] -> Coffee | confidence : 0.677

Rule 3: ['Pastry', 'Hot chocolate'] -> Coffee | confidence : 0.667

Rule 4: ['Soup', 'Sandwich'] -> Coffee | confidence : 0.654

Rule 5: ['Salad'] -> Coffee | confidence : 0.626

Rule 6: ['Cookies', 'Hot chocolate'] -> Coffee | confidence : 0.614

Rule 7: ['Cookies', 'Juice'] -> Coffee | confidence : 0.603

Rule 8: ['Hot chocolate', 'Cake'] -> Coffee | confidence : 0.602

Rule 9: ['Spanish Brunch'] -> Coffee | confidence : 0.599

Rule 10: ['Cookies', 'Cake'] -> Coffee | confidence : 0.580



## Použití knihovny mlxtend

In [19]:
from mlxtend.frequent_patterns import apriori, association_rules

Dataset je potřeba předzpracovat na vhodnou podobu pro funkci _apriori()_ z knihovny mlxtend, která má tvar _pandas.Dataframu_ ve tvaru MxN, kde M je počet transakcí a N jsou všechny dostupné položky a každá buňka má True hodnotu pokud je položka součástí transakce

In [20]:
xtend_basket_data = df.groupby(['Transaction', 'Item'])['Item'].count().reset_index(name='Count')
xtend_data = xtend_basket_data.pivot_table(index='Transaction', columns='Item', values='Count', aggfunc='sum').fillna(0)
xtend_data.head()

Item,Adjustment,Afternoon with the baker,Alfajores,Argentina Night,Art Tray,Bacon,Baguette,Bakewell,Bare Popcorn,Basket,...,The BART,The Nomad,Tiffin,Toast,Truffles,Tshirt,Valentine's card,Vegan Feast,Vegan mincepie,Victorian Sponge
Transaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:

def one_hot_encode(x):
    if x >= 1:
        return True
    if x <= 0:
        return False

In [22]:
xtend_data = xtend_data.applymap(one_hot_encode)
xtend_data.head()

Item,Adjustment,Afternoon with the baker,Alfajores,Argentina Night,Art Tray,Bacon,Baguette,Bakewell,Bare Popcorn,Basket,...,The BART,The Nomad,Tiffin,Toast,Truffles,Tshirt,Valentine's card,Vegan Feast,Vegan mincepie,Victorian Sponge
Transaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### mlxtend implementace algoritmu apriori

In [23]:
#vytvoreni frekventovanych mnozin pomoci Apriori
frequent_items = apriori(xtend_data, min_support = 0.02, use_colnames = True)
frequent_items['lenght'] = frequent_items['itemsets'].apply(lambda x: len(x))
frequent_items

Unnamed: 0,support,itemsets,lenght
0,0.036344,(Alfajores),1
1,0.327205,(Bread),1
2,0.040042,(Brownie),1
3,0.103856,(Cake),1
4,0.478394,(Coffee),1
5,0.054411,(Cookies),1
6,0.039197,(Farm House),1
7,0.05832,(Hot chocolate),1
8,0.038563,(Juice),1
9,0.061807,(Medialuna),1


In [24]:
#vytvoreni pravidel z frekventovanych mnozin vytvorene pomoci algoritmu Apriori
rules = association_rules(frequent_items, metric='lift')
#upraveni dataFrame, aby se zobrazili pravidla sestupne podle metriky spolehlivosti
rules = rules.sort_values(by=['confidence'], ascending=False)
rules = rules.reset_index(drop=True)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Toast),(Coffee),0.033597,0.478394,0.023666,0.704403,1.472431,0.007593,1.764582
1,(Medialuna),(Coffee),0.061807,0.478394,0.035182,0.569231,1.189878,0.005614,1.210871
2,(Pastry),(Coffee),0.086107,0.478394,0.047544,0.552147,1.154168,0.006351,1.164682
3,(Juice),(Coffee),0.038563,0.478394,0.020602,0.534247,1.11675,0.002154,1.119919
4,(Sandwich),(Coffee),0.071844,0.478394,0.038246,0.532353,1.112792,0.003877,1.115384
5,(Cake),(Coffee),0.103856,0.478394,0.054728,0.526958,1.101515,0.005044,1.102664
6,(Cookies),(Coffee),0.054411,0.478394,0.028209,0.518447,1.083723,0.002179,1.083174
7,(Hot chocolate),(Coffee),0.05832,0.478394,0.029583,0.507246,1.060311,0.001683,1.058553
8,(Pastry),(Bread),0.086107,0.327205,0.02916,0.33865,1.034977,0.000985,1.017305
9,(Cake),(Tea),0.103856,0.142631,0.023772,0.228891,1.604781,0.008959,1.111865


### mlxtend implementace FP-growth 

In [25]:
from mlxtend.frequent_patterns import fpgrowth

#vytvoreni frekventovanych mnozin pomoci FP-growth
fitems_fpg = fpgrowth(xtend_data, min_support = 0.01, use_colnames=True)
fitems_fpg

Unnamed: 0,support,itemsets
0,0.327205,(Bread)
1,0.029054,(Scandinavian)
2,0.058320,(Hot chocolate)
3,0.054411,(Cookies)
4,0.015003,(Jam)
...,...,...
56,0.019651,"(Brownie, Coffee)"
57,0.010777,"(Bread, Brownie)"
58,0.023666,"(Coffee, Toast)"
59,0.018067,"(Coffee, Scone)"


In [26]:
#vytvoreni pravidel z frekventovanych mnozin vytvorene pomoci algoritmu FP-grow
rules_fpg = association_rules(fitems_fpg, metric='lift')
#upraveni dataFrame, aby se zobrazili pravidla sestupne podle metriky spolehlivosti
rules_fpg = rules_fpg.sort_values(by=['confidence'], ascending=False)
rules_fpg = rules_fpg.reset_index(drop=True)
rules_fpg

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Toast),(Coffee),0.033597,0.478394,0.023666,0.704403,1.472431,0.007593,1.764582
1,(Spanish Brunch),(Coffee),0.018172,0.478394,0.010882,0.598837,1.251766,0.002189,1.300235
2,(Medialuna),(Coffee),0.061807,0.478394,0.035182,0.569231,1.189878,0.005614,1.210871
3,(Pastry),(Coffee),0.086107,0.478394,0.047544,0.552147,1.154168,0.006351,1.164682
4,(Alfajores),(Coffee),0.036344,0.478394,0.019651,0.540698,1.130235,0.002264,1.135648
5,(Juice),(Coffee),0.038563,0.478394,0.020602,0.534247,1.11675,0.002154,1.119919
6,(Sandwich),(Coffee),0.071844,0.478394,0.038246,0.532353,1.112792,0.003877,1.115384
7,(Cake),(Coffee),0.103856,0.478394,0.054728,0.526958,1.101515,0.005044,1.102664
8,(Scone),(Coffee),0.034548,0.478394,0.018067,0.522936,1.093107,0.001539,1.093366
9,(Cookies),(Coffee),0.054411,0.478394,0.028209,0.518447,1.083723,0.002179,1.083174


 ### časové porovnání dvou algoritmu pro hledání množin frekventovaný vzoru

In [27]:
%timeit -n 100 -r 10 apriori(xtend_data, min_support = 0.5)


1.16 ms ± 67.2 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)


In [28]:
%timeit -n 100 -r 10 fpgrowth(xtend_data, min_support = 0.5)

20.2 ms ± 757 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)


* FP-grow by mel byt radove rychlejsi, dokonce v dokumentaci je to uvedeno, ale zde je opakem, kde je radove pomalejsi

## Použití knihovny PyCaret

In [29]:
from pycaret.arules import *

In [30]:
#zpracovani datasetu, nastaveni atrubutu Transakce a nastaveni atributu polozka
exp = setup(data=df, transaction_id='Transaction', item_id='Item')

Description,Value
session_id,3822.0
# Transactions,9465.0
# Items,94.0
Ignore Items,


In [31]:
#vytvoreni pravidel
model1 = create_model(metric="confidence", min_support=0.01)
model1

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Toast),(Coffee),0.0336,0.4784,0.0237,0.7044,1.4724,0.0076,1.7646
1,(Spanish Brunch),(Coffee),0.0182,0.4784,0.0109,0.5988,1.2518,0.0022,1.3002
2,(Medialuna),(Coffee),0.0618,0.4784,0.0352,0.5692,1.1899,0.0056,1.2109
3,(Pastry),(Coffee),0.0861,0.4784,0.0475,0.5521,1.1542,0.0064,1.1647
4,(Alfajores),(Coffee),0.0363,0.4784,0.0197,0.5407,1.1302,0.0023,1.1356
5,(Juice),(Coffee),0.0386,0.4784,0.0206,0.5342,1.1167,0.0022,1.1199
6,(Sandwich),(Coffee),0.0718,0.4784,0.0382,0.5324,1.1128,0.0039,1.1154
7,(Cake),(Coffee),0.1039,0.4784,0.0547,0.527,1.1015,0.005,1.1027
8,(Scone),(Coffee),0.0345,0.4784,0.0181,0.5229,1.0931,0.0015,1.0934
9,(Cookies),(Coffee),0.0544,0.4784,0.0282,0.5184,1.0837,0.0022,1.0832


In [32]:
#vykresleni pravidel
plot_model(model1, plot='2d', scale=0.55)

PyCaret knihovna nabízí funkci _get_rules()_, která kombinuje předchozí dvě funkce v jednu

In [33]:
rules = get_rules(df, transaction_id='Transaction', item_id='Item', min_support=0.01, metric="lift")
rules

Description,Value
session_id,7588.0
# Transactions,9465.0
# Items,94.0
Ignore Items,


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Cake),"(Tea, Coffee)",0.1039,0.0499,0.0100,0.0966,1.9380,0.0049,1.0518
1,"(Tea, Coffee)",(Cake),0.0499,0.1039,0.0100,0.2013,1.9380,0.0049,1.1220
2,(Cake),(Hot chocolate),0.1039,0.0583,0.0114,0.1099,1.8839,0.0054,1.0579
3,(Hot chocolate),(Cake),0.0583,0.1039,0.0114,0.1957,1.8839,0.0054,1.1141
4,(Tea),(Cake),0.1426,0.1039,0.0238,0.1667,1.6048,0.0090,1.0754
...,...,...,...,...,...,...,...,...,...
69,(Bread),(Tea),0.3272,0.1426,0.0281,0.0859,0.6022,-0.0186,0.9379
70,(Bread),(Coffee),0.3272,0.4784,0.0900,0.2751,0.5751,-0.0665,0.7196
71,(Coffee),(Bread),0.4784,0.3272,0.0900,0.1882,0.5751,-0.0665,0.8287
72,(Bread),"(Coffee, Cake)",0.3272,0.0547,0.0100,0.0307,0.5605,-0.0079,0.9752


In [34]:
plot_model(rules, scale=0.75, plot='3d')