# Discovery of Frequent Itemsets and Association Rules

The problem of discovering association rules between itemsets in a sales transaction database (a set of baskets) includes the following two sub-problems:

1. Finding frequent itemsets with support at least s;
2. Generating association rules with confidence at least c from the itemsets found in the first step.
Remind that an association rule is an implication X → Y, where X and Y are itemsets such that X∩Y=∅. Support of the rule X → Y is the number of transactions that contain X⋃Y. Confidence of the rule X → Y the fraction of transactions containing X⋃Y in all transactions that contain X.

You are to solve the first sub-problem: to implement the A-Priori algorithm for finding frequent itemsets with support at least s in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your A-Priori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions.

The sale transaction dataset includes generated transactions (baskets) of hashed items (see Canvas).

In [1]:
baskets = [i.strip().split() for i in open("T10I4D100K.dat").readlines()]
len(baskets)

100000

In [2]:
transactions = {} # Dictionary with transaction ID as key, and basket as value
count = 0
for basket in baskets:
    count += 1
    transactions[count] = basket


In [3]:
items = set() # Set of items from all baskets
for i in transactions.values():
    for j in i:
        items.add(j) 

In [4]:
# Count the frequency of each item
def freq(k,items, transactions):
    items_counts = dict() # Dictionary of item and its frequency
    for i in items:
        if k == 1:
            temp_i = {i}
        else:
            temp_i = set(i)
            
        for j in transactions.items(): # and basket
            if temp_i.issubset(set(j[1])): # if item is in basket
                if i in items_counts:
                    items_counts[i] += 1 # If already spotted/already in item-freq dict, add 1 to count
                else:
                    items_counts[i] = 1 # If not spotted yet, set count to 1
    return items_counts

In [184]:
items_counts = freq(1,items, transactions)

In [5]:
def support(items_counts, transactions):
    support = dict()
    for i in items_counts:
        support[i] = items_counts[i]/len(transactions) # Support = #transactions in which item appears/#total transactions
    return support   #Support for itemset I is the number of baskets containing all items in I-->i thought the same as you

In [7]:
min_support = 0.05
items_atleast_min_support = [{j[0]:j[1] for j in support(items_counts, transactions).items() if j[1]>=min_support}]

In [8]:
items_atleast_min_support

[{'368': 0.07828,
  '766': 0.06265,
  '419': 0.05057,
  '529': 0.07057,
  '722': 0.05845,
  '217': 0.05375,
  '354': 0.05835,
  '684': 0.05408,
  '494': 0.05102,
  '829': 0.0681}]

In [6]:
#different support
fr = []
s_min = 5000
L1 = [{j[0]:j[1] for j in freq(1,items, transactions).items() if j[1]>=s_min}]
list(L1[0].keys())
for x in list(L1[0].keys()):
    fr.append(tuple({x}))
    

KeyboardInterrupt: 

In [10]:
print(fr)

[('368',), ('766',), ('419',), ('529',), ('722',), ('217',), ('354',), ('684',), ('494',), ('829',)]


In [185]:
from itertools import combinations

# Candidates of len-k which are generated by combining itemsets from L_k-1 and L_1
def C_k(k, prev_freq):
    cand = []
    if k-1 == 1:
        #combs = combinations(list(temp.union(set(prev_freq.keys()))), k)
        combs = combinations(list(L1[0].keys()), k)
        cand = list(combs)
        #print(cand)
    else:
        combs = combinations(list(L1[0].keys()), k)
        cand = list(combs)
        #print(cand)
        
        for i in prev_freq[0].keys():
                temp = set(i)
                for j in L1[0].keys():
                    if len(temp.union({j}))==k and temp.union({j}) not in cand:
                        cand.append(tuple(temp.union({j})))
                #print(cand)
        
    return cand
#cand2 = C_k(2,L1)

In [33]:
def L_k(k, candidates, threshold):
    print(f"Calculating frequent items of size {k}")
    #for i in candidates: # Check for every item
    
    Lk = [{j[0]:j[1] for j in freq(k,candidates, transactions).items() if j[1]>=threshold}]
    
    return Lk
L2 = L_k(2, cand2, 100)
L2

Calculating frequent items of size 2


[{('684', '368'): 387,
  ('684', '829'): 349,
  ('684', '529'): 334,
  ('684', '766'): 613,
  ('684', '494'): 208,
  ('684', '217'): 198,
  ('684', '419'): 155,
  ('684', '722'): 443,
  ('684', '354'): 219,
  ('368', '829'): 1194,
  ('368', '529'): 640,
  ('368', '766'): 504,
  ('368', '494'): 860,
  ('368', '217'): 303,
  ('368', '419'): 355,
  ('368', '722'): 392,
  ('368', '354'): 319,
  ('829', '529'): 584,
  ('829', '766'): 321,
  ('829', '494'): 267,
  ('829', '217'): 275,
  ('829', '419'): 259,
  ('829', '722'): 294,
  ('829', '354'): 259,
  ('529', '766'): 317,
  ('529', '494'): 225,
  ('529', '217'): 403,
  ('529', '419'): 252,
  ('529', '722'): 283,
  ('529', '354'): 301,
  ('766', '494'): 227,
  ('766', '217'): 276,
  ('766', '419'): 238,
  ('766', '722'): 328,
  ('766', '354'): 329,
  ('494', '217'): 183,
  ('494', '419'): 176,
  ('494', '722'): 226,
  ('494', '354'): 189,
  ('217', '419'): 344,
  ('217', '722'): 498,
  ('217', '354'): 280,
  ('419', '722'): 366,
  ('419', 

Optional task for extra bonus: Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered by using the A-Priori algorithm in a dataset of sales transactions. The rules must have support at least s and confidence at least c, where s and c are given as input parameters.

In [186]:
cand3 = C_k(3,L2)
cand3

[('368', '766', '419'),
 ('368', '766', '529'),
 ('368', '766', '722'),
 ('368', '766', '217'),
 ('368', '766', '354'),
 ('368', '766', '684'),
 ('368', '766', '494'),
 ('368', '766', '829'),
 ('368', '419', '529'),
 ('368', '419', '722'),
 ('368', '419', '217'),
 ('368', '419', '354'),
 ('368', '419', '684'),
 ('368', '419', '494'),
 ('368', '419', '829'),
 ('368', '529', '722'),
 ('368', '529', '217'),
 ('368', '529', '354'),
 ('368', '529', '684'),
 ('368', '529', '494'),
 ('368', '529', '829'),
 ('368', '722', '217'),
 ('368', '722', '354'),
 ('368', '722', '684'),
 ('368', '722', '494'),
 ('368', '722', '829'),
 ('368', '217', '354'),
 ('368', '217', '684'),
 ('368', '217', '494'),
 ('368', '217', '829'),
 ('368', '354', '684'),
 ('368', '354', '494'),
 ('368', '354', '829'),
 ('368', '684', '494'),
 ('368', '684', '829'),
 ('368', '494', '829'),
 ('766', '419', '529'),
 ('766', '419', '722'),
 ('766', '419', '217'),
 ('766', '419', '354'),
 ('766', '419', '684'),
 ('766', '419', 

In [64]:
L3 = L_k(3, cand3, 100)

Calculating frequent items of size 3


In [65]:
L3

[{('494', '684', '368'): 105,
  ('829', '684', '368'): 348,
  ('684', '529', '766'): 120,
  ('829', '766', '368'): 136,
  ('829', '529', '368'): 225,
  ('829', '217', '368'): 141,
  ('829', '354', '368'): 141,
  ('494', '829', '368'): 270,
  ('494', '529', '368'): 123,
  ('494', '722', '368'): 108,
  ('494', '217', '368'): 108,
  ('494', '354', '368'): 102,
  ('722', '354', '368'): 105}]

In [71]:
#Look for frequent items until there is no one

size = 1
frequent_items = []
s_min = 10
L1 = [{j[0]:j[1] for j in freq(1,items, transactions).items() if j[1]>=5000}]
#frequent_items.extend(list(L1[0].keys()))
for x in list(L1[0].keys()):
    frequent_items.append(tuple({x}))
prev_freq = L1
while True: 
    size+=1
    candidates = C_k(size,prev_freq)
    frequents = L_k(size,candidates,s_min)
    prev_freq = frequents
    if len(frequents[0])!=0:
        frequent_items.extend(list(frequents[0].keys()))
    else:
        break

Calculating frequent items of size 2
Calculating frequent items of size 3
Calculating frequent items of size 4
Calculating frequent items of size 5
Calculating frequent items of size 6


In [187]:
frequent_items

[('368',),
 ('766',),
 ('419',),
 ('529',),
 ('722',),
 ('217',),
 ('354',),
 ('684',),
 ('494',),
 ('829',),
 ('368', '766'),
 ('368', '419'),
 ('368', '529'),
 ('368', '722'),
 ('368', '217'),
 ('368', '354'),
 ('368', '684'),
 ('368', '494'),
 ('368', '829'),
 ('766', '419'),
 ('766', '529'),
 ('766', '722'),
 ('766', '217'),
 ('766', '354'),
 ('766', '684'),
 ('766', '494'),
 ('766', '829'),
 ('419', '529'),
 ('419', '722'),
 ('419', '217'),
 ('419', '354'),
 ('419', '684'),
 ('419', '494'),
 ('419', '829'),
 ('529', '722'),
 ('529', '217'),
 ('529', '354'),
 ('529', '684'),
 ('529', '494'),
 ('529', '829'),
 ('722', '217'),
 ('722', '354'),
 ('722', '684'),
 ('722', '494'),
 ('722', '829'),
 ('217', '354'),
 ('217', '684'),
 ('217', '494'),
 ('217', '829'),
 ('354', '684'),
 ('354', '494'),
 ('354', '829'),
 ('684', '494'),
 ('684', '829'),
 ('494', '829'),
 ('368', '766', '419'),
 ('368', '766', '529'),
 ('368', '766', '722'),
 ('368', '766', '217'),
 ('368', '766', '354'),
 ('36

## get all potential association rules

In [188]:
from itertools import chain, combinations
from copy import deepcopy


# For every subset A of frequent itemset I, rule is A -> I\A
def association_rules(frequents):
    lhs_rhs = []

    for itemset in frequents: # all subsets of itemset
        r = chain.from_iterable(combinations(itemset, r) for r in range(len(itemset)+1))
        final_r = []
        
        for com in list(r):
            if len(com)!=0 and len(com)!=len(itemset):
                final_r.append(com)   #all subsets of all frequent itemsets
            #print(final_r)
        
        for A in final_r:
            remaining = set(final_r)-{A}
            temp = deepcopy(remaining)
            for rem in remaining:
                for a in A:
                    if {a}.issubset(rem):
                        if rem in temp:
                            temp.remove(rem)  #so that i won't have eg a->a,b
                            
                            
                            
            for rhs in temp:
                if [A,rhs] not in lhs_rhs:
                    lhs_rhs.append([A,rhs]) #pairs lhs,rhs-->if not already present so that we won't take same
                                                            #association rule twice (set was ruining order)

            #print('A=',A,'remaining=',temp)
    print(lhs_rhs)


    pass

In [189]:
#association rule for itemsets of size >=2
fr = []
for f in frequent_items:
    if len(f)>1:
        fr.append(f)
        
#print(fr)  
association_rules(fr)

[[('368',), ('766',)], [('766',), ('368',)], [('368',), ('419',)], [('419',), ('368',)], [('368',), ('529',)], [('529',), ('368',)], [('368',), ('722',)], [('722',), ('368',)], [('368',), ('217',)], [('217',), ('368',)], [('368',), ('354',)], [('354',), ('368',)], [('368',), ('684',)], [('684',), ('368',)], [('368',), ('494',)], [('494',), ('368',)], [('368',), ('829',)], [('829',), ('368',)], [('766',), ('419',)], [('419',), ('766',)], [('766',), ('529',)], [('529',), ('766',)], [('766',), ('722',)], [('722',), ('766',)], [('766',), ('217',)], [('217',), ('766',)], [('766',), ('354',)], [('354',), ('766',)], [('766',), ('684',)], [('684',), ('766',)], [('766',), ('494',)], [('494',), ('766',)], [('766',), ('829',)], [('829',), ('766',)], [('419',), ('529',)], [('529',), ('419',)], [('419',), ('722',)], [('722',), ('419',)], [('419',), ('217',)], [('217',), ('419',)], [('419',), ('354',)], [('354',), ('419',)], [('419',), ('684',)], [('684',), ('419',)], [('419',), ('494',)], [('494',)

## Brute Force

In [16]:
from itertools import chain, combinations
from copy import deepcopy


# For every subset A of frequent itemset I, rule is A -> I\A
def association_rules(frequents):
    lhs_rhs = []
    assoc = []
    
    for itemset in frequents: # all subsets of itemset
        r = chain.from_iterable(combinations(itemset, r) for r in range(len(itemset)+1))
        final_r = []
        
        for com in reversed(list(r)):  #so that we can start from a,b,c-->d and then a,b-->c,d
            if len(com)!=0 and len(com)!=len(itemset):  #excract {} and the whole itemset
                final_r.append(com)   #all subsets of all frequent itemsets
            #print(final_r)
        
        for A in final_r:
            remaining = set(final_r)-{A}
            temp = deepcopy(remaining)
            for rem in remaining:
                for a in A:
                    if {a}.issubset(rem):
                        if rem in temp:
                            temp.remove(rem)  #so that i won't have eg a->a,b
            
            #support A
            sup = {}
            if sup.get(A, "empty")=="empty":
                supa = 0
                for j in transactions.items(): # and basket                    
                    if (set(A)).issubset(set(j[1])): # if item is in basket
                        supa +=1              
                sup[A] = supa
            
            
            
            
            for rhs in temp:
                if [A,rhs] not in lhs_rhs:
                    lhs_rhs.append([A,rhs]) #pairs lhs,rhs-->if not already present so that we won't take same
                                                            #association rule twice (set was ruining order)
                
                #print(set(A),set(rhs))
                
                #support of union
                sup_union = {}
                if sup_union.get((A,rhs), "empty")=="empty":
                    supb = 0
                    for j in transactions.items(): # and basket
                        if (set(A).union(set(rhs))).issubset(set(j[1])): # if item is in basket
                            supb +=1
                    sup_union[(A,rhs)] = supb
                
                if sup[A]>0:
                    conf = sup_union[(A,rhs)] / sup[A]
                    #print(A,rhs,conf)
                    if conf>0.6:
                        assoc.append((A,rhs))
                else:
                    print("no support")
                
            #print('A=',A,'remaining=',temp)
    #print(lhs_rhs)

    print(assoc)
    pass




#conf(I→j) = supp(I,j)/supp(I)


In [93]:
association_rules(fr)


[(('368', '722', '766', '684'), ('829',)), (('829', '722', '766', '684'), ('368',))]


## optimized

In [237]:
from itertools import chain, combinations
from copy import deepcopy
'''
def iteration(assoc, flag):
    #print(assoc)
    #print(flag)
    if assoc[0] and flag:
        print(flag)
        supa = 0
        supb = 0
        for j in transactions.items(): # and basket                    
            if (set(assoc[0])).issubset(set(j[1])): # if item is in basket
                supa +=1
            if (set(assoc[0]).union(set(assoc[1]))).issubset(set(j[1])): # if item is in basket
                supb +=1
            
            if supa>0:
                conf = supb / supa
            else:
                conf=0
            
            if conf < 0.6:
                flag = False

                #print(conf)
            else:
                print(conf)
                temp = assoc
                 
                for item in temp[0]:
                    assoc = (tuple(set(temp[0])-{item}),tuple(set(temp[1]).union({item})))
                    print(assoc)
                    iteration(assoc,True)
        pass            

'''
# For every subset A of frequent itemset I, rule is A -> I\A
def association_rules(frequents):

    for itemset in reversed(frequents): # all subsets of itemset
        #print(itemset)
        for i in itemset:
            new_list = [(tuple(set(itemset)-{i}), tuple({i}))]
            #print(new)
            flag = True
            while new_list and flag:
                item = new_list[0]
                while item[0] and flag:
                    supa = 0
                    supb = 0
                    for j in transactions.items(): # and basket
                        if (set(item[0])).issubset(set(j[1])): # if item is in basket
                            supa +=1
                        if (set(item[0]).union(set(item[1]))).issubset(set(j[1])): # if item is in basket
                            supb +=1
                    if supa>0:
                        conf = supb / supa
                    else:
                        conf=0
                    if conf < 0.6:
                        flag = False
                        del new_list[0]
                    else:
                        print(item,conf)
                        temp = item
                        new_list = []
                        for k in temp[0]:
                            new_list.append((tuple(set(temp[0])-{k}),tuple(set(temp[1]).union({k}))))
                        print("list=",new_list)

    pass




#conf(I→j) = supp(I,j)/supp(I)


In [238]:
association_rules(fr)

(('722', '829', '684', '766'), ('368',)) 0.6666666666666666
list= [(('829', '684', '766'), ('722', '368')), (('722', '684', '766'), ('829', '368')), (('722', '829', '766'), ('684', '368')), (('722', '829', '684'), ('766', '368'))]
(('722', '829', '684', '766'), ('368',)) 0.6666666666666666
list= [(('829', '684', '766'), ('722', '368')), (('722', '684', '766'), ('829', '368')), (('722', '829', '766'), ('684', '368')), (('722', '829', '684'), ('766', '368'))]
(('722', '829', '684', '766'), ('368',)) 0.6666666666666666
list= [(('829', '684', '766'), ('722', '368')), (('722', '684', '766'), ('829', '368')), (('722', '829', '766'), ('684', '368')), (('722', '829', '684'), ('766', '368'))]
(('722', '829', '684', '766'), ('368',)) 0.6666666666666666
list= [(('829', '684', '766'), ('722', '368')), (('722', '684', '766'), ('829', '368')), (('722', '829', '766'), ('684', '368')), (('722', '829', '684'), ('766', '368'))]
(('722', '829', '684', '766'), ('368',)) 0.6666666666666666
list= [(('829', 

(('722', '829', '684', '766'), ('368',)) 0.6666666666666666
list= [(('829', '684', '766'), ('722', '368')), (('722', '684', '766'), ('829', '368')), (('722', '829', '766'), ('684', '368')), (('722', '829', '684'), ('766', '368'))]
(('722', '829', '684', '766'), ('368',)) 0.6666666666666666
list= [(('829', '684', '766'), ('722', '368')), (('722', '684', '766'), ('829', '368')), (('722', '829', '766'), ('684', '368')), (('722', '829', '684'), ('766', '368'))]
(('722', '829', '684', '766'), ('368',)) 0.6666666666666666
list= [(('829', '684', '766'), ('722', '368')), (('722', '684', '766'), ('829', '368')), (('722', '829', '766'), ('684', '368')), (('722', '829', '684'), ('766', '368'))]
(('722', '829', '684', '766'), ('368',)) 0.6666666666666666
list= [(('829', '684', '766'), ('722', '368')), (('722', '684', '766'), ('829', '368')), (('722', '829', '766'), ('684', '368')), (('722', '829', '684'), ('766', '368'))]
(('722', '829', '684', '766'), ('368',)) 0.6666666666666666
list= [(('829', 

KeyboardInterrupt: 

In [94]:
assoc = (('722', '829', '766', '368'), ('684',))
supa = 0
supb = 0
for j in transactions.items(): # and basket                    
    if (set(assoc[0])).issubset(set(j[1])): # if item is in basket
        supa +=1
    if (set(assoc[0]).union(set(assoc[1]))).issubset(set(j[1])): # if item is in basket
        supb +=1
conf = supb / supa
print(conf)

0.3333333333333333


In [245]:
def association_rules(frequents):

    for itemset in reversed(frequents): # all subsets of itemset
        #print(itemset)
        for i in itemset:
            new = (tuple(set(itemset)-{i}), tuple({i}))
            print(new)
            flag = True
            while new[0] and flag:
                supa = 0
                supb = 0
                for j in transactions.items(): # and basket
                    if (set(new[0])).issubset(set(j[1])): # if item is in basket
                        supa +=1
                    if (set(new[0]).union(set(new[1]))).issubset(set(j[1])): # if item is in basket
                        supb +=1

                conf = supb / supa
                if conf < 0.6:
                    flag = False
                else:
                    print(conf)
                    temp = new
                    for item in temp[0]:
                        new = (tuple(set(temp[0])-{item}),tuple(set(temp[1]).union({item})))
                        print(new)

    pass

In [None]:
association_rules(fr)

(('722', '829', '766', '368'), ('684',))
(('722', '829', '684', '766'), ('368',))
0.6666666666666666
(('829', '684', '766'), ('722', '368'))
(('722', '684', '766'), ('829', '368'))
(('722', '829', '766'), ('684', '368'))
(('722', '829', '684'), ('766', '368'))
(('829', '684', '766', '368'), ('722',))
(('722', '684', '766', '368'), ('829',))
0.6666666666666666
(('684', '766', '368'), ('722', '829'))
(('722', '766', '368'), ('829', '684'))
(('722', '684', '368'), ('829', '766'))
(('722', '684', '766'), ('829', '368'))
(('722', '829', '684', '368'), ('766',))
(('722', '829', '766', '368'), ('684',))
(('722', '829', '684', '766'), ('368',))
0.6666666666666666
(('829', '684', '766'), ('722', '368'))
(('722', '684', '766'), ('829', '368'))
(('722', '829', '766'), ('684', '368'))
(('722', '829', '684'), ('766', '368'))
(('722', '684', '766', '368'), ('829',))
0.6666666666666666
(('684', '766', '368'), ('722', '829'))
(('722', '766', '368'), ('829', '684'))
(('722', '684', '368'), ('829', '766

(('829', '766', '368'), ('722',))
(('722', '766', '368'), ('829',))
(('722', '829', '368'), ('766',))
(('722', '829', '766'), ('368',))
(('684', '766', '368'), ('829',))
(('829', '766', '368'), ('684',))
(('829', '684', '368'), ('766',))
(('829', '684', '766'), ('368',))
(('684', '766', '368'), ('494',))
(('494', '766', '368'), ('684',))
(('494', '684', '368'), ('766',))
(('494', '684', '766'), ('368',))
(('766', '354', '368'), ('684',))
(('684', '354', '368'), ('766',))
(('684', '766', '368'), ('354',))
(('684', '354', '766'), ('368',))
(('217', '766', '368'), ('829',))
(('829', '766', '368'), ('217',))
(('829', '217', '368'), ('766',))
(('829', '217', '766'), ('368',))
(('766', '354', '368'), ('217',))
(('217', '354', '368'), ('766',))
(('217', '766', '368'), ('354',))
(('217', '354', '766'), ('368',))
(('722', '766', '368'), ('829',))
(('829', '766', '368'), ('722',))
(('829', '722', '368'), ('766',))
(('829', '722', '766'), ('368',))
(('684', '766', '368'), ('722',))
(('722', '766'

(('494', '766'), ('354',))
(('494', '354'), ('766',))
(('354', '766'), ('684',))
(('684', '766'), ('354',))
(('684', '354'), ('766',))
(('354', '766'), ('529',))
(('529', '766'), ('354',))
(('529', '354'), ('766',))
(('354', '766'), ('419',))
(('419', '766'), ('354',))
(('354', '419'), ('766',))
(('354', '766'), ('368',))
(('368', '766'), ('354',))
(('354', '368'), ('766',))
(('217', '766'), ('829',))
(('829', '766'), ('217',))
(('829', '217'), ('766',))
(('217', '766'), ('494',))
(('494', '766'), ('217',))
(('494', '217'), ('766',))
(('217', '766'), ('684',))
(('684', '766'), ('217',))
(('684', '217'), ('766',))
(('354', '766'), ('217',))
(('217', '766'), ('354',))
(('217', '354'), ('766',))
(('368', '766'), ('217',))
(('217', '766'), ('368',))
(('217', '368'), ('766',))
(('722', '766'), ('829',))
(('829', '766'), ('722',))
(('829', '722'), ('766',))
(('722', '766'), ('494',))
(('494', '766'), ('722',))
(('494', '722'), ('766',))
(('684', '766'), ('722',))
(('722', '766'), ('684',))
(