# Discovery of Frequent Itemsets and Association Rules

The problem of discovering association rules between itemsets in a sales transaction database (a set of baskets) includes the following two sub-problems:

1. Finding frequent itemsets with support at least s;
2. Generating association rules with confidence at least c from the itemsets found in the first step.
Remind that an association rule is an implication X → Y, where X and Y are itemsets such that X∩Y=∅. Support of the rule X → Y is the number of transactions that contain X⋃Y. Confidence of the rule X → Y the fraction of transactions containing X⋃Y in all transactions that contain X.

You are to solve the first sub-problem: to implement the A-Priori algorithm for finding frequent itemsets with support at least s in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your A-Priori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions.

The sale transaction dataset includes generated transactions (baskets) of hashed items (see Canvas).

In [1]:
baskets = [i.strip().split() for i in open("T10I4D100K.dat").readlines()]
len(baskets)

100000

In [2]:
transactions = {} # Dictionary with transaction ID as key, and basket as value
count = 0
for basket in baskets:
    count += 1
    transactions[count] = basket


In [3]:
items = set() # Set of items from all baskets
for i in transactions.values():
    for j in i:
        items.add(j) 

In [4]:
# Count the frequency of each item
def freq(k,items, transactions):
    items_counts = dict() # Dictionary of item and its frequency
    for i in items:
        if k == 1:
            temp_i = {i}
        else:
            temp_i = set(i)
            
        for j in transactions.items(): # and basket
            if temp_i.issubset(set(j[1])): # if item is in basket
                if i in items_counts:
                    items_counts[i] += 1 # If already spotted/already in item-freq dict, add 1 to count
                else:
                    items_counts[i] = 1 # If not spotted yet, set count to 1
    return items_counts

In [5]:
items_counts = freq(1,items, transactions)

In [6]:
def support(items_counts, transactions):
    support = dict()
    for i in items_counts:
        support[i] = items_counts[i]/len(transactions) # Support = #transactions in which item appears/#total transactions
    return support   #Support for itemset I is the number of baskets containing all items in I-->i thought the same as you

In [7]:
min_support = 0.05
items_atleast_min_support = [{j[0]:j[1] for j in support(items_counts, transactions).items() if j[1]>=min_support}]

In [8]:
items_atleast_min_support

[{'494': 0.05102,
  '419': 0.05057,
  '722': 0.05845,
  '354': 0.05835,
  '684': 0.05408,
  '368': 0.07828,
  '217': 0.05375,
  '829': 0.0681,
  '766': 0.06265,
  '529': 0.07057}]

In [9]:
#different support
fr = []
s_min = 5000
L1 = [{j[0]:j[1] for j in freq(1,items, transactions).items() if j[1]>=s_min}]
list(L1[0].keys())
for x in list(L1[0].keys()):
    fr.append(tuple({x}))
    

In [10]:
print(fr)

[('494',), ('419',), ('722',), ('354',), ('684',), ('368',), ('217',), ('829',), ('766',), ('529',)]


In [11]:
from itertools import combinations

#candidates of len-k which are generated by combining itemsets from L_k-1 and L_1
def C_k(k, prev_freq):
    cand = []
    print(f"Calculating candidates of size {k}...")
    for i in prev_freq[0].keys():
        if k-1 == 1:
            temp = {i}
            combs = combinations(list(temp.union(set(L1[0].keys()))), k) 
            cand = list(combs)

        else:
            temp = set(i)
            for j in L1[0].keys():
                if len(temp.union({j}))==k:
                    cand.append(tuple(temp.union({j})))
    return cand
cand2 = C_k(2,L1)


Calculating candidates of size 2...


In [12]:
def L_k(k, candidates, threshold):
    print(f"Calculating frequent items of size {k}")
    for i in candidates: # Check for every item
        '''
        temp_list = []
        temp_list.append(j)

        temp_i = set(temp_list)  #for each item check if it belongs to a transaction
        '''
    
    Lk = [{j[0]:j[1] for j in freq(k,candidates, transactions).items() if j[1]>=threshold}]
    
    return Lk
L2 = L_k(2, cand2, 100)
L2

Calculating frequent items of size 2


[{('722', '829'): 294,
  ('722', '529'): 283,
  ('722', '766'): 328,
  ('722', '368'): 392,
  ('722', '419'): 366,
  ('722', '354'): 566,
  ('722', '494'): 226,
  ('722', '217'): 498,
  ('722', '684'): 443,
  ('829', '529'): 584,
  ('829', '766'): 321,
  ('829', '368'): 1194,
  ('829', '419'): 259,
  ('829', '354'): 259,
  ('829', '494'): 267,
  ('829', '217'): 275,
  ('829', '684'): 349,
  ('529', '766'): 317,
  ('529', '368'): 640,
  ('529', '419'): 252,
  ('529', '354'): 301,
  ('529', '494'): 225,
  ('529', '217'): 403,
  ('529', '684'): 334,
  ('766', '368'): 504,
  ('766', '419'): 238,
  ('766', '354'): 329,
  ('766', '494'): 227,
  ('766', '217'): 276,
  ('766', '684'): 613,
  ('368', '419'): 355,
  ('368', '354'): 319,
  ('368', '494'): 860,
  ('368', '217'): 303,
  ('368', '684'): 387,
  ('419', '354'): 263,
  ('419', '494'): 176,
  ('419', '217'): 344,
  ('419', '684'): 155,
  ('354', '494'): 189,
  ('354', '217'): 280,
  ('354', '684'): 219,
  ('494', '217'): 183,
  ('494', 

Optional task for extra bonus: Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered by using the A-Priori algorithm in a dataset of sales transactions. The rules must have support at least s and confidence at least c, where s and c are given as input parameters.

In [13]:
cand3 = C_k(3,L2)
cand3

Calculating candidates of size 3...


[('722', '829', '494'),
 ('722', '419', '829'),
 ('722', '354', '829'),
 ('722', '684', '829'),
 ('722', '829', '368'),
 ('722', '217', '829'),
 ('722', '766', '829'),
 ('722', '529', '829'),
 ('722', '529', '494'),
 ('722', '419', '529'),
 ('722', '354', '529'),
 ('722', '684', '529'),
 ('722', '529', '368'),
 ('722', '217', '529'),
 ('722', '529', '829'),
 ('722', '766', '529'),
 ('722', '766', '494'),
 ('722', '766', '419'),
 ('722', '766', '354'),
 ('722', '766', '684'),
 ('722', '766', '368'),
 ('722', '766', '217'),
 ('722', '766', '829'),
 ('722', '766', '529'),
 ('722', '494', '368'),
 ('722', '419', '368'),
 ('722', '354', '368'),
 ('722', '684', '368'),
 ('722', '217', '368'),
 ('722', '829', '368'),
 ('722', '766', '368'),
 ('722', '529', '368'),
 ('722', '419', '494'),
 ('722', '419', '354'),
 ('722', '419', '684'),
 ('722', '419', '368'),
 ('722', '419', '217'),
 ('722', '419', '829'),
 ('722', '419', '766'),
 ('722', '419', '529'),
 ('722', '354', '494'),
 ('722', '419', 

In [14]:
L3 = L_k(3, cand3, 100)

Calculating frequent items of size 3


In [15]:
L3

[{('722', '829', '368'): 138,
  ('722', '354', '368'): 105,
  ('529', '829', '368'): 225,
  ('766', '829', '368'): 204,
  ('419', '829', '368'): 132,
  ('354', '829', '368'): 141,
  ('684', '829', '368'): 348,
  ('217', '829', '368'): 141,
  ('529', '766', '684'): 120,
  ('766', '684', '368'): 117}]

In [16]:
#Look for frequent items until there is no one

size = 1
frequent_items = []
s_min = 10
L1 = [{j[0]:j[1] for j in freq(1,items, transactions).items() if j[1]>=5000}]
#frequent_items.extend(list(L1[0].keys()))
for x in list(L1[0].keys()):
    frequent_items.append(tuple({x}))
prev_freq = L1
while True: 
    size+=1
    candidates = C_k(size,prev_freq)
    frequents = L_k(size,candidates,s_min)
    prev_freq = frequents
    if len(frequents[0])!=0:
        frequent_items.extend(list(frequents[0].keys()))
    else:
        break


Calculating candidates of size 2...
Calculating frequent items of size 2
Calculating candidates of size 3...
Calculating frequent items of size 3
Calculating candidates of size 4...
Calculating frequent items of size 4
Calculating candidates of size 5...
Calculating frequent items of size 5
Calculating candidates of size 6...
Calculating frequent items of size 6


In [17]:
frequent_items

[('494',),
 ('419',),
 ('722',),
 ('354',),
 ('684',),
 ('368',),
 ('217',),
 ('829',),
 ('766',),
 ('529',),
 ('722', '829'),
 ('722', '529'),
 ('722', '766'),
 ('722', '368'),
 ('722', '419'),
 ('722', '354'),
 ('722', '494'),
 ('722', '217'),
 ('722', '684'),
 ('829', '529'),
 ('829', '766'),
 ('829', '368'),
 ('829', '419'),
 ('829', '354'),
 ('829', '494'),
 ('829', '217'),
 ('829', '684'),
 ('529', '766'),
 ('529', '368'),
 ('529', '419'),
 ('529', '354'),
 ('529', '494'),
 ('529', '217'),
 ('529', '684'),
 ('766', '368'),
 ('766', '419'),
 ('766', '354'),
 ('766', '494'),
 ('766', '217'),
 ('766', '684'),
 ('368', '419'),
 ('368', '354'),
 ('368', '494'),
 ('368', '217'),
 ('368', '684'),
 ('419', '354'),
 ('419', '494'),
 ('419', '217'),
 ('419', '684'),
 ('354', '494'),
 ('354', '217'),
 ('354', '684'),
 ('494', '217'),
 ('494', '684'),
 ('217', '684'),
 ('722', '829', '494'),
 ('722', '419', '829'),
 ('722', '354', '829'),
 ('722', '684', '829'),
 ('722', '829', '368'),
 ('72

## get all potential association rules

In [19]:
from itertools import chain, combinations
from copy import deepcopy


# For every subset A of frequent itemset I, rule is A -> I\A
def association_rules(frequents):
    lhs_rhs = []

    for itemset in frequents: # all subsets of itemset
        r = chain.from_iterable(combinations(itemset, r) for r in range(len(itemset)+1))
        final_r = []
        
        for com in list(r):
            if len(com)!=0 and len(com)!=len(itemset):
                final_r.append(com)   #all subsets of all frequent itemsets
            #print(final_r)
        
        for A in final_r:
            remaining = set(final_r)-{A}
            temp = deepcopy(remaining)
            for rem in remaining:
                for a in A:
                    if {a}.issubset(rem):
                        if rem in temp:
                            temp.remove(rem)  #so that i won't have eg a->a,b
                            
                            
                            
            for rhs in temp:
                if [A,rhs] not in lhs_rhs:
                    lhs_rhs.append([A,rhs]) #pairs lhs,rhs-->if not already present so that we won't take same
                                                            #association rule twice (set was ruining order)

            #print('A=',A,'remaining=',temp)
    print(lhs_rhs)


    pass

In [20]:
#association rule for itemsets of size >=2
fr = []
for f in frequent_items:
    if len(f)>1:
        fr.append(f)
        
#print(fr)  
association_rules(fr)

[[('722',), ('829',)], [('829',), ('722',)], [('722',), ('529',)], [('529',), ('722',)], [('722',), ('766',)], [('766',), ('722',)], [('722',), ('368',)], [('368',), ('722',)], [('722',), ('419',)], [('419',), ('722',)], [('722',), ('354',)], [('354',), ('722',)], [('722',), ('494',)], [('494',), ('722',)], [('722',), ('217',)], [('217',), ('722',)], [('722',), ('684',)], [('684',), ('722',)], [('829',), ('529',)], [('529',), ('829',)], [('829',), ('766',)], [('766',), ('829',)], [('829',), ('368',)], [('368',), ('829',)], [('829',), ('419',)], [('419',), ('829',)], [('829',), ('354',)], [('354',), ('829',)], [('829',), ('494',)], [('494',), ('829',)], [('829',), ('217',)], [('217',), ('829',)], [('829',), ('684',)], [('684',), ('829',)], [('529',), ('766',)], [('766',), ('529',)], [('529',), ('368',)], [('368',), ('529',)], [('529',), ('419',)], [('419',), ('529',)], [('529',), ('354',)], [('354',), ('529',)], [('529',), ('494',)], [('494',), ('529',)], [('529',), ('217',)], [('217',)

## Brute Force

In [92]:
from itertools import chain, combinations
from copy import deepcopy


# For every subset A of frequent itemset I, rule is A -> I\A
def association_rules(frequents):
    lhs_rhs = []
    assoc = []
    
    for itemset in frequents: # all subsets of itemset
        r = chain.from_iterable(combinations(itemset, r) for r in range(len(itemset)+1))
        final_r = []
        
        for com in reversed(list(r)):  #so that we can start from a,b,c-->d and then a,b-->c,d
            if len(com)!=0 and len(com)!=len(itemset):  #excract {} and the whole itemset
                final_r.append(com)   #all subsets of all frequent itemsets
            #print(final_r)
        
        for A in final_r:
            remaining = set(final_r)-{A}
            temp = deepcopy(remaining)
            for rem in remaining:
                for a in A:
                    if {a}.issubset(rem):
                        if rem in temp:
                            temp.remove(rem)  #so that i won't have eg a->a,b
            
            #support A
            sup = {}
            if sup.get(A, "empty")=="empty":
                supa = 0
                for j in transactions.items(): # and basket                    
                    if (set(A)).issubset(set(j[1])): # if item is in basket
                        supa +=1              
                sup[A] = supa
        
            
            
            
            for rhs in temp:
                if [A,rhs] not in lhs_rhs:
                    lhs_rhs.append([A,rhs]) #pairs lhs,rhs-->if not already present so that we won't take same
                                                            #association rule twice (set was ruining order)
                
                #print(set(A),set(rhs))
                
                #support of union
                sup_union = {}
                if sup_union.get((A,rhs), "empty")=="empty":
                    supb = 0
                    for j in transactions.items(): # and basket
                        if (set(A).union(set(rhs))).issubset(set(j[1])): # if item is in basket
                            supb +=1
                    sup_union[(A,rhs)] = supb
                
                if sup[A]>0:
                    conf = sup_union[(A,rhs)] / sup[A]
                    #print(A,rhs,conf)
                    if conf>0.6:
                        assoc.append((A,rhs))
                else:
                    print("no support")
                
            #print('A=',A,'remaining=',temp)
    #print(lhs_rhs)

    print(assoc)
    pass




#conf(I→j) = supp(I,j)/supp(I)


In [93]:
association_rules(fr)


[(('368', '722', '766', '684'), ('829',)), (('829', '722', '766', '684'), ('368',))]


## optimized

In [55]:
from itertools import chain, combinations
from copy import deepcopy

#get subsets of combinations in itemset that should not be checked
def sub(x, exclude):
    for i in x:
        for j in x[0]:
            #print("######",x,j)
            exclude.add((tuple(set(x[0])-{j}),tuple(set(x[1]).union({j}))))
            #print("$$$$$$",(tuple(set(x[0])-{j}),tuple(set(x[1]).union({j}))))
    print(exclude)


# For every subset A of frequent itemset I, rule is A -> I\A
def association_rules(frequents):
    lhs_rhs = []
    assoc2 = []
    exclude = set()
    
    
    for itemset in reversed(frequents): # all subsets of itemset
        r = chain.from_iterable(combinations(itemset, r) for r in range(len(itemset)+1))
        final_r = []
        
        for com in reversed(list(r)):  #so that we can start from a,b,c-->d and then a,b-->c,d
            if len(com)!=0 and len(com)!=len(itemset):  #excract {} and the whole itemset
                final_r.append(com)   #all subsets of all frequent itemsets
            #print(final_r)
        
        for A in final_r:
            remaining = set(final_r)-{A}
            temp = deepcopy(remaining)
            for rem in remaining:
                for a in A:
                    if {a}.issubset(rem):
                        if rem in temp:
                            temp.remove(rem)  #so that i won't have eg a->a,b
            
            '''
            #support A
            sup = {}
            if sup.get(A, "empty")=="empty":
                supa = 0
                for j in transactions.items(): # and basket                    
                    if (set(A)).issubset(set(j[1])): # if item is in basket
                        supa +=1              
                sup[A] = supa
            '''
            
            
            
            for rhs in temp:
                if [A,rhs] not in lhs_rhs:
                    lhs_rhs.append([A,rhs]) #pairs lhs,rhs-->if not already present so that we won't take same
                                                            #association rule twice (set was ruining order)
                
                #print(set(A),set(rhs))
                
                #print(A,rhs)
                
                '''
                #support of union
                sup_union = {}
                if sup_union.get((A,rhs), "empty")=="empty" and (A,rhs) not in exclude:
                    exclude.add(sub(a,rhs))
                    #supb = 0
                    #for j in transactions.items(): # and basket
                    #    if (set(A).union(set(rhs))).issubset(set(j[1])): # if item is in basket
                     #       supb +=1
                    #sup_union[(A,rhs)] = supb
               
                if sup[A]>0:
                    conf = sup_union[(A,rhs)] / sup[A]
                    #print(A,rhs,conf)
                    if conf>0.6:
                        assoc2.append((A,rhs))
                        exclude.add(sub((a,rhs),exclude))
                else:
                    print("no support")
                '''
            #print('A=',A,'remaining=',temp)
    #print(lhs_rhs)
    sub((("a","b","c"),("d")),exclude)
    print(exclude)
    pass




#conf(I→j) = supp(I,j)/supp(I)


In [56]:
association_rules(fr)

{(('a', 'b'), ('d', 'c')), (('b', 'c'), ('a', 'd')), (('a', 'c'), ('d', 'b'))}
{(('a', 'b'), ('d', 'c')), (('b', 'c'), ('a', 'd')), (('a', 'c'), ('d', 'b'))}
