# Discovery of Frequent Itemsets and Association Rules

The problem of discovering association rules between itemsets in a sales transaction database (a set of baskets) includes the following two sub-problems:

1. Finding frequent itemsets with support at least s;
2. Generating association rules with confidence at least c from the itemsets found in the first step.
Remind that an association rule is an implication X → Y, where X and Y are itemsets such that X∩Y=∅. Support of the rule X → Y is the number of transactions that contain X⋃Y. Confidence of the rule X → Y the fraction of transactions containing X⋃Y in all transactions that contain X.

#### Exercise 1
You are to solve the first sub-problem: to implement the A-Priori algorithm for finding frequent itemsets with support at least s in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your A-Priori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions which includes generated transactions (baskets) of hashed items (see Canvas).

In [1]:
baskets = [i.strip().split() for i in open("T10I4D100K.dat").readlines()]
len(baskets)

100000

In [2]:
transactions = {} # Dictionary with transaction ID as key, and basket as value
count = 0
for basket in baskets:
    count += 1
    transactions[count] = basket

In [3]:
items = set() # Set of items from all baskets
for i in transactions.values():
    for j in i:
        items.add(j) 

In [4]:
# Count the frequency of each item
def freq(k,items, transactions):
    items_counts = dict() # Dictionary of item and its frequency
    for i in items:
        if k == 1:
            temp_i = {i}
        else:
            temp_i = set(i)
            
        for j in transactions.items(): # and basket
            if temp_i.issubset(set(j[1])): # if item is in basket
                if i in items_counts:
                    items_counts[i] += 1 # If already spotted/already in item-freq dict, add 1 to count
                else:
                    items_counts[i] = 1 # If not spotted yet, set count to 1
    return items_counts

In [7]:
# Support
s_min = 5000
L1 = [{j[0]:j[1] for j in freq(1,items, transactions).items() if j[1]>=s_min}]

In [9]:
from itertools import combinations

#candidates of len-k which are generated by combining itemsets from L_k-1 and L_1
def C_k(k, prev_freq):
    cand = []
    print(f"Calculating candidates of size {k}...")
    for i in prev_freq[0].keys():
        if k-1 == 1:
            temp = {i}
            combs = combinations(list(temp.union(set(L1[0].keys()))), k) 
            cand = list(combs)

        else:
            temp = set(i)
            for j in L1[0].keys():
                if len(temp.union({j}))==k:
                    cand.append(tuple(temp.union({j})))
    return cand
cand2 = C_k(2,L1)


Calculating candidates of size 2...


In [10]:
def L_k(k, candidates, threshold):
    print(f"Calculating frequent items of size {k}")
    
    Lk = [{j[0]:j[1] for j in freq(k,candidates, transactions).items() if j[1]>=threshold}]
    
    return Lk
L2 = L_k(2, cand2, 100)

Calculating frequent items of size 2


In [11]:
cand3 = C_k(3,L2)

Calculating candidates of size 3...


In [13]:
L3 = L_k(3, cand3, 100)

Calculating frequent items of size 3


In [15]:
#Look for frequent items until there is no one
lookup = [] # acts as lookup dictionary later on incl frozensets so order doesn't matter
size = 1
frequent_items = []
s_min = 10
L1 = [{j[0]:j[1] for j in freq(1,items, transactions).items() if j[1]>=5000}]
lookup.append({frozenset([k]): v for k, v in L1[0].items()})
#frequent_items.extend(list(L1[0].keys()))
for x in list(L1[0].keys()):
    frequent_items.append(tuple({x}))
prev_freq = L1
while True: 
    size+=1
    candidates = C_k(size,prev_freq)
    frequents = L_k(size,candidates,s_min)
    prev_freq = frequents
    if len(frequents[0])!=0:
        frequent_items.extend(list(frequents[0].keys()))
        lookup.append({frozenset(k): v for k, v in prev_freq[0].items()})
    else:
        break


Calculating candidates of size 2...
Calculating frequent items of size 2
Calculating candidates of size 3...
Calculating frequent items of size 3
Calculating candidates of size 4...
Calculating frequent items of size 4
Calculating candidates of size 5...
Calculating frequent items of size 5
Calculating candidates of size 6...
Calculating frequent items of size 6


#### Exercise 2 
Optional task for extra bonus: Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered by using the A-Priori algorithm in a dataset of sales transactions. The rules must have support at least s and confidence at least c, where s and c are given as input parameters.

## get all potential association rules

In [16]:
from itertools import chain, combinations
from copy import deepcopy


# For every subset A of frequent itemset I, rule is A -> I\A
def association_rules(frequents):
    lhs_rhs = []

    for itemset in frequents: # all subsets of itemset
        r = chain.from_iterable(combinations(itemset, r) for r in range(len(itemset)+1))
        final_r = []
        
        for com in list(r):
            if len(com)!=0 and len(com)!=len(itemset):
                final_r.append(com)   #all subsets of all frequent itemsets
            #print(final_r)
        
        for A in final_r:
            remaining = set(final_r)-{A}
            temp = deepcopy(remaining)
            for rem in remaining:
                for a in A:
                    if {a}.issubset(rem):
                        if rem in temp:
                            temp.remove(rem)  #so that i won't have eg a->a,b
                            
                            
                            
            for rhs in temp:
                if [A,rhs] not in lhs_rhs:
                    lhs_rhs.append([A,rhs]) #pairs lhs,rhs-->if not already present so that we won't take same
                                                            #association rule twice (set was ruining order)

            #print('A=',A,'remaining=',temp)
    return lhs_rhs


    pass

In [17]:
#association rule for itemsets of size >=2
fr = []
for f in frequent_items:
    if len(f)>1:
        fr.append(f)
        
#print(fr)  
rules = association_rules(fr)

In [25]:
# Calculate confidence for each rule
min_c = 0.6
total_dict = {k: v for d in lookup for k, v in d.items()} # Create one dictionary as look-up
confidences = {}
for rule in rules:
    lhs = rule[0]
    rhs = rule[1]
    if frozenset(lhs) in total_dict:
        support = total_dict[frozenset(lhs)]
    union_lhs_rhs = frozenset(tuple(set(lhs+rhs)))
    if union_lhs_rhs in total_dict:
        support_union = total_dict[union_lhs_rhs]
        confidence = support_union/support
        confidences[(lhs,rhs)] = round(confidence,3)
association_rules_at_least_c = {j[0]:j[1] for j in confidences.items() if j[1]>=min_c}

In [26]:
association_rules_at_least_c

{(('494', '217', '722'), ('368',)): 0.714,
 (('368', '684', '722'), ('494',)): 0.75,
 (('494', '722', '829'), ('368',)): 1.818,
 (('684', '722', '368'), ('494',)): 0.75,
 (('494', '684', '829'), ('368',)): 0.667,
 (('217', '494', '722'), ('368',)): 0.714,
 (('217', '684', '829'), ('368',)): 0.667,
 (('217', '684', '368'), ('829',)): 0.727,
 (('217', '766', '829'), ('368',)): 0.667,
 (('368', '217', '684'), ('829',)): 0.727,
 (('829', '722', '766'), ('368',)): 0.8,
 (('722', '766', '368'), ('829',)): 0.632,
 (('829', '684', '722'), ('368',)): 0.737,
 (('684', '722', '829'), ('368',)): 0.737,
 (('368', '722', '766'), ('829',)): 0.632,
 (('722', '766', '829'), ('368',)): 0.8,
 (('368', '766', '722'), ('829',)): 0.632,
 (('829', '766', '722'), ('368',)): 0.8,
 (('722', '684', '829'), ('368',)): 0.737,
 (('722', '766', '829'), ('368', '684')): 0.667,
 (('368', '722', '684', '829'), ('766',)): 0.714,
 (('368', '722', '766', '829'), ('684',)): 0.833}

## Brute Force

In [None]:
from itertools import chain, combinations
from copy import deepcopy


# For every subset A of frequent itemset I, rule is A -> I\A
def association_rules(frequents):
    lhs_rhs = []
    assoc = []
    
    for itemset in frequents: # all subsets of itemset
        r = chain.from_iterable(combinations(itemset, r) for r in range(len(itemset)+1))
        final_r = []
        
        for com in reversed(list(r)):  #so that we can start from a,b,c-->d and then a,b-->c,d
            if len(com)!=0 and len(com)!=len(itemset):  #excract {} and the whole itemset
                final_r.append(com)   #all subsets of all frequent itemsets
            #print(final_r)
        
        for A in final_r:
            remaining = set(final_r)-{A}
            temp = deepcopy(remaining)
            for rem in remaining:
                for a in A:
                    if {a}.issubset(rem):
                        if rem in temp:
                            temp.remove(rem)  #so that i won't have eg a->a,b
            
            #support A
            sup = {}
            if sup.get(A, "empty")=="empty":
                supa = 0
                for j in transactions.items(): # and basket                    
                    if (set(A)).issubset(set(j[1])): # if item is in basket
                        supa +=1              
                sup[A] = supa
        
            
            
            
            for rhs in temp:
                if [A,rhs] not in lhs_rhs:
                    lhs_rhs.append([A,rhs]) #pairs lhs,rhs-->if not already present so that we won't take same
                                                            #association rule twice (set was ruining order)
                
                #print(set(A),set(rhs))
                
                #support of union
                sup_union = {}
                if sup_union.get((A,rhs), "empty")=="empty":
                    supb = 0
                    for j in transactions.items(): # and basket
                        if (set(A).union(set(rhs))).issubset(set(j[1])): # if item is in basket
                            supb +=1
                    sup_union[(A,rhs)] = supb
                
                if sup[A]>0:
                    conf = sup_union[(A,rhs)] / sup[A]
                    #print(A,rhs,conf)
                    if conf>0.6:
                        assoc.append((A,rhs))
                else:
                    print("no support")
                
            #print('A=',A,'remaining=',temp)
    #print(lhs_rhs)

    print(assoc)
    pass




#conf(I→j) = supp(I,j)/supp(I)


In [None]:
association_rules(fr)


## optimized

In [None]:
from itertools import chain, combinations
from copy import deepcopy

#get subsets of combinations in itemset that should not be checked
def sub(x, exclude):
    for i in x:
        for j in x[0]:
            #print("######",x,j)
            exclude.add((tuple(set(x[0])-{j}),tuple(set(x[1]).union({j}))))
            #print("$$$$$$",(tuple(set(x[0])-{j}),tuple(set(x[1]).union({j}))))
    print(exclude)


# For every subset A of frequent itemset I, rule is A -> I\A
def association_rules(frequents):
    lhs_rhs = []
    assoc2 = []
    exclude = set()
    
    
    for itemset in reversed(frequents): # all subsets of itemset
        r = chain.from_iterable(combinations(itemset, r) for r in range(len(itemset)+1))
        final_r = []
        
        for com in reversed(list(r)):  #so that we can start from a,b,c-->d and then a,b-->c,d
            if len(com)!=0 and len(com)!=len(itemset):  #excract {} and the whole itemset
                final_r.append(com)   #all subsets of all frequent itemsets
            #print(final_r)
        
        for A in final_r:
            remaining = set(final_r)-{A}
            temp = deepcopy(remaining)
            for rem in remaining:
                for a in A:
                    if {a}.issubset(rem):
                        if rem in temp:
                            temp.remove(rem)  #so that i won't have eg a->a,b
            
            '''
            #support A
            sup = {}
            if sup.get(A, "empty")=="empty":
                supa = 0
                for j in transactions.items(): # and basket                    
                    if (set(A)).issubset(set(j[1])): # if item is in basket
                        supa +=1              
                sup[A] = supa
            '''
            
            
            
            for rhs in temp:
                if [A,rhs] not in lhs_rhs:
                    lhs_rhs.append([A,rhs]) #pairs lhs,rhs-->if not already present so that we won't take same
                                                            #association rule twice (set was ruining order)
                
                #print(set(A),set(rhs))
                
                #print(A,rhs)
                
                '''
                #support of union
                sup_union = {}
                if sup_union.get((A,rhs), "empty")=="empty" and (A,rhs) not in exclude:
                    exclude.add(sub(a,rhs))
                    #supb = 0
                    #for j in transactions.items(): # and basket
                    #    if (set(A).union(set(rhs))).issubset(set(j[1])): # if item is in basket
                     #       supb +=1
                    #sup_union[(A,rhs)] = supb
               
                if sup[A]>0:
                    conf = sup_union[(A,rhs)] / sup[A]
                    #print(A,rhs,conf)
                    if conf>0.6:
                        assoc2.append((A,rhs))
                        exclude.add(sub((a,rhs),exclude))
                else:
                    print("no support")
                '''
            #print('A=',A,'remaining=',temp)
    #print(lhs_rhs)
    sub((("a","b","c"),("d")),exclude)
    print(exclude)
    pass




#conf(I→j) = supp(I,j)/supp(I)


In [None]:
association_rules(fr)