# Discovery of Frequent Itemsets and Association Rules

#### By group 16: Antonios Mantzaris & Ya Ting Hu

In this assignment we tackle the problem of discovering association rules between itemsets in a sales transaction database (a set of baskets) which includes the following two sub-problems:

1. Finding frequent itemsets with support at least s;
2. Generating association rules with confidence at least c from the itemsets found in the first step.

## Exercise 1
You are to solve the first sub-problem: to implement the A-Priori algorithm for finding frequent itemsets with support at least s in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your A-Priori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions which includes generated transactions (baskets) of hashed items (see Canvas).

In [1]:
baskets = [i.strip().split() for i in open("T10I4D100K.dat").readlines()]
#baskets = baskets[:5] # used as test subset

In [2]:
transactions = {} # Dictionary with transaction ID as key, and basket as value
count = 0
for basket in baskets:
    count += 1
    transactions[count] = basket

In [3]:
items = set() # Set of items from all baskets
for i in transactions.values():
    for j in i:
        items.add(j) 

In [4]:
# Count the frequency of each item
def freq(k,items, transactions):
    items_counts = dict() # Dictionary of item and its frequency
    for i in items:
        if k == 1:
            temp_i = {i}
        else:
            temp_i = set(i)
            
        for j in transactions.items(): # and basket
            if temp_i.issubset(set(j[1])): # if item is in basket
                if i in items_counts:
                    items_counts[i] += 1 # If already spotted/already in item-freq dict, add 1 to count
                else:
                    items_counts[i] = 1 # If not spotted yet, set count to 1
    return items_counts

We set s_min for itemsets of size 1 to 5000 which is relatively high, however it helps us run the functions in demonstration-friendly time. 

In [5]:
# Support
s_min = 5000
base = freq(1,items, transactions)
L1 = [{j[0]:j[1] for j in base.items() if j[1]>=s_min}]  #frequent items 

### Candidate generation function
As input, the user should pass the number of items k of the k-itemset, and the previous' size frequent itemsets.

In [6]:
from itertools import combinations

# Candidates of len-k which are generated by combining itemsets from L_k-1 and L_1
def C_k(k, prev_freq):
    cand = []
    if k-1 == 1:
        combs = combinations(list(L1[0].keys()), k)
        cand = list(combs)
        
    else:
        combs = combinations(list(L1[0].keys()), k)
        cand = list(combs)

        for i in prev_freq[0].keys():
                temp = set(i)
                for j in L1[0].keys():
                    if len(temp.union({j}))==k:
                        cand.append(tuple(temp.union({j})))
    # Remove duplicate tuples 
    cand = [t for t in (set(tuple(i) for i in cand))]
    return cand

#### Example of the candidates function and its output below (only run the following cells if of interest):

In [7]:
cand2 = C_k(2,L1)
print(cand2)

[('354', '684'), ('217', '722'), ('368', '494'), ('722', '829'), ('684', '419'), ('529', '766'), ('354', '368'), ('529', '494'), ('529', '722'), ('766', '368'), ('217', '419'), ('529', '217'), ('368', '419'), ('684', '368'), ('354', '829'), ('829', '494'), ('722', '766'), ('766', '829'), ('722', '494'), ('529', '419'), ('684', '829'), ('217', '368'), ('354', '766'), ('829', '419'), ('354', '494'), ('217', '829'), ('354', '722'), ('529', '368'), ('722', '419'), ('684', '766'), ('766', '494'), ('354', '217'), ('354', '529'), ('494', '419'), ('684', '494'), ('684', '722'), ('829', '368'), ('529', '829'), ('684', '529'), ('684', '217'), ('722', '368'), ('217', '766'), ('354', '419'), ('217', '494'), ('766', '419')]


In [8]:
def L_k(k, candidates, threshold):
    Lk = [{j[0]:j[1] for j in freq(k,candidates, transactions).items() if j[1]>=threshold}]
    return Lk

In [9]:
L2 = L_k(2, cand2, 1)
print(L2)

[{('354', '684'): 219, ('217', '722'): 498, ('368', '494'): 860, ('722', '829'): 294, ('684', '419'): 155, ('529', '766'): 317, ('354', '368'): 319, ('529', '494'): 225, ('529', '722'): 283, ('766', '368'): 504, ('217', '419'): 344, ('529', '217'): 403, ('368', '419'): 355, ('684', '368'): 387, ('354', '829'): 259, ('829', '494'): 267, ('722', '766'): 328, ('766', '829'): 321, ('722', '494'): 226, ('529', '419'): 252, ('684', '829'): 349, ('217', '368'): 303, ('354', '766'): 329, ('829', '419'): 259, ('354', '494'): 189, ('217', '829'): 275, ('354', '722'): 566, ('529', '368'): 640, ('722', '419'): 366, ('684', '766'): 613, ('766', '494'): 227, ('354', '217'): 280, ('354', '529'): 301, ('494', '419'): 176, ('684', '494'): 208, ('684', '722'): 443, ('829', '368'): 1194, ('529', '829'): 584, ('684', '529'): 334, ('684', '217'): 198, ('722', '368'): 392, ('217', '766'): 276, ('354', '419'): 263, ('217', '494'): 183, ('766', '419'): 238}]


In [10]:
cand3 = C_k(3,L2)
print(cand3)

[('529', '766', '419'), ('354', '368', '419'), ('829', '419', '684'), ('354', '829', '494'), ('217', '722', '829'), ('766', '829', '684'), ('529', '722', '829'), ('766', '722', '368'), ('217', '529', '494'), ('354', '684', '829'), ('766', '217', '722'), ('684', '217', '494'), ('529', '368', '494'), ('494', '368', '829'), ('529', '217', '684'), ('494', '684', '354'), ('354', '217', '829'), ('766', '217', '354'), ('217', '368', '419'), ('494', '419', '368'), ('217', '829', '494'), ('722', '368', '354'), ('529', '419', '829'), ('217', '354', '829'), ('829', '766', '684'), ('217', '684', '829'), ('684', '722', '829'), ('217', '722', '529'), ('766', '354', '829'), ('529', '217', '829'), ('217', '529', '684'), ('684', '529', '494'), ('419', '722', '368'), ('354', '529', '829'), ('684', '368', '494'), ('766', '368', '354'), ('529', '368', '684'), ('722', '829', '419'), ('684', '529', '217'), ('217', '722', '419'), ('494', '368', '529'), ('419', '354', '684'), ('766', '684', '829'), ('494', '5

In [11]:
L3 = L_k(3, cand3, 2)
print(L3)

[{('529', '766', '419'): 7, ('354', '368', '419'): 13, ('829', '419', '684'): 7, ('354', '829', '494'): 5, ('217', '722', '829'): 22, ('766', '829', '684'): 27, ('529', '722', '829'): 11, ('766', '722', '368'): 19, ('217', '529', '494'): 5, ('354', '684', '829'): 8, ('766', '217', '722'): 22, ('684', '217', '494'): 6, ('529', '368', '494'): 41, ('494', '368', '829'): 90, ('529', '217', '684'): 18, ('494', '684', '354'): 7, ('354', '217', '829'): 13, ('766', '217', '354'): 18, ('217', '368', '419'): 17, ('494', '419', '368'): 33, ('217', '829', '494'): 5, ('722', '368', '354'): 35, ('529', '419', '829'): 20, ('217', '354', '829'): 13, ('829', '766', '684'): 27, ('217', '684', '829'): 12, ('684', '722', '829'): 19, ('217', '722', '529'): 33, ('766', '354', '829'): 11, ('529', '217', '829'): 15, ('217', '529', '684'): 18, ('684', '529', '494'): 8, ('419', '722', '368'): 27, ('354', '529', '829'): 18, ('684', '368', '494'): 35, ('766', '368', '354'): 25, ('529', '368', '684'): 33, ('722', 

## Get frequent items
L1 already found and used for getting the candidates C_k as a combination of L1 and L_(k-1). The default threshold for support is 10 in order to get some interesting results. The generation of candidates and eventually frequent itemsets stops when there are no new frequent itemsets discovered, otherwise it continues with the candidates and frequent itemsets of the next size. 

In [12]:
#Look for frequent items until there is no one
lookup = [] # acts as lookup dictionary later on incl frozensets so order doesn't matter
size = 1
frequent_items = []
s_min = 10
#base = freq(1,items, transactions)
#L1 = [{j[0]:j[1] for j in base.items() if j[1]>=1}]
print(f"Checking for k={size}...")
lookup.append({frozenset([k]): v for k, v in base.items()}) # Calculated in previous cell

for x in list(L1[0].keys()):
    frequent_items.append(tuple({x}))
prev_freq = L1
while True: 
    size+=1
    print(f"Checking for k={size}...")
    candidates = C_k(size,prev_freq)
    lvl = freq(size,candidates,transactions)
    frequents = [{j[0]:j[1] for j in lvl.items() if j[1]>=s_min}]
    prev_freq = frequents
    if len(frequents[0])!=0:
        frequent_items.extend(list(frequents[0].keys()))
        lookup.append({frozenset(k): v for k, v in lvl.items()})
    else:
        break


Checking for k=1...
Checking for k=2...
Checking for k=3...
Checking for k=4...


In [13]:
print(frequent_items)

[('354',), ('684',), ('529',), ('217',), ('722',), ('766',), ('829',), ('368',), ('494',), ('419',), ('354', '684'), ('217', '722'), ('368', '494'), ('722', '829'), ('684', '419'), ('529', '766'), ('354', '368'), ('529', '494'), ('529', '722'), ('766', '368'), ('217', '419'), ('529', '217'), ('368', '419'), ('684', '368'), ('354', '829'), ('829', '494'), ('722', '766'), ('766', '829'), ('722', '494'), ('529', '419'), ('684', '829'), ('217', '368'), ('354', '766'), ('829', '419'), ('354', '494'), ('217', '829'), ('354', '722'), ('529', '368'), ('722', '419'), ('684', '766'), ('766', '494'), ('354', '217'), ('354', '529'), ('494', '419'), ('684', '494'), ('684', '722'), ('829', '368'), ('529', '829'), ('684', '529'), ('684', '217'), ('722', '368'), ('217', '766'), ('354', '419'), ('217', '494'), ('766', '419'), ('354', '368', '419'), ('217', '722', '829'), ('766', '829', '684'), ('529', '722', '829'), ('766', '722', '368'), ('766', '217', '722'), ('529', '368', '494'), ('494', '368', '82

## Exercise 2 
Optional task for extra bonus: Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered by using the A-Priori algorithm in a dataset of sales transactions. The rules must have support at least s and confidence at least c, where s and c are given as input parameters.

### Generation of all possible combinations of frequent itemsets that we got from Q1
To generate the association rules between the frequent itemsets from the previous question, we first create a list of all the frequent itemset which are of size larger than one. Next, we exclude null subsets and the subset itself so that we can then take for each subset A → I\A.

In [14]:
#association rule for itemsets of size >=2
fr = []
for f in frequent_items:
    if len(f)>1:
        fr.append(f)     

In [15]:
from itertools import chain, combinations
from copy import deepcopy

# For every subset A of frequent itemset I, rule is A -> I\A
def association_rules(frequents):
    lhs_rhs = []

    for itemset in frequents: # all subsets of itemset
        r = chain.from_iterable(combinations(itemset, r) for r in range(len(itemset)+1))
        final_r = []
        
        for com in list(r):
            if len(com)!=0 and len(com)!=len(itemset):
                final_r.append(com)   #all subsets of all frequent itemsets
        
        for A in final_r:
            remaining = set(final_r)-{A}
            temp = deepcopy(remaining)
            for rem in remaining:
                for a in A:
                    if {a}.issubset(rem):
                        if rem in temp:
                            temp.remove(rem)  #so that i won't have eg a->a,b
                            
            for rhs in temp:
                if [A,rhs] not in lhs_rhs:
                    lhs_rhs.append([A,rhs]) #pairs lhs,rhs-->if not already present so that we won't take same
                                                            #association rule twice (set was ruining order)
    return lhs_rhs


    pass

In [16]:
rules = association_rules(fr)

In [17]:
rules

[[('354',), ('684',)],
 [('684',), ('354',)],
 [('217',), ('722',)],
 [('722',), ('217',)],
 [('368',), ('494',)],
 [('494',), ('368',)],
 [('722',), ('829',)],
 [('829',), ('722',)],
 [('684',), ('419',)],
 [('419',), ('684',)],
 [('529',), ('766',)],
 [('766',), ('529',)],
 [('354',), ('368',)],
 [('368',), ('354',)],
 [('529',), ('494',)],
 [('494',), ('529',)],
 [('529',), ('722',)],
 [('722',), ('529',)],
 [('766',), ('368',)],
 [('368',), ('766',)],
 [('217',), ('419',)],
 [('419',), ('217',)],
 [('529',), ('217',)],
 [('217',), ('529',)],
 [('368',), ('419',)],
 [('419',), ('368',)],
 [('684',), ('368',)],
 [('368',), ('684',)],
 [('354',), ('829',)],
 [('829',), ('354',)],
 [('829',), ('494',)],
 [('494',), ('829',)],
 [('722',), ('766',)],
 [('766',), ('722',)],
 [('766',), ('829',)],
 [('829',), ('766',)],
 [('722',), ('494',)],
 [('494',), ('722',)],
 [('529',), ('419',)],
 [('419',), ('529',)],
 [('684',), ('829',)],
 [('829',), ('684',)],
 [('217',), ('368',)],
 [('368',),

The next step is to calculate the support and confidence of the association rule. Recall that an association rule is an implication X → Y, where X and Y are itemsets such that X∩Y=∅. Support of the rule X → Y is the number of transactions that contain X⋃Y. Confidence of the rule X → Y is the fraction of transactions containing X⋃Y in all transactions that contain X (source: Canvas).

This is done by giving the user two parameters, namely min_c and min_support, where min_c is the threshold for the confidence and min_support is the threshold for the support of the association rule. 
The total_dict acts as our look-up dictionary which has been generated in the previous exercise. Here we have the itemset as key and the frequency of the itemset as value. 

So, each association rule is of the format lhs → rhs. We check the support of lhs⋃rhs and the support of lhs, because the confidence of lhs → rhs is support(lhs⋃rhs)/support(lhs). 
In the end, two dictionaries are generated:
1. association_rules_at_least_c contains all the association rules and confidences where confidence is at least min_c.
2. association_rules_at_least_s contains all the association rules and supports where support is at least min_support.

Lasly, association_rules_at_least_c_s is a set containing all the association rules which have confidence at least min_c and support at least min_support. 

In [18]:
# Calculate confidence for each rule
min_c = 0.15
min_support = 500

total_dict = {k: v for d in lookup for k, v in d.items()} # Create one dictionary as look-up
confidences = {}
supports = {}
for rule in rules:
    lhs = rule[0]
    rhs = rule[1]
    union_lhs_rhs = frozenset(tuple(set(lhs+rhs)))
    support_lhs = total_dict[frozenset(lhs)]
    support_union = total_dict[union_lhs_rhs]
    confidence = support_union/support_lhs
    confidences[(lhs,rhs)] = round(confidence,5)
    supports[(lhs,rhs)] = round(support_union,5)
    
association_rules_at_least_c = {j[0]:j[1] for j in confidences.items() if j[1]>=min_c}
association_rules_at_least_s = {j[0]:j[1] for j in supports.items() if j[1]>=min_support}
association_rules_at_least_c_s = association_rules_at_least_c.keys() & association_rules_at_least_s.keys()

In [19]:
association_rules_at_least_c

{(('494',), ('368',)): 0.16856,
 (('829',), ('368',)): 0.17533,
 (('368',), ('829',)): 0.15253,
 (('529', '494'), ('368',)): 0.18222,
 (('494', '829'), ('368',)): 0.33708,
 (('494', '419'), ('368',)): 0.1875,
 (('684', '494'), ('368',)): 0.16827,
 (('494', '529'), ('368',)): 0.18222,
 (('722', '494'), ('368',)): 0.15929,
 (('766', '829'), ('368',)): 0.21184,
 (('684', '368'), ('829',)): 0.29974,
 (('684', '829'), ('368',)): 0.33238,
 (('829', '494'), ('368',)): 0.33708,
 (('354', '829'), ('368',)): 0.18147,
 (('494', '354'), ('368',)): 0.17989,
 (('829', '684'), ('368',)): 0.33238,
 (('368', '684'), ('829',)): 0.29974,
 (('354', '494'), ('368',)): 0.17989,
 (('217', '829'), ('368',)): 0.17091,
 (('217', '368'), ('829',)): 0.15512,
 (('494', '217'), ('368',)): 0.19672,
 (('217', '494'), ('368',)): 0.19672,
 (('419', '829'), ('368',)): 0.16988,
 (('494', '722'), ('368',)): 0.15929,
 (('829', '419'), ('368',)): 0.16988,
 (('722', '829'), ('368',)): 0.15646,
 (('829', '354'), ('368',)): 0.

In [20]:
association_rules_at_least_s

{(('368',), ('494',)): 860,
 (('494',), ('368',)): 860,
 (('766',), ('368',)): 504,
 (('368',), ('766',)): 504,
 (('354',), ('722',)): 566,
 (('722',), ('354',)): 566,
 (('529',), ('368',)): 640,
 (('368',), ('529',)): 640,
 (('684',), ('766',)): 613,
 (('766',), ('684',)): 613,
 (('829',), ('368',)): 1194,
 (('368',), ('829',)): 1194,
 (('529',), ('829',)): 584,
 (('829',), ('529',)): 584}

In [21]:
association_rules_at_least_c_s

{(('368',), ('829',)), (('494',), ('368',)), (('829',), ('368',))}