# Discovery of Frequent Itemsets and Association Rules

The problem of discovering association rules between itemsets in a sales transaction database (a set of baskets) includes the following two sub-problems:

1. Finding frequent itemsets with support at least s;
2. Generating association rules with confidence at least c from the itemsets found in the first step.
Remind that an association rule is an implication X → Y, where X and Y are itemsets such that X∩Y=∅. Support of the rule X → Y is the number of transactions that contain X⋃Y. Confidence of the rule X → Y the fraction of transactions containing X⋃Y in all transactions that contain X.

#### Exercise 1
You are to solve the first sub-problem: to implement the A-Priori algorithm for finding frequent itemsets with support at least s in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your A-Priori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions which includes generated transactions (baskets) of hashed items (see Canvas).

In [1]:
baskets = [i.strip().split() for i in open("T10I4D100K.dat").readlines()]

In [2]:
transactions = {} # Dictionary with transaction ID as key, and basket as value
count = 0
for basket in baskets:
    count += 1
    transactions[count] = basket

In [3]:
items = set() # Set of items from all baskets
for i in transactions.values():
    for j in i:
        items.add(j) 

In [4]:
# Count the frequency of each item
def freq(k,items, transactions):
    items_counts = dict() # Dictionary of item and its frequency
    for i in items:
        if k == 1:
            temp_i = {i}
        else:
            temp_i = set(i)
            
        for j in transactions.items(): # and basket
            if temp_i.issubset(set(j[1])): # if item is in basket
                if i in items_counts:
                    items_counts[i] += 1 # If already spotted/already in item-freq dict, add 1 to count
                else:
                    items_counts[i] = 1 # If not spotted yet, set count to 1
    return items_counts

In [5]:
# Support
s_min = 5000
L1 = [{j[0]:j[1] for j in freq(1,items, transactions).items() if j[1]>=s_min}]

In [7]:
from itertools import combinations

#candidates of len-k which are generated by combining itemsets from L_k-1 and L_1
def C_k(k, prev_freq):
    cand = []
    print(f"Calculating candidates of size {k}...")
    for i in prev_freq[0].keys():
        if k-1 == 1:
            temp = {i}
            combs = combinations(list(temp.union(set(L1[0].keys()))), k) 
            cand = list(combs)

        else:
            temp = set(i)
            for j in L1[0].keys():
                if len(temp.union({j}))==k:
                    cand.append(tuple(temp.union({j})))
    return cand
cand2 = C_k(2,L1)


Calculating candidates of size 2...


In [8]:
def L_k(k, candidates, threshold):
    print(f"Calculating frequent items of size {k}")
    Lk = [{j[0]:j[1] for j in freq(k,candidates, transactions).items() if j[1]>=threshold}]
    return Lk
L2 = L_k(2, cand2, 100)

Calculating frequent items of size 2


In [9]:
cand3 = C_k(3,L2)

Calculating candidates of size 3...


In [10]:
L3 = L_k(3, cand3, 100)

Calculating frequent items of size 3


In [11]:
# Look for frequent items until there is none

result = [] # acts as the sets we need to look up 
lookup = [] # acts as lookup dictionary later on incl frozensets so order doesn't matter
size = 1
frequent_items = [] # excluding the frequency 
s_min = 10
L1 = [{j[0]:j[1] for j in freq(1,items, transactions).items() if j[1]>=5000}]
result.append(L1[0])
lookup.append({frozenset([k]): v for k, v in L1[0].items()})
#frequent_items.extend(list(L1[0].keys()))
for x in list(L1[0].keys()):
    frequent_items.append(tuple({x}))
prev_freq = L1
while True: 
    size+=1
    candidates = C_k(size,prev_freq)
    frequents = L_k(size,candidates,s_min)
    prev_freq = frequents
    if len(frequents[0])!=0:
        frequent_items.extend(list(frequents[0].keys()))
        result.append(frequents[0])
        lookup.append({frozenset(k): v for k, v in prev_freq[0].items()})
    else:
        break


Calculating candidates of size 2...
Calculating frequent items of size 2
Calculating candidates of size 3...
Calculating frequent items of size 3
Calculating candidates of size 4...
Calculating frequent items of size 4
Calculating candidates of size 5...
Calculating frequent items of size 5


Optional task for extra bonus: Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered by using the A-Priori algorithm in a dataset of sales transactions. The rules must have support at least s and confidence at least c, where s and c are given as input parameters.

In [12]:
# For every subset A of frequent itemset I, rule is A -> I\A
# Since conf(ABC → D) ≥ conf(AB →CD) ≥ conf(A → BCD), first filter on association rules with large lhs

# Suppose the largest frequent itemset found is of size k=4
# Then, we start with all rules with 3 items on lhs and 1 item on rhs i.e. (3)-->(1) [size=4] 
# From (3)-->(1), we go to (2)-->(1) [size 3] we go to (1)-->(1) [size 2]

# Then from (3)-->(1), we go to (2)-->(2) [size=4] and (1)-->(3) [size=4]
# From (2)-->(2) we go to (1)-->(2) [size 3] 

In [13]:
result

[{'684': 5408,
  '217': 5375,
  '368': 7828,
  '529': 7057,
  '829': 6810,
  '419': 5057,
  '722': 5845,
  '766': 6265,
  '494': 5102,
  '354': 5835},
 {('419', '368'): 355,
  ('419', '684'): 155,
  ('419', '829'): 259,
  ('419', '529'): 252,
  ('419', '354'): 263,
  ('419', '217'): 344,
  ('419', '766'): 238,
  ('419', '722'): 366,
  ('419', '494'): 176,
  ('368', '684'): 387,
  ('368', '829'): 1194,
  ('368', '529'): 640,
  ('368', '354'): 319,
  ('368', '217'): 303,
  ('368', '766'): 504,
  ('368', '722'): 392,
  ('368', '494'): 860,
  ('684', '829'): 349,
  ('684', '529'): 334,
  ('684', '354'): 219,
  ('684', '217'): 198,
  ('684', '766'): 613,
  ('684', '722'): 443,
  ('684', '494'): 208,
  ('829', '529'): 584,
  ('829', '354'): 259,
  ('829', '217'): 275,
  ('829', '766'): 321,
  ('829', '722'): 294,
  ('829', '494'): 267,
  ('529', '354'): 301,
  ('529', '217'): 403,
  ('529', '766'): 317,
  ('529', '722'): 283,
  ('529', '494'): 225,
  ('354', '217'): 280,
  ('354', '766'): 32

In [14]:
def association_rules(itemsets):
    rules = []
    for itemset in itemsets: # First generate rules for the largest itemsets
        rule = {}
        for i in range(len(itemset)):
            rhs = itemset[i] # For 4-itemset, we check (3) --> (1)
            lhs = set(itemset) - {rhs}
            rule[tuple(lhs)] = rhs
        rules.append(rule)
    return rules

In [15]:
def calculate_confidence(min_c,rules):
    confidences = {}
    for i in range(len(rules)):
        rule = rules[i]
        for lhs,rhs in zip(list(rule.keys()),list(rule.values())):
            support = total_dict[frozenset(lhs)]
            union_lhs_rhs = frozenset(tuple(set(lhs+tuple([rhs]))))
            if union_lhs_rhs in total_dict:
                support_union = total_dict[union_lhs_rhs]
                confidence = support_union/support
                confidences[(lhs,rhs)] = round(confidence,3)
            #elif tuple(set(lhs+tuple([rhs]))) in total_dict:
            #    confidence = support/total_dict[tuple(set(lhs+tuple([rhs])))]
            #    confidences[str(lhs)+"->"+str(rhs)] = round(confidence,3)
            else: 
                print("Not in dictionary")
    association_rules_at_least_c = {j[0]:j[1] for j in confidences.items() if j[1]>=min_c}
    return association_rules_at_least_c
    #return confidences

In [16]:
# Suppose the largest frequent itemset found is of size k=4
# Then, we start with all rules with 3 items on lhs and 1 item on rhs i.e. (3)-->(1) [size=4] 
# From (3)-->(1), we go to (2)-->(1) [size 3] we go to (1)-->(1) [size 2]

all_filtered_rules = []
min_c = 0.3

total_dict = {k: v for d in lookup for k, v in d.items()} # Create one dictionary as look-up
# Start with the largest lhs
largest_lhs = list(result[-1].keys())
# Generate association rules for the largest lhs
sub_rules = association_rules(largest_lhs) # For 4-itemset, we check (3)-->(1)
# Calculate confidence and filter out
filtered_rules = calculate_confidence(min_c,sub_rules)
all_filtered_rules.append(filtered_rules)

# Generate new rules based on the non-filtered-out rules
sub = [rule[0] for rule in filtered_rules.keys()]
while len(sub)>0 and len(sub[0])>1:
    rules_sub = association_rules(sub) # For 4-itemset, we checked (3)-->(1), now (2)-->(1), and (1)-->(1)
    filtered_rules = calculate_confidence(min_c,rules_sub)
    all_filtered_rules.append(filtered_rules)
    sub = [rule[0] for rule in filtered_rules.keys()]

association_rules_filtered = {k: v for d in all_filtered_rules for k, v in d.items()}

In [17]:
all_filtered_rules

[{(('419', '494', '217'), '368'): 0.667,
  (('419', '494', '722'), '368'): 0.333,
  (('684', '368', '217'), '829'): 0.545,
  (('684', '829', '217'), '368'): 1.5,
  (('829', '368', '529'), '684'): 0.32,
  (('684', '368', '529'), '829'): 0.727,
  (('684', '829', '529'), '368'): 1.091,
  (('684', '529', '766'), '368'): 0.3,
  (('684', '368', '529'), '766'): 0.364,
  (('684', '829', '722'), '368'): 1.105,
  (('684', '829', '766'), '368'): 0.778,
  (('684', '829', '494'), '368'): 1.5,
  (('684', '494', '722'), '368'): 0.381,
  (('684', '368', '722'), '494'): 0.333,
  (('829', '684', '217'), '368'): 1.5,
  (('829', '684', '529'), '368'): 1.091,
  (('829', '684', '722'), '368'): 1.105,
  (('829', '684', '766'), '368'): 0.778,
  (('829', '494', '684'), '368'): 1.5,
  (('829', '217', '766'), '368'): 0.667,
  (('829', '529', '766'), '368'): 0.476,
  (('368', '722', '766'), '829'): 0.421,
  (('829', '722', '766'), '368'): 0.533,
  (('829', '494', '722'), '368'): 0.485,
  (('529', '684', '766'), '

In [18]:
association_rules_filtered

{(('419', '494', '217'), '368'): 0.667,
 (('419', '494', '722'), '368'): 0.333,
 (('684', '368', '217'), '829'): 0.545,
 (('684', '829', '217'), '368'): 1.5,
 (('829', '368', '529'), '684'): 0.32,
 (('684', '368', '529'), '829'): 0.727,
 (('684', '829', '529'), '368'): 1.091,
 (('684', '529', '766'), '368'): 0.3,
 (('684', '368', '529'), '766'): 0.364,
 (('684', '829', '722'), '368'): 1.105,
 (('684', '829', '766'), '368'): 0.778,
 (('684', '829', '494'), '368'): 1.5,
 (('684', '494', '722'), '368'): 0.381,
 (('684', '368', '722'), '494'): 0.333,
 (('829', '684', '217'), '368'): 1.5,
 (('829', '684', '529'), '368'): 1.091,
 (('829', '684', '722'), '368'): 1.105,
 (('829', '684', '766'), '368'): 0.778,
 (('829', '494', '684'), '368'): 1.5,
 (('829', '217', '766'), '368'): 0.667,
 (('829', '529', '766'), '368'): 0.476,
 (('368', '722', '766'), '829'): 0.421,
 (('829', '722', '766'), '368'): 0.533,
 (('829', '494', '722'), '368'): 0.485,
 (('529', '684', '766'), '368'): 0.3,
 (('529', '36

In [19]:
total_dict[frozenset(('684', '829', '722'))]

19

In [20]:
total_dict[frozenset(('684', '829', '722', '368'))]

21

In [21]:
# Then from (3)-->(1), we go to (2)-->(2) [size=4] and (1)-->(3) [size=4]
# From (2)-->(2) we go to (1)-->(2) [size 3] 

all_filtered_rules = []
min_c = 0.8

total_dict = {k: v for d in lookup for k, v in d.items()} # Create one dictionary as look-up
# Start with the largest lhs
largest_lhs = list(result[-1].keys())
# Generate association rules for the largest lhs
sub_rules = association_rules(largest_lhs) # For 4-itemset, we check (3)-->(1)
# Calculate confidence and filter out
filtered_rules = calculate_confidence(min_c,sub_rules)
print(filtered_rules)
all_filtered_rules.append(filtered_rules)

{(('684', '829', '217'), '368'): 1.5, (('684', '829', '529'), '368'): 1.091, (('684', '829', '722'), '368'): 1.105, (('684', '829', '494'), '368'): 1.5, (('829', '684', '217'), '368'): 1.5, (('829', '684', '529'), '368'): 1.091, (('829', '684', '722'), '368'): 1.105, (('829', '494', '684'), '368'): 1.5, (('494', '722', '354'), '368'): 1.2}


In [22]:
rules_2_to_2 = {}
for rule in filtered_rules.keys():
    lhs = rule[0]
    rhs = rule[1]
    for i in range(len(lhs)):
        new_lhs = set(lhs) - {lhs[i]}
        new_rhs = set([rhs]).union({lhs[i]})
        rules_2_to_2[tuple(new_lhs)] = tuple(new_rhs)
rules_2_to_2

{('829', '217'): ('684', '368'),
 ('684', '217'): ('829', '368'),
 ('684', '829'): ('368', '494'),
 ('829', '529'): ('684', '368'),
 ('684', '529'): ('829', '368'),
 ('829', '722'): ('684', '368'),
 ('684', '722'): ('829', '368'),
 ('829', '494'): ('684', '368'),
 ('684', '494'): ('829', '368'),
 ('829', '684'): ('368', '494'),
 ('722', '354'): ('368', '494'),
 ('494', '354'): ('368', '722'),
 ('494', '722'): ('368', '354')}

In [23]:
min_c = 0.05
confidences_2_to_2 = {}
for lhs in rules_2_to_2:
    support = total_dict[frozenset(lhs)]
    rhs = rules_2_to_2[lhs]
    union_lhs_rhs = frozenset(tuple(lhs+rhs))
    if union_lhs_rhs in total_dict:
        support_union = total_dict[union_lhs_rhs]
        confidence = support_union/support
        confidences_2_to_2[(lhs,rhs)] = round(confidence,3)
    else: 
        print("Not in dictionary")
association_rules_at_least_c = {j[0]:j[1] for j in confidences_2_to_2.items() if j[1]>=min_c}
association_rules_at_least_c

{(('829', '217'), ('684', '368')): 0.065,
 (('684', '217'), ('829', '368')): 0.091,
 (('684', '829'), ('368', '494')): 0.06,
 (('684', '529'), ('829', '368')): 0.072,
 (('829', '722'), ('684', '368')): 0.071,
 (('829', '494'), ('684', '368')): 0.079,
 (('684', '494'), ('829', '368')): 0.101,
 (('829', '684'), ('368', '494')): 0.06,
 (('494', '354'), ('368', '722')): 0.063,
 (('494', '722'), ('368', '354')): 0.053}

In [24]:
# Then from (3)-->(1), we go to (2)-->(2) [size=4] and (1)-->(3) [size=4]
rules_1_to_3 = {}
for rule in association_rules_at_least_c.keys():
    lhs = rule[0]
    rhs = rule[1]
    for i in range(len(lhs)):
        new_lhs = set(lhs) - {lhs[i]}
        new_rhs = set(rhs).union({lhs[i]})
        rules_1_to_3[tuple(new_lhs)] = tuple(new_rhs)
rules_1_to_3

{('217',): ('829', '368', '684'),
 ('829',): ('684', '368', '494'),
 ('684',): ('829', '368', '494'),
 ('529',): ('829', '368', '684'),
 ('722',): ('368', '494', '354'),
 ('494',): ('368', '722', '354'),
 ('354',): ('368', '494', '722')}

In [25]:
min_c = 0
confidences_1_to_3 = {}
for lhs in rules_1_to_3:
    support = total_dict[frozenset(lhs)]
    rhs = rules_1_to_3[lhs]
    union_lhs_rhs = frozenset(tuple(lhs+rhs))
    if union_lhs_rhs in total_dict:
        support_union = total_dict[union_lhs_rhs]
        confidence = support_union/support
        confidences_1_to_3[(lhs,rhs)] = round(confidence,3)
    else: 
        print("Not in dictionary")
association_rules_at_least_c = {j[0]:j[1] for j in confidences_1_to_3.items() if j[1]>=min_c}
association_rules_at_least_c

{(('217',), ('829', '368', '684')): 0.003,
 (('829',), ('684', '368', '494')): 0.003,
 (('684',), ('829', '368', '494')): 0.004,
 (('529',), ('829', '368', '684')): 0.003,
 (('722',), ('368', '494', '354')): 0.002,
 (('494',), ('368', '722', '354')): 0.002,
 (('354',), ('368', '494', '722')): 0.002}

In [26]:
min_c = 0.05
confidences_2_to_2 = {}
for lhs in rules_2_to_2:
    support = total_dict[frozenset(lhs)]
    rhs = rules_2_to_2[lhs]
    union_lhs_rhs = frozenset(tuple(lhs+rhs))
    if union_lhs_rhs in total_dict:
        support_union = total_dict[union_lhs_rhs]
        confidence = support_union/support
        confidences_2_to_2[(lhs,rhs)] = round(confidence,3)
    else: 
        print("Not in dictionary")
association_rules_at_least_c = {j[0]:j[1] for j in confidences_2_to_2.items() if j[1]>=min_c}
association_rules_at_least_c

{(('829', '217'), ('684', '368')): 0.065,
 (('684', '217'), ('829', '368')): 0.091,
 (('684', '829'), ('368', '494')): 0.06,
 (('684', '529'), ('829', '368')): 0.072,
 (('829', '722'), ('684', '368')): 0.071,
 (('829', '494'), ('684', '368')): 0.079,
 (('684', '494'), ('829', '368')): 0.101,
 (('829', '684'), ('368', '494')): 0.06,
 (('494', '354'), ('368', '722')): 0.063,
 (('494', '722'), ('368', '354')): 0.053}

In [27]:
# From (2)-->(2) we go to (1)-->(2) [size 3] 
rules_1_to_2 = {}
for rule in association_rules_at_least_c.keys():
    lhs = rule[0]
    rhs = rule[1]
    for i in range(len(lhs)):
        new_lhs = set(lhs) - {lhs[i]}
        new_rhs = set(rhs)
        rules_1_to_2[tuple(new_lhs)] = tuple(new_rhs)
rules_1_to_2

{('217',): ('829', '368'),
 ('829',): ('368', '494'),
 ('684',): ('368', '494'),
 ('529',): ('829', '368'),
 ('722',): ('368', '354'),
 ('494',): ('368', '354'),
 ('354',): ('368', '722')}

In [28]:
min_c = 0
confidences_1_to_2 = {}
for lhs in rules_1_to_2:
    support = total_dict[frozenset(lhs)]
    rhs = rules_1_to_2[lhs]
    union_lhs_rhs = frozenset(tuple(lhs+rhs))
    if union_lhs_rhs in total_dict:
        support_union = total_dict[union_lhs_rhs]
        confidence = support_union/support
        confidences_1_to_2[(lhs,rhs)] = round(confidence,3)
    else: 
        print("Not in dictionary")
association_rules_at_least_c = {j[0]:j[1] for j in confidences_1_to_2.items() if j[1]>=min_c}
association_rules_at_least_c

{(('217',), ('829', '368')): 0.026,
 (('829',), ('368', '494')): 0.04,
 (('684',), ('368', '494')): 0.019,
 (('529',), ('829', '368')): 0.011,
 (('722',), ('368', '354')): 0.018,
 (('494',), ('368', '354')): 0.02,
 (('354',), ('368', '722')): 0.018}