# Discovery of Frequent Itemsets and Association Rules

The problem of discovering association rules between itemsets in a sales transaction database (a set of baskets) includes the following two sub-problems:

1. Finding frequent itemsets with support at least s;
2. Generating association rules with confidence at least c from the itemsets found in the first step.
Remind that an association rule is an implication X → Y, where X and Y are itemsets such that X∩Y=∅. Support of the rule X → Y is the number of transactions that contain X⋃Y. Confidence of the rule X → Y the fraction of transactions containing X⋃Y in all transactions that contain X.

### Question 1
You are to solve the first sub-problem: to implement the A-Priori algorithm for finding frequent itemsets with support at least s in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your A-Priori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions including generated transactions (baskets) of hashed items (see Canvas).

In [235]:
baskets = [i.strip().split() for i in open("T10I4D100K.dat").readlines()]

In [236]:
transactions = {} # Dictionary with basket ID as key, and basket as value
count = 0
for basket in baskets:
    count += 1
    transactions[count] = basket

In [237]:
items = set() # Set of items from all baskets
for i in transactions.values():
    for j in i:
        items.add(j) 

In [238]:
# Count the frequency of each item
def freq(k, items, transactions):
    items_counts = dict() # Dictionary of item and its frequency
    for item in items:
        if k == 1:
            temp_i = {item}
        else:
            temp_i = set(item)
            
        for transaction in transactions.items(): # and basket
            if temp_i.issubset(set(transaction[1])): # if item is in basket
                if item in items_counts:
                    items_counts[item] += 1 # if already spotted/already in item-freq dict, add 1 to count
                else:
                    items_counts[item] = 1 # if not spotted yet, set count to 1
    return items_counts

In [239]:
from itertools import combinations

# Candidates of len-k which are generated by combining itemsets from L_k-1 and L_1
def C_k(k, d):
    cand = []
    for i in d.keys():
        if k-1 == 1:
            temp = {i}
            combs = combinations(list(temp.union(set(d.keys()))), k) 
            cand = list(combs)

        else:
            temp = set(i)
            for j in d.keys():
                if len(temp.union({j}))==k:
                    cand.append(tuple(temp.union({j})))
    return cand

In [285]:
import time
def generate_freq(s_min):
    results = []
    start_time = time.time()
    
    size = 1
    print(f"Checking for k={size}...")
    support_dict = freq(1, items, transactions) #1-itemset
    L = {j[0]:j[1] for j in support_dict.items() if j[1]>=s_min}  #change constant to s_min
    results.append(L)
    prev_freq = L
    
    while True: 
        size+=1
        print(f"Checking for k={size}...")
        candidates = C_k(size, prev_freq)
        support_dict_k = freq(size, candidates, transactions)
        prev_freq = {j[0]:j[1] for j in support_dict_k.items() if j[1]>=s_min} #change constant to s_min
        if len(prev_freq)!=0:
            results.append(prev_freq)
        else: # empty k-itemset found
            break
            
    print("--- %s seconds ---" % (time.time() - start_time))
    
    return results

In [286]:
results = generate_freq(100)

Checking for k=1...
Checking for k=2...


KeyboardInterrupt: 

In [None]:
# Combine into one dictionary
total_dict = {k: v for d in results for k, v in d.items()}
total_dict

In [249]:
freq_itemsets = list(total_dict.keys())
freq_itemsets

['368',
 '829',
 '419',
 '217',
 '766',
 '684',
 '529',
 '354',
 '494',
 '722',
 ('529', '368'),
 ('529', '722'),
 ('529', '766'),
 ('529', '217'),
 ('529', '494'),
 ('529', '684'),
 ('529', '419'),
 ('529', '829'),
 ('529', '354'),
 ('368', '722'),
 ('368', '766'),
 ('368', '217'),
 ('368', '494'),
 ('368', '684'),
 ('368', '419'),
 ('368', '829'),
 ('368', '354'),
 ('722', '766'),
 ('722', '217'),
 ('722', '494'),
 ('722', '684'),
 ('722', '419'),
 ('722', '829'),
 ('722', '354'),
 ('766', '217'),
 ('766', '494'),
 ('766', '684'),
 ('766', '419'),
 ('766', '829'),
 ('766', '354'),
 ('217', '494'),
 ('217', '684'),
 ('217', '419'),
 ('217', '829'),
 ('217', '354'),
 ('494', '684'),
 ('494', '419'),
 ('494', '829'),
 ('494', '354'),
 ('684', '419'),
 ('684', '829'),
 ('684', '354'),
 ('419', '829'),
 ('419', '354'),
 ('829', '354')]

### Question 2
Optional task for extra bonus: Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered by using the A-Priori algorithm in a dataset of sales transactions. The rules must have support at least s and confidence at least c, where s and c are given as input parameters.

In [253]:
# For every subset A of frequent itemset I, rule is A -> I\A
def association_rules():
    rules = []
    for itemset in list(results[-1].keys()): # First generate rules for the largest itemsets
        rule = {}
        for i in range(len(itemset)):
            lhs = itemset[i]
            rhs = set(itemset) - {lhs}
            rule[lhs] = list(rhs)
        rules.append(rule)
    return rules

In [287]:
association_rules()

[{'529': ['368'], '368': ['529']},
 {'529': ['722'], '722': ['529']},
 {'529': ['766'], '766': ['529']},
 {'529': ['217'], '217': ['529']},
 {'529': ['494'], '494': ['529']},
 {'529': ['684'], '684': ['529']},
 {'529': ['419'], '419': ['529']},
 {'529': ['829'], '829': ['529']},
 {'529': ['354'], '354': ['529']},
 {'368': ['722'], '722': ['368']},
 {'368': ['766'], '766': ['368']},
 {'368': ['217'], '217': ['368']},
 {'368': ['494'], '494': ['368']},
 {'368': ['684'], '684': ['368']},
 {'368': ['419'], '419': ['368']},
 {'368': ['829'], '829': ['368']},
 {'368': ['354'], '354': ['368']},
 {'722': ['766'], '766': ['722']},
 {'722': ['217'], '217': ['722']},
 {'722': ['494'], '494': ['722']},
 {'722': ['684'], '684': ['722']},
 {'722': ['419'], '419': ['722']},
 {'722': ['829'], '829': ['722']},
 {'722': ['354'], '354': ['722']},
 {'766': ['217'], '217': ['766']},
 {'766': ['494'], '494': ['766']},
 {'766': ['684'], '684': ['766']},
 {'766': ['419'], '419': ['766']},
 {'766': ['829'], '8

In [256]:
# Combine into one dictionary
total_dict = {k: v for d in results for k, v in d.items()}

In [283]:
# To check support of the rule, check support of lhs
# To check confidence of the rule, divide {support of lhs} by {support of lhs and rhs combined}

def calculate_confidence(min_c):
    confidences = {}
    for i in range(len(rules)):
        rule = rules[i]
        for lhs,rhs in zip(list(rule.keys()),list(rule.values())):
            support = total_dict[lhs]
            if tuple([lhs])+tuple(rhs) in total_dict:
                confidence = total_dict[lhs]/total_dict[tuple([lhs])+tuple(rhs)]
            if tuple(rhs)+tuple([lhs]) in total_dict:
                confidence = total_dict[lhs]/total_dict[tuple(rhs)+tuple([lhs])]
            confidences[str(lhs)+"->"+str(rhs)] = round(confidence,3)
    association_rules_at_least_c = {j[0]:j[1] for j in confidences.items() if j[1]>=min_c}
    return association_rules_at_least_c
    

In [284]:
calculate_confidence(20)

{"529->['722']": 24.936,
 "722->['529']": 20.654,
 "529->['766']": 22.262,
 "529->['494']": 31.364,
 "494->['529']": 22.676,
 "529->['684']": 21.129,
 "529->['419']": 28.004,
 "419->['529']": 20.067,
 "529->['354']": 23.445,
 "368->['217']": 25.835,
 "368->['684']": 20.227,
 "368->['419']": 22.051,
 "368->['354']": 24.539,
 "722->['494']": 25.863,
 "494->['722']": 22.575,
 "829->['722']": 23.163,
 "766->['217']": 22.699,
 "766->['494']": 27.599,
 "494->['766']": 22.476,
 "766->['419']": 26.324,
 "419->['766']": 21.248,
 "829->['766']": 21.215,
 "217->['494']": 29.372,
 "494->['217']": 27.88,
 "217->['684']": 27.146,
 "684->['217']": 27.313,
 "829->['217']": 24.764,
 "354->['217']": 20.839,
 "494->['684']": 24.529,
 "684->['494']": 26.0,
 "494->['419']": 28.989,
 "419->['494']": 28.733,
 "829->['494']": 25.506,
 "494->['354']": 26.995,
 "354->['494']": 30.873,
 "684->['419']": 34.89,
 "419->['684']": 32.626,
 "684->['354']": 24.694,
 "354->['684']": 26.644,
 "829->['419']": 26.293,
 "35

### Try ouy for larger itemsets

In [288]:
freq_items_4 = list(combinations(['368','829','419','567','899'],4))
freq_items_4

[('368', '829', '419', '567'),
 ('368', '829', '419', '899'),
 ('368', '829', '567', '899'),
 ('368', '419', '567', '899'),
 ('829', '419', '567', '899')]

In [289]:
# Since conf(ABC → D) ≥ conf(AB →CD) ≥ conf(A → BCD), it's more efficient to find association rules with large lhs
def association_rules(itemsets):
    rules = []
    for itemset in itemsets: # First generate rules for the largest itemsets
        rule = {}
        for i in range(len(itemset)):
            rhs = itemset[i]
            lhs = set(itemset) - {rhs}
            rule[tuple(lhs)] = rhs
        rules.append(rule)
    return rules

In [290]:
association_rules(freq_items_4)

[{('829', '419', '567'): '368',
  ('419', '368', '567'): '829',
  ('829', '368', '567'): '419',
  ('829', '368', '419'): '567'},
 {('829', '899', '419'): '368',
  ('368', '899', '419'): '829',
  ('368', '829', '899'): '419',
  ('829', '368', '419'): '899'},
 {('829', '899', '567'): '368',
  ('368', '899', '567'): '829',
  ('368', '829', '899'): '567',
  ('829', '368', '567'): '899'},
 {('419', '899', '567'): '368',
  ('368', '899', '567'): '419',
  ('368', '899', '419'): '567',
  ('419', '368', '567'): '899'},
 {('419', '899', '567'): '829',
  ('829', '899', '567'): '419',
  ('829', '899', '419'): '567',
  ('829', '419', '567'): '899'}]