# Discovery of Frequent Itemsets and Association Rules

The problem of discovering association rules between itemsets in a sales transaction database (a set of baskets) includes the following two sub-problems:

1. Finding frequent itemsets with support at least s;
2. Generating association rules with confidence at least c from the itemsets found in the first step.
Remind that an association rule is an implication X → Y, where X and Y are itemsets such that X∩Y=∅. Support of the rule X → Y is the number of transactions that contain X⋃Y. Confidence of the rule X → Y the fraction of transactions containing X⋃Y in all transactions that contain X.

You are to solve the first sub-problem: to implement the A-Priori algorithm for finding frequent itemsets with support at least s in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your A-Priori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions.

The sale transaction dataset includes generated transactions (baskets) of hashed items (see Canvas).

In [1]:
baskets = [i.strip().split() for i in open("T10I4D100K.dat").readlines()]
len(baskets)

100000

In [2]:
transactions = {} # Dictionary with transaction ID as key, and basket as value
count = 0
for basket in baskets:
    count += 1
    transactions[count] = basket


In [3]:
items = set() # Set of items from all baskets
for i in transactions.values():
    for j in i:
        items.add(j) 

In [4]:
# Count the frequency of each item
def freq(k,items, transactions):
    items_counts = dict() # Dictionary of item and its frequency
    for i in items:
        if k == 1:
            temp_i = {i}
        else:
            temp_i = set(i)
            
        for j in transactions.items(): # and basket
            if temp_i.issubset(set(j[1])): # if item is in basket
                if i in items_counts:
                    items_counts[i] += 1 # If already spotted/already in item-freq dict, add 1 to count
                else:
                    items_counts[i] = 1 # If not spotted yet, set count to 1
    return items_counts

In [5]:
items_counts = freq(1,items, transactions)

In [6]:
def support(items_counts, transactions):
    support = dict()
    for i in items_counts:
        support[i] = items_counts[i]/len(transactions) # Support = #transactions in which item appears/#total transactions
    return support   #Support for itemset I is the number of baskets containing all items in I-->i thought the same as you

In [21]:
min_support = 0.05
items_atleast_min_support = [{j[0]:j[1] for j in support(items_counts, transactions).items() if j[1]>=min_support}]

In [22]:
items_atleast_min_support

[{'829': 0.0681,
  '684': 0.05408,
  '354': 0.05835,
  '722': 0.05845,
  '217': 0.05375,
  '529': 0.07057,
  '419': 0.05057,
  '766': 0.06265,
  '368': 0.07828,
  '494': 0.05102}]

In [119]:
#different support
s_min = 5000
L1 = [{j[0]:j[1] for j in freq(1,items, transactions).items() if j[1]>=s_min}]
list(L1[0].keys())

['684', '419', '217', '354', '368', '829', '722', '529', '494', '766']

In [120]:
from itertools import combinations

#candidates of len-k which are generated by combining itemsets from L_k-1 and L_1
def C_k(k, prev_freq):
    cand = []
    print(f"Calculating candidates of size {k}...")
    for i in prev_freq[0].keys():
        if k-1 == 1:
            temp = {i}
            combs = combinations(list(temp.union(set(L1[0].keys()))), k) 
            cand = list(combs)

        else:
            temp = set(i)
            for j in L1[0].keys():
                if len(temp.union({j}))==k:
                    cand.append(tuple(temp.union({j})))
    return cand
cand2 = C_k(2,L1)


Calculating candidates of size 2...
[('722', '829'), ('722', '419'), ('722', '684'), ('722', '494'), ('722', '217'), ('722', '354'), ('722', '529'), ('722', '766'), ('722', '368'), ('829', '419'), ('829', '684'), ('829', '494'), ('829', '217'), ('829', '354'), ('829', '529'), ('829', '766'), ('829', '368'), ('419', '684'), ('419', '494'), ('419', '217'), ('419', '354'), ('419', '529'), ('419', '766'), ('419', '368'), ('684', '494'), ('684', '217'), ('684', '354'), ('684', '529'), ('684', '766'), ('684', '368'), ('494', '217'), ('494', '354'), ('494', '529'), ('494', '766'), ('494', '368'), ('217', '354'), ('217', '529'), ('217', '766'), ('217', '368'), ('354', '529'), ('354', '766'), ('354', '368'), ('529', '766'), ('529', '368'), ('766', '368')]


In [121]:
def L_k(k, candidates, threshold):
    print(f"Calculating frequent items of size {k}")
    for i in candidates: # Check for every item
        temp_list = []
        temp_list.append(j)

        temp_i = set(temp_list)  #for each item check if it belongs to a transaction

    
    Lk = [{j[0]:j[1] for j in freq(k,candidates, transactions).items() if j[1]>=threshold}]
    
    return Lk
L2 = L_k(2, cand2, 100)
L2

Calculating frequent items of size 2


[{('722', '829'): 294,
  ('722', '419'): 366,
  ('722', '684'): 443,
  ('722', '494'): 226,
  ('722', '217'): 498,
  ('722', '354'): 566,
  ('722', '529'): 283,
  ('722', '766'): 328,
  ('722', '368'): 392,
  ('829', '419'): 259,
  ('829', '684'): 349,
  ('829', '494'): 267,
  ('829', '217'): 275,
  ('829', '354'): 259,
  ('829', '529'): 584,
  ('829', '766'): 321,
  ('829', '368'): 1194,
  ('419', '684'): 155,
  ('419', '494'): 176,
  ('419', '217'): 344,
  ('419', '354'): 263,
  ('419', '529'): 252,
  ('419', '766'): 238,
  ('419', '368'): 355,
  ('684', '494'): 208,
  ('684', '217'): 198,
  ('684', '354'): 219,
  ('684', '529'): 334,
  ('684', '766'): 613,
  ('684', '368'): 387,
  ('494', '217'): 183,
  ('494', '354'): 189,
  ('494', '529'): 225,
  ('494', '766'): 227,
  ('494', '368'): 860,
  ('217', '354'): 280,
  ('217', '529'): 403,
  ('217', '766'): 276,
  ('217', '368'): 303,
  ('354', '529'): 301,
  ('354', '766'): 329,
  ('354', '368'): 319,
  ('529', '766'): 317,
  ('529', 

Optional task for extra bonus: Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered by using the A-Priori algorithm in a dataset of sales transactions. The rules must have support at least s and confidence at least c, where s and c are given as input parameters.

In [122]:
cand3 = C_k(3,L2)
cand3

Calculating candidates of size 3...


[('722', '684', '829'),
 ('722', '419', '829'),
 ('722', '217', '829'),
 ('722', '354', '829'),
 ('722', '829', '368'),
 ('722', '829', '529'),
 ('722', '829', '494'),
 ('722', '766', '829'),
 ('722', '419', '684'),
 ('722', '419', '217'),
 ('722', '419', '354'),
 ('722', '419', '368'),
 ('722', '419', '829'),
 ('722', '419', '529'),
 ('722', '419', '494'),
 ('722', '419', '766'),
 ('722', '419', '684'),
 ('722', '217', '684'),
 ('722', '354', '684'),
 ('722', '684', '368'),
 ('722', '684', '829'),
 ('722', '684', '529'),
 ('722', '684', '494'),
 ('722', '766', '684'),
 ('722', '684', '494'),
 ('722', '419', '494'),
 ('722', '217', '494'),
 ('722', '354', '494'),
 ('722', '368', '494'),
 ('722', '829', '494'),
 ('722', '494', '529'),
 ('722', '766', '494'),
 ('722', '217', '684'),
 ('722', '217', '419'),
 ('722', '217', '354'),
 ('722', '217', '368'),
 ('722', '217', '829'),
 ('722', '217', '529'),
 ('722', '217', '494'),
 ('722', '217', '766'),
 ('722', '354', '684'),
 ('722', '419', 

In [123]:
L3 = L_k(3, cand3, 100)

Calculating frequent items of size 3


In [111]:
L3

[{('722', '829', '368'): 138,
  ('722', '354', '368'): 105,
  ('419', '829', '368'): 132,
  ('684', '829', '368'): 348,
  ('368', '829', '494'): 180,
  ('217', '829', '368'): 141,
  ('354', '829', '368'): 141,
  ('829', '368', '529'): 225,
  ('766', '829', '368'): 204,
  ('684', '766', '529'): 120,
  ('684', '766', '368'): 117}]

In [138]:
#Look for frequent items until there is no one

size = 1
frequent_items = []
s_min = 10
L1 = [{j[0]:j[1] for j in freq(1,items, transactions).items() if j[1]>=5000}]
frequent_items.extend(list(L1[0].keys()))

prev_freq = L1
while True: 
    size+=1
    candidates = C_k(size,prev_freq)
    frequents = L_k(size,candidates,s_min)
    prev_freq = frequents
    if len(frequents[0])!=0:
        frequent_items.extend(list(frequents[0].keys()))
    else:
        break


Calculating candidates of size 2...
Calculating frequent items of size 2
45
Calculating candidates of size 3...
Calculating frequent items of size 3
153
Calculating candidates of size 4...
Calculating frequent items of size 4
23
Calculating candidates of size 5...
Calculating frequent items of size 5
1
Calculating candidates of size 6...
Calculating frequent items of size 6


In [139]:
frequent_items

['684',
 '419',
 '217',
 '354',
 '368',
 '829',
 '722',
 '529',
 '494',
 '766',
 ('722', '829'),
 ('722', '419'),
 ('722', '684'),
 ('722', '494'),
 ('722', '217'),
 ('722', '354'),
 ('722', '529'),
 ('722', '766'),
 ('722', '368'),
 ('829', '419'),
 ('829', '684'),
 ('829', '494'),
 ('829', '217'),
 ('829', '354'),
 ('829', '529'),
 ('829', '766'),
 ('829', '368'),
 ('419', '684'),
 ('419', '494'),
 ('419', '217'),
 ('419', '354'),
 ('419', '529'),
 ('419', '766'),
 ('419', '368'),
 ('684', '494'),
 ('684', '217'),
 ('684', '354'),
 ('684', '529'),
 ('684', '766'),
 ('684', '368'),
 ('494', '217'),
 ('494', '354'),
 ('494', '529'),
 ('494', '766'),
 ('494', '368'),
 ('217', '354'),
 ('217', '529'),
 ('217', '766'),
 ('217', '368'),
 ('354', '529'),
 ('354', '766'),
 ('354', '368'),
 ('529', '766'),
 ('529', '368'),
 ('766', '368'),
 ('722', '684', '829'),
 ('722', '419', '829'),
 ('722', '217', '829'),
 ('722', '354', '829'),
 ('722', '829', '368'),
 ('722', '829', '529'),
 ('722', '8