# Discovery of Frequent Itemsets and Association Rules

The problem of discovering association rules between itemsets in a sales transaction database (a set of baskets) includes the following two sub-problems:

1. Finding frequent itemsets with support at least s;
2. Generating association rules with confidence at least c from the itemsets found in the first step.
Remind that an association rule is an implication X → Y, where X and Y are itemsets such that X∩Y=∅. Support of the rule X → Y is the number of transactions that contain X⋃Y. Confidence of the rule X → Y the fraction of transactions containing X⋃Y in all transactions that contain X.

You are to solve the first sub-problem: to implement the A-Priori algorithm for finding frequent itemsets with support at least s in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your A-Priori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions.

The sale transaction dataset includes generated transactions (baskets) of hashed items (see Canvas).

In [26]:
baskets = [i.strip().split() for i in open("T10I4D100K.dat").readlines()]
len(baskets)

100000

In [6]:
transactions = {} # Dictionary with transaction ID as key, and basket as value
count = 0
for basket in baskets:
    count += 1
    transactions[count] = basket


In [7]:
items = set() # Set of items from all baskets
for i in transactions.values():
    for j in i:
        items.add(j) 

In [133]:
# Count the frequency of each item
def freq(k,items, transactions):
    items_counts = dict() # Dictionary of item and its frequency
    for i in items:
        if k == 1:
            temp_i = {i}
        else:
            temp_i = set(i)
            
        for j in transactions.items(): # and basket
            if temp_i.issubset(set(j[1])): # if item is in basket
                if i in items_counts:
                    items_counts[i] += 1 # If already spotted/already in item-freq dict, add 1 to count
                else:
                    items_counts[i] = 1 # If not spotted yet, set count to 1
    return items_counts

In [134]:
items_counts = freq(1,items, transactions)

In [106]:
def support(items_counts, transactions):
    support = dict()
    for i in items_counts:
        support[i] = items_counts[i]/len(transactions) # Support = #transactions in which item appears/#total transactions
    return support   #Support for itemset I is the number of baskets containing all items in I-->i thought the same as you

In [21]:
min_support = 0.05
items_atleast_min_support = [{j[0]:j[1] for j in support(items_counts, transactions).items() if j[1]>=min_support}]

In [22]:
items_atleast_min_support

[{'829': 0.0681,
  '684': 0.05408,
  '354': 0.05835,
  '722': 0.05845,
  '217': 0.05375,
  '529': 0.07057,
  '419': 0.05057,
  '766': 0.06265,
  '368': 0.07828,
  '494': 0.05102}]

In [137]:
#different support
s_min = 5000
L1 = [{j[0]:j[1] for j in freq(1,items, transactions).items() if j[1]>=s_min}]
list(L1[0].keys())

['829', '684', '354', '722', '217', '529', '419', '766', '368', '494']

In [138]:
from itertools import combinations

#candidates of len-k which are generated by combining itemsets from L_k-1 and L_1
def C_k(k, prev_freq):
    combs = combinations(list(set(prev_freq[0].keys()).union(set(L1[0].keys()))), k)
    cand = list(combs)
    return cand
cand2 = C_k(2,L1)

In [148]:
def L_k(k, candidates, threshold):
    
    for i in candidates: # Check for every item
        for j in i:
            temp_list = []
            temp_list.append(j)

        temp_i = set(temp_list)  #for each item check if it belongs to a transaction

    
    Lk = [{j[0]:j[1] for j in freq(k,candidates, transactions).items() if j[1]>=threshold}]
    
    return Lk
L2 = L_k(2, cand2, 100)
L2

[{('684', '217'): 198,
  ('684', '529'): 334,
  ('684', '354'): 219,
  ('684', '419'): 155,
  ('684', '494'): 208,
  ('684', '766'): 613,
  ('684', '829'): 349,
  ('684', '368'): 387,
  ('684', '722'): 443,
  ('217', '529'): 403,
  ('217', '354'): 280,
  ('217', '419'): 344,
  ('217', '494'): 183,
  ('217', '766'): 276,
  ('217', '829'): 275,
  ('217', '368'): 303,
  ('217', '722'): 498,
  ('529', '354'): 301,
  ('529', '419'): 252,
  ('529', '494'): 225,
  ('529', '766'): 317,
  ('529', '829'): 584,
  ('529', '368'): 640,
  ('529', '722'): 283,
  ('354', '419'): 263,
  ('354', '494'): 189,
  ('354', '766'): 329,
  ('354', '829'): 259,
  ('354', '368'): 319,
  ('354', '722'): 566,
  ('419', '494'): 176,
  ('419', '766'): 238,
  ('419', '829'): 259,
  ('419', '368'): 355,
  ('419', '722'): 366,
  ('494', '766'): 227,
  ('494', '829'): 267,
  ('494', '368'): 860,
  ('494', '722'): 226,
  ('766', '829'): 321,
  ('766', '368'): 504,
  ('766', '722'): 328,
  ('829', '368'): 1194,
  ('829', 

Optional task for extra bonus: Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered by using the A-Priori algorithm in a dataset of sales transactions. The rules must have support at least s and confidence at least c, where s and c are given as input parameters.