## **Assingment 2** - Group 50

Lütfi Altin (lutfia@kth.se) |
Jakob Heyder (heyder@kth.se)

### Task:
You are to solve the first sub-problem: to implement the Apriori algorithm for finding frequent itemsets with support at least **_s_** in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your Apriori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions.

Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered by using the Apriori algorithm in a dataset of sales transactions. The rules must have support at least **_s_** and confidence at least **_c_**, where **_s_** and **_c_** are given as input parameters.


### Dataset and Tools

The classes will be implemented as Python functions.

In [0]:
# Load dependencies (pandas, csv etc.)
import csv
import numpy as np
import re
import hashlib
import itertools
from collections import Counter
from pprint import pprint
import pandas as pd


In [3]:
# Import sales data
baskets = []

for l in open('T10I4D100K.dat', 'r'):
    items = l.strip().split(' ')
    items = list(map(int, items)) # convert items to integers
    baskets.append(items)

print("First 3 baskets:")
pprint(baskets[0:3])

print()
print("Number of baskets: ", len(baskets))


First 3 baskets:
[[25, 52, 164, 240, 274, 328, 368, 448, 538, 561, 630, 687, 730, 775, 825, 834],
 [39, 120, 124, 205, 401, 581, 704, 814, 825, 834],
 [35, 249, 674, 712, 733, 759, 854, 950]]

Number of baskets:  100000


Construct function: Creates possible set of frequent itemsets.

**_*1_** Consider the case where there are 3 frequent items: a, b, c. Possible pairs of frequent items are (a, b), (a, c) and (b, c). Construction of possible triplets is done by merging pairs with frequent itemset, looping through frequent items and adding them to each pair. This will result in following duplicate triplets:
* (a, b, c)
* (a, c, b)
* (b, c, a)

To overcome the duplicates we keep an identity of each triplet and filter them if they are already in possible set of frequent items.

**_*2_** Assume with 3 frequent items a, b, c following frequent pairs are found: (a, b) and (a, c). In that case (a, b, c) is not a canditate for frequent pair because (b, c) is not a frequent itemset. However this is not so important because k=2 requires most computational power for typical baskets.

In [4]:
# Apriori algorithm implementation

# constructs possible set of frequent itemsets with k+1 items where `freqItemsets` is itemsets with k items 
def construct(freqItemsets, freqItems):
    itemsets = []
    signatures = set()

    for s in freqItemsets:
        for i in freqItems:
            if type(s) == list:
                possibleSet = list(s)
            else:
                possibleSet = [s]

            if i not in possibleSet: # add to possible set if not already in there
                possibleSet.append(i)
                possibleSet.sort()
                signature = ",".join(map(str, possibleSet))

                if signature not in signatures: # make sure new possible set is not already generated. see notice *1
                    signatures.add(signature)
                    itemsets.append(possibleSet)

                    # possible performance increase: check that all subsets of possibleSet are in freqItemsets. see *2

    return itemsets
    
def filter(baskets, possibleSets, threshold):
    print('filtering %s candidate sets with k=%s' % (len(possibleSets), len(possibleSets[0])))

    # count occurences of itemsets in baskets
    occurences = [0 for s in possibleSets] # initialize list with 0 occurences
    for b in baskets:
        for index, s in enumerate(possibleSets):

            in_basket = True
            for item in s:
                if item not in b:
                    in_basket = False
                    break
            
            if in_basket:
                occurences[index] += 1
    
    # filter and return frequent itemsets with occurence counts
    return { ",".join(map(str, possibleSets[i])):v for i, v in enumerate(occurences) if v >= threshold }

def apriori(baskets, threshold):
    occurenceData = []

    # 1. count occurences of each item in baskets
    occurences = {} # a dictionary containing occurence of items in baskets
    for b in baskets:
        for item in b:
            if item not in occurences: # this item didn't exist in previous baskets, initialize with 0
                occurences[item] = 0
            occurences[item] += 1
    
    # 2. filter frequent items, itemsets can only be made of frequent items
    occurences = {k:v for k, v in occurences.items() if v >= threshold}
    freqItems = [k for k, v in occurences.items()]

    occurenceData.append(occurences)
    
    # 3. continue with apriori algorithm pipeline, construct candidate k-tuples & filter
    freqItemsets = freqItems
    while len(freqItemsets) > 0:
        print(freqItemsets)

        occurences = filter(baskets, construct(freqItemsets, freqItems), threshold)
        freqItemsets = [list(map(int, k.split(","))) for k, v in occurences.items()]

        occurenceData.append(occurences)

    return occurenceData

#occurenceData = apriori(baskets, 1000) # 1%

occurenceData = apriori(baskets[0:1000], 12) # test with this during development, tests against a subset of data

print('done')


[25, 52, 240, 274, 368, 448, 538, 561, 775, 825, 39, 120, 205, 401, 581, 704, 814, 35, 674, 733, 854, 950, 449, 895, 937, 964, 229, 283, 294, 381, 738, 766, 853, 883, 966, 978, 143, 569, 620, 798, 214, 350, 529, 658, 682, 782, 809, 947, 970, 227, 390, 71, 192, 279, 280, 496, 530, 597, 675, 720, 914, 932, 183, 193, 217, 256, 276, 653, 706, 878, 161, 175, 177, 424, 571, 623, 795, 910, 960, 125, 130, 392, 461, 801, 862, 27, 78, 921, 147, 411, 572, 579, 778, 803, 903, 266, 523, 614, 888, 944, 43, 70, 204, 334, 480, 874, 151, 830, 890, 73, 118, 310, 419, 484, 722, 810, 844, 846, 918, 967, 326, 403, 526, 774, 788, 789, 975, 116, 198, 201, 395, 171, 541, 701, 805, 946, 471, 487, 631, 638, 735, 780, 935, 17, 242, 758, 763, 956, 145, 385, 676, 790, 792, 885, 522, 617, 12, 296, 354, 548, 684, 740, 841, 210, 346, 477, 605, 829, 884, 355, 460, 746, 600, 28, 742, 5, 115, 517, 736, 744, 919, 196, 489, 494, 673, 362, 591, 31, 58, 181, 472, 573, 628, 651, 154, 168, 580, 832, 871, 988, 72, 981, 10, 132

In [5]:
def associationRules(occurenceData, confidence=0.8):
    freqItems = occurenceData[0]

    for k in range(len(occurenceData) - 1): # occurenceData contains support values for each k-itemsets
        for itemset in occurenceData[k]:
            supportItemset = occurenceData[k][itemset] # extract support value for current itemset, will be divided to calculate confidence

            if type(itemset) == int:
                itemset = [itemset]
            else:
                itemset = list(map(int, itemset.split(",")))

            for newItem in freqItems: # loop through `freqItems` and add each one of them to current itemset. same as construct
                newItemset = list(itemset)
                newItemset.append(newItem)
                newItemset.sort()
                newItemset = ",".join(map(str, newItemset)) # generate new itemset with `newItem` added to itemset

                if newItemset not in occurenceData[k+1]:
                    continue

                conf = occurenceData[k+1][newItemset] / supportItemset # calculate confidence: support of `newItemset` divided by support of itemset
                if conf > confidence:
                    print(itemset, ' ==> ', newItem) 

associationRules(occurenceData)

[801]  ==>  862
[515]  ==>  217
[217, 283]  ==>  346
[217, 283]  ==>  33
[217, 283]  ==>  515
[283, 346]  ==>  217
[283, 346]  ==>  33
[283, 346]  ==>  515
[33, 283]  ==>  217
[33, 283]  ==>  346
[33, 283]  ==>  515
[283, 515]  ==>  217
[283, 515]  ==>  346
[283, 515]  ==>  33
[33, 217]  ==>  283
[33, 217]  ==>  346
[33, 217]  ==>  515
[217, 515]  ==>  283
[217, 515]  ==>  346
[217, 515]  ==>  33
[33, 346]  ==>  283
[33, 346]  ==>  217
[33, 346]  ==>  515
[346, 515]  ==>  283
[346, 515]  ==>  217
[346, 515]  ==>  33
[33, 515]  ==>  283
[33, 515]  ==>  217
[33, 515]  ==>  346
[217, 283, 346]  ==>  33
[217, 283, 346]  ==>  515
[33, 217, 283]  ==>  346
[33, 217, 283]  ==>  515
[217, 283, 515]  ==>  346
[217, 283, 515]  ==>  33
[33, 283, 346]  ==>  217
[33, 283, 346]  ==>  515
[283, 346, 515]  ==>  217
[283, 346, 515]  ==>  33
[33, 283, 515]  ==>  217
[33, 283, 515]  ==>  346
[33, 217, 346]  ==>  283
[33, 217, 346]  ==>  515
[217, 346, 515]  ==>  283
[217, 346, 515]  ==>  33
[33, 217, 515]

NOTES:

construct is quite fast no need for optimization, filter is slow especially for k=2 (and I don't think there can be more optimization because number of baskets is high).