In [1]:
%reload_ext watermark
%watermark -a 'Antonio Javier González Ferrer & Hamed Mohammadpour' -v -d -r

Antonio Javier González Ferrer & Hamed Mohammadpour 2017-11-15 

CPython 3.5.2
IPython 6.2.1
Git repo: https://github.com/jgonzalezferrer/apriori.git


# Introduction

In [2]:
from apriori.algorithm import find_frequent_itemsets, generate_candidates, prune_candidates
from apriori.utility import create_items_catalog
from tqdm import tqdm_notebook

import json

# For printing maps and dictionaries in sorted, beautiful format
def printify(my_dict):
    print(json.dumps(my_dict))

# Small Example

Let us explain how the frequent itemsets and association rules are calculated with a little example. Our shopping list consists of three different baskets, where each oen contains different elements. We will use letters for representing the elements and numbers for representing baskets. 

In [3]:
baskets = [['a', 'b', 'c'], ['a', 'b', 'e'], ['c', 'd']]
n = len(baskets)
baskets

[['a', 'b', 'c'], ['a', 'b', 'e'], ['c', 'd']]

In [4]:
baskets[0]

['a', 'b', 'c']

The basket number $0$ contains the elements $a$, $b$ and $c$. An element $i$ is said to be "frequent" if it appears in more than a certain number of baskets. We call this count the support $s(i)$ and we measure it proportionally to the total numbers of transactions. For instance, $s(a) = 2/3$ since it appears in two of the three baskets. 

We can also calculate frequent itemsets of any size $k$. Let us focus on calculating frequent itemsets of size $k=2$ for the sake of simplicity. A näive algorithm would count first the frequent itemsets of size $1$, then combinate all different pairs from the previous set and count again the occurrences of the itemsets of size $2$ within the baskets. Considering that we have to check the number of appearances of each pair (which is the most costly operation), there are two main drawbacks in terms of efficiency using this approach:

1. We do not need to iterate over the whole dataset again for calculating the support of the pairs, since we could have stored the number of occurrences of each element in a clever way in the first pass. We will explain it soon but the method `create_items_catalog()` saves for each item the list of baskets where it appears (do you already realise how to use this information for calculating the support of 2-itemsets?) 
2. The <i>apriori</i> property: each subset of a frequent itemset $k$ must be also a frequent itemset. Imagine we generate the tuple $(a,b,c)$ given that the frequent itemsets of size $2$ are $\{(a,b), (b,c)\}$. The tuple $(a,b,c)$ cannot belong to the frequent itemsets of size $3$ since its subset $(a,c)$ is not a frequent itemset of size 2. Therefore, a bruce force generation of candidate frequent itemsets is not efficient. We will handle this generation with the implementation of the `generate_candidates()` method.

## Efficient Store of Support

The key idea is to pass over the dataset only once and store for each of the items a list of baskets where it appears. This implicitly indicates the support of an element just by dividing the length of its list by the total number of baskets. Using this information, we can trivially calculate the support of an element of size $k$. We just need to intersect the $k$ different lists and the length of the resulting set will be the final support.

In [5]:
items_catalog = create_items_catalog(baskets, 'str')
printify(items_catalog)

{"b": [0, 1], "d": [2], "a": [0, 1], "e": [1], "c": [0, 2]}


We can see that the element $a$ is in the lists $0$ and $1$ and the element $c$ in the lists $0$ and $2$, then $s(a)=s(c)=2/3$. Hence, if we want to calculate the support of the itemset $(a, c)$ we just need to intersect both lists and see the length:

In [6]:
ac_intersection = set(items_catalog['a']).intersection(set(items_catalog['c']))
ac_occurrences = len(ac_intersection)
print("The intersection of (a,c) is {} and the support is {}/{}.".format(ac_intersection, ac_occurrences, n))

The intersection of (a,c) is {0} and the support is 1/3.


## Generation of Candidate Itemsets

The candidates itemsets of size $k$ are generated by combining the frequent itemsets of size $k-1$ and the singletons (frequent itemsets of size $1$). However, each candidate itemset must fulfil the <i>apriori</i> property. Let us compare the brute force method against the implemented method:


In [7]:
L1 = ['a', 'b', 'c', 'd']

# after pruning itemsets, these meet the support threshold...
L2 = [set({'a', 'b'}), set({'a', 'c'}), set({'b', 'c'}), set({'b', 'd'})] 

brute_force_candidates = set()
for candidate in L2:
    for single in L1:
        k_candidate = frozenset(candidate.union(single))
        if len(k_candidate) == 3:
            brute_force_candidates.add(k_candidate)

apriori_candidates = generate_candidates(L2, L1)

In [8]:
print('The brute force candidates are: \n{}\n'.format(brute_force_candidates))
print('The candidates generated by the apriori algorithm are: \n{}'.format(apriori_candidates))

The brute force candidates are: 
{frozenset({'b', 'd', 'c'}), frozenset({'b', 'd', 'a'}), frozenset({'d', 'a', 'c'}), frozenset({'b', 'a', 'c'})}

The candidates generated by the apriori algorithm are: 
{frozenset({'b', 'a', 'c'})}


As expected, the only valid itemset of size $k=3$ is $(a,b,c)$ since $(a,b), (a,c)$ and $(b,c)$ are also frequent itemsets. For instance, $(a, d, b)$ contains $(a,d)$ which is not a frequent itemset.

# Real Example

Now let us test is with a real example.

## Load the data

In [9]:
%%time
data_file = "./data/T10I4D100K.dat"

with open(data_file, 'r') as f:
    content = f.read()
    baskets = []

    for line in content.splitlines():
        baskets.append(line.split())

support = 0.01
n = len(baskets)
print("The dataset contains {:,} baskets.".format(n))

items_catalog = create_items_catalog(baskets)
items_length = len(list(items_catalog.keys()))
print("There are {} different items.".format(items_length))

The dataset contains 100,000 baskets.
There are 870 different items.
CPU times: user 604 ms, sys: 8 ms, total: 612 ms
Wall time: 611 ms


Let us calculate which elements of size 1 are actually frequent itemsets, i.e. those who meet the specified support:

In [10]:
%%time
frequent_itemsets = set()
c1 = {frozenset({x}) for x in set(items_catalog.keys())}
l1, _ = prune_candidates(n, c1, support, items_catalog)
print("There are {} frequent itemsets of size k=1.".format(len(l1)))

There are 375 frequent itemsets of size k=1.
CPU times: user 80 ms, sys: 4 ms, total: 84 ms
Wall time: 83.7 ms


From those singletons, let us see how many candidate itemsets we find:

In [11]:
%%time
c2 = generate_candidates(l1, l1)
print("There are {} possible candidate itemsets of size k=2.".format(len(c2)))

There are 70125 possible candidate itemsets of size k=2.
CPU times: user 260 ms, sys: 20 ms, total: 280 ms
Wall time: 277 ms


Now we have to find the support of those elements and filter out the ones that do not fulfill the threshold:

In [12]:
%%time 
l2, _ = prune_candidates(n, c2, support, items_catalog)
print("There are {} frequent itemsets of size k=2.".format(len(l2)))

There are 9 frequent itemsets of size k=2.
CPU times: user 16.2 s, sys: 12 ms, total: 16.2 s
Wall time: 16.2 s


This is the operation that takes most of the time since it needs to check many candidate itemsets. For larger $k$ the possible combinations are much smaller and therefore it will be fast:

In [13]:
%%time
c3 = generate_candidates(l2, l1)
print("There is {} possible candidate itemset of size k=3.".format(len(c3)))

There is 1 possible candidate itemset of size k=3.
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 6.08 ms


In [14]:
%%time
l3, _ = prune_candidates(n, c3, support, items_catalog)
print("There is {} frequent itemset of size k=3 and it is -> {}".format(len(l3), l3))

There is 1 frequent itemset of size k=3 and it is -> {frozenset({frozenset({'39'}), frozenset({'825'}), frozenset({'704'})})}
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 4.73 ms
