# Frequent Itemset Mining
**Prepared by Christian Alis**

Frequent Itemset Mining is also known as Frequent Pattern Mining. In practice, we don't usually end with frequent itemset mining but instead continues on to association pattern mining.

Frequent Itemset Mining and Association Pattern Mining (aka Association Rule Mining) were born out of market basket analysis. Hence, many of the terms used are relevant to that original problem.

Consider a _database_ $\mathscr{T}$, which is an unordered set of $N$ transactions $\{T_1, T_2, ..., T_N\}$. Each transaction $T_i$ is a set of $n_i = \left|T_i\right|$ items drawn from a universe $\mathscr{U}$ of items. Instead of using a sequence of sets, we may represent $\mathscr{T}$ as a matrix of $d=\left|\mathscr{U}\right|$ dimensions where 0 represents the absence and, a positive number, the presence of the item in the transaction.

In this notebook, we will only consider the presence of an item in a transaction and will ignore the actual amount (as long as it's nonzero, of course). Furthermore, in the real world, $d$ is large.

The following is an example of a database:

| tid |           items             |
|-----|-----------------------------|
|  1  | {bread, butter, milk}       |
|  2  | {eggs, milk, yogurt}        |
|  3  | {bread, cheese, eggs, milk} |
|  4  | {eggs, milk, yogurt}        |
|  5  | {cheese, milk, yogurt}      |

In Python, we may represent the database as a list of sets, making use of the fact that the `tid` is sequential.

**Exercise 1**

Give three examples of business problems that can be solved using Frequent Itemset Mining or Associative Pattern Mining.

1. Choosing what to bundle in a supermarket
2. Recommending a course to a student. Assuming students choose what courses they will take. Assuming courses do not have precedence.
3. Detecting crimes from unusual patterns (least frequent "items")

In [1]:
import numpy as np
import pandas as pd
from scipy.sparse.csr import csr_matrix
from numpy.testing import assert_equal, assert_array_equal

In [2]:
db = [{"bread", "butter", "milk"},
      {"eggs", "milk", "yogurt"},
      {"bread", "cheese", "eggs", "milk"},
      {"eggs", "milk", "yogurt"},
      {"cheese", "milk", "yogurt"}]

We may also represent the database as a sparse matrix.

**Exercise 2**

Create a function `to_sparse` that reads `db` and returns a numpy sparse matrix representation of the database. Sort the columns in alphabetical order.

In [3]:
def to_sparse(db):
    flat_list = [item for sublist in [list(x) for x in db] for item in sublist]
    db_list = [list(x) for x in db]
    sorted_items = sorted(list(set(flat_list)))
    zeros = np.zeros((len(db), len(sorted_items)))  
    for t, t_elem in enumerate(db_list):
        for elem in t_elem:
            zeros[t,sorted_items.index(elem)] = 1 
    return csr_matrix(zeros)

In [4]:
db_sparse = to_sparse(db)
assert isinstance(db_sparse, csr_matrix)
assert_equal(db_sparse.shape, (5, 6))
assert_array_equal(db_sparse[0,:].toarray().squeeze(), 
                   [1., 1., 0., 0., 1., 0.])

An **itemset** $X$ is a subset of items, $X \subseteq \mathscr{U}$. A length $k = |X|$ itemset is also known as a **$k$-itemset**. For example, the subset itemsets of {bread, butter, milk} are {bread}, {butter}, {milk}, {bread, butter}, {bread, milk}, {butter, milk}, {bread, butter, milk}. It consists of three 1-itemsets, three 2-itemsets and one 3-itemset.

For our course, we define the **support** $sup(X)$ of an itemset $X$ as the number of transactions having $X$ as a subset,
$$
sup(X) = \left|{T_i|X \subseteq T_i}\right|.
$$
This is known as the **absolute support**. We define **relative support** $relSup(X)$ as the fraction of transactions having $X$ as a subset,
$$
relSup(X) = \frac{sup(X)}{N}.
$$
For many references and libraries, support is defined as the _relative_ support so be careful when reading or using other materials. It's quite easy to convert between absolute support and relative support, though. To illustrate, the (absolute) support for {bread, milk} is 2 and its relative support is $2/5=0.4$.

We define **frequent itemsets** (or frequent patterns) as those itemsets that have a support at least equal to a given minimum support ($minsup$). The task of frequent itemset mining is to find all frequent itemsets in the database.

In this notebook, we look at four methods for finding the frequent itemsets:
* Brute force
* Apriori
* ECLAT
* FP-growth

## Brute force algorithm

The brute force approach to frequent itemset mining is to compute the support of each possible itemset of $\mathscr{U}$ then return only those itemsets whose support pass the threshold $minsup$.

**Exercise 3**

Create a function `brute` that accepts `db` and `minsup` then returns the frequent itemsets as a list of tuples with their support using the brute force algorithm. Sort the items in each tuple alphabetically. Sort the results, first by decreasing support, then by decreasing number of items then alphabetically. Use only the Python Standard Library.

In [5]:
def brute(db, minsup):
    from itertools import combinations
    flat_list = [item for sublist in [list(x) for x in db] for item in sublist]
    sorted_items = sorted(list(set(flat_list)))
    li = []
    length = list(range(len(sorted_items),0,-1))
    for l in length:
        for i in combinations(sorted_items, l):
            count = 0
            for t in db:
                if set(i).issubset(t):
                    count += 1
            if count > 0:
                li.append((i, count))
    sorted_items_list = [(sorted(x), sup) for x,sup in li]
    nonzero_itemmsets = [(tuple(elem[0]), elem[1]) for elem in sorted(sorted_items_list, key=lambda x: (-x[1], -len(x[0]), x[0]))]
    return [itemmset for itemmset in nonzero_itemmsets if itemmset[1]>=minsup]

In [6]:
fi_brute = brute(db, 2)
assert_equal(len(fi_brute), 11)
assert_equal(fi_brute[0], (('milk',), 5))

Although it took you a few loops, the brute force algorithm is relatively simple. It doesn't scale well though.

**Exercise 4**

Estimate the runtime for `brute`, relative to the runtime for the current case, when the number of items is 1000 but the number of transactions remains the same, and when the number of transactions is 1000 but the number of items remains the same.


The runtime would take a lot longer since the number of combinations of 1000 items is a lot compared to just 4 items. 

## Apriori algorithm (better for memory)

- Not totally parallelizable
- raw data can be split to different "workers", where each split needs a new adjusted absolute support (or just use rel support so no adjustment)

Clearly, generating all the possible itemsets from a universe of items could take a while. Coupled with scanning through the entire dataset for each possible itemset, the brute force algorithm is simply untenable for large databases. To make progress in making frequent itemset mining scalable, one should realize that generating the itemsets can be thought of as doing a search. Figure 1 shows the search space for a 5-item database.


<img src="search-space.png" style="width: 25em"/>
<strong>Figure 1</strong>. The $k$-th layer corresponds to the $k$-itemsets for the 5-item database. A line connects a $k$-itemset to a $(k+1)$-itemset generated by adding a unique item to the latter. The tree can be partitioned into frequent and infrequent itemsets using a border (blue broken line).


We now introduce the following important properties of itemsets:

* **Support monoticity property**: The support of every subset $Y$ of itemset $X$ is at least equal to that of the support for $X$, $sup(Y) \geq sup(X), \forall Y \subseteq X$.
* **Downward closure property**: Every subset of a frequent itemset is also frequent.
* **Maximal frequent itemsets**: A frequent itemset is maximal at a given minimum support $minsup$ if it is frequent and no superset of it is frequent.

Using the properties above, what we can do is to perform a breadth-first search on the tree starting from the 1-itemsets, **we check whether the itemset is frequent and we only go deeper if it is**. This is the gist of the Apriori algorithm which we formally write down below.

<img src="apriori.png" style="width: 40em" />

An implication of the Apriori algorithm is that we would be able partition the search space by a border (Figure 1). The frequent itemsets along the border are the maximal frequent itemsets.

**Exercise 5**

Create a function `apriori` that accepts `db` and `minsup` then returns the frequent itemsets as a list of tuples with their support using the Apriori algorithm. Sort the items in each tuple alphabetically. Sort the results, first by decreasing support, then by decreasing number of items then alphabetically. Use only the Python Standard Library.

In [7]:
def apriori(db, minsup):
    from itertools import combinations
    li = []
    flat_list = [item for sublist in [list(x) for x in db] for item in sublist]
    sorted_items = sorted(list(set(flat_list)))
    for l in range(1, len(sorted_items)+1):
        scope=[]
        for i in combinations(sorted_items, l):
            count=0
            for t in db:
                if set(i).issubset(t):
                    count+=1
            if count >= minsup:
                li.append((tuple(sorted(i)), count))
                scope.append(i)

        sorted_items = set([item for sublist in [list(x) for x in scope] for item in sublist])
    return sorted(li, key=lambda x: (-x[1], -len(x[0]), x[0]))

In [8]:
db

[{'bread', 'butter', 'milk'},
 {'eggs', 'milk', 'yogurt'},
 {'bread', 'cheese', 'eggs', 'milk'},
 {'eggs', 'milk', 'yogurt'},
 {'cheese', 'milk', 'yogurt'}]

In [9]:
fi_apriori = apriori(db, 2)
assert_equal(len(fi_apriori), 11)
assert_equal(fi_apriori[0], (('milk',), 5))

**Exercise 6**

Create a function `apriori2` that accepts `db` and `minsup` then returns the maximal frequent itemsets as a list of tuples with their support using the Apriori algorithm. Sort the items in each tuple alphabetically. Sort the results, first by decreasing support, then by decreasing number of items then alphabetically. Use only the Python Standard Library.

In [10]:
def apriori2(db, minsup):
    from itertools import combinations
    li = []
    flat_list = [item for sublist in [list(x) for x in db] for item in sublist]
    
    sorted_items = sorted(list(set(flat_list)))
    maximal = []
    checker = []
    for l in range(1, len(sorted_items)+1):
        scope=[]
        for i in combinations(sorted_items, l):
            count=0
            for t in db:
                if set(i).issubset(t):
                    count+=1
            if count >= minsup:
                li.append((tuple(sorted(i)), count))
                scope.append(i)
        iter_set = [set(x) for x in scope]
        sorted_items = set([item for sublist in [list(x) for x in scope] for item in sublist])
#         print('-----------------------------------------')
#         print(f'list at iteration {l}\n:',iter_set)
#         print('-----')
#         print('checker:\n', checker)

        for set_ in checker:

            if not any([set_.issubset(i_iter) for i_iter in iter_set]):
                maximal.append(set_)
        checker = iter_set

#         print('new sorted items\n', sorted_items)

    apriori = sorted(li, key=lambda x: -x[1])
#     display(apriori)

    sorted_maximal = [tuple(sorted(m)) for m in maximal]
    return sorted([(x, dict(li)[x]) for x in sorted_maximal], key=lambda i: (-i[1], -len(i[0]), i[0]))

In [11]:
maxfi_apriori = apriori2(db, 2)
assert_equal(len(maxfi_apriori), 3)
assert_equal(maxfi_apriori[0], (('eggs', 'milk', 'yogurt'), 2))

In [12]:
maxfi_apriori

[(('eggs', 'milk', 'yogurt'), 2),
 (('bread', 'milk'), 2),
 (('cheese', 'milk'), 2)]

## ECLAT (better for speed)

 - not ideal to be parallelized since it will eat up a lot of resources

A problem of the Apriori algorithm is that one has to scan the entire database to compute the support of each candidate itemset. Equivalence Class Clustering and Bottom-up Lattice Traversal (ECLAT) avoids this problem by using a vertical database, which is a database where the item is the key and the values are the set of `tid`s where it is found. To illustrate, the vertical database representation of the example horizontal database above is

|  item  |       tids      |
|--------|-----------------|
| bread  | {1,3}           |
| butter | {1}             |
| milk   | {1, 2, 3, 4, 5} |
| eggs   | {2, 3, 4}       |
| yogurt | {2, 4, 5}       |
| cheese | {3, 5}          |

From the vertical database, the `tid` list of the itemset $X \cup Y$ is $tid(X \cup Y) = tid(X) \cap tid(Y)$. The support is simply $sup(X \cup Y) = |tid(X \cup Y)|$. Thus, the original horizontal database need not be scanned anymore as long as the vertical database is given.

ECLAT performs one pass over the database to create the vertical database then performs a depth-first search using the vertical database. The algorithm is shown below.

<img src="eclat.png" style="width: 40em" />

**Exercise 7**

Create a function `eclat` that accepts `db` and `minsup` then returns the frequent itemsets as a list of tuples with their support using the ECLAT algorithm. Sort the items in each tuple alphabetically. Sort the results, first by decreasing support, then by decreasing number of items then alphabetically. Use only the Python Standard Library.

In [13]:
def eclat(db, minsup):
    from itertools import combinations
    flat_list = [item for sublist in [list(x) for x in db] for item in sublist]
    sorted_items = sorted(list(set(flat_list)))
    # print(sorted_items)
    dict_ = {}
    for item in sorted_items:
        li = []
        for index, t in enumerate(db):
    #         print(t)
            if item in t:
                li.append(index)
        if len(set(li)) >= minsup:
            dict_[item] = set(li)
    final_dict = {}
    final_dict.update(dict_)
    li_to_comb = dict_.keys()
    for l in range(2, len(db)):
        tempo_dict = {}
        for i in combinations(li_to_comb, l):
            if len(set.intersection(*[dict_[x] for x in i])) >= minsup:
    #             print(i, len(set.intersection(*[dict_[x] for x in i])))
                tempo_dict[i] = set.intersection(*[dict_[x] for x in i])
            final_dict.update(tempo_dict)
        li_to_comb = set([item for sublist in [list(x) for x in tempo_dict.keys()] for item in sublist])
        
#         print('li to comb', li_to_comb)

    return_dict = {k if isinstance(k, tuple) else (k,) : len(v) for k,v in final_dict.items()}
    return_tuple = sorted([(tuple(sorted(k)),v) for k,v in return_dict.items()], key=lambda x: (-x[1], -len(x[0]), x[0]))
    return return_tuple

In [14]:
fi_eclat = eclat(db, 2)
assert_equal(len(fi_eclat), 11)
assert_equal(fi_eclat[0], (('milk',), 5))

In [15]:
fi_eclat

[(('milk',), 5),
 (('eggs', 'milk'), 3),
 (('milk', 'yogurt'), 3),
 (('eggs',), 3),
 (('yogurt',), 3),
 (('eggs', 'milk', 'yogurt'), 2),
 (('bread', 'milk'), 2),
 (('cheese', 'milk'), 2),
 (('eggs', 'yogurt'), 2),
 (('bread',), 2),
 (('cheese',), 2)]

## FP-Growth Algorithm

 - slightly more efficient in candidate generation (faster)
 - parallelizable
 - yt vids that helped me : https://www.youtube.com/watch?v=VB8KWm8MXss&t=634s , https://www.youtube.com/watch?v=7oGz4PCp9jI

Although ECLAT improves the Apriori algorithm by avoiding multiple database scans, it still suffers from generating candidate itemsets that may not be in the database at all. FP-Growth, a pattern growth algorithm, avoids this problem by using projected databases.

FP-Growth, like any other pattern growth algorithm, uses an enumeration tree. This algorithm employs a particular kind of enumeration known as a prefix tree. Assume that there's a defined lexicographical order among items i.e., items can be sorted. For example, if we define the lexicographical order as alphabetical, then {milk, cheese} will be transformed to {cheese, milk} after sorting. A prefix tree is a tree where each node is an item and a higher-level node can only be connected to a lower-level node of later order (Figure 2).

<div style="width: 40em; margin: 0 auto">
    <img src="prefix-tree.png" style="width: 25em" />
    <strong>Figure 2</strong>. A higher-level prefix tree node can only be connected to a lower-level prefix tree node of later lexicographical order.
</div>

What FP-growth does is it recursively scans the database to find the frequent items (not itemsets) but instead of scanning the entire database, it will only scan the subset of the database that contains the prefix of the itemsets and with the prefix removed from the values. This is known as the projected database.

To illustrate, FP-Growth will scan the database and finds that bread, with $minsup=2$, is a frequent itemset. The resulting projected database would then be:

| tid |           items      |
|-----|----------------------|
|  1  | {butter, milk}       |
|  3  | {cheese, eggs, milk} |

The algorithm would repeat the process but scanning the projected database instead. It would find that butter has only a support of 1 hence {bread, butter} is not a frequent itemset. It would then look at the other items (cheese, eggs, milk, in order) but it would find that only milk has a support passing the $minsup$ threshold. Thus, it would add {bread, milk} as a frequent itemset. However, the resulting projected database is empty so it won't go deeper. Instead it would go up, back to the original dataset where it would find that butter is not a frequent itemset but cheese is so it would repeat the same process for the projected database of cheese and so on.

The FP-growth algorithm is summarized below.

<img src="fp-growth.png" style="width: 40em" />

**Exercise 8**

Create a function `fpgrowth` that accepts `db` and `minsup` then returns the frequent itemsets as a list of tuples with their support using the FP-Growth algorithm. Sort the items in each tuple alphabetically. Sort the results, first by decreasing support, then by decreasing number of items then alphabetically. Use only the Python Standard Library.

In [16]:
def fpgrowth(db, minsup):
    minsup=2
    flat_list = [item for sublist in [list(x) for x in db] for item in sublist]

    from collections import Counter

    ### add to generated frequent itemsets
    to_add = sorted([(k,v) for k,v in Counter(flat_list).items() if v>=minsup], key=lambda x: (-x[1], x[0]))
    print('Frequent Items that met MinSup\n################')
    display(to_add)

    ### sort and filter transaction table using relevant items
    sorted_filtered=[sorted(itemset, key=lambda a: -dict(to_add)[a]) for itemset in [set([b[0] for b in to_add]).intersection(x) for x in db]]
    print('Sorted and Filtered Transaction table\n################')
    display(sorted_filtered)


    item_list = [x[0] for x in to_add][::-1]
    dict_ = {k:[] for k in item_list}

    for t in sorted_filtered:
        counter = Counter(t)

        for k in dict_:
            try:
                t.index(k)
            except:
                continue
            else:
                index_of_k = t.index(k)
                dict_[k].append({tuple(t[:index_of_k]):counter[k]})

    dict_ = {k:v for k,v in sorted(dict_.items(), key=lambda x: dict(to_add)[x[0]])}
    display(dict_)

    conditional_base ={}
    for k,v in dict_.items():
        final_dict = {}
        for elem in v:
            for key in elem.keys():

                if key in final_dict:
                    final_dict[key] = final_dict[key] + elem[key]
                else:
                    final_dict[key] = elem[key]

        conditional_base[k] = final_dict
    display(conditional_base)

    conditional_FP = {k:{} for k in conditional_base.keys()}
    for k,v in conditional_base.items():
        li = []
        for key in v.keys():
            for item in key:
                li.append(item)
        set_li = set(li)
        for x in set_li:
            count = 0
            for key in v.keys():
                if x in key:
                    count += v[key]
            if count >= minsup:
                conditional_FP[k][x] = count
    display(conditional_FP)    

    freq_itemset = []
    for k,v in conditional_FP.items():
        li = []
        for elem in v.keys():
            li.append(elem)
    #     print(li)

        from itertools import combinations
        for i in range(1, len(li)+1):
            for c in combinations(li,i):
    #             print(c)
                if len(c) > 1:
                    supp = min([v[a] for a in c])
                elif len(c) == 1:
                    supp = v[c[0]]

                list_ = list(c)
                list_.append(k)
                freq_itemset.append((tuple(list_), supp))
    print(freq_itemset,'\n')
    
   
    
    output = freq_itemset + [((x[0],), x[1]) for x in to_add]
    print('Final\n###################\n', sorted(output, key=lambda x: (-x[1], -len(x[0]), x[0])))
    return sorted(output, key=lambda x: (-x[1], -len(x[0]), x[0]))

In [17]:
db

[{'bread', 'butter', 'milk'},
 {'eggs', 'milk', 'yogurt'},
 {'bread', 'cheese', 'eggs', 'milk'},
 {'eggs', 'milk', 'yogurt'},
 {'cheese', 'milk', 'yogurt'}]

In [18]:
fi_fpgrowth = fpgrowth(db, 2)
assert_equal(len(fi_fpgrowth), 11)
assert_equal(fi_fpgrowth[0], (('milk',), 5))

Frequent Items that met MinSup
################


[('milk', 5), ('eggs', 3), ('yogurt', 3), ('bread', 2), ('cheese', 2)]

Sorted and Filtered Transaction table
################


[['milk', 'bread'],
 ['milk', 'yogurt', 'eggs'],
 ['milk', 'eggs', 'bread', 'cheese'],
 ['milk', 'yogurt', 'eggs'],
 ['milk', 'yogurt', 'cheese']]

{'cheese': [{('milk', 'eggs', 'bread'): 1}, {('milk', 'yogurt'): 1}],
 'bread': [{('milk',): 1}, {('milk', 'eggs'): 1}],
 'yogurt': [{('milk',): 1}, {('milk',): 1}, {('milk',): 1}],
 'eggs': [{('milk', 'yogurt'): 1}, {('milk',): 1}, {('milk', 'yogurt'): 1}],
 'milk': [{(): 1}, {(): 1}, {(): 1}, {(): 1}, {(): 1}]}

{'cheese': {('milk', 'eggs', 'bread'): 1, ('milk', 'yogurt'): 1},
 'bread': {('milk',): 1, ('milk', 'eggs'): 1},
 'yogurt': {('milk',): 3},
 'eggs': {('milk', 'yogurt'): 2, ('milk',): 1},
 'milk': {(): 5}}

{'cheese': {'milk': 2},
 'bread': {'milk': 2},
 'yogurt': {'milk': 3},
 'eggs': {'milk': 3, 'yogurt': 2},
 'milk': {}}

[(('milk', 'cheese'), 2), (('milk', 'bread'), 2), (('milk', 'yogurt'), 3), (('milk', 'eggs'), 3), (('yogurt', 'eggs'), 2), (('milk', 'yogurt', 'eggs'), 2)] 

Final
###################
 [(('milk',), 5), (('milk', 'eggs'), 3), (('milk', 'yogurt'), 3), (('eggs',), 3), (('yogurt',), 3), (('milk', 'yogurt', 'eggs'), 2), (('milk', 'bread'), 2), (('milk', 'cheese'), 2), (('yogurt', 'eggs'), 2), (('bread',), 2), (('cheese',), 2)]


# Association Pattern Mining

For many applications, frequent itemset analysis is not the final analysis performed but rather continues on to association pattern mining. Let us define a few terms first.

The **confidence** of an **association rule** $A \rightarrow B$ is the conditional probability that $B$ is in a transaction given that it contains $A$,
$$
conf(A \rightarrow B) = \Pr(B \in T_i | A \in T_i) = \frac{sup(A \cup B)}{sup(A)}.
$$
We say that $A$ is the **antecedent** and $B$ as the **consequent** of the rule. The objective of association pattern mining is to find the rules that have a confidence at least equal to a given minimum confidence $minconf$. **Lift** is the increase in the likelihood of $B$ in a transaction given that $A$ is already included,
$$
lift(A \rightarrow B) = \frac{conf(A \rightarrow B)}{relSup(B)}.
$$
A lift greater than 1 means $A$ and $B$ are more likely to be found together than just $B$ alone, a value of 1 means there is no association between $A$ and $B$, and a lift of less than 1 implies that $A$ and $B$ are unlikely to be together (negative association).

We find the association rules for a database of transactions by first looking for all frequent itemsets in the database with the minimum support. For all frequent itemsets, we generate the candidate association rules by setting one of the items as the consequent then all possible subsets of the remaining items as antecedent. The confidence of each candidate association rule is computed and only those that reached the minimum support is returned.

**Exercise 9**

Create a function `assoc` that accepts the list of frequent itemsets returned by the functions above, minimum confidence $minconf$ and number of transactions in the database then returns the list of association rules with $minconf$ minimum confidence as a list of dicts. Each dict corresponds to a rule and it should have the following keys: `antecedent`, `consequent`, `support`, `confidence` and `lift`. Sort the rules by decreasing lift.

In [19]:
### LT7
# def assoc(fi, minconf, db_size):
#     fi = sorted(fi, key=lambda x:len(x[0]))   
#     sup_dict = {fi_pair[0]: fi_pair[1] for fi_pair in fi}

#     # Generate candidate rules
#     cand_rules = set()
#     for fi_pair in fi:
#         if len(fi_pair[0]) > 1:
#             itemset = list(fi_pair[0])
#             for item in itemset:
#                 cons_fi = item
#                 ante_list = itemset.copy()
#                 ante_list.remove(item)
#                 r = len(ante_list)
#                 ante_list = (
#                     [ante for ante in itertools.combinations(ante_list, r)]
#                 )
#                 for ante_fi in ante_list:
#                     cand_rules.add((ante_fi, cons_fi, fi_pair[1]))
    
#     # Compute for the confidence and lift
#     rules = []
#     for cand_rule in cand_rules:
#         confidence = cand_rule[2] / sup_dict[cand_rule[0]]
#         if confidence < minconf:
#             continue
#         lift = confidence / (sup_dict[tuple(cand_rule[1])] / db_size)
#         rule_dict = {
#             'antecedent': cand_rule[0],
#             'consequent': cand_rule[1],
#             'support': cand_rule[2],
#             'confidence': confidence,
#             'lift': lift
#         }
#         rules.append(rule_dict)
                
#     return sorted(rules, key=lambda x: -x['lift'])

In [20]:
# ## LT Pat
# def assoc(fi, minconf, db_size):
#     revised_fi = [p[0] for p in fi if len(p[0]) >= 2]

#     antecedents = []
#     consequents = []

#     for itemset in revised_fi:
#         for item in itemset:
#             antecedent_tmp = list(itemset)
#             antecedent_tmp.remove(item)

#             antecedent_comb = []
#             for r in range(1, len(antecedent_tmp)+1):
#                 antecedent_comb += [c for c in combinations(antecedent_tmp, r)]

#             for subset in antecedent_comb:
#                 consequents += [item]
#                 antecedents += [subset]

#     candidates = list(set(tuple(zip(antecedents, consequents))))
#     dict_fi = dict(fi)

#     strong_rules = []
#     for candidate in candidates:
#         antecedent = candidate[0]
#         consequent = candidate[1]
#         support = dict_fi[tuple(sorted(antecedent + tuple(consequent)))]
#         confidence = support/dict_fi[tuple(sorted(antecedent))]
#         lift = confidence/(dict_fi[tuple(sorted(consequent))]/db_size)

#         measures = dict(zip(['antecedent','consequent', 'support', 'confidence', 'lift'],
#                             [antecedent, consequent, support, confidence, lift]))

#         if confidence >= minconf:
#             strong_rules += [measures]

#     return sorted(strong_rules, key=lambda x:
#                  (-x['lift'], x['consequent'], x['antecedent']))

In [21]:
def assoc(fi, minconf, db_size):
    from itertools import combinations
    
    fi = [(tuple(sorted(k)),v) for k,v in fi]
    
    dict_fi = {k:v for k,v in fi}

    return_dict = {}
    for key in dict_fi.keys():

        if len(key) > 1:
            print('\nKEY: ', key)
            for length in range(1, len(key)+1):
                for ante in combinations(key, length):
                    ante=list(ante)
                    cons_li = list(key)
                    for a in ante:
                        cons_li.remove(a)
                    cons = cons_li
                    if len(cons) > 0:
                        for con in combinations(cons, 1):
                            print(f'Ante:{ante}, cons:{list(con)}')
                            conf = dict_fi[tuple(sorted(ante + list(con)))] / dict_fi[tuple(ante)]
                            lift = conf/(dict_fi[con] / db_size)

                            if conf >= minconf:

                                print(f'\t CONF: {conf}')
                                print(f'\t LIFT: {lift}')
                                supp = dict_fi[key]
                                return_dict[(tuple(ante), con)] = (supp, conf, lift)
                            else:
                                print('DID NOT REACH MINCONF')

    output_list = []
    for k,v in return_dict.items():
        output_list.append(
        {'antecedent': tuple(sorted(k[0])), 
         'consequent': k[1][0], 
         'support': v[0], 
         'confidence': v[1], 
         'lift': v[2]}
        )
    return sorted(output_list, key=lambda x: (-x['lift'], x['consequent'], x['antecedent']))

In [22]:
rules = assoc([
    (('a',), 5),
    (('a', 'b'), 3),
    (('a', 'c'), 3),
    (('b',), 3),
    (('c',), 3),
    (('a', 'b', 'c'), 2),
    (('a', 'd'), 2),
    (('a', 'e'), 2),
    (('b', 'c'), 2),
    (('d',), 2),
    (('e',), 2)], 
    0.2, 5)
assert_equal(len(rules), 13)
assert_equal(rules[0],
             {'antecedent': ('a', 'c'),
              'consequent': 'b',
              'support': 2,
              'confidence': 0.6666666666666666,
              'lift': 1.1111111111111112})


KEY:  ('a', 'b')
Ante:['a'], cons:['b']
	 CONF: 0.6
	 LIFT: 1.0
Ante:['b'], cons:['a']
	 CONF: 1.0
	 LIFT: 1.0

KEY:  ('a', 'c')
Ante:['a'], cons:['c']
	 CONF: 0.6
	 LIFT: 1.0
Ante:['c'], cons:['a']
	 CONF: 1.0
	 LIFT: 1.0

KEY:  ('a', 'b', 'c')
Ante:['a'], cons:['b']
	 CONF: 0.6
	 LIFT: 1.0
Ante:['a'], cons:['c']
	 CONF: 0.6
	 LIFT: 1.0
Ante:['b'], cons:['a']
	 CONF: 1.0
	 LIFT: 1.0
Ante:['b'], cons:['c']
	 CONF: 0.6666666666666666
	 LIFT: 1.1111111111111112
Ante:['c'], cons:['a']
	 CONF: 1.0
	 LIFT: 1.0
Ante:['c'], cons:['b']
	 CONF: 0.6666666666666666
	 LIFT: 1.1111111111111112
Ante:['a', 'b'], cons:['c']
	 CONF: 0.6666666666666666
	 LIFT: 1.1111111111111112
Ante:['a', 'c'], cons:['b']
	 CONF: 0.6666666666666666
	 LIFT: 1.1111111111111112
Ante:['b', 'c'], cons:['a']
	 CONF: 1.0
	 LIFT: 1.0

KEY:  ('a', 'd')
Ante:['a'], cons:['d']
	 CONF: 0.4
	 LIFT: 1.0
Ante:['d'], cons:['a']
	 CONF: 1.0
	 LIFT: 1.0

KEY:  ('a', 'e')
Ante:['a'], cons:['e']
	 CONF: 0.4
	 LIFT: 1.0
Ante:['e'], cons:[

# pyFIM

Pandas and scikit-learn do not have a frequent itemset analysis module. We could instead use [pyFIM](http://www.borgelt.net/pyfim.html), which is a rather comprehensive FIM library.

Here is an example of its usage:

In [23]:
import fim
fim.apriori(db)

[(('milk',), 5),
 (('butter', 'milk', 'bread'), 1),
 (('butter', 'milk'), 1),
 (('butter', 'bread'), 1),
 (('butter',), 1),
 (('bread', 'milk'), 2),
 (('bread',), 2),
 (('bread', 'cheese', 'milk', 'eggs'), 1),
 (('bread', 'cheese', 'milk'), 1),
 (('bread', 'cheese', 'eggs'), 1),
 (('bread', 'cheese'), 1),
 (('bread', 'eggs', 'milk'), 1),
 (('bread', 'eggs'), 1),
 (('cheese', 'milk'), 2),
 (('cheese',), 2),
 (('cheese', 'yogurt', 'milk'), 1),
 (('cheese', 'yogurt'), 1),
 (('cheese', 'eggs', 'milk'), 1),
 (('cheese', 'eggs'), 1),
 (('yogurt', 'milk'), 3),
 (('yogurt',), 3),
 (('yogurt', 'eggs', 'milk'), 2),
 (('yogurt', 'eggs'), 2),
 (('eggs', 'milk'), 3),
 (('eggs',), 3)]

Note that in `fim`, a positive argument to `supp` (support) is treated as a percentage and a negative value is considered as absolute support.

In [24]:
fim.eclat(db, supp=-2)

[(('yogurt', 'milk'), 3),
 (('yogurt',), 3),
 (('eggs', 'yogurt', 'milk'), 2),
 (('eggs', 'yogurt'), 2),
 (('eggs', 'milk'), 3),
 (('eggs',), 3),
 (('bread', 'milk'), 2),
 (('bread',), 2),
 (('cheese', 'milk'), 2),
 (('cheese',), 2),
 (('milk',), 5)]

The minimum length of an itemset can be specified with `zmin`.

In [25]:
fim.fpgrowth(db, supp=20, zmin=2)

[(('eggs', 'milk'), 3),
 (('yogurt', 'eggs', 'milk'), 2),
 (('yogurt', 'eggs'), 2),
 (('yogurt', 'milk'), 3),
 (('bread', 'eggs', 'milk'), 1),
 (('bread', 'eggs'), 1),
 (('bread', 'milk'), 2),
 (('cheese', 'eggs', 'milk'), 1),
 (('cheese', 'eggs'), 1),
 (('cheese', 'yogurt', 'milk'), 1),
 (('cheese', 'yogurt'), 1),
 (('cheese', 'bread', 'milk', 'eggs'), 1),
 (('cheese', 'bread', 'milk'), 1),
 (('cheese', 'bread', 'eggs'), 1),
 (('cheese', 'bread'), 1),
 (('cheese', 'milk'), 2),
 (('butter', 'milk', 'bread'), 1),
 (('butter', 'milk'), 1),
 (('butter', 'bread'), 1)]

The `report` parameter is quite powerful and can be used, for example, to return the rules along with their lift as a fraction.

In [26]:
fim.fpgrowth(db, target='r', supp=20, conf=30, report='l')

[('milk', (), 1.0),
 ('milk', ('eggs',), 1.0),
 ('eggs', ('milk',), 1.0),
 ('eggs', (), 1.0),
 ('milk', ('yogurt',), 1.0),
 ('yogurt', ('milk',), 1.0),
 ('milk', ('yogurt', 'eggs'), 1.0),
 ('eggs', ('yogurt', 'milk'), 1.1111111111111112),
 ('yogurt', ('eggs', 'milk'), 1.1111111111111112),
 ('eggs', ('yogurt',), 1.1111111111111112),
 ('yogurt', ('eggs',), 1.1111111111111112),
 ('yogurt', (), 1.0),
 ('milk', ('bread',), 1.0),
 ('bread', ('milk',), 1.0),
 ('milk', ('bread', 'eggs'), 1.0),
 ('eggs', ('bread', 'milk'), 0.8333333333333334),
 ('bread', ('eggs', 'milk'), 0.8333333333333334),
 ('eggs', ('bread',), 0.8333333333333334),
 ('bread', ('eggs',), 0.8333333333333334),
 ('bread', (), 1.0),
 ('milk', ('cheese',), 1.0),
 ('cheese', ('milk',), 1.0),
 ('milk', ('cheese', 'eggs'), 1.0),
 ('eggs', ('cheese', 'milk'), 0.8333333333333334),
 ('cheese', ('eggs', 'milk'), 0.8333333333333334),
 ('eggs', ('cheese',), 0.8333333333333334),
 ('cheese', ('eggs',), 0.8333333333333334),
 ('milk', ('cheese

**Exercise 10**

Compare the runtime of Apriori, ECLAT and FP-growth when the number of items is large, and when the number of transactions is large.

In terms of runtime, FP-growth is the best with ECLAT coming in second, and Apriori coming last. This is because FP-growth does not require candidate generation unlike ECLAT and Apriori, ECLAT is faster than apriori simply because ECLAT only scans the projected database, while apriori scans the original database.

**Exercise 11**

Suppose the owner of a store provided you their POS data (`/mnt/data/public/retaildata/Online Retail.csv`). Provide three suggestions to the owner of the store based on the results of FIM. More information about the dataset is available [here](https://archive.ics.uci.edu/ml/datasets/online+retail).

In [27]:
import fim
df = pd.read_csv('/mnt/data/public/retaildata/Online Retail.csv').dropna()
db = df.groupby('InvoiceNo')['Description'].agg(lambda x: set(x))
results_eclat = fim.eclat(db, supp=1, zmin=1)
fim_ = sorted(results_eclat, key=lambda x: (-x[1], -len(x[0]), x[0]))
fim_

[(('WHITE HANGING HEART T-LIGHT HOLDER',), 2013),
 (('REGENCY CAKESTAND 3 TIER',), 1884),
 (('JUMBO BAG RED RETROSPOT',), 1643),
 (('PARTY BUNTING',), 1399),
 (('ASSORTED COLOUR BIRD ORNAMENT',), 1385),
 (('LUNCH BAG RED RETROSPOT',), 1329),
 (('SET OF 3 CAKE TINS PANTRY DESIGN ',), 1218),
 (('POSTAGE',), 1194),
 (('LUNCH BAG  BLACK SKULL.',), 1073),
 (('PACK OF 72 RETROSPOT CAKE CASES',), 1041),
 (('SPOTTY BUNTING',), 1015),
 (('LUNCH BAG SPACEBOY DESIGN ',), 1001),
 (("PAPER CHAIN KIT 50'S CHRISTMAS ",), 990),
 (('LUNCH BAG CARS BLUE',), 989),
 (('NATURAL SLATE HEART CHALKBOARD ',), 984),
 (('HEART OF WICKER SMALL',), 972),
 (('JAM MAKING SET WITH JARS',), 965),
 (('LUNCH BAG PINK POLKADOT',), 951),
 (('LUNCH BAG SUKI DESIGN ',), 916),
 (('ALARM CLOCK BAKELIKE RED ',), 907),
 (('WOODEN PICTURE FRAME WHITE FINISH',), 894),
 (('JUMBO BAG PINK POLKADOT',), 884),
 (('BAKING SET 9 PIECE RETROSPOT ',), 881),
 (('JAM MAKING SET PRINTED',), 880),
 (('LUNCH BAG APPLE DESIGN',), 878),
 (('RECI

In [28]:
assoc_ex11 = assoc(fim_, 0.2, len(db))


KEY:  ('GREEN REGENCY TEACUP AND SAUCER', 'ROSES REGENCY TEACUP AND SAUCER ')
Ante:['GREEN REGENCY TEACUP AND SAUCER'], cons:['ROSES REGENCY TEACUP AND SAUCER ']
	 CONF: 0.7598908594815825
	 LIFT: 20.16983034915827
Ante:['ROSES REGENCY TEACUP AND SAUCER '], cons:['GREEN REGENCY TEACUP AND SAUCER']
	 CONF: 0.666267942583732
	 LIFT: 20.16983034915827

KEY:  ('JUMBO BAG PINK POLKADOT', 'JUMBO BAG RED RETROSPOT')
Ante:['JUMBO BAG PINK POLKADOT'], cons:['JUMBO BAG RED RETROSPOT']
	 CONF: 0.6266968325791855
	 LIFT: 8.464030867274575
Ante:['JUMBO BAG RED RETROSPOT'], cons:['JUMBO BAG PINK POLKADOT']
	 CONF: 0.3371880706025563
	 LIFT: 8.464030867274575

KEY:  ('ALARM CLOCK BAKELIKE GREEN', 'ALARM CLOCK BAKELIKE RED ')
Ante:['ALARM CLOCK BAKELIKE GREEN'], cons:['ALARM CLOCK BAKELIKE RED ']
	 CONF: 0.6625463535228677
	 LIFT: 16.20937550680533
Ante:['ALARM CLOCK BAKELIKE RED '], cons:['ALARM CLOCK BAKELIKE GREEN']
	 CONF: 0.5909592061742006
	 LIFT: 16.20937550680533

KEY:  ('LUNCH BAG PINK POLKA

Ante:['LUNCH BAG SPACEBOY DESIGN '], cons:['LUNCH BAG ALPHABET DESIGN']
	 CONF: 0.24775224775224775
	 LIFT: 8.32973087518542

KEY:  ('PACK OF 72 RETROSPOT CAKE CASES', 'PACK OF 72 SKULL CAKE CASES')
Ante:['PACK OF 72 RETROSPOT CAKE CASES'], cons:['PACK OF 72 SKULL CAKE CASES']
	 CONF: 0.23823246878001922
	 LIFT: 10.572756964457252
Ante:['PACK OF 72 SKULL CAKE CASES'], cons:['PACK OF 72 RETROSPOT CAKE CASES']
	 CONF: 0.496
	 LIFT: 10.572756964457252

KEY:  ('CHOCOLATE HOT WATER BOTTLE', 'HOT WATER BOTTLE KEEP CALM')
Ante:['CHOCOLATE HOT WATER BOTTLE'], cons:['HOT WATER BOTTLE KEEP CALM']
	 CONF: 0.3493635077793494
	 LIFT: 10.678204184054769
Ante:['HOT WATER BOTTLE KEEP CALM'], cons:['CHOCOLATE HOT WATER BOTTLE']
	 CONF: 0.3402203856749311
	 LIFT: 10.67820418405477

KEY:  ('JAM MAKING SET WITH JARS', 'SET OF 3 CAKE TINS PANTRY DESIGN ')
Ante:['JAM MAKING SET WITH JARS'], cons:['SET OF 3 CAKE TINS PANTRY DESIGN ']
	 CONF: 0.2559585492227979
	 LIFT: 4.66315287951879
Ante:['SET OF 3 CAKE TI

In [29]:
assoc_ex11

[{'antecedent': ('SET/6 RED SPOTTY PAPER PLATES',),
  'consequent': 'SET/6 RED SPOTTY PAPER CUPS',
  'support': 236,
  'confidence': 0.7261538461538461,
  'lift': 56.53808367071525},
 {'antecedent': ('SET/6 RED SPOTTY PAPER CUPS',),
  'consequent': 'SET/6 RED SPOTTY PAPER PLATES',
  'support': 236,
  'confidence': 0.8280701754385965,
  'lift': 56.53808367071525},
 {'antecedent': ('REGENCY TEA PLATE ROSES ',),
  'consequent': 'REGENCY TEA PLATE GREEN ',
  'support': 232,
  'confidence': 0.6823529411764706,
  'lift': 55.05967914438503},
 {'antecedent': ('REGENCY TEA PLATE GREEN ',),
  'consequent': 'REGENCY TEA PLATE ROSES ',
  'support': 232,
  'confidence': 0.8436363636363636,
  'lift': 55.05967914438502},
 {'antecedent': ("POPPY'S PLAYHOUSE KITCHEN",),
  'consequent': "POPPY'S PLAYHOUSE BEDROOM ",
  'support': 255,
  'confidence': 0.7306590257879656,
  'lift': 50.82546640199046},
 {'antecedent': ("POPPY'S PLAYHOUSE BEDROOM ",),
  'consequent': "POPPY'S PLAYHOUSE KITCHEN",
  'support':

Recommendations:

1.) When purchasing from suppliers, negotiate for discounts for frequent items. 
For example items such as:
 - `SET/6 RED SPOTTY PAPER PLATES` and `SET/6 RED SPOTTY PAPER CUPS`
 - `REGENCY TEA PLATE ROSES ` and `REGENCY TEA PLATE GREEN`
 
2.) Place frequent items in close proximity of each other. As much as possible these items should be in the same shelf or same aisle. This will increase the likelihood of the consumer to add to their cart the other items frequently bought along with the item that they intended to buy.

3.) Create bundles for consumers. The store can bundle up frequent itemsets together to increase sales. People who only inteded to buy item A could also be influenced to buy item B if they are bundled. Using the previous example, the store can create a "Red Spotty Bundle" that consists of `SET/6 RED SPOTTY PAPER PLATES` and `SET/6 RED SPOTTY PAPER CUPS`.

# References

* C. Aggarwal, "Data Mining: The Textbook", Springer, 2015.
* P. Fournier‐Viger, J.C.W. Lin, B. Vo, T.T. Chi, J. Zhang, & H.B. Le, "A survey of itemset mining", Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(4), e1207, 2017.
* U. Malik, "Association Rule Mining via Apriori Algorithm in Python", 2018 [retrieved from https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python]