# Lab: Frequent Itemsets
Data Mining 2022/2023   
Danny Plenge and Gosia Migut  
Revised by Aleksander Buszydlik

**WHAT** This *optional* lab consists of several programming exercises and insight questions. These exercises are meant to let you practice with the theory covered in: [Chapter 6][1] from "Mining of Massive Datasets" by J. Leskovec, A. Rajaraman, J. D. Ullman. <br>

**WHY** Practicing, both through programming and answering the insight questions, aims at deepening your knowledge and preparing you for the exam.  

**HOW** Follow the exercises in this notebook either on your own or with a friend. Use [StackOverflow][2]
to discuss the questions with your peers. For additional questions and feedback please consult the TAs during the assigned lab session. The answers to these exercises will not be provided.

[1]: http://infolab.stanford.edu/~ullman/mmds/ch6.pdf
[2]: https://stackoverflow.com/c/tud-cs/questions

#### Summary

In the following exercises you will work on implementing algorithms to detect frequent itemsets.
* Exercise 1: A-Priori algorithm
* Exercise 2 (optional): PCY algotihm

In addition, we will be comparing the efficiency of the two algorithms. 

In [1]:
# This installs the required library.
!pip install sortedcontainers

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/local/bin/python3.6 -m pip install --upgrade pip' command.[0m


## Exercise 1: A-Priori algorithm

The A-Priori algorithm was introduced as a way to efficiently find association rules between items. During the lecture you learned how this knowledge may be very important to, for example, a store chain which would like to know about its customers' buying habits. A-Priori works well even with very large datasets that do not fit in memory because it finds the frequent itemsets iteratively, applying a bottom-up approach. This offers a significant improvement over naive approaches which try to verify for each candidate itemset whether it is frequent. As an example, if we know that watermelons aren't bought frequently, then it is also impossible that watermelons and pineapples are frequently bought together. In this exercise you will implement the A-Priori algorithm.

The A-Priori algorithm consists of three phases that are repeated until some number of frequent itemsets of a chosen size are found. The steps of the A-Priori algorithm are given below:

1. Construct a set of candidate itemsets $C_k$
2. Go through the data and for each basket construct subsets of size $k$. For each of these subsets, increment the support value if a subset exists in $C_k$.
3. Filter the set of candidate itemsets to get the set of truly frequent itemsets. That is, verify if the support value of an itemset is equal to or larger than the support threshold.
4. Go to step 1 for $k = k + 1$. Repeat until you found frequent itemsets of the required size.

Below we define some helper functions that will be used throughout the assignment. Please do not modify them.

In [3]:
from sortedcontainers import SortedSet

def get_subsets(set1, k):
    result = SortedSet()
    
    set_list = set(set1)
    subset = set()
    get_subsets_(set_list, subset, k, result)
    return result

# This is a helper function for getSubsets
def get_subsets_(set1, subset, subset_size, candidates):
    if subset_size == len(subset):
        candidates.add(frozenset(x for x in subset))
    else:
        for s in set1:
            subset.add(s)
            clone = set(set1)
            clone.remove(s)
            get_subsets_(clone, subset, subset_size, candidates)
            subset.remove(s)

# The Support Threshold
support_threshold = 3

baskets = list(set())
baskets.append(set("Cat and dog bites".lower().split(" ")))
baskets.append(set("Yahoo news claims a cat mated with a dog and produced viable offspring".lower().split(" ")))
baskets.append(set("Cat killer likely is a big dog".lower().split(" ")))
baskets.append(set("Professional free advice on dog training puppy training".lower().split(" ")))
baskets.append(set("Cat and kitten training and behavior".lower().split(" ")))
baskets.append(set("Dog & Cat provides dog training in Eugene Oregon".lower().split(" ")))
baskets.append(set("Dog and cat is a slang term used by police officers for a male female relationship".lower().split(" ")))
baskets.append(set("Shop for your show dog grooming and pet supplies".lower().split(" ")))

### Step 1: Implement `construct_candidates`

Implement the functionality of the `construct_candidates` function. It performs the first step of the process, constructing the set $C_k$ containing all candidate itemsets of size $k$ given the set $L_{k-1}$ of filtered candidate itemsets of size $k - 1$. For the initial case $k = 1$, where no filtered candidate set is present yet, it returns all sets of size 1. For larger $k$, it should check the union of every possible pair of itemsets in $L_{k−1}$ . If the size of a union is $k$, then this union is a candidate itemset. Note that the size of the union may also be larger than $k$, in which case it is not a candidate.  
**Note:** This approach often creates more candidate itemsets than necessary but for the purpose of this exercise it will suffice.  
**Hint:** Remember that at this stage you should simply create a list of potential candidates (so you should not filter them yet).

In [3]:
def construct_candidates(baskets, filtered, k):
    """
    This function will create candidates for the A-Priori algorithm.
    :param baskets: A list of baskets containing the strings
    :param filtered: The set of filtered candidates from the last iteration
    :param k: The size of the required itemsets
    :return: A list of candidates (sets of strings)
    """
    candidates = list()
    
    # First iteration
    if filtered == None:
        for basket in baskets:
            for string in basket:
                s = set()
                s.add(string)
                
                if s not in candidates:
                    candidates.append(s)
    
    else:     
    # Create k-item combinations of itemsets from the filtered set
        for st1 in filtered:
            for st2 in filtered:
                un = st1.union(st2)
                if len(un) == k and un not in candidates:
                    candidates.append(un)
    
    return candidates  

In [4]:
k = 1
filtered = None
test_baskets =  [set(), set("abc"), set("bcd"), set("aaa")]

test_candidates = construct_candidates(test_baskets, filtered, k)
expected_candidates = [{'b'}, {'c'}, {'a'}, {'d'}]

assert sorted(test_candidates, key=lambda x: str(x)) == sorted(expected_candidates, key=lambda x: str(x)), f"{test_candidates} != {expected_candidates}"

for candidate in test_candidates:
    assert candidate in expected_candidates, f"{candidate} is not in {expected_candidates}"
    
assert len(test_candidates) == len(expected_candidates), f"{test_candidates} != {expected_candidates}"

In [5]:
k = 2
filtered = {frozenset({"c"}), frozenset("b"), frozenset("a")}

test_candidates = construct_candidates(test_baskets, filtered, k)
expected_candidates = [{'c', 'a'}, {'c', 'b'}, {'a', 'b'}]

assert sorted(test_candidates, key=lambda x: str(x)) == sorted(expected_candidates, key=lambda x: str(x)), f"{test_candidates} != {expected_candidates}"
sorted(test_candidates, key=lambda x: str(x))

for candidate in test_candidates:
    assert candidate in expected_candidates, f"{candidate} is not in {expected_candidates}"
    
assert len(test_candidates) == len(expected_candidates), f"{test_candidates} != {expected_candidates}"

### Step 2: Implement `count_candidates`

Implement the functionality of the `count_candidates` function which performs the second step of the process.  
**Hint:** You can use the `get_subsets` function to create subsets of size $k$.

In [6]:
def count_candidates(baskets, candidates, k):
    """
    This function will count the candidates for the A-Priori algorithm.
    It will return a dictionary with the candidates as keys and corresponding amounts as values.
    :param baskets: A list of baskets containing the strings
    :param candidates: The list of candidates (sets of strings)
    :param k: The size of the required itemsets
    :return: A dictionary storing the amount for each unique candidate
    """
    candidates_count = dict()
    
    for b in baskets:
        occurences = get_subsets(b, k)
        
        for occ in candidates:
            if occ in occurences:
                if frozenset(occ) in candidates_count:
                
                    candidates_count[frozenset(occ)] = candidates_count[frozenset(occ)] +1
                else:
                    candidates_count[frozenset(occ)] = 1
            

    return candidates_count

In [7]:
k = 1
filtered = None
test_baskets = [set(), set("abc"), set("bcd"), set("aaa")]

candidates = construct_candidates(test_baskets, filtered, k)
counted_candidates = count_candidates(test_baskets, candidates, k)
expected_counted_candidates = {frozenset({"b"}): 2, frozenset({"a"}): 2, frozenset({"c"}): 2, frozenset({"d"}): 1}

assert counted_candidates == expected_counted_candidates, f"{counted_candidates} != {expected_counted_candidates}"

In [8]:
k = 2
filtered = {frozenset({"c"}), frozenset("b"), frozenset("a")}
test_baskets = [set(), set("abc"), set("bcd")]
candidates = construct_candidates(test_baskets, filtered, k)
counted_candidates = count_candidates(test_baskets, candidates, k)
expected_counted_candidates = {frozenset({'b', 'c'}): 2, frozenset({'a', 'b'}): 1, frozenset({'a', 'c'}): 1}

assert counted_candidates == expected_counted_candidates, f"{counted_candidates} != {expected_counted_candidates}"

### Step 3: Implement `filter_candidates`

This next function should verify whether the amount of occurrences stored for a candidate matches the support threshold.

In [9]:
def filter_candidates(candidates_count, support_threshold):
    """
    This function will filter the candidates for the A-Priori algorithm.
    :param candidates_Count: A dictionary with the candidates as keys and corresponding amounts as values
    :param support_threshold: The chosen support threshold
    :return: A set representing the filtered candidate itemsets.
    """
    
    filtered_candidates = set()
    
    print(len(candidates_count))
    for cand,count in candidates_count.items():
        if count >= support_threshold:
            filtered_candidates.add(cand)
    
    return filtered_candidates

In [10]:
k = 1
filtered = None
test_baskets = [set(), set("abc"), set("bcd"), set("aaa")]
test_support_threshold = 2

candidates = construct_candidates(test_baskets, filtered, test_support_threshold)
counted_candidates = count_candidates(test_baskets, candidates, k)
filtered_candidates = filter_candidates(counted_candidates, test_support_threshold)
expected_filtered_candidates = {frozenset({"c"}), frozenset("a"), frozenset("b")}

assert filtered_candidates == expected_filtered_candidates, f"{filtered_candidates} != {expected_filtered_candidates}"

4


In [11]:
k = 2
test_baskets = [set(), set("abc"), set("bcd"), set("aaa")]
filtered = {frozenset({"c"}), frozenset("b"), frozenset("a")}

candidates = construct_candidates(test_baskets, filtered, test_support_threshold)
counted_candidates = count_candidates(test_baskets, candidates, k)
filtered_candidates = filter_candidates(counted_candidates, test_support_threshold)
expected_filtered_candidates = {frozenset({'b', 'c'})}

assert filtered_candidates == expected_filtered_candidates, f"{filtered_candidates} != {expected_filtered_candidates}"

3


### Step 4: Implement `get_frequent_sets`

Our last function implements the entire A-Priori algorithm by combining the methods we have created in previous steps. For each size from $1$ to $k$, it should:
1. construct candidate itemsets
2. count the occurrences of these itemsets
3. filter them and return truly frequent itemsets.  

In [12]:
def get_frequent_sets(baskets, support_threshold, k):
    """
    This function will create a set of frequent item sets by performing the entire A-Priori algorithm.
    :param baskets: A list of baskets containing the strings
    :param support_threshold: The chosen support threshold
    :param k: The size of the required itemsets
    :return: A set containing the frozensets of all the 'frequent items'
    """
    filtered_candidates = None
    
    for i in range(1, k+1):
        candidates = construct_candidates(baskets, filtered_candidates, i)
        candidates_count = count_candidates(baskets, candidates, i)

        filtered_candidates = filter_candidates(candidates_count, support_threshold)

    return filtered_candidates

In [13]:
k = 1
test_baskets = [set(), set("abc"), set("bcd")]
test_support_threshold = 2

filtered_candidates = get_frequent_sets(test_baskets, test_support_threshold, k)
expected_filtered_candidates = {frozenset({"c"}), frozenset("b")}

assert filtered_candidates == expected_filtered_candidates, f"{filtered_candidates} != {expected_filtered_candidates}"

4


In [14]:
k = 2
test_baskets = [set(), set("abc"), set("bcd")]

filtered_candidates = get_frequent_sets(test_baskets, test_support_threshold, k)
expected_filtered_candidates = {frozenset({"b", "c"})}

assert filtered_candidates == expected_filtered_candidates, f"{filtered_candidates} != {expected_filtered_candidates}"

4
1


### Step 5: Performing the A-Priori algorithm

Run the A-Priori algorithm using the function `get_frequent_sets` to verify whether your code works as expected.

In [15]:
print(sorted([sorted(list(x)) for x in get_frequent_sets(baskets, support_threshold, 1)]) )
print(sorted([sorted(list(x)) for x in get_frequent_sets(baskets, support_threshold, 2)]) )
print(sorted([sorted(list(x)) for x in get_frequent_sets(baskets, support_threshold, 3)]) )
assert sorted([sorted(list(x)) for x in get_frequent_sets(baskets, support_threshold, 1)]) == [['a'], ['and'], ['cat'], ['dog'], ['training']], "Incorrect A-Priori itemsets of size 1"
assert sorted([sorted(list(x)) for x in get_frequent_sets(baskets, support_threshold, 2)]) == [['a', 'cat'], ['a', 'dog'], ['and', 'cat'], ['and', 'dog'], ['cat', 'dog']], "Incorrect A-Priori itemsets of size 2"
assert sorted([sorted(list(x)) for x in get_frequent_sets(baskets, support_threshold, 3)]) == [['a', 'cat', 'dog'], ['and', 'cat', 'dog']], "Incorrect A-Priori itemsets of size 3"

46
[['a'], ['and'], ['cat'], ['dog'], ['training']]
46
9
[['a', 'cat'], ['a', 'dog'], ['and', 'cat'], ['and', 'dog'], ['cat', 'dog']]
46
9
4
[['a', 'cat', 'dog'], ['and', 'cat', 'dog']]
46
46
9
46
9
4


$\textbf{Question 1}$: What are the frequent doubletons in our case? If we want to compute frequent itemsets of size k, how many passes through the data do we need to do using the A-Priori algorithm?

To answer your first question, frequent doubletons are pairs of items that frequently co-occur in the dataset. In other words, these are itemsets of size 2 that have a high support count.

To find frequent doubletons using the A-Priori algorithm, we need to perform two passes through the data. In the first pass, we count the occurrences of each item in the dataset and identify the frequent items. In the second pass, we use the frequent items from the first pass to generate candidate doubletons and count their occurrences. We then identify the frequent doubletons based on their support count.

To compute frequent itemsets of size k using the A-Priori algorithm, we need to perform k passes through the data. In each pass, we generate candidate itemsets of size k based on the frequent itemsets from the previous pass and count their occurrences. We then identify the frequent itemsets based on their support count. This approach is known as the "level-wise" search strategy, where we generate candidate itemsets of size k from frequent itemsets of size k-1.

$\textbf{Question 2}$: An alternative would be to read through the baskets and immediately construct subsets of size $k$ and count how many times each of them occurred, thereby avoiding the calculation of frequent itemsets of size $1$ to $k − 1$. Why is this not feasible for larger datasets?

## Exercise 2 (extra practice): PCY algorithm

Now we will make a small improvement to the A-Priori algorithm and turn it into the Park-Chen-Yu (PCY) algorithm. This algorithm allows for a more efficient use of memory during the first pass as it modifies the way in which candidate pairs are chosen to be frequent itemsets, that is, it affects the set $C_2$.

### Step 1: Implement `count_PCY_candidates`

Complete the implementation of the `count_PCY_candidates` function. It is very similar to the corresponding step of A-Priori. However, when iterating over the data with $k = 1$, you should also generate the subsets of size $k + 1 = 2$, hash them, and increment the value stored in a bucket where they are hashed.

In [1]:
def count_PCY_candidates(baskets, candidates, k, bucket_size, buckets):
    """
    This function will count the candidates for the PCY algorithm 
    :param baskets: A list of baskets containing the strings
    :param candidates: A list of candidates (strings)
    :param k: The size of the required itemsets
    :param bucket_size: The chosen bucket size
    :param buckets: The list of buckets
    :return: A dictionary showing the amount for each unique candidate
    """
        
    if k != 1:
        return count_candidates(baskets, candidates, k)
 
    for i in range(bucket_size):
        buckets.append(0)
    
    candidates_count = dict()
    
    for b in baskets:
        subsets = get_subsets(b, 1)
        for c in subsets:
            if frozenset(c) in candidates_count:
                candidates_count[frozenset(c)] = candidates_count[frozenset(c)] + 1
            else:
                candidates_count[frozenset(c)] = 1
                
    for b in baskets:
        subsets = get_subsets(b, 2)
        for c in subsets:
            buckets[hash(frozenset(c))%bucket_size] = buckets[hash(frozenset(c))%bucket_size] + 1
                        
            
    print(sum(buckets))       

    return candidates_count

In [4]:
k = 1
bucket_size = 4
buckets = []
test_baskets = [set(), set("abc"), set("bcd"), set("aaa")]
candidates = [{'b'}, {'c'}, {'a'}, {'b'}, {'c'}, {'d'}, {'a'}]

counted_PCY_candidates = count_PCY_candidates(test_baskets, candidates, k, bucket_size, buckets)
expected_counted_PCY_candidates = {frozenset({'a'}): 2, frozenset({'c'}): 2, frozenset({'b'}): 2, frozenset({'d'}): 1}

assert counted_PCY_candidates == expected_counted_PCY_candidates, f"{counted_PCY_candidates} != {expected_counted_PCY_candidates}"
assert sum(buckets) == 6, f"Not all of the subsets have been counted. 6 subset should have been counted"

6


### Step 2: Implement `construct_candidates`
Next we will be implementing the `construct_candidates` function. Again, this implementation is very similar to the implementation of the A-Priori. However for k = 2, before adding an itemset to the set of candidates, also test that the itemset hashes to a frequent bucket (i.e. a bucket with a count of at least `support_threshold`). If this is not the case, the itemset should be skipped.  
  
**Hint:** Only frozensets can be hashed. You can convert a set to a frozenset in the following way:  

```python
s = set()  
s = frozenset(s)
```

In [154]:
support_threshold = 3;

def construct_PCY_candidates(baskets, filtered, k, bucket_size, buckets):
    """
    This function will create candidates for the A-Priori algorithm.
    :param baskets: A list of strings containg the baskets
    :param filtered: The filtered candidates from the last iteration
    :param k: The chosen size k
    :param bucket_size: The chosen bucket size
    :param buckets: The list of buckets
    :return: A list of candidates (sets of strings)
    """
    
    candidates = list()
    
    # First iteration
    if filtered == None:
        for b in baskets:
            for s in b:
                s1 = set()
                s1.add(s)
                
                if s1 not in candidates:
                    candidates.append(s1)
                hvalue  = hash(frozenset(s1)) % bucket_size
                buckets[hvalue] = 1

    else: 
        
            if k != 2:
                candidates = construct_candidates(baskets, filtered, k)
            else:
                for st1 in filtered:
                    for st2 in filtered:
                        un = st1.union(st2)
                        if len(un) == k and un not in candidates and buckets[hash(un) % bucket_size] >= support_threshold:
                            candidates.append(un)    
        
        
        
    return candidates

### Step 3: Implement `get_PCY_frequent_sets`

Combine the two functions implemented previously (`construct_PCY_candidates` and `count_PCY_candidates`) into `get_PCY_frequent_sets` which will calculate the frequent itemsets of the PCY algorithm. You can use the `filter_candidates` function implemented for the A-Priori algorithm. Set the `bucket_size` to 256.

In [155]:
def get_PCY_frequent_sets(baskets, support_threshold, k, bucket_size):
    """
    This function will get the frequent item sets by performing the whole PCY algorithm
    :param baskets: A list of strings containg the baskets
    :param support_threshold: The chosen support threshold
    :param k: The chosen size k
    :param bucket_size: The chosen bucket size
    :return: A set containing the frozensets of all the 'frequent items'
    """
    filtered_candidates = []
    buckets = list()
    
    for i in range (1, k+1):
        candidate_items = construct_PCY_candidates(baskets, filtered_candidates, i, bucket_size, buckets)
        counted_candidates = count_PCY_candidates(baskets, candidate_items, i, bucket_size, buckets)
        filtered_candidates = filter_candidates(counted_candidates, support_threshold)
    
    
    return filtered_candidates

In [156]:
assert sorted([sorted(list(x)) for x in get_PCY_frequent_sets(baskets, support_threshold, 1, 256)]) == [['a'], ['and'], ['cat'], ['dog'], ['training']], "Incorrect PCY itemsets of size 1"
assert sorted([sorted(list(x)) for x in get_PCY_frequent_sets(baskets, support_threshold, 2, 256)]) == [['a', 'cat'], ['a', 'dog'], ['and', 'cat'], ['and', 'dog'], ['cat', 'dog']], "Incorrect PCY itemsets of size 2"
assert sorted([sorted(list(x)) for x in get_PCY_frequent_sets(baskets, support_threshold, 3, 256)]) == [['a', 'cat', 'dog'], ['and', 'cat', 'dog']], "Incorrect PCY itemsets of size 3"

293
46
293
46
7
293
46
7
4


$\textbf{Question 3}$: Compared to the A-Priori algorithm, what is the difference in the number of candidate sets checked by the PCY algorithm?

The PCY (Park, Chen, and Yu) algorithm is an improvement over the Apriori algorithm for frequent itemset mining in large datasets. The main difference between the two algorithms is the way they prune infrequent itemsets.

In the Apriori algorithm, all candidate itemsets are generated and counted, even if some of them are unlikely to be frequent. This results in a large number of candidate itemsets, which can be computationally expensive to generate and count.

On the other hand, the PCY algorithm uses a two-step approach to reduce the number of candidate itemsets. In the first step, it uses a hash function to map each item to a bucket. Then, it scans the dataset once to count the frequency of each item and the frequency of each pair of items that land in the same bucket. In the second step, it generates candidate itemsets using only the frequent items and pairs that have a count greater than a threshold value. This reduces the number of candidate itemsets significantly and improves the efficiency of the algorithm.

Therefore, the PCY algorithm checks a smaller number of candidate sets compared to the Apriori algorithm, which results in faster processing times and reduced computational resources.

$\textbf{Question 4}$: What is the advantage of the PCY algorithm over the A-Priori algorithm?

The PCY (Park-Chen-Yu) algorithm is an improvement over the Apriori algorithm, which is a popular algorithm for mining frequent itemsets in data mining. The PCY algorithm offers several advantages over the Apriori algorithm, including:

Reduction in candidate itemsets: The Apriori algorithm generates all possible itemsets, even those that are infrequent. The PCY algorithm, on the other hand, reduces the number of candidate itemsets by using a hash table to count the frequency of candidate itemsets in the first pass over the data. The hash table is used to identify potentially frequent itemsets, which are then used to generate candidate itemsets in the second pass. This reduces the number of candidate itemsets and speeds up the mining process.

Efficient use of memory: The Apriori algorithm needs to keep track of all frequent itemsets in memory, which can be a challenge for large datasets. The PCY algorithm, on the other hand, only needs to keep track of the hash table in memory, which is much smaller than the set of all frequent itemsets. This makes the PCY algorithm more memory-efficient.

Better scalability: The PCY algorithm is more scalable than the Apriori algorithm, particularly for datasets with a large number of items. This is because the Apriori algorithm generates a large number of candidate itemsets, many of which are pruned, whereas the PCY algorithm generates fewer candidate itemsets due to the use of the hash table. This makes the PCY algorithm more suitable for large-scale data mining tasks.

In summary, the PCY algorithm offers several advantages over the Apriori algorithm, including reduced candidate itemsets, more efficient use of memory, and better scalability. These advantages make the PCY algorithm a popular choice for frequent itemset mining in data mining tasks.


$\textbf{Question 5}$: What is the influence of the bucket size on the algorithm? For example, what would happen if the bucket size was be too low?