# Lab: Frequent Itemsets
Data Mining 2018/2019 <br> 
Danny Plenge and Gosia Migut

**WHAT** This *optional* lab consists of several programming and insight exercises/questions. 
These exercises are ment to let you practice with the theory covered in: [Chapter 6][2] from "Mining of Massive Datasets" by J. Leskovec, A. Rajaraman, J. D. Ullman. <br>

**WHY** Practicing, both through programming and answering the insight questions, aims at deepening your knowledge and preparing you for the exam. <br>

**HOW** Follow the exercises in this notebook either on your own or with a friend. Use [Mattermost][1]
to disscus questions with your peers. For additional questions and feedback please consult the TA's at the assigned lab session. The answers to these exercises will not be provided.

[1]: https://mattermost.ewi.tudelft.nl/signup_user_complete/?id=ccffzw3cdjrkxkksq79qbxww7a
[2]: http://infolab.stanford.edu/~ullman/mmds/ch6.pdf

**SUMMARY**

In the following exercises you will work on implementing algorithms to detect frequent itemsets
from a set of baskets using the A-Priori algorithm. In addition we will be adding efficiency to the
A-Priori algorithm using the PCY algorithm. Finally, we will be using the MapReduce framework
to parallelize the algorithm.

## Exercise 1: A-Priori algorithm

The A-Priori algorithm consists of three phases that are iterated until some number of frequent itemsets of some size have been found. The steps are described below:

1. Construct a set of candidate itemsets $C_k$
2. Go through the data and construct for each basket subsets of size k. For each of these subsets, increment their support value if that subset exists in $C_k$.
3. Filter the set of candidate itemsets to get the set of truly frequent itemsets . That is, check if their support value is equal to or larger than the support threshold.
4. Go to step 1 for k = k + 1, until you found frequent itemsets for the size that you requested.

In [None]:
from sortedcontainers import SortedSet

# This function can be used to get the subsets from your baskets
# PLEASE DO NOT MODIFY THESE FUNCTIONS
def getSubsets(set1, k):
    # Is SortedSet even necessary here?
    result = SortedSet()
    
    setList = set(set1)
    subset = set()
    getSubsets_(setList, subset, k, result)
    return result

# This is a helper function for getSubsets
def getSubsets_(set1, subset, subsetSize, candidates):
    if subsetSize == len(subset):
        candidates.add(frozenset(x for x in subset))
    else:
        for s in set1:
            subset.add(s)
            clone = set(set1)
            clone.remove(s)
            getSubsets_(clone, subset, subsetSize, candidates)
            subset.remove(s)

# The Support Threshold
supportThreshold = 3


baskets = list(set())
baskets.append(set("Cat and dog bites".lower().split(" ")))
baskets.append(set("Yahoo news claims a cat mated with a dog and produced viable offspring".lower().split(" ")))
baskets.append(set("Cat killer likely is a big dog".lower().split(" ")))
baskets.append(set("Professional free advice on dog training puppy training".lower().split(" ")))
baskets.append(set("Cat and kitten training and behavior".lower().split(" ")))
baskets.append(set("Dog & Cat provides dog training in Eugene Oregon".lower().split(" ")))
baskets.append(set("Dog and cat is a slang term used by police officers for a male female relationship".lower().split(" ")))
baskets.append(set("Shop for your show dog grooming and pet supplies".lower().split(" ")))

### Step 1

Implement the functionality of the `constructCandidates` function. This function performs the first step of the process, constructing $C_k$ with all candidate itemsets of size $k$ given the set $L_{k-1}$ of filtered candidate itemsets of size $k - 1$. For the initial case $k = 1$, where no filtered candidates set is present yet, it returns all sets of size 1. For larger k, it should check each union of all possible pairs of itemset in −1 . If the size of a union is k, then this union is a candidate itemset. Note that the size of the union could also be larger than k, in which case it is not a candidate. 
  
  
**Note:** This very often creates more candidate itemsets than necessary, but for the purpose of this exercise, it will suffice.

In [None]:
def constructCandidates(baskets, filtered, k):
    """
    This function will create candidates for the A-Priori Algorithm
    :param baskets: A list of strings containg the baskets
    :param filtered: The filtered candidates from the last iteration
    :param k: The size k
    :return: A list of candidates (sets of strings)
    """
    candidates = list()
    
    # First iteration
    if filtered == None:
        for b in baskets:
            for s in b:
                s1 = set()
                s1.add(s)
                candidates.append(s1)
    
    else:
        # Insert code here!
        pass
    
    return candidates  
 

### Step 2. 

Implement the functionality of the `countCandidates` function. This function
performs the second step of the process.  

**Hint:** For creating subsets of size k, you may use the `getSubsets` function.

In [None]:
def countCandidates(baskets, candidates, k):
    """
    This function will count the candidates for the A-Priori Algorithm
    It will return a dictionary with the candidate as key and the amount as value
    :param baskets: A list of strings containg the baskets
    :param candidates: The list of candidates (sets of strings)
    :param k: The chosen size k
    :return: A dictionary showing the amount for each unique candidate
    """
    candidatesCount = dict()
    
    for b in baskets:
        occurences = getSubsets(b, k)
        
        # Insert code here!
   
    return candidatesCount


### Step 3. 

Implement the `filterCandidates` method.

In [None]:
def filterCandidates(candidatesCount, supportThreshold):
    """
    This function will filter the candidates for the A-Priori Algorithm
    :param candidatesCount: A dictionary with the candidate as key and the amount as value
    :param supportThreshold: The chosen support threshold
    :return: A set representing the filtered candidates
    """
    
    filteredCandidates = set()
    
    # Insert code here!

    return filteredCandidates


### Step 4. 

Implement the `getFrequentSets` function. This function implements the full process by combining the previously created methods. For each size from 1 to k, it should construct candidate itemsets, count these itemsets and filter them.  
  
  
**Note:** On the last iteration, no candidate sets need to be computed.

In [None]:
def getFrequentSets(baskets, supportThreshold, k):
    """
    This function will get the frequent item sets by performing the whole A-Priori algorithm
    :param baskets: A list of strings containg the baskets
    :param supportThreshold: The chosen support threshold
    :param k: The chosen size k
    :return: A set containing the frozensets of all the 'frequent items'
    """
    filteredCandidates = None
    
    # Start with 1 as k has a minimum of 1
    for i in range(1,(k+1)):
        # Step 1
        candidates = None
        # Step 2
        countedCandidates = None
        # Step 3
        filteredCandidates = None
        
    return filteredCandidates


### Step 5.

Run the APriori algorithm by using the function `getFrequentSets`.

In [None]:
# Uncomment to try out!
#print(getFrequentSets(baskets, supportThreshold, 1))
#print(getFrequentSets(baskets, supportThreshold, 2))
#getFrequentSets(baskets, supportThreshold, 3)

### Expected output

Expected output for candidates with k=3:  
`[{'a', 'and', 'cat'}, {'a', 'dog', 'cat'}, {'a', 'dog', 'cat'}, {'dog', 'and', 'cat'}, {'dog', 'and', 'cat'}, {'a', 'dog', 'and'}, {'a', 'dog', 'cat'}, {'dog', 'and', 'cat'}]`  
  
Expected output for counted candidates with k=3:  
`{frozenset({'dog', 'and', 'cat'}): 3, frozenset({'dog', 'a', 'cat'}): 3, frozenset({'dog', 'and', 'a'}): 2, frozenset({'and', 'a', 'cat'}): 2}`  
  
Expected output for filtered candidates with k=3:  
`{frozenset({'and', 'cat', 'dog'}), frozenset({'a', 'cat', 'dog'})}`

### Question 1

What are the frequent doubletons? If we want to compute frequent itemsets of size k, how many passes through the data do we need to do using the A-Priori algorithm?

### Question 2

An alternative would be to read through the baskets and immediately construct subsets of size k and count how many times each occurred, thereby avoiding calculating the frequent itemsets of size 1 to k − 1. Why is this not feasible for larger datasets?

## Exercise 2: PCY algorithm

Next we will be making a small adjustment to the A-Priori algorithm, which leads to the PCY
algorithm. The PCY algorithm affects the choosing of candidate pairs as frequent itemsets, that is it affects C2.

### Step 1  

Complete the implementation of the `countCandidates` method. The implementation is very similar to the implementation in the A-Priori. However, when iterating over the data during k = 1, you should also generate subsets of size k + 1 = 2, hash these subsets and increment the value in the bucket array to which they hash to.

In [None]:
def countPCYCandidates(baskets, candidates, k, bucketSize, buckets):
    """
    This function will count the candidates for the PCY algorithm 
    :param baskets: A list of strings containg the baskets
    :param candidates: The list of candidates (strings)
    :param k: The chosen size k
    :param bucketSize: The chosen bucket size
    :param buckets: The list of buckets
    :return: A dictionary showing the amount for each unique candidate
    """
        
    if k != 1:
        return countCandidates(baskets, candidates, k)
 
    for i in range(bucketSize):
        buckets.append(0)
    
    candidatesCount = dict()

    # Insert code here!
    
    return candidatesCount

### Step 2. 
Next we will be implementing the `constructCandidates` function. Again, this implementation is very similar to the implementation of the A-Priori. However for k = 2, before adding an itemset to the set of candidates, also test that the itemset hashes to a frequent bucket (i.e. a bucket with a count of at least supportThreshold). If this is not the case, the itemset should be skipped.  
  
**Note**: only frozensets can be hashed, you can convert a set to a frozenset the following way:  
```
s = set()  
s = frozenset(s)
```

In [None]:
supportThreshold = 3;

def constructPCYCandidates(baskets, filtered, k, bucketSize, buckets):
    """
    This function will create candidates for the A-Priori algorithm
    :param baskets: A list of strings containg the baskets
    :param filtered: The filtered candidates from the last iteration
    :param k: The chosen size k
    :param bucketSize: The chosen bucket size
    :param buckets: The list of buckets
    :return: A list of candidates (sets of strings)
    """
    
    candidates = list()
    
    # On first iteration (k=1) just append sets to candidates list
    if filtered == None:
        for b in baskets:
            for s in b:
                s1 = set()
                s1.add(s)
                candidates.append(s1)
    
    # On second iteration (k=2) check if itemset hashes to a frequent bucket
    # This will not work anymore if k > 2
    else:
        # Insert code here!
        pass
        
    return candidates

### Step 3. 

Use the `constructPCYcandidates` and `countPCYcandidates` functions to calculate the frequent itemsets by implementing the `getPCYFrequenSets`, using the `filterCandidates` function of the A-Priori algorithm. You can set the `bucketSize` to 256.

In [None]:
def getPCYFrequenSets(baskets, supportThreshold, k, bucketSize):
    """
    This function will get the frequent item sets by performing the whole PCY algorithm
    :param baskets: A list of strings containg the baskets
    :param supportThreshold: The chosen support threshold
    :param k: The chosen size k
    :param bucketSize: The chosen bucket size
    :return: A set containing the frozensets of all the 'frequent items'
    """
    filteredCandidates = None
    buckets = list()
    
    # Start with 1 as k has a minimum of 1
    for i in range(1,(k+1)):
        # Step 1
        candidates = None
        # Step 2
        countedCandidates = None
        # Step 3
        filteredCandidates = None
    
    return filteredCandidates

getPCYFrequenSets(baskets, supportThreshold, 2, 256)

### Expected output

Expected output for filteredCandidates with k=2 and bucketSize=256:  
`{frozenset({'and', 'cat'}),
 frozenset({'a', 'cat'}),
 frozenset({'cat', 'dog'}),
 frozenset({'and', 'dog'}),
 frozenset({'a', 'dog'})}`

### Question 3. 
Compared to the A-Priori algorithm, what is the difference in number of candidate
sets that the algorithm tests?

### Question 4. 
What is the advantage of the PCY algorithm over the A-Priori algorithm?

### Question 5. 

What is the influence of the buckets size? For example, what would happen if the
bucket size would be too low?