# Foundations of Data Science (GDW) 2023



# Exercise IV: Sets & Frequent Item Set Mining

In this weeks exercise we will take a look at mining frequent itemsets with the Apriori algorithm.

## Part 1: (Frozen) Sets

### Basic introduction
Sets in python (set objects) were deﬁned according to the mathematical deﬁnition. Sets are unordered collections of objects which may occur at most once inside one set. Thus, we may not rely on an index to refer to a particular object in a set, nor on a key, as in a dictionary.

Python distinguishes two set types:
- `set` and
- `frozenset`

Objects of type set are mutable, whereas frozensets are immutable and may thus act as a key in a
dictionary, or could itself become items of another set.


In [None]:
a = { 1, 2, 3, 2, 3 }
print(a)
type(a)

In [None]:
capitals = {"Germany": "Berlin", "France": "Paris", "England": "London"}
frozencs = frozenset(capitals)
print(frozencs)
#frozencs.add("Barcelona")

An often used application of sets is to count the occurrence of unique values.

In [None]:
pangram = "A wizard's job is to vex chumps quickly in fog."
for c in sorted(list(set(pangram))):
    print('{}:{}x'.format(c,pangram.count(c)))

A list of important set functions is given in the next codeblock.

In [None]:
s1 = set(range(5))
s2 = set(range(3, 10))
s3 = set(range(10))

print("join" , end=" "); print(s1 | s2)
print("join" , end=" "); print(s1.union(s2))
print("intersection", end = " ") ; print(s1 & s2)
print("intersection", end = " ") ; print(s1.intersection(s2))
print("difference", end=" ") ; print(s1 - s2)
print("difference", end=" ") ; print(s1.difference(s2))

# comparison
print(s2.issubset(s3))
print(s3.issuperset(s1))

# clear a set
s3.clear()
print(s3)

# add an element
s1.add(9)
print(s1)

## Part 2: Frequent Itemset Mining & Apriori

With these functions we can start a-priori frequent set mining with the following source code.

### Task 2.1
Extend the source code below to an apriori algorithm on D, use a threshold of 2. 

*Hint: Utilize `frozensets`, to include sets as elements of sets.*

In [None]:
def frequency(itemset, D):
    return sum([1 for T in D if itemset.issubset(T)])

I = 'abcdef'
D = [set(x) for x in ['abc', 'acf', 'abce', 'de']]

In [None]:
# add your code here

A Lattice is a partially ordered set in which every two elements have both
- a supremum (least upper bound) and
- an infimum (greatest lower bound).

Here, we make use of the properties of the itemset *Lattice*:

- $X \subseteq Y$, $supp(Y) = t => supp(X) \geq t$
- $X \supseteq Y$, $supp(X) < min_t => supp(Y) < min_t$

For data mining in python, the package pandas provides a convenient data structure called *data frames*.

The a-priori algorithm, however, is part of the `mlxtend` package. You can install the latter in
jupyterlab by typing

`!pip install mlxtend`

In [None]:
!pip install mlxtend

With these two packages, you could perform frequent set mining as follows.

In [None]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules

D = [set(x) for x in ['abc', 'acf', 'abce', 'de']]

df = pd.DataFrame(D)

items = ['a', 'b', 'c', 'd', 'e', 'f']
# We convert the data to one-hot encoding
def onehot_encode(df, items):
    itemset = set(items)
    encoded_vals = []
    for _, row in df.iterrows():
        rowset = set(row)
        labels = {}
        uncommons = list(itemset - rowset)
        commons = list(itemset.intersection(rowset))
        for uc in uncommons:
            labels[uc] = 0
        for com in commons:
            labels[com] = 1
        encoded_vals.append(labels)
    return encoded_vals

ohe_items = onehot_encode(df, items)
ohe_df = pd.DataFrame(ohe_items)
print(ohe_df)

# We know the data is boolean, so we can explicitly declare it as such
freq_items = apriori(ohe_df.astype('bool'), min_support=0.4, use_colnames=True, verbose=1)
freq_items

Now for the fun part!

We are able to extract association rules from these sets.

In [None]:
rules = association_rules(freq_items, metric="confidence", min_threshold=0.6)
rules.sort_values(by='support', ascending=False).head()

### Task 2.2
Read the .csv file and print a collection of its unique elements:

https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv.

In [None]:
# add your code here

### Task 2.3
Perform frequent itemset mining on the data from **Task 2.1**. You can freely (within reason) choose a value for `min_support`.

In [None]:
# add your code here

## Part 3: Maximal/Closed Frequent Itemsets

In the lecture you have seen algorithms to discover frequent itemsets. In practice, the number of frequent
itemsets is very high so maximal frequent itemsets are of highest interest.

By deﬁnition:
- an itemset is maximal frequent if none of its immediate supersets is frequent. 
- an itemset is closed if none of its immediate supersets has the same support as the itemset.

### Task 3.1
Given
- items "a", "b", "c", "d", "e", "f"
- a threshold of 2
- transactions $\{abc, acdef, abc, df\}$
compute all closed and maximal frequent itemsets of the transactions.

You can do so programatically or by hand.

In [None]:
# write your code here or

*add your notes here*

### Task 3.2
Given above results, which of those itemsets is maximal? Which of these are closed?

*Hint: It might help to visualize your findings.*

Maximal sets: 

Closed sets: