# Foundations of Data Science (GDW) 2023



# Exercise V: Association Rules & FP-Trees

This week's exercise will be about how to evaluate association rules and tree-based algorithm for mining those.

## Part 1: Metrics
Given a rule $X \rightarrow Y$, we know three metrics to measure its quality:
- Support: $supp(X, Y) = \frac{freq(X, Y)}{N}$
- Confidence: $conf(X, Y) = \frac{freq(X, Y)}{freq(X)}$
- Lift: $lift(X, Y) = \frac{supp(X, Y)}{supp(X)*(supp(Y)}$

### Task 1.1
Explain the intuitive meaning behind these three metrics. You may use a real-life example such as shopping, etc.

*Write your notes here:*

...


Now, execute the apriori code below to mine assocation rules.

In [1]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules

D = [set(x) for x in ['abc', 'acf', 'abce', 'de', 'cfg', 'abfg', 'acdeg', 'abdfg', 'afg', 'abdefg', 'abdf']]

df = pd.DataFrame(D)

items = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
# We convert the data to one-hot encoding
def onehot_encode(df, items):
    itemset = set(items)
    encoded_vals = []
    for _, row in df.iterrows():
        rowset = set(row)
        labels = {}
        uncommons = list(itemset - rowset)
        commons = list(itemset.intersection(rowset))
        for uc in uncommons:
            labels[uc] = 0
        for com in commons:
            labels[com] = 1
        encoded_vals.append(labels)
    return encoded_vals

ohe_items = onehot_encode(df, items)
ohe_df = pd.DataFrame(ohe_items)

# We know the data is boolean, so we can explicitly declare it as such
freq_items = apriori(ohe_df.astype('bool'), min_support=0.4, use_colnames=True, verbose=1)
rules = association_rules(freq_items, metric="confidence", min_threshold=0.6)
rules.sort_values(by='support', ascending=False)

ModuleNotFoundError: No module named 'pandas'

### Task 1.2
Try to play with the value for minimum support and at the lift. You may choose threshold values for lift as you deem fit.

Note your findings and try to give an explanation intuitively.

In [None]:
freq_items = apriori(ohe_df.astype('bool'), min_support=0.4, use_colnames=True, verbose=1)
rules = association_rules(freq_items, metric="lift", min_threshold=1.2)
rules.sort_values(by='support', ascending=True)

*Write your notes here*

...

## Part 2: FP-Growth

We know FP-Growth to be an alternative to the Apriori algorithm that does not need to generate candidates.

The code below generates a transaction database for the given dataset.

In [None]:
dataset = [["beer","nuts","diaper"],
            ["beer","coffee","diaper"],
            ["beer","diaper","eggs"],
            ["nuts","eggs","milk"],
            ["nuts","coffee","diaper","eggs","milk"],
            ["beer", "coffee"]]

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

You can visit https://planktonfun.github.io/FPTreeSimulator/ to see a visualization of the tree.

*Note: To get the result for the transaction database above,
add {beer,coffee} to the end of the list.*

Now, we will execute the FP-Growth algorithm that internally uses FP-Trees.

In [None]:
from mlxtend.frequent_patterns import fpgrowth

fp_items = fpgrowth(df, min_support=0.5, use_colnames=True)
fp_items

In [None]:
fp_rules = association_rules(fp_items, metric="lift", min_threshold=1.0)
fp_rules

### Task 2.1
Given transactions $\{bce, bcd, abde, abce, abde, abcde\}$, perform the necessary steps to generate an FP-Tree by hand.

*Write your notes here*

...

### Task 2.2
Given the *FP-Tree* above, convert it to a *Conditional FP-Tree*.

*Write your notes here*

...

### Task 2.3
Compare the runtime performances of FP-Growth to Apriori.

*Hint: You can measure the duration an algorithm `f(...)` runs with* 

`%timeit -n 100 -r 10 f(...)`


In [None]:
# add your code here

### Task 2.4
Can you name one (or more) advantage(s) that Apriori has over FP-Growth?

*Write your notes here*

...