# Exercise 4

We will implement two techniques that are part of the so-called shopping basket analysis, which will help us to better understand how customers data are being processed to extract insights about their habits.


#### Notes about external libraries
You can check your implementation of the Apriori algorithm and the Association Rules using the data mining library MLxtend (`pip install mlxtend`). 


## Exercise 4.1

Use the Apriori algorithm to extract frequent itemsets from a list of grocery store transactions. 

In [1]:
import operator
import numpy as np

def preprocess(dataset):
    """Formats the transaction dataset.
    Expects an array of transactions,
    with each transaction being an array of items
    
    Example:
        [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5]]
    """
    # Get frozensets of unique items to use itemsets as dict key
    unique_items = list(set([frozenset([item]) for
        trans in dataset for item in trans]))
    return unique_items, list(map(set,dataset))

In [2]:
def generate_candidates(Lk):
    """Generates candidates of length k+1 from a list Lk
    of items, each item of length k

    Example:
        [{1}, {2}, {5}]          -> [{1, 2}, {1, 5}, {2, 5}]
        [{2, 3}, {2, 5}, {3, 5}] -> [{2, 3, 5}]
    """

    output = []

    # We generate rules of the target size k
    k = len(Lk[0])
    
    for i in range(len(Lk)):
        for j in range(i + 1, len(Lk)): 
            L1 = list(Lk[i])[:k - 1]; 
            L2 = list(Lk[j])[:k - 1]
            L1.sort()
            L2.sort()

            # Merge sets if first k-1 elements are equal
            # If k=1 generates all possible combinations
            if L1 == L2:
                output.append(Lk[i] | Lk[j])

    return output

In [3]:
def print_support(support, max_display=10, min_items=1):
    """Prints the results of the apriori algorithm"""
    print('support\t itemset')
    print('-' * 30)
    filt_support = {k:v for k, v in support.items() if len(k) >= min_items}
    for s, sup in sorted(filt_support.items(), key=operator.itemgetter(1),
        reverse=True)[:max_display]:
        print("%.2f" % sup, '\t', set(s))
        
def print_support_mx(df, max_display=10, min_items=1):
    """Prints the results of the apriori algorithm"""
    print('support\t itemset')
    print('-' * 30)
    lenrow = df['itemsets'].apply(lambda x: len(x))
    df = df[lenrow >= min_items]
    df = df.sort_values('support', ascending=False).iloc[:max_display]
    for i, row in df.iterrows():
        print("%.2f" % float(row['support']), '\t', set(row['itemsets']))

### Apriori Algorithm

The [Apriori Algorithm](https://en.wikipedia.org/wiki/Apriori_algorithm) identifies frequent combinations of items by extending them to larger and larger itemsets (see the generate_candidates function) as long as they appear sufficiently often in a list of transactions.

Compute support for all the candidate itemsets contained in Ck, given the total list of transactions. We already provide the functions to compute candidate itemsets. The support of the itemset $X$ with respect to the list of transactions $T$ is defined as the proportion of transactions $t$ in the dataset which contains the itemset $X$. Support can be computed using the following formula

$$\mathrm{supp}(X) = \frac{|\{t \in T; X \subseteq t\}|}{|T|}$$  

After computing the support for each itemset, prune the ones that do not match the minimal specificied support.

In [4]:
def get_support(dataset, Ck, min_support):
    """
    Compute support for each provided itemset by counting # of
    occurences in original transactions dataset.

    dataset      : list of transactions, preprocessed using 'preprocess()'
    Ck           : list of itemsets to compute support for. 
    min_support  : minimum support. Itemsets with support below this threshold
                   will be pruned.

    output       : list of remaining itemsets, after the pruning step.
    support_dict : dictionary containing the support value for each itemset.
    """

    def all_contained(subset, set_):
        for i in subset:
            if i not in set_:
                return 0
        else:
            return 1

    # Compute the supports
    supports = {}
    entries = len(dataset)
    for u in Ck:
        count = sum([all_contained(u, s) for s in dataset])
        supports[u] = count / entries

    output = [k for k, v in supports.items() if v >= min_support]

    return output, supports

In [5]:
def apriori(dataset, min_support=0.5):
    """Runs the apriori algorithm

    dataset     : list of transactions
    min_support : minimum support. Itemsets with support below this threshold
                  will be pruned.
    """
    unique_items, dataset = preprocess(dataset)
    L1, supportData = get_support(dataset, unique_items, min_support)
    
    L = [L1]
    k = 0
    while True:
        Ck = generate_candidates(L[k])
        Lk, supK = get_support(dataset, Ck, min_support)
        
        # Check for itemsets of length k with minimum support
        if len(Lk):
            supportData.update(supK)
            L.append(Lk) 
            k += 1
        else:
            break
            
    return L, supportData

### Run

In [11]:
dataset = [l.strip().split(',') for i, l in enumerate(open('groceries.csv').readlines())]

L, support = apriori(dataset, min_support=0.01)
print_support(support, 10, min_items=2)

support	 itemset
------------------------------
0.07 	 {'whole milk', 'other vegetables'}
0.06 	 {'rolls/buns', 'whole milk'}
0.06 	 {'whole milk', 'yogurt'}
0.05 	 {'root vegetables', 'whole milk'}
0.05 	 {'root vegetables', 'other vegetables'}
0.04 	 {'yogurt', 'other vegetables'}
0.04 	 {'rolls/buns', 'other vegetables'}
0.04 	 {'tropical fruit', 'whole milk'}
0.04 	 {'soda', 'whole milk'}
0.04 	 {'rolls/buns', 'soda'}


### Check

You can check the results of your implementation using MLXtend. Just run the cell below

In [7]:
import pandas as pd
from mlxtend.frequent_patterns import apriori as mx_apriori

df_dummy = pd.get_dummies(pd.Series(dataset).apply(pd.Series).stack()).sum(level=0)
frequent_itemsets = mx_apriori(df_dummy, min_support=0.01, use_colnames=True)
print_support_mx(frequent_itemsets, 10, min_items=2)

support	 itemset
------------------------------
0.07 	 {'whole milk', 'other vegetables'}
0.06 	 {'rolls/buns', 'whole milk'}
0.06 	 {'yogurt', 'whole milk'}
0.05 	 {'root vegetables', 'whole milk'}
0.05 	 {'root vegetables', 'other vegetables'}
0.04 	 {'yogurt', 'other vegetables'}
0.04 	 {'rolls/buns', 'other vegetables'}
0.04 	 {'tropical fruit', 'whole milk'}
0.04 	 {'soda', 'whole milk'}
0.04 	 {'rolls/buns', 'soda'}


## Question 4.2

Such associations are not necessarily symmetric. Therefore, in the second part, we will use [association rule learning](https://en.wikipedia.org/wiki/Association_rule_learning) to better understand the directionality of frequent itemsets.

In [9]:
def generate_rules(L, supportData, min_confidence=0.7):  
    """Generates association rules given an array of frequent itemsets
    and a level of confidence
    
    Inputs:
        L: itemsets
        supportData: dictionary storing itemsets support
        min_confidence: rules with confidence under threshold are pruned
    """

    # Rules computed
    rules = []
    
    # Iterate over itemsets of length >= 2
    for i in range(1, len(L)):
        # Iterate over each frequent itemset
        for freqSet in L[i]:          
            # Check if freqSet has more than 2 elements
            if (i+1 > 2):
                # recursively generate candidates 
                rules_from_consequent(freqSet, supportData, rules, min_confidence)
                compute_confidence(freqSet, supportData, rules, min_confidence)

            # Otherwise
            else:
                # compute rule confidence
                compute_confidence(freqSet, supportData, rules, min_confidence)

    return rules   


def rules_from_consequent(freqSet, supportData, rules, min_confidence):
    """
    Inputs:
        freqSet: frequent itemset
        supportData: dictionary storing itemsets support
        rules: array to store rules
        min_confidence: rules with confidence under threshold are pruned
    """
    H = [frozenset([item]) for item in freqSet]
    m = len(H[0])
    if (len(freqSet) > (m + 1)): 
        # create new candidates of size n+1
        Hmp1 = generate_candidates(H)
        Hmp1 = compute_confidence(freqSet, Hmp1, supportData, rules, min_confidence)
        
        #need at least two sets to merge
        if (len(Hmp1) > 1):
            rules_from_consequent(freqSet, Hmp1, supportData, rules, min_confidence)
            
def print_rules(rules, max_display=10):
    """Prints the resulting rules"""
    print('confidence\t rule')
    print('-' * 30)
    for a, b, sup in sorted(rules, key=lambda x: x[2],
        reverse = True)[:max_display]:
        print("%.2f" % sup, '\t', set(a), '->', set(b))

def print_rules_mx(df,max_display=10):
    """Prints the resulting rules"""
    print('confidence\t rule')
    print('-' * 30)
    df  = df.sort_values('confidence', ascending=False).iloc[:max_display]
    for i, row in df.iterrows():
        print("%.2f" % float(row['confidence']), '\t',
            set(row['antecedants']), '->',set(row['consequents']))

### `compute_confidence`

`compute_confidence(...)` computes the confidence for a set of candidate rules H and prunes the rules with a confidence below the threshold. The confidence is given by:

$$\mathrm{conf}(X \Rightarrow Y) = \mathrm{supp}(X \cup Y) / \mathrm{supp}(X)$$


In [14]:
def compute_confidence(freqSet, supportData, rules, min_confidence=0.7):
    """Computes the confidence for a set of rules and their supports
    
    Inputs:
        freqSet: one frequent itemset of N-element
        supportData: dictionary storing itemsets support
        rules: array to store rules
        min_confidence: rules with less confidence are pruned
    """
    H = [frozenset([item]) for item in freqSet]
    prunedH = [] 
    
    for Y in H:
        # Compute frequent itemsets
        X = 
        
        # Compute support for both terms
        support_XuY = 
        support_X = 
        # Compute confidence
        conf = 
        
        if conf >= min_confidence: 
            rules.append((X, Y, conf))
            prunedH.append(Y)
    return prunedH

### Run

In [None]:
rules=generate_rules(L,support, min_confidence=0.1)
print_rules(rules,10)

### Check

You can check the results of your implementation using MLXtend. Just run the cell below (you will have to run the checking code of question 1 first).

In [10]:
from mlxtend.frequent_patterns import association_rules as mx_association_rules

rules_mx = mx_association_rules(frequent_itemsets, metric="confidence", min_threshold=0.1)
print_rules_mx(rules_mx,max_display=10)

confidence	 rule
------------------------------
0.59 	 {'root vegetables', 'citrus fruit'} -> {'other vegetables'}
0.58 	 {'tropical fruit', 'root vegetables'} -> {'other vegetables'}
0.58 	 {'yogurt', 'curd'} -> {'whole milk'}
0.57 	 {'butter', 'other vegetables'} -> {'whole milk'}
0.57 	 {'tropical fruit', 'root vegetables'} -> {'whole milk'}
0.56 	 {'root vegetables', 'yogurt'} -> {'whole milk'}
0.55 	 {'domestic eggs', 'other vegetables'} -> {'whole milk'}
0.52 	 {'yogurt', 'whipped/sour cream'} -> {'whole milk'}
0.52 	 {'rolls/buns', 'root vegetables'} -> {'whole milk'}
0.52 	 {'pip fruit', 'other vegetables'} -> {'whole milk'}


## EPFL Twitter Data

Now that we have a working implementation, we will apply the Apriori algorithm on a dataset that you should know pretty well by now: EPFL Twitter data. In that scenario, tweets will be considered as transactions and words will be items. Let's see what kind of frequent associations we can discover.

The method below cleans the tweets and formats them in the same format as the transactions of the previous exercise. Run the cells and generate the results for both algorithms. What can you observe from the association rules results? Briefly explain.

In [None]:
# Loading of libraries and documents

from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
import string
from nltk.corpus import stopwords
import math
from collections import Counter
nltk.download('stopwords')
nltk.download('punkt')

# Tokenize, stem a document
stemmer = PorterStemmer()
def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    return " ".join([stemmer.stem(word.lower()) for word in tokens])

# Remove stop words
def clean_voc(documents):
    cleaned = []
    for tweet in documents:
        new_tweet = []
        tweet = tokenize(tweet).split()
        for word in tweet:
            if (word not in stopwords.words('english') and 
                word not in stopwords.words('german') and
                word not in stopwords.words('french')):
                if word=="epflen":
                    word = "epfl"
                new_tweet.append(word)
        if len(new_tweet)>0:
            cleaned.append(new_tweet)
    return cleaned

# Read a list of documents from a file. Each line in a file is a document
with open("epfldocs.txt") as f:
    content = f.readlines()
original_documents = [x.strip() for x in content] 
documents = clean_voc(original_documents)

In [None]:
L,support = apriori(documents,min_support = 0.01)
print_support(support,20,min_items=2)

In [None]:
rules=generate_rules(L,support, min_confidence=0.1)
print_rules(rules,20)

## 4.3 Pen and Paper

You are given the following accident and weather data. Each line corresponds to one event:

1. car_accident rain lightning wind clouds fire
2. fire clouds rain lightning wind
3. car_accident fire wind
4. clouds rain wind
5. lightning fire rain clouds  
6. clouds wind car_accident  
7. rain lightning clouds fire  
8. lightning fire car_accident

(a) You would like to know what is the likely cause of all the car accidents. What association rules do you need to look for? Compute the confidence and support values for these rules. Looking at these values, which is the most likely cause of the car accidents?

(b) Find all the association rules for minimal support 0.6 and minimal confidence of 1.0 (certainty). Follow the apriori algorithm.