# Association Rules

Today, we will implement two techniques that are part of the so-called shopping basket analysis, which will help us to better understand how customers data are being processed to extract insights about their habits.


#### Notes about external libraries
You can check your implementation of the Apriori algorithm and the Association Rules using MLxtend, a data mining library. Unfortunately, the library is not directly shipped with Anaconda. To install MLxtend, just execute  

```bash
pip install mlxtend  
```

Or directly using Anaconda

```bash
conda install -c conda-forge mlxtend 
```

Note that the installation of MLxtend is not mandatory, as we will provide the expected results in pre-rendered cells.


## 📚 Exercise 1: Apriori algorithm
In the first excercise, we will put into practice the Apriori algorithm. In particular, we will extract frequent itemsets from a list of transactions coming from a grocery store. You will have to complete the function `get_support(...)`.

In [1]:
import operator
import numpy as np

"""
Format the transaction dataset.
Expect a list of transaction in the format:
[[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], ...]
"""
def preprocess(dataset):
    unique_items = set()
    for transaction in dataset:
        for item in transaction:
            unique_items.add(item)
       
    # Converting to frozensets to use itemsets as dict key
    unique_items = [frozenset([i]) for i in list(unique_items)]
    
    return unique_items,list(map(set,dataset))


"""
Generate candidates of length n+1 from a list of items, each of length n.

Example:
[{1}, {2}, {5}]          -> [{1, 2}, {1, 5}, {2, 5}]
[{2, 3}, {2, 5}, {3, 5}] -> [{2, 3, 5}]
"""
def generate_candidates(Lk):
    output = []

    # We generate rules of the target size k
    k=len(Lk[0])+1
    
    for i in range(len(Lk)):
        for j in range(i+1, len(Lk)): 
            L1 = list(Lk[i])[:k-2]; 
            L2 = list(Lk[j])[:k-2]
            L1.sort(); 
            L2.sort()

            # Merge sets if first k-2 elements are equal
            # For the case of k<2, generate all possible combinations
            if L1==L2: 
                output.append(Lk[i] | Lk[j])

    return output


"""
Print the results of the apriori algorithm
"""
def print_support(support,max_display=10,min_items=1):
    print('support\t itemset')
    print('-'*30)
    filt_support = {k:v for k,v in support.items() if len(k)>=min_items}
    for s,sup in sorted(filt_support.items(), key=operator.itemgetter(1),reverse=True)[:max_display]:
        print("%.2f" % sup,'\t',set(s))
        
def print_support_mx(df,max_display=10,min_items=1):
    print('support\t itemset')
    print('-'*30)
    lenrow = df['itemsets'].apply(lambda x: len(x))
    df  = df[lenrow>=min_items]
    df  = df.sort_values('support',ascending=False).iloc[:max_display]
    for i,row in df.iterrows():
        print("%.2f" % float(row['support']),'\t',set(row['itemsets']))
        

"""
Run the apriori algorithm

dataset     : list of transactions
min_support : minimum support. Itemsets with support below this threshold
              will be pruned.
"""
def apriori(dataset, min_support = 0.5):
    unique_items,dataset = preprocess(dataset)
    L1, supportData      = get_support(dataset, unique_items, min_support)
    
    L = [L1]
    k = 0
    while True:
        Ck       = generate_candidates(L[k])
        Lk, supK = get_support(dataset, Ck, min_support)
        
        # Is there itemsets of length k that have the minimum support ?
        if len(Lk)>0:
            supportData.update(supK)
            L.append(Lk) 
            k += 1
        else:
            break
            
    return L, supportData

### TODO

The [Apriori Algorithm](https://en.wikipedia.org/wiki/Apriori_algorithm) identifies frequent combinations of items by extending them to larger and larger itemsets (see the generate_candidates function) as long as they appear sufficiently often in a list of transactions.

Compute support for all the candidate itemsets contained in Ck, given the total list of transactions. We already provide the functions to compute candidate itemsets. The support of the itemset $X$ with respect to the list of transactions $T$ is defined as the proportion of transactions $t$ in the dataset which contains the itemset $X$. Support can be computed using the following formula

$$\mathrm{supp}(X) = \frac{|\{t \in T; X \subseteq t\}|}{|T|}$$  

After computing the support for each itemset, prune the ones that do not match the minimal specificied support.

In [2]:
"""
Compute support for each provided itemset by counting the number of
its occurences in the original dataset of transactions.

dataset      : list of transactions, preprocessed using 'preprocess()'
Ck           : list of itemsets to compute support for. 
min_support  : minimum support. Itemsets with support below this threshold
               will be pruned.
              
output       : list of remaining itemsets, after the pruning step.
support_dict : dictionary containing the support value for each itemset.
"""
def get_support(dataset, Ck, min_support):
    
    # This dictionary should contain the number of appearance of each itemset in the dataset.
    # Itemset in Ck are represented as frozensets and can directly be uses as dictionary keys.
    support_count = {}
    
    for transaction in dataset:
        for candidate in Ck:
            if candidate.issubset(transaction):
                if not candidate in support_count: support_count[candidate]=1
                else: support_count[candidate] += 1
    
    output = []
    num_transactions = float(len(dataset))
    support_dict = {}
    for key in support_count:
        
        support = support_count[key]/num_transactions
        
        if support >= min_support:
            output.insert(0,key)
            support_dict[key] = support
    return output, support_dict

### Run

In [3]:
dataset = [ l.strip().split(',') for i,l in enumerate(open('groceries.csv').readlines())]

L,support = apriori(dataset,min_support=0.01)
print_support(support,10,min_items=2)

support	 itemset
------------------------------
0.07 	 {'whole milk', 'other vegetables'}
0.06 	 {'whole milk', 'rolls/buns'}
0.06 	 {'whole milk', 'yogurt'}
0.05 	 {'whole milk', 'root vegetables'}
0.05 	 {'other vegetables', 'root vegetables'}
0.04 	 {'other vegetables', 'yogurt'}
0.04 	 {'other vegetables', 'rolls/buns'}
0.04 	 {'whole milk', 'tropical fruit'}
0.04 	 {'whole milk', 'soda'}
0.04 	 {'rolls/buns', 'soda'}


### Check

You can check the results of your implementation using MLXtend. Just run the cell below

In [5]:
pip install mlxtend

Collecting mlxtend
  Obtaining dependency information for mlxtend from https://files.pythonhosted.org/packages/1c/07/512f6a780239ad6ce06ce2aa7b4067583f5ddcfc7703a964a082c706a070/mlxtend-0.23.1-py3-none-any.whl.metadata
  Downloading mlxtend-0.23.1-py3-none-any.whl.metadata (7.3 kB)
Downloading mlxtend-0.23.1-py3-none-any.whl (1.4 MB)
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   -- ------------------------------------- 0.1/1.4 MB 1.5 MB/s eta 0:00:01
   ------ --------------------------------- 0.2/1.4 MB 2.5 MB/s eta 0:00:01
   ----------- ---------------------------- 0.4/1.4 MB 2.8 MB/s eta 0:00:01
   ---------------- ----------------------- 0.6/1.4 MB 3.3 MB/s eta 0:00:01
   ---------------------- ----------------- 0.8/1.4 MB 3.5 MB/s eta 0:00:01
   --------------------------- ------------ 1.0/1.4 MB 3.5 MB/s eta 0:00:01
   ------------------------------- -------- 1.1/1.4 MB 3.4 MB/s eta 0:

In [8]:
import pandas as pd
from mlxtend.frequent_patterns import apriori as mx_apriori

df_dummy = pd.get_dummies(pd.Series(dataset).apply(pd.Series).stack()).sum(level = 0)
frequent_itemsets = mx_apriori(df_dummy, min_support=0.01, use_colnames=True)
print_support_mx(frequent_itemsets,10,min_items=2)

TypeError: sum() got an unexpected keyword argument 'level'

## 📚 Exercise 2: Association Rule Learning
Such associations are not necessarily symmetric. Therefore, in the second part, we will use [association rule learning](https://en.wikipedia.org/wiki/Association_rule_learning) to better understand the directionality of our computed frequent itemsets. In other terms, we will have to infer if the purchase of one item generally implies the the purchase of another.

In [9]:
"""
L              : itemsets
supportData    : dictionary storing itemsets support
min_confidence : rules with a confidence under this threshold should be pruned
"""
def generate_rules(L, supportData, min_confidence=0.7):  
    # Rules to be computed
    rules = []
    
    # Iterate over itemsets of length 2..N
    for i in range(1, len(L)):
        
        # Iterate over each frequent itemset
        for freqSet in L[i]:
            H1 = [frozenset([item]) for item in freqSet]
            
            # If the itemset contains more than 2 elements
            # recursively generate candidates 
            if (i+1 > 2):
                rules_from_consequent(freqSet, H1, supportData, rules, min_confidence)
                compute_confidence(freqSet, H1, supportData, rules, min_confidence)
            # If the itemsset contains 2 or less elements
            # conpute rule confidence
            else:
                compute_confidence(freqSet, H1, supportData, rules, min_confidence)

    return rules   

"""
freqSet        : frequent itemset
H              : candidate elements to create a rule
supportData    : dictionary storing itemsets support
rules          : array to store rules
min_confidence : rules with a confidence under this threshold should be pruned
"""
def rules_from_consequent(freqSet, H, supportData, rules, min_confidence=0.7):
    m = len(H[0])
    if (len(freqSet) > (m + 1)): 

        # create new candidates of size n+1
        Hmp1 = generate_candidates(H)
        Hmp1 = compute_confidence(freqSet, Hmp1, supportData, rules, min_confidence)
        
        if (len(Hmp1) > 1):    #need at least two sets to merge
            rules_from_consequent(freqSet, Hmp1, supportData, rules, min_confidence)
            
"""
Print the resulting rules
"""
def print_rules(rules,max_display=10):
    print('confidence\t rule')
    print('-'*30)
    for a,b,sup in sorted(rules, key=lambda x: x[2],reverse=True)[:max_display]:
        print("%.2f" % sup,'\t',set(a),'->',set(b))
def print_rules_mx(df,max_display=10):
    print('confidence\t rule')
    print('-'*30)
    df  = df.sort_values('confidence',ascending=False).iloc[:max_display]
    for i,row in df.iterrows():
        print("%.2f" % float(row['confidence']),'\t',set(row['antecedents']),'->',set(row['consequents']))

### TODO:

You will have to complete the method `compute_confidence(...)`, that computes confidence for a set of candidate rules H and prunes the rules that have a confidence below the specified threshold. Please complete it by computing rules confidence using the following formula:

$$\mathrm{conf}(X \Rightarrow Y) = \mathrm{supp}(X \cup Y) / \mathrm{supp}(X)$$


In [10]:
"""
Compute confidence for a given set of rules and their respective support

freqSet        : frequent itemset of N-element
H              : list of candidate elements Y1, Y2... that are part of the frequent itemset
supportData    : dictionary storing itemsets support
rules          : array to store rules
min_confidence : rules with a confidence under this threshold should be pruned
"""
def compute_confidence(freqSet, H, supportData, rules, min_confidence=0.7):
    prunedH = [] 
    
    for Y in H:
        # Compute X which is the frequent itemset minus the considered Y
        X           = freqSet -  Y
        
        # Compute support for both terms
        support_XuY = supportData[freqSet]
        support_X   = supportData[X]
        
        # Compute confidence
        conf        = support_XuY / support_X
        
        if conf >= min_confidence: 
            rules.append((X, Y, conf))
            prunedH.append(Y)
    return prunedH

### Run

In [11]:
rules=generate_rules(L,support, min_confidence=0.1)
print_rules(rules,10)

confidence	 rule
------------------------------
0.59 	 {'citrus fruit', 'root vegetables'} -> {'other vegetables'}
0.58 	 {'root vegetables', 'tropical fruit'} -> {'other vegetables'}
0.58 	 {'curd', 'yogurt'} -> {'whole milk'}
0.57 	 {'other vegetables', 'butter'} -> {'whole milk'}
0.57 	 {'root vegetables', 'tropical fruit'} -> {'whole milk'}
0.56 	 {'root vegetables', 'yogurt'} -> {'whole milk'}
0.55 	 {'domestic eggs', 'other vegetables'} -> {'whole milk'}
0.52 	 {'whipped/sour cream', 'yogurt'} -> {'whole milk'}
0.52 	 {'rolls/buns', 'root vegetables'} -> {'whole milk'}
0.52 	 {'other vegetables', 'pip fruit'} -> {'whole milk'}


### Check

You can check the results of your implementation using MLXtend. Just run the cell below (you will have to run the checking code of question 1 first).

In [12]:
from mlxtend.frequent_patterns import association_rules as mx_association_rules

rules_mx = mx_association_rules(frequent_itemsets, metric="confidence", min_threshold=0.1)
print_rules_mx(rules_mx,max_display=10)

NameError: name 'frequent_itemsets' is not defined

## EPFL Twitter Data

Now that we have a working implementation, we will apply the Apriori algorithm on a dataset that you should know pretty well by now: EPFL Twitter data. In that scenario, tweets will be considered as transactions and words will be items. Let's see what kind of frequent associations we can discover.

The method below cleans the tweets and formats them in the same format as the transactions of the previous exercise. Run the cells and generate the results for both algorithms. What can you observe from the association rules results? Briefly explain.

In [13]:
# Loading of libraries and documents

from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
import string
from nltk.corpus import stopwords
import math
from collections import Counter
nltk.download('stopwords')
nltk.download('punkt')

# Tokenize, stem a document
stemmer = PorterStemmer()
def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    return " ".join([stemmer.stem(word.lower()) for word in tokens])

# Remove stop words
def clean_voc(documents):
    cleaned = []
    for tweet in documents:
        new_tweet = []
        tweet = tokenize(tweet).split()
        for word in tweet:
            if (word not in stopwords.words('english') and 
                word not in stopwords.words('german') and
                word not in stopwords.words('french')):
                if word=="epflen":
                    word = "epfl"
                new_tweet.append(word)
        if len(new_tweet)>0:
            cleaned.append(new_tweet)
    return cleaned

# Read a list of documents from a file. Each line in a file is a document
with open("epfldocs.txt") as f:
    content = f.readlines()
original_documents = [x.strip() for x in content] 
documents = clean_voc(original_documents)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [14]:
L,support = apriori(documents,min_support = 0.01)
print_support(support,20,min_items=2)

support	 itemset
------------------------------
0.08 	 {'via', 'epfl'}
0.06 	 {'epfl', '’'}
0.05 	 {'new', 'epfl'}
0.05 	 {'epfl', 'amp'}
0.05 	 {'research', 'epfl'}
0.04 	 {'lausann', 'epfl'}
0.04 	 {'vdtech', 'epfl'}
0.04 	 {'epfl', 'switzerland'}
0.04 	 {'epfl', 'robot'}
0.03 	 {'day', 'epfl'}
0.03 	 {'epfl', 'swiss'}
0.03 	 {'vdtech', 'via'}
0.03 	 {'vdtech', 'via', 'epfl'}
0.03 	 {'scienc', 'epfl'}
0.03 	 {'epfl', 'innov'}
0.03 	 {'epfl', 'student'}
0.03 	 {'epfl', 'first'}
0.03 	 {'work', 'epfl'}
0.02 	 {'technolog', 'epfl'}
0.02 	 {'epfl', '2018'}


In [15]:
rules=generate_rules(L,support, min_confidence=0.1)
print_rules(rules,20)

confidence	 rule
------------------------------
1.00 	 {'»'} -> {'epfl'}
1.00 	 {'«'} -> {'epfl'}
1.00 	 {'«'} -> {'»'}
1.00 	 {'»'} -> {'«'}
1.00 	 {'model'} -> {'epfl'}
1.00 	 {'perovskit'} -> {'epfl'}
1.00 	 {'next'} -> {'epfl'}
1.00 	 {'improv'} -> {'epfl'}
1.00 	 {'particip'} -> {'epfl'}
1.00 	 {'technolog'} -> {'epfl'}
1.00 	 {'drone'} -> {'epfl'}
1.00 	 {'epflcampu'} -> {'epfl'}
1.00 	 {'learn'} -> {'epfl'}
1.00 	 {'present'} -> {'epfl'}
1.00 	 {'mooc'} -> {'epfl'}
1.00 	 {'show'} -> {'epfl'}
1.00 	 {'scientist'} -> {'epfl'}
1.00 	 {'brain'} -> {'epfl'}
1.00 	 {'eth'} -> {'epfl'}
1.00 	 {'«'} -> {'»', 'epfl'}


## 📚 Exercise 3: Pen and Paper!

You are given the following accident and weather data. Each line corresponds to one event:

1. car_accident rain lightning wind clouds fire
2. fire clouds rain lightning wind
3. car_accident fire wind
4. clouds rain wind
5. lightning fire rain clouds  
6. clouds wind car_accident  
7. rain lightning clouds fire  
8. lightning fire car_accident

(a) You would like to know what is the likely cause of all the car accidents. What association rules do you need to look for? Compute the confidence and support values for these rules. Looking at these values, which is the most likely cause of the car accidents?

We need to look for the association rules of the form: {cause} → {car accident}
i.e. in which the left-hand side represents the cause of the accident. 

The possible association rules are: 

{lightning} → {car accident} support: 0.25 confidence: 0.4
    Support: 0.25 means that lightning and car accident is present in 25% = 2 of all transactions.
    Confidence: 0.4 means that among the transactions involving lightning, 40% of them also involve a car accident. = number total of transactions with lightning + car accident / number total of transactions with lightning = 2/5 = 0.4

{wind} → {car accident} support: 0.375 confidence: 0.6
    Support: 0.375 indicates that wind + car accident is present in 37.5% = 3 of all transactions.
    Confidence: 0.6 = number total of transactions with wind + car accident / number total of transactions with wind = 3/5 = 0.6

{fire} → {car accident} support: 0.375 confidence: 0.5
    support: there is 3 transactions with fire + car accident; 3/8 = 0.375
    confidence: number total of transactions with fire + car accident / number total of transactions with fire = 3/6 = 0.5

{clouds} → {car accident} support: 0.25 confidence: 0.33
    support: there is 2 transactions with clouds + car accident; 2/8 = 0.25
    confidence: number total of transactions with clouds + car accident / number total of transactions with clouds = 2/6 = 0.33
    
{rain} → {car accident} support: 0.125 confidence: 0.2
    support: there is 1 transactions with rain + car accident; 1/8 = 0.125
    confidence: number total of transactions with rain + car accident / number total of transactions with clouds = 1/5 = 0.2


{wind} has both the highest confidence and the highest support and is the most likely cause of the car accidents.

(b) Find all the association rules for minimal support 0.6 and minimal confidence of 1.0 (certainty). Follow the apriori algorithm.

We first find all the frequent itemsets of size one. The minimal support requirement is 0.6,
which means that to be frequent an itemset must occur in at least 5 out of the 8 transactions, 5/8 = 0.625 ≥ 0.6 There are five frequent itemsets:

{clouds} support: 0.75
{wind} support: 0.625
{lightning} support: 0.625
{rain} support: 0.625
{fire} support: 0.75

From the above itemsets we next generate all possible itemsets of size 2 and prune the itemsets with support below 0.6. Only two itemsets remain:

{lightning, fire} support: 0.625
{clouds, rain} support: 0.625


It is not possible to generate the itemsets of size 3 out of the above 2 itemsets, the intersection is empty. 

Based on the itemsets of size 2 we generate all possible association rules and compute their confidence:

{lightning} →{fire} support: 0.625 confidence: 1.0
{fire} → {lightning} support: 0.625 confidence: 0.833
{clouds} → {rain} support: 0.625 confidence: 0.833
{rain} → {clouds} support: 0.625 confidence: 1.0

There are only two association rules with confidence equal to 1 and that is the final solution.


# l'ordre d'affichage dans la liste est important : rain --> cloud c'est pas pareil que cloud --> rain