# Association Rules



## Objectives

- Learn to use Python and _mlxtend_ (Machine Learning Extension) to generate frequent itemsets from a set of transactions data.
- Use the mlxtend library to mine for association rules from frequent itemsets.
- Apply Association Rules mining to unstructured text data.


## Introduction

When people buy cigarettes, do they tend to also buy chocolate or beer? If people have high cholesterol, do they also tend to have high blood pressure? If people buy car insurance, do they also buy house insurance? 
Answers to such questions can form the basis of brand positioning, advertising and even direct marketing. But how do we find out if such associations exist? And how can we search for them when our databases have tens of thousands of records and many fields? 

Association detection algorithms provide rules describing the values of fields that typically occur together. They can therefore be used as an approach to this area of data understanding. 

An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item found in the data. A consequent is an item that is found in combination with the antecedent.

Below are some common terms used when working on association rules:
- *Instances* indicates the number of records in the data set that match the antecedents. 
- *Support* refers to the percentage of records that match the antecedents. (Same as “Instances” but in percentage)
- *Confidence* is the percentage of all records matching the antecedents that also match the consequent. 
- *Rule Support* is the percentage of records that match the entire rule (both the antecedents and consequent). 
- *Lift* refers to the expected return using a model or rule. In this context it is the ratio of the rule confidence to the overall percentage occurrence of the consequent in the data. Think of it as a measure of how much better the model is compared with a random-choice model.

We have been using Sckit-Learn for our other machine learning tasks so far. Unfortunately, Scikit-Learn does not offer any support for frequent itemset generation nor association rule mining. So in this practical, we will be using the mlxtend (Machine Learning Extensions) package. The package is  available at http://rasbt.github.io/mlxtend/. The python extension offers tools that helps in “day-to-day data science tasks”.

To install mlxtend, issue the following command in a conda command prompt:

```
conda install mlxtend --channel conda-forge
```


## Generating Frequent Itemsets

Let us now see how we can generate frequent itemsets given a set of transactions records.

### Step 1 Loading the Data File

Obtain a copy of the ```Shopping.csv``` file and save it in the same directory as your jupyter notebook file. 

Add codes to read in the ```Shopping.csv``` file using *Pandas*. Print out the data to see the dataset.

<details>
    <summary><strong>Click here to view codes</strong></summary>


```
import pandas as pd

df = pd.read_csv("Shopping.csv")
df.head()
```


In [None]:
#Enter Codes here to import pandas and load Shopping.csv file



The following shows the printout of the shopping data:

```
Ready made Frozen foods Alcohol Fresh Vegetables Milk Bakery goods Fresh meat Toiletries Snacks Tinned Goods
...

0   1   0   0   0   0   0   0   0   1   0
1   1   0   0   0   0   0   0   1   0   0
2   1   0   0   0   0   0   0   1   1   0
3   1   0   0   0   1   1   0   1   1   0
4   1   0   0   0   0   0   0   1   1   0
```
The data consists of rows of transactions. Each column indicates if an item has been bought in the transaction, a value of 0 means _no_ while 1 means _yes_. However, this is not the format expected by the *mlxtend* library, we need to have ```True``` and ```False``` values instead of 0 and 1.

Modify your codes to add a ```dtype=bool``` parameter to ```read_csv()``` function call as follows:

```python
df = pd.read_csv("Shopping.csv", dtype=bool)
```

Execute them again with the ```dtype=bool``` parameter.

You should now see the following output:

```
Ready made Frozen foods Alcohol Fresh Vegetables  Milk Bakery goods Fresh meat Toiletries Snacks Tinned Goods
...
0   True   False   False   False   False   False   False   False   False   False
1   True   False   False   False   False   False   False   True    True    False
2   True   False   False   False   False   False   False   True    True    False
3   True   False   False   False   False   True    False   True    True    False
4   True   False   False   False   False   False   False   True    True    False
```

The values of 0 and 1 has been changed to ```False``` and ```True```

We are now ready to generate the frequent itemsets.

Add the following lines to use the _Apriori_ algorithm to generate the frequent itemsets with minimum support level of 10%

```python
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
print(frequent_itemsets)
```

Recall that frequent itemsets are determined by the specified _support_ value. We stated a value of 0.1 (10%). So  itemsets with  support of less than 10% will be discarded.


In [None]:
#Enter your codes here to generate the frequent itemset using mlxtend


If you run the codes, you will see the frequent itemsets generated by the ```apriori()``` function as follows:

         support                                           itemsets
    0   0.492366                                       (Ready made)
    1   0.402036                                     (Frozen foods)
    2   0.394402                                          (Alcohol)
    3   0.188295                                             (Milk)
    4   0.428753                                     (Bakery goods)
    5   0.474555                                           (Snacks)
    6   0.455471                                     (Tinned Goods)
    7   0.211196                         (Ready made, Frozen foods)
    8   0.212468                              (Alcohol, Ready made)
    9   0.133588                                 (Ready made, Milk)
    10  0.255725                         (Ready made, Bakery goods)
    ...

The listing shows the frequent itemsets together with the support of the itemset. We requested for a support of 10%, from the list, we see that there are a total of 52 itemsets with support of 10% or more.

## Generating Association Rules from Frequent Itemsets

Once we have the frequent itemsets, we can proceed to generate the association rules. Whilst we generate the frequent itemsets based on the desired __support__ value, association rules are generated from the frequent itemsets (to ensure wide applicability) based on __confidence__ value.

Let us now generate rules based on a confidence of 0.7 (70%)

Add and run the following codes:

```python
from mlxtend.frequent_patterns import association_rules
print(association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7))
```




In [None]:
#Enter your codes here to generate rules from the frequent itemsets



You should see the following outputs:

                                     antecedents     consequents  
    0                                     (Milk)    (Ready made)   
    1                                     (Milk)  (Bakery goods)   
    2                         (Ready made, Milk)  (Bakery goods)   
    3                       (Milk, Bakery goods)    (Ready made)   
    4               (Frozen foods, Tinned Goods)  (Bakery goods)   
    5                    (Alcohol, Tinned Goods)  (Bakery goods)   
    6                       (Milk, Tinned Goods)  (Bakery goods)
    ...
    
    
    antecedent support  consequent support   support  confidence      lift  
    0             0.188295            0.492366  0.133588    0.709459  1.440918   
    1             0.188295            0.428753  0.139949    0.743243  1.733499   
    2             0.133588            0.428753  0.105598    0.790476  1.843663   
    3             0.139949            0.492366  0.105598    0.754545  1.532488   
    4             0.207379            0.428753  0.148855    0.717791  1.674137   
    5             0.173028            0.428753  0.123410    0.713235  1.663510   
    6             0.127226            0.428753  0.100509    0.790000  1.842552   
    
    
        leverage  conviction  
    0   0.040878    1.747204  
    1   0.059217    2.224856  
    2   0.048322    2.726405  
    3   0.036692    2.068137  
    4   0.059940    2.024201  
    5   0.049223    1.992040  
    6   0.045960    2.720223 
    
    
For the first rules with ID 0, we get ```(Milk)``` &#8658; ```(Ready made)```.

**Support and Confidence**

It also shows the support to to be ```0.133588``` and confidence of ```0.709459```.

### Other Measures

Besides the usual, support and confidence, we can also use other measures like *lift*, *leverage* and *conviction* to help us better understand the rules that were generated. These values are generated for us by default.

**Lift**

The lift value is ```1.440918``` which is positive. The *lift* value provides an indication of how the consequents  and antecedents are dependent on each other.
- If the value is < 1, the two is negatively dependent, meaning if one occurs, the other is less likely to occur (subtitution effect).
- If lift = 1, then they are independent of each other, which in turn means that the rule is useless. 
- Lastly, if lift > 1, then they are positively dependent and occurrence of one means higher chance of the other occurring (complementary effect).

The higher a lift is, the better the rule is. For association rules, the lift should be > 1.


**Leverage**

We also see another measure which is the *leverage*. *Leverage* is similar to *lift* in that it compares the rules to the case where the antecedents and consequents are independent. In this case a leverage of 0 means that the antecendent and consequent itemsets are independent. For example, ```(Milk)``` &#8658; ```(Ready made)``` has a leverage value of 0.04 (>0) which means that the two products sells more together then can be expected if they were sold independently.


**Conviction**

Conviction is a measure that takes into account the _support_ of the antecedent and _confidence_ of the rule in a single measure. The value ranges from 0 to infinity. The higher the value, the more the consequent is dependent on the antecedent. In short, the larger the value, the better.

The value for the rule ```(Milk)``` &#8658; ```(Ready made)``` is ```1.747204```.

We will not discuss lift, leverage and conviction in details, if you are interested, please refer to [http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules].

## Exercise

1. Run the apriori function again with 20% support, how many frequent itemsets are generated?

<details>
    <summary>
        <strong>Click for answer</strong>
    </summary>

```
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)
print(frequent_itemsets)
```
    
You should see that the number of frequent itemsets is 19.
  
</details>
    
2. With these itemset for a support of 20%, generate rules based on a confidence threshold of 0.7, how many rules are generated?
    
<details>
    <summary>
        <strong>Click for answer</strong>
    </summary>
    
```
print(association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7))
```    
    
There are no rules that can be generated from the frequent itemsets. In other words, no rules can be generated from the current transactional data with support of > 20% and confidence of > 0.7.
    
</details>


In [None]:
#Enter your exercise codes here


## Word Associations

Besides analysing things people buy together, we can also apply association rules to other cases like finding out words that appears frequently together, from such associations, we can gain more insights in a corpus (collection of text).

Let us try to apply association rule mining to a set of text and see what we can find.

### Step 1 Loading the Attractions.txt File

Obtain the file *Attractions.txt* and load the file using Pandas

<details>
    <summary><strong>Click to see codes</strong></summary>
    
    
```
import pandas as pd
df = pd.read_csv("Attractions.txt")
df.head()
```
</details>

Print out using the ```head()``` function to see the data.


In [None]:
#Enter codes here to read in the Attractions.txt file using Pandas


You should see the following:

     	text
    0 	garden bay place wish aside entire cheap worth...
    1 	garden bay beautiful walk hours place guided e...
    2 	highlights visit singapore flower singapore cl...
    3 	loved garden door waterfall garden bay flower ...
    4 	beautiful recommend flower dome cloud dome gar...


As can be seen from the snapshot, it contains reviews about an attraction (Gardens by the Bay to be specific) in Singapore.

Each line is a single review and text pre-processing (like removing of stop words) has been performed on the texts.

### Step 2 Configuring Natural Language Toolkit (NLTK)

We will need the ```nltk``` package to process the text data. 

"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum."- https://www.nltk.org/

You will first need to install it if it is not already install on your machine. The conda command to install the toolkit is

```conda install -c anaconda nltk ```

Once installed, you will still need to install additional nltk files before it can be used, the following codes will download the files that are required. 

Execute the following codes:

```python
import nltk
nltk.download("popular")
```

In [None]:
#Execute your codes here to download additional files for NLTK



You should see that files are being downloaded (or updated):

```
    [nltk_data] Downloading collection 'popular'
    [nltk_data]    | 
    [nltk_data]    | Downloading package cmudict to /root/nltk_data...
    [nltk_data]    | Downloading package gazetteers to /root/nltk_data...
    [nltk_data]    | Downloading package genesis to /root/nltk_data...
    [nltk_data]    | Downloading package gutenberg to /root/nltk_data...
    ...
    [nltk_data]  Done downloading collection popular
```
It might take some time to download the required files depending on your network connection speed.

### Step 3 Apply Word Tokenization

We need to read in each review and break down the text into individual words. Each word is now an "item" in our market basket.

```python
import nltk
df = df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)
print(df)
```

In [None]:
#Enter your codes here to tokenize text into words



You should see that the text is down converted into arrays of words as shown below:

    0     [garden, bay, place, wish, aside, entire, chea...
    1     [garden, bay, beautiful, walk, hours, place, g...
    2     [highlights, visit, singapore, flower, singapo...
    3     [loved, garden, door, waterfall, garden, bay, ...
    4     [beautiful, recommend, flower, dome, cloud, do...
    5     [singaporean, garden, garden, bay, walk, visit...
    6     [garden, showpiece, horticulture, garden, arti...
    7     [garden, spectacular, night, take, walk, great...
    ...
    
### Step 4 Convert to Transactional Format

The ```apriori()``` function works on transactional format, not an array of text, so let us now convert it to the required format with the following codes:

```python
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(df).transform(df)
df = pd.DataFrame(te_ary, columns=te.columns_)
print(df)
```


In [None]:
#Enter your codes here to encode transaction data


You should see the following output

        activity  afternoon  allows  amaze  amount  ample  areas  arranged  \
    0      False      False   False  False   False  False  False     False   
    1      False      False   False   True   False  False  False     False   
    2      False      False   False   True   False  False  False     False   
    3      False      False   False  False   False  False  False     False   
    4      False      False   False  False   False  False  False     False   
    5      False      False   False  False   False  False  False     False 


Recall that in a market basket analysis we use earlier, a transaction table is as follows:

- rows = transactions
- columns = items in the market
- True = item present in the transaction
- False = item not present in the transaction.

In this case, we have

- rows = reviews
- columns = a word in the corpus
- True = word occurs in the document
- False = word does not occurs in the document

### Step 5 Frequent Itemsets and Assoication Rule Mining

As we have done before, we will now apply the _Apriori_ algorithm to generate the frequent itemsets as well as mine the association rules. We use a minimum support of 0.4 and confidence threshold of 0.7.

Enter and run the following codes:

```python
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

print(df.columns.tolist())
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)
print(frequent_itemsets.sort_values("support", ascending=False))
print(association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)) 
```

The code performs the rules mining as well as print out the data.


In [None]:
#Enter your codes here to generate frequent itemsets and association rules


**Words in Corpus**

If you run the codes, you will first see an output printing the columns which is the same as the unique words in the corpus.

    ['activity', 'afternoon', 'allows', 'amaze', 'amount', 'ample', 'areas', 'arranged', 'arrangements', 'artificial', 'artistry', 'aside', 'attractions', 'audio', 'barely', 'barrage', 'bay', 'beautiful', 'bite', 'blossom', 'break', 'breathtaking', 'bridge', 'buds', 'built', 'busy', 'cafes', 'calm', 'canalized', 'captions', 'chance', 'cheap', 'cheers', 'cherry', 'christmas', 'city', 'climate', 'close', 'cloud', 'come', 'compared', 'completing', 'complex',
    ...

It is a result from the statement
```python
print(df.columns.tolist())
```


**Frequent Itemsets**

You will next see the frequent itemsets from the statement 

```python
print(frequent_itemsets.sort_values("support", ascending=True))
```

```
     support                  itemsets
5   0.933333                  (garden)
1   0.733333                     (bay)
12  0.733333             (bay, garden)
4   0.533333                  (flower)
11  0.466667             (bay, flower)
14  0.466667       (beautiful, garden)
2   0.466667               (beautiful)
20  0.466667     (bay, garden, flower)
19  0.466667  (beautiful, bay, garden)
10  0.466667          (beautiful, bay)
15  0.466667          (flower, garden)
18  0.400000            (garden, walk)
17  0.400000           (visit, garden)
16  0.400000       (singapore, garden)
0   0.400000                   (amaze)
13  0.400000              (bay, visit)
9   0.400000           (amaze, garden)
8   0.400000                    (walk)
7   0.400000                   (visit)
6   0.400000               (singapore)
3   0.400000                    (dome)
21  0.400000      (bay, visit, garden)
```
    
The frequent itemsets tell us which are the words that occur frequently together. For example, we see the words "amaze" and "garden" appearing together with support of 0.4 (40%). Also the words {"bay", "garden"} has support of 73.33% which is not suprising as the reviews are on Gardens by the Bay.

**Assoication Rules**


You should also see a set of rules

                antecedents    consequents  antecedent support  
    0               (amaze)       (garden)            0.400000   
    1           (beautiful)          (bay)            0.466667   
    2              (flower)          (bay)            0.533333   
    3                 (bay)       (garden)            0.733333   
    4              (garden)          (bay)            0.933333   
    5               (visit)          (bay)            0.400000

The rules suggest the association between set of words.

        consequent support   support  confidence      lift  leverage  conviction  
    0             0.933333  0.400000    1.000000  1.071429  0.026667         inf  
    1             0.733333  0.466667    1.000000  1.363636  0.124444         inf  
    2             0.733333  0.466667    0.875000  1.193182  0.075556    2.133333  
    3             0.933333  0.733333    1.000000  1.071429  0.048889         inf  
    4             0.733333  0.733333    0.785714  1.071429  0.048889    1.244444  
    5             0.733333  0.400000    1.000000  1.363636  0.106667         inf 

As expected, we can see that the rule ```(bay)``` &#8658; ```(garden)``` has a 100% confidence.

Other associated words like {"amaze"} &#8658; {"garden"} has high confidence of  100% and support of 40%. This tells us that many reviews use the word "amaze" with "garden" which leads us to suspect that the reviews are pretty good.


## Summary

In this practical, we looked at how to generate frequent itemsets and mine association rules from the frequent itemsets. We also look at the measures like support, confidence, lift, leverage and conviction.  Lastly, we did an exercise on a set of review text for a tourist attraction. The exercise provides us with a very good idea of the words that are associated with each other.