## Finding Association Rules on Grocery Dataset
In this Notebook, we will be implementing an algorithm for mining association rules from a dataset. We will test our algorithm with a small synthetic (artificial) dataset, before we use the algorithm to find association rules from a larger dataset - the [grocery dataset](https://www.kaggle.com/irfanasrullah/groceries).

Our Notebooks in CSMODEL are designed to be guided learning activities. To use them, simply through the cells from top to bottom, following the directions along the way. If you find any unclear parts or mistakes in the Notebooks, email me at arren.antioquia@dlsu.edu.ph

## Import
Import **pandas** and **matplotlib**.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

## Synthetic Dataset
Before we use a more complicated dataset, we will first test our algorithm using a synthetic (artificial) dataset created using random numbers. The dataset contains 20 distinct items. There are 300 observations representing the baskets in the market-basket model. Each observation (basket) contains at most 8 items.

Let's first create the synthetic dataset using the [`choice`](https://docs.scipy.org/doc//numpy-1.10.4/reference/generated/numpy.random.choice.html) function of `numpy`. You may check the documentation of the function for further information. We have set the same seed to have the same values in the synthetic dataset.

In [3]:
np.random.seed(1)
baskets = [np.sort(np.random.choice(20, size=(np.random.randint(1, 9)), replace=False)) for i in range(300)]

Let's display the contents of the synthetic dataset. It should list 300 baskets with its contents.

In [4]:
for i, basket in enumerate(baskets):
    print('Basket', i, basket)

Basket 0 [ 3 10 14 15 17 18]
Basket 1 [ 2  3  5  8 17]
Basket 2 [ 2 11 12 16 17 18]
Basket 3 [16]
Basket 4 [ 4  5 12 16]
Basket 5 [ 0  1  3  9 17 18]
Basket 6 [11]
Basket 7 [ 3  8 13]
Basket 8 [ 2  5 12 13 19]
Basket 9 [ 1  9 12 14 19]
Basket 10 [6 7]
Basket 11 [ 1  6  8  9 13 15 17 18]
Basket 12 [ 3  6  9 11 12 14 18 19]
Basket 13 [ 1  3  4  5  7  8 11]
Basket 14 [ 1  2  9 11 15 16 18 19]
Basket 15 [ 3  4  8  9 10 14]
Basket 16 [ 8  9 12 13 18]
Basket 17 [1 5 8]
Basket 18 [6 7]
Basket 19 [ 3 19]
Basket 20 [ 5  8 13 15]
Basket 21 [ 1  3  4 15 19]
Basket 22 [ 0  3 12]
Basket 23 [ 1  5  7  9 14 17]
Basket 24 [10 14 17]
Basket 25 [0 3 4]
Basket 26 [ 4  7 10 18]
Basket 27 [ 2  5 11 17 18]
Basket 28 [ 1  3  7  8 11 17 18]
Basket 29 [0 2 4 9]
Basket 30 [ 3 15 17 19]
Basket 31 [8]
Basket 32 [ 3  9 11]
Basket 33 [ 4  5  7  8  9 16 17]
Basket 34 [ 2  6  7 19]
Basket 35 [5]
Basket 36 [ 1  3  7 10 12 15 16]
Basket 37 [ 1 14]
Basket 38 [ 2  3  6 13 18 19]
Basket 39 [ 2  6 11 14 16]
Basket 40 [3]
B

As of now, our dataset is represented as a list of list. Instead of using this representation, we will convert our dataset to a matrix represented as a `pandas` `DataFrame`. The `DataFrame` will contain 300 rows - equivalent to the number of observations in the dataset, and 20 columns - equivalent to the number of distinct items in the dataset. The value in the cell in row `x` and column `y` is 1 if item `y` is in observation (basket) `x`, otherwise, the value in the cell in row `x` and column `y` is 0.

In [5]:
syn_df = pd.DataFrame([[0 for _ in range(20)] for _ in range(300)], columns=[i for i in range(20)])

for i, basket in enumerate(baskets):
    syn_df.iloc[i, basket] = 1

Let's check the `DataFrame` representing the synthetic dataset here. In row `0`, the `DataFrame` should contain the value `1` in columns `3`, `10`, `14`, `15`, `17`, and `18`. All other columns in row `0` should contain the value 0. You may check the other values based on the list-of-list representation that we have displayed earlier.

In [6]:
print(syn_df)

     0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  \
0     0   0   0   1   0   0   0   0   0   0   1   0   0   0   1   1   0   1   
1     0   0   1   1   0   1   0   0   1   0   0   0   0   0   0   0   0   1   
2     0   0   1   0   0   0   0   0   0   0   0   1   1   0   0   0   1   1   
3     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   
4     0   0   0   0   1   1   0   0   0   0   0   0   1   0   0   0   1   0   
..   ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..   
295   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   
296   0   1   0   0   0   1   0   0   0   0   1   1   0   1   0   0   0   1   
297   0   1   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   
298   1   0   0   1   0   0   0   0   0   0   0   1   0   0   0   0   0   1   
299   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   

     18  19  
0     1   0  
1     0   0  
2     1  

## Rule Miner
Open `rule_miner.py` file. Some of the functions in the `RuleMiner` class are not yet implemented. We will implement the missing parts of this class.

Import the `RuleMiner` class

In [7]:
from rule_miner import RuleMiner

Instantiate a `RuleMiner` object with `support_t` equal to `10` and `confidence_t` equal to `0.6`. The field `support_t` represents the support threshold, while the field `confidence_t` represents the confidence threshold.

In [8]:
rule_miner = RuleMiner(10, 0.6)

Open `rule_miner.py` file and complete the `get_support()` function. This function returns the support for an itemset. The support of an itemset refers to the number of baskets wherein the itemset is present.

Implement the `get_support()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

In [9]:
print(rule_miner.get_support(syn_df, [0]))
print(rule_miner.get_support(syn_df, [0, 1]))
print(rule_miner.get_support(syn_df, [0, 1, 2]))

64
16
4


**Question #1:** What is the support of the itemset `{0}`? 
- 64

**Question #2:** What is the support of the itemset `{0, 1}`? 
- 16

**Question #3:** What is the support of the itemset `{0, 1, 2}`? 
- 4

Open `rule_miner.py` file again and complete the `get_frequent_itemsets()` function. This function returns a list frequent itemsets in the dataset. The support of each frequent itemset should be greater than or equal to the support threshold.

Implement the `get_frequent_itemsets()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

In [10]:
frequent_itemsets = rule_miner.get_frequent_itemsets(syn_df)
print(frequent_itemsets)

[[1, 12, 7], [16, 1, 12], [1, 19, 12], [8, 17, 2]]


**Question #4:** List all the frequent itemsets in the dataset, given the support threshold `10`.
- \[1, 12, 7\]
- \[16, 1, 12\]
- \[1, 19, 12\]
- \[8, 17, 2\]

Using the `get_rules()` function in `rule_miner.py`, let us list all the possible rules for all frequent itemsets in our dataset. The `get_rules()` function returns a list of rules produced from an itemset.

In [11]:
for itemset in frequent_itemsets:
    print(rule_miner.get_rules(itemset))

[[[1, 12], [7]], [[7], [1, 12]], [[1, 7], [12]], [[12], [1, 7]], [[12, 7], [1]], [[1], [12, 7]]]
[[[16, 1], [12]], [[12], [16, 1]], [[16, 12], [1]], [[1], [16, 12]], [[1, 12], [16]], [[16], [1, 12]]]
[[[1, 19], [12]], [[12], [1, 19]], [[1, 12], [19]], [[19], [1, 12]], [[19, 12], [1]], [[1], [19, 12]]]
[[[8, 17], [2]], [[2], [8, 17]], [[8, 2], [17]], [[17], [8, 2]], [[17, 2], [8]], [[8], [17, 2]]]


Upon getting all the possible rules based on our most frequent itemsets, we should check if the confidence of each rule is greater than or equal to the confidence threshold that we set.

To do this, open `rule_miner.py` file again and complete the `get_confidence()` function. This function returns the confidence for a rule. Suppose that we want to find the rule is `{1, 2} -> {3}`, then the confidence for the rule is the support of `{1, 2, 3}` divided by the support of `{1, 2}`. In this code, we represent a rule using a list which contains 2 lists -  the first list contains the left-hand side of the rule (which in our example is `{1, 2}`), and the second list contains the right-hand side of the rule (which in our example is `{3}`).

Implement the `get_confidence()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

In [12]:
print('{:.2f}'.format(rule_miner.get_confidence(syn_df, [[1, 2], [3]])))
print('{:.2f}'.format(rule_miner.get_confidence(syn_df, [[4, 5], [6]])))
print('{:.2f}'.format(rule_miner.get_confidence(syn_df, [[7, 8], [9]])))
print('{:.2f}'.format(rule_miner.get_confidence(syn_df, [[10, 11], [12]])))

0.21
0.07
0.09
0.13


**Question #5:** What is the confidence of the rule `{1, 2} -> {3}`? Limit to 2 decimal places.
- 0.21

**Question #6:** What is the confidence of the rule `{4, 5} -> {6}`? Limit to 2 decimal places.
- 0.07

**Question #7:** What is the confidence of the rule `{7, 8} -> {9}`? Limit to 2 decimal places.
- 0.09

**Question #8:** What is the confidence of the rule `{10, 11} -> {12}`? Limit to 2 decimal places.
- 0.13

We have now completed all functions necessary for our rule miner. The only function left to implement is the `get_association_rules()` function, which integrates all of these functions together.

Open `rule_miner.py` file again and complete the `get_association_rules()` function. This function returns a list of association rules with support greater than or equal to the support threshold `support_t` and confidence greater than or equal to the confidence threshold `confidence_t`.

Implement the `get_association_rules()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

With `support_t` equal to `10`, and `confidence_t` equal to `0.6`, let's get the association rules from this dataset.

In [14]:
rules = rule_miner.get_association_rules(syn_df)
print(rules)

[[[12, 7], [1]]]


**Question #9:** What is/are the association rules that we derived from the dataset?
- {12, 7} -> {1}

## Grocery Dataset
For this notebook, we will work on a dataset called `grocery dataset`. This dataset contains 9835 rows which represents transactions by customers shopping for groceries. The dataset contains 169 unique items.

The dataset is provided to you as a `.csv` file. `.csv` means comma-separated values. You can open the file in Notepad to see how it is exactly formatted.

If you view the `.csv` file in Excel, you can see that our dataset contains a list of items bought by a customer for each single transaction, represented in rows.

In [15]:
temp_df = pd.read_csv("groceries.csv", header=None)

Let's convert the items, represented as strings, to integers. To do this, let's create a dictionary that will contain the mapping for each item string to its corresponding integer. The dictionary should contain 169 unique strings, with integer mapping from 0 to 168.

In [16]:
values = temp_df.values.ravel()
values = [value for value in pd.unique(values) if not pd.isnull(value)]

value_dict = {}
for i, value in enumerate(values):
    value_dict[value] = i
    
print(value_dict)

{'citrus fruit': 0, 'semi-finished bread': 1, 'margarine': 2, 'ready soups': 3, 'tropical fruit': 4, 'yogurt': 5, 'coffee': 6, 'whole milk': 7, 'pip fruit': 8, 'cream cheese': 9, 'meat spreads': 10, 'other vegetables': 11, 'condensed milk': 12, 'long life bakery product': 13, 'butter': 14, 'rice': 15, 'abrasive cleaner': 16, 'rolls/buns': 17, 'UHT-milk': 18, 'bottled beer': 19, 'liquor (appetizer)': 20, 'potted plants': 21, 'cereals': 22, 'white bread': 23, 'bottled water': 24, 'chocolate': 25, 'curd': 26, 'flour': 27, 'dishes': 28, 'beef': 29, 'frankfurter': 30, 'soda': 31, 'chicken': 32, 'sugar': 33, 'fruit/vegetable juice': 34, 'newspapers': 35, 'packaged fruit/vegetables': 36, 'specialty bar': 37, 'butter milk': 38, 'pastry': 39, 'processed cheese': 40, 'detergent': 41, 'root vegetables': 42, 'frozen dessert': 43, 'sweet spreads': 44, 'salty snack': 45, 'waffles': 46, 'candy': 47, 'bathroom cleaner': 48, 'canned beer': 49, 'sausage': 50, 'brown bread': 51, 'shopping bags': 52, 'bev

As of now, the `DataFrame` representation of the transaction contains 9835 rows, wherein each row contains a list of string representing the items bought for each transaction. We want to convert this representation to a list of list, with the corresponding integers as value instead of the strings.

In [17]:
temp_df = temp_df.stack().map(value_dict).unstack()

baskets = []
for i in range(temp_df.shape[0]):
    basket = np.sort([int(x) for x in temp_df.iloc[i].values.tolist() if str(x) != 'nan'])
    baskets.append(basket)

Let's display the contents of the dataset. It should list 9835 baskets with its contents.

In [18]:
for i, basket in enumerate(baskets):
    print('Basket', i, basket)

 [  0   4   5  11  38  39  46  47 107]
Basket 9139 [ 5 29 42 60]
Basket 9140 [ 11  17  31  50  72 100 126]
Basket 9141 [19 31]
Basket 9142 [ 52 126 129]
Basket 9143 [17 35 39 57 58]
Basket 9144 [ 13  17  31  32  54  60 121]
Basket 9145 [ 7  8 11 25 27 33 53 78]
Basket 9146 [ 7 11 19 29]
Basket 9147 [  4   7  30  41  45  77 123]
Basket 9148 [24 57]
Basket 9149 [49]
Basket 9150 [8]
Basket 9151 [8]
Basket 9152 [35 50 54 57 60 62 65 74 80 83]
Basket 9153 [  7  24  54  78  92 132]
Basket 9154 [ 13  50  62  72  94 158]
Basket 9155 [19 29 33 63]
Basket 9156 [ 7 11 31 52]
Basket 9157 [39]
Basket 9158 [32]
Basket 9159 [ 19  24  78  85  93 115]
Basket 9160 [  0   2   9  29 135]
Basket 9161 [  5  14  17  31  35  46  97 160 161]
Basket 9162 [42 60 65 93]
Basket 9163 [ 11  29  35  40  60  74  89 115]
Basket 9164 [ 34  52  60 115 127]
Basket 9165 [ 14  24  41 136]
Basket 9166 [17 23 33 42 92]
Basket 9167 [ 4  5 11 19 34 35]
Basket 9168 [ 7 11 21 46 54 57 65 92]
Basket 9169 [  5  14  19  23  24  29  

As of now, our dataset is represented as a list of list. Instead of using this representation, we will convert our dataset to a matrix represented as a `pandas` `DataFrame`. The `DataFrame` will contain 9835 rows - equivalent to the number of observations in the dataset, and 169 columns - equivalent to the number of distinct items in the dataset. The value in the cell in row `x` and column `y` is 1 if item `y` is in observation (basket) `x`, otherwise, the value in the cell in row `x` and column `y` is 0.

In [19]:
grocery_df = pd.DataFrame([[0 for _ in range(169)] for _ in range(9835)], columns=values)

for i, basket in enumerate(baskets):
    grocery_df.iloc[i, basket] = 1

Let's check the `DataFrame` representing the dataset here. In row `0`, the `DataFrame` should contain the value `1` in columns `citrus fruit`, `semi-finished bread`, `margarine`, and `ready soups`. All other columns in row `0` should contain the value 0. You may check the other values based on the list-of-list representation that we have displayed earlier.

In [20]:
print(grocery_df)

      citrus fruit  semi-finished bread  margarine  ready soups  \
0                1                    1          1            1   
1                0                    0          0            0   
2                0                    0          0            0   
3                0                    0          0            0   
4                0                    0          0            0   
...            ...                  ...        ...          ...   
9830             1                    0          0            0   
9831             0                    0          0            0   
9832             1                    0          0            0   
9833             0                    1          0            0   
9834             0                    0          0            0   

      tropical fruit  yogurt  coffee  whole milk  pip fruit  cream cheese  \
0                  0       0       0           0          0             0   
1                  1       1       1     

Instantiate a `RuleMiner` object with `support_t` equal to `85` and `confidence_t` equal to `0.6`. The field `support_t` represents the support threshold, while the field `confidence_t` represents the confidence threshold.

In [21]:
rule_miner = RuleMiner(85, 0.6)

With `support_t` equal to `85`, and `confidence_t` equal to `0.6`, let's get the association rules from this dataset.

In [22]:
rules = rule_miner.get_association_rules(grocery_df)
print(rules)

[[['yogurt', 'butter'], ['whole milk']]]


**Question #10:** What is/are the association rules that we derived from the dataset?
- {yogurt, butter} -> {whole milk}