<a href="https://colab.research.google.com/github/moktan456/Data-Mining/blob/main/Answers_04_AssociationPatternMining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supermarket basket association pattern mining

In this question, we perform association pattern mining using the supermarket dataset `supermarket.arff` from the [Weka MOOC](https://www.cs.waikato.ac.nz/ml/weka/courses.html).


1. Load the data file `supermarket.arff` into a pandas data frame

2. Remove the following attributes
  - `department*`
  - `non host support`
  - `total`

3.  Select the Apriori algorithm and perform frequent itemset mining with minsup = 0.2 and minconf = 0.8 and find out:
  - The numbers of frequent 2-itemsets, and 3-itemsets.
  - The best three (2) rules with largest confidence. Examine these rules and describe them in your own words.

4. The supermarket manager wishes to boost the sale of fruit and therefore the manager needs to know other itemsets most likely be purchased with fruit to make promotion decisions.
  - Using the same minimum support and minimum confidence value.
  - List the top three itemsets to report to the supermarket manager.

5. Repeat task 3, but using the FP Growth algorithm instead.  
  - Compare the rules found.
  - Are they consistent?

## 0 Upgrade mlxtend
The default version of `mlxtend` on Google Colaborate is too old for this prac
so we must upgrade it. We want something that is at least version 0.18.
Note that code statements beginning with `!` are not python code, but system calls. If you are running this in a personal jupyterlab you might have to update this module a different way.

In [None]:
! pip install --upgrade 'mlxtend>=0.18'



In [None]:
# Check we have the right version
import mlxtend
print(mlxtend.__version__)

0.22.0


If you ran the two cells above inreverse order then you'll have to restart the kernel before you can load the newer version of the `mlxtend` module.

To do this: choose "Runitime" -> "Restart runtime".

In [None]:
import pandas as pd
from scipy.io import arff
import urllib
import urllib.request
import numpy as np

## 1 Load the data file `supermarket.arff` into a pandas data frame

We did this in a previous prac: download the file into your working directory using `urrlib`, load it using `scipy`, and then convert to a `pandas` data frame. The file on the Weka website has a few problems that we need to work around, so I've provided a cleaned version of the data on [GitHub](https://raw.githubusercontent.com/PaulHancock/COMP5009_pracs/main/data/supermarket.arff).

In [None]:
data_url = 'https://raw.githubusercontent.com/PaulHancock/COMP5009_pracs/main/data/supermarket.arff'
file_name = 'supermarket.arff'
urllib.request.urlretrieve(data_url, file_name)

('supermarket.arff', <http.client.HTTPMessage at 0x7a8075074850>)

In [None]:
# load the data from arff format
data = arff.loadarff('supermarket.arff')
raw_df = pd.DataFrame(data[0])
# The data table is 1 and 0, but we want it to be boolean (true/false) so we
# need to convert from int -> bool
df = raw_df.astype(bool)

In [None]:
df.describe()

Unnamed: 0,department1,department2,department3,department4,department5,department6,department7,department8,department9,grocery misc,...,department208,department209,department210,department211,department212,department213,department214,department215,department216,total
count,4627,4627,4627,4627,4627,4627,4627,4627,4627,4627,...,4627,4627,4627,4627,4627,4627,4627,4627,4627,4627
unique,2,2,2,2,2,2,2,1,2,2,...,1,1,2,2,2,2,1,1,1,1
top,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
freq,3580,4496,4537,4543,4452,4625,4560,4627,4545,4449,...,4627,4627,4436,4420,4589,4605,4627,4627,4627,4627


## 2 Remove attributes
Remove the following attributes as they have been deemed to be not-useful:
  - `department*`
  - `non host support`
  - `total`


In [None]:
cols_to_drop = ['non host support', 'total']
# Instead of hand writing all the names that start with department, use a loop
for col in df.columns:
  if col.startswith('department'): # choose all the columns which start with the word 'department'
    cols_to_drop.append(col)
print("The folloiwing columns will be dropped:")
print(cols_to_drop)

The folloiwing columns will be dropped:
['non host support', 'total', 'department1', 'department2', 'department3', 'department4', 'department5', 'department6', 'department7', 'department8', 'department9', 'department11', 'department57', 'department70', 'department79', 'department80', 'department81', 'department88', 'department89', 'department98', 'department100', 'department101', 'department102', 'department107', 'department108', 'department109', 'department110', 'department111', 'department112', 'department113', 'department114', 'department116', 'department117', 'department118', 'department119', 'department120', 'department122', 'department123', 'department124', 'department125', 'department126', 'department127', 'department128', 'department129', 'department130', 'department137', 'department138', 'department139', 'department140', 'department141', 'department142', 'department143', 'department144', 'department145', 'department146', 'department147', 'department148', 'department149', 'depa

In [None]:
df = df.drop(columns=cols_to_drop)

In [None]:
# confirm we have dropped the columns by showing a summary, we should have 104 cols left, all with descriptive names.
df.describe()

Unnamed: 0,grocery misc,baby needs,bread and cake,baking needs,coupons,juice-sat-cord-ms,tea,biscuits,canned fish-meat,canned fruit,...,casks red wine,750ml white nz,750ml red nz,750ml white imp,750ml red imp,sparkling nz,sparkling imp,brew kits/accesry,port and sherry,ctrled label wine
count,4627,4627,4627,4627,4627,4627,4627,4627,4627,4627,...,4627,4627,4627,4627,4627,4627,4627,4627,4627,4627
unique,2,2,2,2,1,2,2,2,2,2,...,2,2,2,2,2,2,2,1,2,1
top,False,False,True,True,False,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
freq,4449,4008,3330,2795,4627,2463,3731,2605,3686,3344,...,4576,4346,4536,4528,4530,4498,4604,4627,4602,4627


## 3 Select the Apriori algorithm

Select the Apriori algorithm and perform frequent itemset mining with `minsup = 0.2` and `minconf = 0.8` and find out:

- The numbers of frequent 2-itemsets, and 3-itemsets.
- The best three rules with largest confidence. Examine these rules and describe them in your own words.

The `apriori` algorithm is found in the `mlxtend` package, so we import it along with the `association_rules` function.

In [None]:
from mlxtend.frequent_patterns import apriori, association_rules

In [None]:
ap_itemsets = apriori(df,
                      min_support=0.2,  # choose the (relative) minsup
                      use_colnames=True)

  and should_run_async(code)


In [None]:
ap_itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.719689,(bread and cake)
1,0.604063,(baking needs)
2,0.532310,(juice-sat-cord-ms)
3,0.563000,(biscuits)
4,0.203372,(canned fish-meat)
...,...,...
541,0.224552,"(fruit, biscuits, vegetables, frozen foods)"
542,0.219365,"(fruit, biscuits, milk-cream, vegetables)"
543,0.228442,"(fruit, vegetables, milk-cream, frozen foods)"
544,0.202939,"(milk-cream, baking needs, bread and cake, veg..."


Now that we have our itemsets we want to chose those with `2<=k<=3`.
This isn't explicitly stored within our dataframe so we'll make a new column which is just the value of `len(itemsets)`.

In [None]:
def find_k(row):
  """Return the number of items in the itemset"""
  return len(row['itemsets'])

# Create a new column which counts the number of items in the itemset
ap_itemsets['k'] = ap_itemsets.apply(find_k, # Apply the function `find_k`
                                     axis=1) # apply the function to each row

  and should_run_async(code)


In [None]:
ap_itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets,k
0,0.719689,(bread and cake),1
1,0.604063,(baking needs),1
2,0.532310,(juice-sat-cord-ms),1
3,0.563000,(biscuits),1
4,0.203372,(canned fish-meat),1
...,...,...,...
541,0.224552,"(fruit, biscuits, vegetables, frozen foods)",4
542,0.219365,"(fruit, biscuits, milk-cream, vegetables)",4
543,0.228442,"(fruit, vegetables, milk-cream, frozen foods)",4
544,0.202939,"(milk-cream, baking needs, bread and cake, veg...",5


In [None]:
k2_itemsets = np.sum(ap_itemsets['k'] == 2) # count the number of rows where k=2
k3_itemsets = np.sum(ap_itemsets['k'] == 3) # count the number of rows where k=3
print(f"There are {k2_itemsets} itemsets with k=2")
print(f"There are {k3_itemsets} itemsets with k=3")

There are 182 itemsets with k=2
There are 252 itemsets with k=3


  and should_run_async(code)


In [None]:
# Now lets see the top 10 itemsets
# try either .head() or .nlargest(10,'support')
ap_itemsets.nlargest(10, 'support')

  and should_run_async(code)


Unnamed: 0,support,itemsets,k
0,0.719689,(bread and cake),1
28,0.640156,(fruit),1
29,0.639939,(vegetables),1
23,0.635185,(milk-cream),1
1,0.604063,(baking needs),1
12,0.587206,(frozen foods),1
3,0.563,(biscuits),1
2,0.53231,(juice-sat-cord-ms),1
51,0.505079,"(bread and cake, milk-cream)",2
16,0.503566,(party snack foods),1


Note that the top 10 itemsets are all 1-itemsets. Is this surprising to you?

We use these itemsets to generate association rules with a minimum confidence of 0.8.

In [None]:
ap_rules = association_rules(ap_itemsets,
                             metric='confidence',
                             min_threshold=0.8) # choose the minimum confidence value

  and should_run_async(code)


In [None]:
ap_rules.head()

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(canned fruit),(bread and cake),0.277285,0.719689,0.224768,0.8106,1.12632,0.025208,1.479997,0.155183
1,(jams-spreads),(bread and cake),0.276205,0.719689,0.221958,0.803599,1.116593,0.023177,1.427242,0.144265
2,(margarine),(bread and cake),0.494489,0.719689,0.395721,0.800262,1.111956,0.039843,1.403396,0.199172
3,(small goods),(bread and cake),0.241193,0.719689,0.201426,0.835125,1.160398,0.027843,1.700148,0.182163
4,"(biscuits, baking needs)",(bread and cake),0.381241,0.719689,0.314675,0.825397,1.14688,0.0403,1.605419,0.206978


Note that the rules above are not sorted by confidence. We should do that ourselves by using the `sort_values` function.

In [None]:
ap_rules.sort_values('confidence', ascending=False)

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
179,"(frozen foods, biscuits, fruit, vegetables)",(bread and cake),0.224552,0.719689,0.200778,0.894129,1.242383,0.039171,2.647667,0.251590
139,"(fruit, biscuits, margarine)",(bread and cake),0.231900,0.719689,0.202723,0.874185,1.214670,0.035828,2.227955,0.230089
132,"(fruit, biscuits, frozen foods)",(bread and cake),0.282905,0.719689,0.247028,0.873186,1.213282,0.043425,2.210406,0.245141
138,"(biscuits, milk-cream, vegetables)",(bread and cake),0.267128,0.719689,0.232332,0.869741,1.208496,0.040083,2.151954,0.235410
117,"(fruit, margarine, baking needs)",(bread and cake),0.244003,0.719689,0.212016,0.868911,1.207342,0.036410,2.138320,0.227163
...,...,...,...,...,...,...,...,...,...,...
153,"(bread and cake, frozen foods, vegetables)",(fruit),0.334558,0.640156,0.268424,0.802326,1.253329,0.054255,1.820389,0.303745
152,"(bread and cake, fruit, frozen foods)",(vegetables),0.334558,0.639939,0.268424,0.802326,1.253752,0.054328,1.821483,0.304150
91,"(breakfast food, vegetables)",(fruit),0.275989,0.640156,0.221310,0.801879,1.252632,0.044634,1.816290,0.278561
173,"(milk-cream, vegetables, frozen foods)",(fruit),0.285066,0.640156,0.228442,0.801365,1.251828,0.045955,1.811583,0.281380


Now describe the first three that you see above in your own words.

## 4 Boost fruit sales
The supermarket manager wishes to boost the sale of fruit and therefore the manager needs to know other itemsets most likely be purchased with fruit to make promotion decisions.
  - Using the same minimum support and minimum confidence value.
  - List the top three itemsets to report to the supermarket manager.

In [None]:
# choose all the rules wihch have "fruit" (not canned fruit) as the consquent
fruit_rules = ap_rules[ap_rules.consequents == frozenset(['fruit'])]

fruit_rules.sort_values('confidence',  # sort based on 'confidence'
                        ascending=False).head(3) # choose the top 3 only

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
177,"(bread and cake, frozen foods, biscuits, veget...",(fruit),0.242057,0.640156,0.200778,0.829464,1.295723,0.045824,2.110082,0.301118
172,"(biscuits, milk-cream, vegetables)",(fruit),0.267128,0.640156,0.219365,0.821197,1.282809,0.048361,2.012523,0.300817
140,"(bread and cake, biscuits, vegetables)",(fruit),0.321375,0.640156,0.262805,0.817754,1.27743,0.057076,1.974497,0.320027


## 5 FP-Growth
Repeat task 3, but using the FP Growth algorithm instead.  
  - Compare the rules found.
  - Are they consistent?

Import the `fpgrowth` function from our `mlxtend` module

In [None]:
from mlxtend.frequent_patterns import fpgrowth


  and should_run_async(code)


In [None]:
fp_itemsets = fpgrowth(df,
                       min_support=0.2, # choose the minimum support
                       use_colnames=True)

  and should_run_async(code)


In [None]:
fp_rules = association_rules(fp_itemsets,
                             metric='confidence',
                             min_threshold=0.8) # choose the minimum confidence

  and should_run_async(code)


There are a lot of rules, lets compare just the first 10 most confident rules.

In [None]:
# Select the top 10 confident rules from each of our algorithms
fp_top_10 = fp_rules.sort_values('confidence', ascending=False).head(10)
ap_top_10 = ap_rules.sort_values('confidence', ascending=False).head(10)

  and should_run_async(code)


In [None]:
print("FP-Growth rules")
fp_top_10

FP-Growth rules


  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
46,"(frozen foods, biscuits, fruit, vegetables)",(bread and cake),0.224552,0.719689,0.200778,0.894129,1.242383,0.039171,2.647667,0.25159
101,"(fruit, biscuits, margarine)",(bread and cake),0.2319,0.719689,0.202723,0.874185,1.21467,0.035828,2.227955,0.230089
31,"(fruit, biscuits, frozen foods)",(bread and cake),0.282905,0.719689,0.247028,0.873186,1.213282,0.043425,2.210406,0.245141
49,"(biscuits, milk-cream, vegetables)",(bread and cake),0.267128,0.719689,0.232332,0.869741,1.208496,0.040083,2.151954,0.23541
93,"(fruit, margarine, baking needs)",(bread and cake),0.244003,0.719689,0.212016,0.868911,1.207342,0.03641,2.13832,0.227163
42,"(frozen foods, biscuits, vegetables)",(bread and cake),0.278798,0.719689,0.242057,0.868217,1.206378,0.041409,2.127067,0.237205
94,"(fruit, margarine, milk-cream)",(bread and cake),0.237087,0.719689,0.205749,0.867821,1.205829,0.03512,2.120699,0.223741
34,"(frozen foods, biscuits, milk-cream)",(bread and cake),0.271234,0.719689,0.235358,0.867729,1.2057,0.040154,2.11922,0.234103
41,"(fruit, biscuits, vegetables)",(bread and cake),0.303436,0.719689,0.262805,0.866097,1.203432,0.044426,2.093388,0.242682
88,"(margarine, milk-cream, baking needs)",(bread and cake),0.246812,0.719689,0.213313,0.864273,1.200899,0.035685,2.065261,0.22211


In [None]:
print("Apriori rules")
ap_top_10

Apriori rules


  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
179,"(frozen foods, biscuits, fruit, vegetables)",(bread and cake),0.224552,0.719689,0.200778,0.894129,1.242383,0.039171,2.647667,0.25159
139,"(fruit, biscuits, margarine)",(bread and cake),0.2319,0.719689,0.202723,0.874185,1.21467,0.035828,2.227955,0.230089
132,"(fruit, biscuits, frozen foods)",(bread and cake),0.282905,0.719689,0.247028,0.873186,1.213282,0.043425,2.210406,0.245141
138,"(biscuits, milk-cream, vegetables)",(bread and cake),0.267128,0.719689,0.232332,0.869741,1.208496,0.040083,2.151954,0.23541
117,"(fruit, margarine, baking needs)",(bread and cake),0.244003,0.719689,0.212016,0.868911,1.207342,0.03641,2.13832,0.227163
133,"(frozen foods, biscuits, vegetables)",(bread and cake),0.278798,0.719689,0.242057,0.868217,1.206378,0.041409,2.127067,0.237205
163,"(fruit, margarine, milk-cream)",(bread and cake),0.237087,0.719689,0.205749,0.867821,1.205829,0.03512,2.120699,0.223741
130,"(frozen foods, biscuits, milk-cream)",(bread and cake),0.271234,0.719689,0.235358,0.867729,1.2057,0.040154,2.11922,0.234103
141,"(fruit, biscuits, vegetables)",(bread and cake),0.303436,0.719689,0.262805,0.866097,1.203432,0.044426,2.093388,0.242682
114,"(margarine, milk-cream, baking needs)",(bread and cake),0.246812,0.719689,0.213313,0.864273,1.200899,0.035685,2.065261,0.22211


Do the above tables agree?