<a href="https://colab.research.google.com/github/henryliangt/usyd/blob/main/5310%20-%2007_association_rules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data mining

## EXERCISE: Association analysis from scratch



### Generate frequent itemsets

Let's find all sets of items with a support greater than some threshold.

We define 4 functions for generating frequent itemsets:
* createC1 - Create first candidate itemsets for k=1
* scanD - Identify itemsets that meet the support threshold
* aprioriGen - Generate the next list of candidates
* apriori - Generate all frequent itemsets

See slides for explanation of functions.

In [None]:
def createC1(dataset):
    "Create a list of candidate item sets of size one."
    c1 = []
    for transaction in dataset:
        for item in transaction:
            if not [item] in c1:
                c1.append([item])
    c1.sort()
    #frozenset because it will be a ket of a dictionary.                         
    return list(map(frozenset, c1))



def scanD(dataset, candidates, min_support):
    "Returns all candidates that meets a minimum support level"
    sscnt = {}
    for tid in dataset:
        for can in candidates:
            if can.issubset(tid):
                sscnt.setdefault(can, 0)
                sscnt[can] += 1

    num_items = float(len(dataset))
    retlist = []
    support_data = {}
    for key in sscnt:
        support = sscnt[key] / num_items
        if support >= min_support:
            retlist.insert(0, key)
            support_data[key] = support
    return retlist, support_data


def aprioriGen(freq_sets, k):
    "Generate the joint transactions from candidate sets"
    retList = []
    lenLk = len(freq_sets)
    for i in range(lenLk):
        for j in range(i + 1, lenLk):
            L1 = list(freq_sets[i])[:k - 2]
            L2 = list(freq_sets[j])[:k - 2]
            L1.sort()
            L2.sort()
            if L1 == L2:
                retList.append(freq_sets[i] | freq_sets[j]) # | is set union
    return retList


def apriori(dataset, min_support=0.5):
    "Generate a list of candidate item sets"
    C1 = createC1(dataset)
    D = list(map(set, dataset))
    L1, support_data = scanD(D, C1, min_support)
    L = [L1]
    k = 2
    while (len(L[k - 2]) > 0):
        Ck = aprioriGen(L[k - 2], k)
        Lk, supK = scanD(D, Ck, min_support)
        support_data.update(supK)
        L.append(Lk)
        k += 1

    return L, support_data

In [None]:
f_ad = 'https://github.com/henryliangt/usyd/blob/main/Groceries.csv'

!pip install pyfpgrowth

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyfpgrowth
  Downloading pyfpgrowth-1.0.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 7.4 MB/s 
[?25hBuilding wheels for collected packages: pyfpgrowth
  Building wheel for pyfpgrowth (setup.py) ... [?25l[?25hdone
  Created wheel for pyfpgrowth: filename=pyfpgrowth-1.0-py2.py3-none-any.whl size=5504 sha256=e40ad80000c2e88f7c3c1d032428b417b8e1076cfe3570b41b0bb5eaeb3b7ca5
  Stored in directory: /root/.cache/pip/wheels/73/97/4b/f12ac994f6bbb99597396255435824c73ad3916be1e678be55
Successfully built pyfpgrowth
Installing collected packages: pyfpgrowth
Successfully installed pyfpgrowth-1.0


In [None]:
!pip install fpgrowth_py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fpgrowth_py
  Downloading fpgrowth_py-1.0.0-py3-none-any.whl (5.6 kB)
Installing collected packages: fpgrowth-py
Successfully installed fpgrowth-py-1.0.0


In [None]:
!pip install mlxtend

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install apriori_python

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting apriori_python
  Downloading apriori_python-1.0.4-py3-none-any.whl (5.0 kB)
Installing collected packages: apriori-python
Successfully installed apriori-python-1.0.4


### Itemset generation on sample data

In [None]:
MIN_SUPPORT= 0.5

# Sample data
DATASET = [['Mango', 'Onion', 'Apple'], ['Corn', 'Onion', 'Eggs'], ['Mango', 'Corn', 'Onion', 'Eggs'], ['Mango', 'Eggs']]
DATASET = [['A', 'C', 'D'], ['B', 'C', 'E'], ['A', 'B', 'C','E'],['B', 'E']]
print('Dataset in list-of-lists format:\n', DATASET, '\n')

# Generate a first candidate itemsets for k=1
C1 = createC1(DATASET)
print('Initial 1-itemset candidates:\n', C1, '\n')

# Convert data to a list of sets
D = list(map(set, DATASET))
print('Dataset in list-of-sets format:\n', D, '\n')

# Identify items that meet support threshold (0.5)
# Note that {4} isn't here as it only occurs in one transaction.
# Remove it so we don't generate any further candidate itemsets containing {4}.
L1, support_data = scanD(D, C1, MIN_SUPPORT)
print('1-itemsets that appear in at least 50% of transactions:\n', L1, '\n')

# Generate the next list of candidates
print('Next set of candidates:\n', aprioriGen(L1,2), '\n')

# Generate all candidate itemsets
L, support_data = apriori(DATASET, min_support=MIN_SUPPORT)
print('Full list of candidate itemsets:\n', L, '\n')
print('Support values for candidate itemsets:\n', support_data, '\n')

### TODO Exploring support thresholds

* Generate frequent itemsets with a support threshold of 0.7
* How many frequent itemsets do we get at 0.7?
* How many do we get at 0.3?
* Do you have datasets that resemble transactions?
* What about the apps/websites you use?

In [None]:
# TODO: replace the content of this cell with your Python solution
L7, support_data07 = apriori(DATASET, min_support=0.7)
print('frequ items: ' , L7)






## *STOP PLEASE. THE FOLLOWING IS FOR THE NEXT EXERCISE. THANKS.*

## Mine association rules

Given frequent itemsets, we can create association rules.

We add three more functions:
* calc_confidence - Identify rules that meet the confidence threshold
* rules_from_conseq - Recursively generate and evaluate candidate rules
* generateRules - Mine all confident association rules

See slides for explanation of functions.

In [None]:
def calc_confidence(freqSet, H, support_data, rules, min_confidence=0.7):
    "Evaluate the rule generated"
    pruned_H = []
    for conseq in H:
        conf = support_data[freqSet] / support_data[freqSet - conseq]
        if conf >= min_confidence:
            #print(freqSet - conseq, '--->', conseq, 'conf:', conf)
            rules.append((freqSet - conseq, conseq, conf))
            pruned_H.append(conseq)
    return pruned_H


def rules_from_conseq(freqSet, H, support_data, rules, min_confidence=0.7):
    "Generate a set of candidate rules"
    m = len(H[0])
    Hmp1 = createC1(H)
    Hmp1 = calc_confidence(freqSet, Hmp1,  support_data, rules, min_confidence)
    if len(Hmp1) <= len(freqSet):
        if (len(freqSet) > (m + 1)):
            Hmp1 = aprioriGen(H, m + 1)
            Hmp1 = calc_confidence(freqSet, Hmp1,  support_data, rules, min_confidence)
            if len(Hmp1) > 1:
                rules_from_conseq(freqSet, Hmp1, support_data, rules, min_confidence)

def generateRules(L, support_data, min_confidence=0.7):
    """Create the association rules
    L: list of frequent item sets
    support_data: support data for those itemsets
    min_confidence: minimum confidence threshold
    """
    rules = []
    for i in range(1, len(L)):
        for freqSet in L[i]:
            H1 = [frozenset([item]) for item in freqSet]           
            if (i > 1):
                rules_from_conseq(freqSet, H1, support_data, rules, min_confidence)
            else:
                calc_confidence(freqSet, H1, support_data, rules, min_confidence)
    return rules

def print_rules(rules):
    for r in rules:
        print('{} ==> {} (c={})'.format(*r))

### Rule mining on sample data

In [None]:

MIN_CONFIDENCE = 0.7
# Mine association rules
association_rules = generateRules(L, support_data, min_confidence=MIN_CONFIDENCE)
print_rules(association_rules)

### TODO Exploring confidence thresholds

* Mine rules with a confidence threshold of 0.7
* How many rules do we get at 0.7?
* How many do we get at 0.3?
* Can we use this for recommendation (e.g., Amazon, Netflix)?

In [None]:
# TODO: replace the content of this cell with your Python solution
raise NotImplementedError

## EXERCISE: mlxtend library and apriori_python

## Association analysis using mlxtend library

In [None]:
#Install the library if it is not availab
#!pip install mlxtend
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import pandas as pd
dataset =   [['A', 'C', 'D'], ['B', 'C', 'E'], ['A', 'B', 'C','E'],['B', 'E']]
            
             
oht = TransactionEncoder()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
print (df)           
 
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)
print('Support values for candidate itemsets:\n', frequent_itemsets, '\n')

 
rules= association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
#rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
#print (rules.as_matrix(columns=['antecedents','consequents','confidence']))
print(rules[{'antecedents','consequents','confidence'}])

## Association analysis using apriori_python library


In [None]:
#!pip install apriori_python
from apriori_python import apriori
dataset =   [['A', 'C', 'D'], ['B', 'C', 'E'], ['A', 'B', 'C','E'],['B', 'E']]
freqItemSet, rules = apriori(dataset, minSup=0.7, minConf=0.0)

for r in rules:
    print('{} ==> {} (c={})'.format(*r)) 

# Load the  supermarket transaction datasets
### Now Lets work on a real Grocery dataset

In [None]:
import csv 
import pprint
file_name = 'Groceries.csv'
data_list = []
with open(file_name, 'r') as f:  #opens PW file
    reader = csv.reader(f)
    # Print every value of every row. 
    for row in reader:
        row_list = []
        for value in row: 
            if len(value.strip()) > 0 and value.strip() != '':
                row_list.append(value.strip())
        data_list.append(row_list)
pprint.pprint(data_list)        

## TODO Mining association rules on Groceries datasets
* Apply apriori and association_rules functions from mlxtend library
* Apply apriori and association_rules functions from apriori_python library
* What would be a reasonable value of min-support for these supermarket transaction data

In [None]:
# TODO: replace the content of this cell with your Python solution
raise NotImplementedError

## *STOP PLEASE. THE FOLLOWING IS FOR THE NEXT EXERCISE. THANKS.*

# EXERCISE: FP-Growth

## Rules  generation using pyfpgrowth library


In [None]:
#Install the library if it is not available
#!pip install pyfpgrowth
import pyfpgrowth
MIN_SUPPORT = 2 
MIN_CONFIDENCE = 0.7
DATASET = [['A', 'C', 'D'], ['B', 'C', 'E'], ['A', 'B', 'C','E'],['B', 'E']]

frequent_itemsets = pyfpgrowth.find_frequent_patterns(DATASET, MIN_SUPPORT)
print('Support values for candidate itemsets:\n', frequent_itemsets, '\n')
rules = pyfpgrowth.generate_association_rules(frequent_itemsets, MIN_CONFIDENCE)
print('Resultant assoication rules:\n')
pprint.pprint(rules) 



## Rules  generation using fpgrowth_py library


In [None]:
#!pip install fpgrowth_py
from fpgrowth_py import fpgrowth
freqItemSet, rules = fpgrowth(DATASET, minSupRatio=0.5, minConf=0.7)
for r in rules:
    print('{} ==> {} (c={})'.format(*r)) 

### TODO Mining association rules using FP-growth and fpgrowth_py on Groceries datasets
* Try different confidence thresholds
* What’s a reasonable value for real data?



In [None]:
# TODO: replace the content of this cell with your Python solution
raise NotImplementedError

# End of Tutorial. Many Thanks.