# Data mining

## EXERCISE: Association analysis from scratch

[Adapted from http://aimotion.blogspot.com.au/2013/01/machine-learning-and-data-mining.html.]

[For more on efficient approaches, see http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf.]

Refer to slides for definitions (itemset, support, frequent itemset, confidence, etc).

### Generate frequent itemsets

Let's find all sets of items with a support greater than some threshold.

We define 4 functions for generating frequent itemsets:
* createC1 - Create first candidate itemsets for k=1
* scanD - Identify itemsets that meet the support threshold
* aprioriGen - Generate the next list of candidates
* apriori - Generate all frequent itemsets

See slides for explanation of functions.

In [30]:
def createC1(dataset):
    "Create a list of candidate item sets of size one."
    c1 = []
    for transaction in dataset:
        for item in transaction:
            if not [item] in c1:
                c1.append([item])
    c1.sort()
    #frozenset because it will be a ket of a dictionary.                         
    return list(map(frozenset, c1))



def scanD(dataset, candidates, min_support):
    "Returns all candidates that meets a minimum support level"
    sscnt = {}
    for tid in dataset:
        for can in candidates:
            if can.issubset(tid):
                sscnt.setdefault(can, 0)
                sscnt[can] += 1

    num_items = float(len(dataset))
    retlist = []
    support_data = {}
    for key in sscnt:
        support = sscnt[key] / num_items
        if support >= min_support:
            retlist.insert(0, key)
            support_data[key] = support
    return retlist, support_data


def aprioriGen(freq_sets, k):
    "Generate the joint transactions from candidate sets"
    retList = []
    lenLk = len(freq_sets)
    for i in range(lenLk):
        for j in range(i + 1, lenLk):
            L1 = list(freq_sets[i])[:k - 2]
            L2 = list(freq_sets[j])[:k - 2]
            L1.sort()
            L2.sort()
            if L1 == L2:
                retList.append(freq_sets[i] | freq_sets[j]) # | is set union
    return retList


def apriori(dataset, min_support=0.5):
    "Generate a list of candidate item sets"
    C1 = createC1(dataset)
    D = list(map(set, dataset))
    L1, support_data = scanD(D, C1, min_support)
    L = [L1]
    k = 2
    while (len(L[k - 2]) > 0):
        Ck = aprioriGen(L[k - 2], k)
        Lk, supK = scanD(D, Ck, min_support)
        support_data.update(supK)
        L.append(Lk)
        k += 1

    return L, support_data

### Itemset generation on sample data

In [31]:
MIN_SUPPORT=0.5

# Sample data
#DATASET = [['Mango', 'Onion', 'Apple'], ['Corn', 'Onion', 'Eggs'], ['Mango', 'Corn', 'Onion', 'Eggs'], ['Mango', 'Eggs']]
DATASET = [['A', 'C', 'D'], ['B', 'C', 'E'], ['A', 'B', 'C','E'],['B', 'E']]
print('Dataset in list-of-lists format:\n', DATASET, '\n')

# Generate a first candidate itemsets for k=1
C1 = createC1(DATASET)
print('Initial 1-itemset candidates:\n', C1, '\n')

# Convert data to a list of sets
D = list(map(set, DATASET))
print('Dataset in list-of-sets format:\n', D, '\n')

# Identify items that meet support threshold (0.5)
# Note that {4} isn't here as it only occurs in one transaction.
# Remove it so we don't generate any further candidate itemsets containing {4}.
L1, support_data = scanD(D, C1, MIN_SUPPORT)
print('1-itemsets that appear in at least 50% of transactions:\n', L1, '\n')

# Generate the next list of candidates
print('Next set of candidates:\n', aprioriGen(L1,2), '\n')

# Generate all candidate itemsets
L, support_data = apriori(DATASET, min_support=MIN_SUPPORT)
print('Full list of candidate itemsets:\n', L, '\n')
print('Support values for candidate itemsets:\n', support_data, '\n')

Dataset in list-of-lists format:
 [['A', 'C', 'D'], ['B', 'C', 'E'], ['A', 'B', 'C', 'E'], ['B', 'E']] 

Initial 1-itemset candidates:
 [frozenset({'A'}), frozenset({'B'}), frozenset({'C'}), frozenset({'D'}), frozenset({'E'})] 

Dataset in list-of-sets format:
 [{'A', 'C', 'D'}, {'B', 'C', 'E'}, {'A', 'B', 'C', 'E'}, {'B', 'E'}] 

1-itemsets that appear in at least 50% of transactions:
 [frozenset({'E'}), frozenset({'B'}), frozenset({'C'}), frozenset({'A'})] 

Next set of candidates:
 [frozenset({'B', 'E'}), frozenset({'C', 'E'}), frozenset({'A', 'E'}), frozenset({'B', 'C'}), frozenset({'A', 'B'}), frozenset({'A', 'C'})] 

Full list of candidate itemsets:
 [[frozenset({'E'}), frozenset({'B'}), frozenset({'C'}), frozenset({'A'})], [frozenset({'B', 'C'}), frozenset({'C', 'E'}), frozenset({'B', 'E'}), frozenset({'A', 'C'})], [frozenset({'B', 'C', 'E'})], []] 

Support values for candidate itemsets:
 {frozenset({'A'}): 0.5, frozenset({'C'}): 0.75, frozenset({'B'}): 0.75, frozenset({'E'}): 

### TODO Exploring support thresholds

* Generate frequent itemsets with a support threshold of 0.7
* How many frequent itemsets do we get at 0.7?
* How many do we get at 0.3?
* What would be a reasonable value for supermarket transaction data?
* Do you have datasets that resemble transactions?
* What about the apps/websites you use?

In [33]:
# 1 - 
l0_7, sd0_7 = apriori(DATASET, min_support=0.7)
print('Full list of candidate itemsets:\n', l0_7, '\n')
print('Support values for candidate itemsets:\n', sd0_7, '\n')

# 2 - 
temp = []
for ksets in l0_7:
    for i in ksets:
        temp.append(i)      
print('Number of frequent itemsets at 0.7:', len(temp))
##alternatively 
print('Number of frequent itemsets at 0.7:', len([i for ksets in l0_7 for i in ksets]))

# 3 - 
l0_3, sd0_3 = apriori(DATASET, min_support=0.3)

print('Number of frequent itemsets at 0.3:', len([i for ksets in l0_3 for i in ksets]))

# 4 - Much lower (e.g., 5%) to actually generate any frequent itemsets on real data

# 5 - Could imagine doing this for files to know what tends to be open at the same time.

# 6 - Many, many! E.g., Amazon, Netflix.

Full list of candidate itemsets:
 [[frozenset({'E'}), frozenset({'B'}), frozenset({'C'})], [frozenset({'B', 'E'})], []] 

Support values for candidate itemsets:
 {frozenset({'C'}): 0.75, frozenset({'B'}): 0.75, frozenset({'E'}): 0.75, frozenset({'B', 'E'}): 0.75} 

Number of frequent itemsets at 0.7: 4
Number of frequent itemsets at 0.7: 4
Number of frequent itemsets at 0.3: 9


## *STOP PLEASE. THE FOLLOWING IS FOR THE NEXT EXERCISE. THANKS.*

## Mine association rules

Given frequent itemsets, we can create association rules.

We add three more functions:
* calc_confidence - Identify rules that meet the confidence threshold
* rules_from_conseq - Recursively generate and evaluate candidate rules
* generateRules - Mine all confident association rules

See slides for explanation of functions.

In [36]:
def calc_confidence(freqSet, H, support_data, rules, min_confidence=0.7):
    "Evaluate the rule generated"
    pruned_H = []
    for conseq in H:
        conf = support_data[freqSet] / support_data[freqSet - conseq]
        if conf >= min_confidence:
            #print(freqSet - conseq, '--->', conseq, 'conf:', conf)
            rules.append((freqSet - conseq, conseq, conf))
            pruned_H.append(conseq)
    return pruned_H


def rules_from_conseq(freqSet, H, support_data, rules, min_confidence=0.7):
    "Generate a set of candidate rules"
    m = len(H[0])
    Hmp1 = createC1(H)
    Hmp1 = calc_confidence(freqSet, Hmp1,  support_data, rules, min_confidence)
    if len(Hmp1) <= len(freqSet):
        if (len(freqSet) > (m + 1)):
            Hmp1 = aprioriGen(H, m + 1)
            Hmp1 = calc_confidence(freqSet, Hmp1,  support_data, rules, min_confidence)
            if len(Hmp1) > 1:
                rules_from_conseq(freqSet, Hmp1, support_data, rules, min_confidence)

def generateRules(L, support_data, min_confidence=0.7):
    """Create the association rules
    L: list of frequent item sets
    support_data: support data for those itemsets
    min_confidence: minimum confidence threshold
    """
    rules = []
    for i in range(1, len(L)):
        for freqSet in L[i]:
            H1 = [frozenset([item]) for item in freqSet]           
            if (i > 1):
                rules_from_conseq(freqSet, H1, support_data, rules, min_confidence)
            else:
                calc_confidence(freqSet, H1, support_data, rules, min_confidence)
    return rules

def print_rules(rules):
    for r in rules:
        print('{} ==> {} (c={})'.format(*r))

### Rule mining on sample data

In [37]:

MIN_CONFIDENCE=0.7
# Mine association rules
association_rules = generateRules(L, support_data, min_confidence=MIN_CONFIDENCE)
print_rules(association_rules)

frozenset({'E'}) ==> frozenset({'B'}) (c=1.0)
frozenset({'B'}) ==> frozenset({'E'}) (c=1.0)
frozenset({'A'}) ==> frozenset({'C'}) (c=1.0)
frozenset({'C', 'E'}) ==> frozenset({'B'}) (c=1.0)
frozenset({'B', 'C'}) ==> frozenset({'E'}) (c=1.0)


### TODO Exploring confidence thresholds

* Mine rules with a confidence threshold of 0.9
* How many rules do we get at 0.9?
* How many do we get at 0.5?
* What would be a reasonable value for supermarket transaction data?
* Can we use this for recommendation (e.g., Amazon, Netflix)?

In [38]:
# 1 - 
r0_9 =  generateRules(L, support_data, min_confidence=0.9)
print('Rules for confidence threshold of 0.9:')
print_rules(r0_9)

# 2 - 
print('Number of rules at 0.9:', len(r0_9))

# 3 - 
r0_5 =  generateRules(L, support_data, min_confidence=0.5)
print('Rules for confidence threshold of 0.5:')
print_rules(r0_5)
print('Number of rules at 0.5:', len(r0_5))

# 4 - 70% might be reasonable; it will depend on the data and how many rules the business can use

# 5 - Absolutely, especially in session-focused recommendation ignoring user profile and history.
#     [https://en.wikipedia.org/wiki/Recommender_system]
#     [https://www.quora.com/How-does-Amazons-collaborative-filtering-recommendation-engine-work]

Rules for confidence threshold of 0.9:
frozenset({'E'}) ==> frozenset({'B'}) (c=1.0)
frozenset({'B'}) ==> frozenset({'E'}) (c=1.0)
frozenset({'A'}) ==> frozenset({'C'}) (c=1.0)
frozenset({'C', 'E'}) ==> frozenset({'B'}) (c=1.0)
frozenset({'B', 'C'}) ==> frozenset({'E'}) (c=1.0)
Number of rules at 0.9: 5
Rules for confidence threshold of 0.5:
frozenset({'C'}) ==> frozenset({'B'}) (c=0.6666666666666666)
frozenset({'B'}) ==> frozenset({'C'}) (c=0.6666666666666666)
frozenset({'E'}) ==> frozenset({'C'}) (c=0.6666666666666666)
frozenset({'C'}) ==> frozenset({'E'}) (c=0.6666666666666666)
frozenset({'E'}) ==> frozenset({'B'}) (c=1.0)
frozenset({'B'}) ==> frozenset({'E'}) (c=1.0)
frozenset({'C'}) ==> frozenset({'A'}) (c=0.6666666666666666)
frozenset({'A'}) ==> frozenset({'C'}) (c=1.0)
frozenset({'C', 'E'}) ==> frozenset({'B'}) (c=1.0)
frozenset({'B', 'E'}) ==> frozenset({'C'}) (c=0.6666666666666666)
frozenset({'B', 'C'}) ==> frozenset({'E'}) (c=1.0)
frozenset({'E'}) ==> frozenset({'B', 'C'}) (c

## EXERCISE: mlxtend library

## Association analysis using mlxtend library

In [39]:
#!pip install mlxtend
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import pandas as pd
dataset =   [['A', 'C', 'D'], ['B', 'C', 'E'], ['A', 'B', 'C','E'],['B', 'E']]
            
             
oht = TransactionEncoder()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
print (df)           
 
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)
print('Support values for candidate itemsets:\n', frequent_itemsets, '\n')

 
rules= association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
#rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
print (rules.as_matrix(columns=['antecedents','consequents','confidence']))
#print(rules[{'antecedents','consequents','confidence'}])

       A      B      C      D      E
0   True  False   True   True  False
1  False   True   True  False   True
2   True   True   True  False   True
3  False   True  False  False   True
Support values for candidate itemsets:
    support   itemsets
0     0.50        (A)
1     0.75        (B)
2     0.75        (C)
3     0.75        (E)
4     0.50     (A, C)
5     0.50     (B, C)
6     0.75     (B, E)
7     0.50     (C, E)
8     0.50  (B, C, E) 

[[frozenset({'A'}) frozenset({'C'}) 1.0]
 [frozenset({'B'}) frozenset({'E'}) 1.0]
 [frozenset({'E'}) frozenset({'B'}) 1.0]
 [frozenset({'B', 'C'}) frozenset({'E'}) 1.0]
 [frozenset({'C', 'E'}) frozenset({'B'}) 1.0]]


### Load the  supermarket transaction datasets


In [40]:
import csv 
import pprint
file_name = 'Groceries.csv'
data_list = []
with open(file_name, 'r') as f:  #opens PW file
    reader = csv.reader(f)
    # Print every value of every row. 
    for row in reader:
        row_list = []
        for value in row: 
            if len(value.strip()) > 0 and value.strip() != '':
                row_list.append(value.strip())
        data_list.append(row_list)
pprint.pprint(data_list)        

[['pork',
  'sandwich bags',
  'lunch meat',
  'all- purpose',
  'flour',
  'soda',
  'butter',
  'vegetables',
  'beef',
  'aluminum foil',
  'all- purpose',
  'dinner rolls',
  'shampoo',
  'all- purpose'],
 ['shampoo',
  'hand soap',
  'waffles',
  'vegetables',
  'cheeses',
  'mixes',
  'milk',
  'sandwich bags',
  'laundry detergent',
  'dishwashing liquid/detergent',
  'waffles',
  'individual meals',
  'hand soap',
  'vegetables'],
 ['pork',
  'soap',
  'ice cream',
  'toilet paper',
  'dinner rolls',
  'hand soap',
  'spaghetti sauce',
  'milk',
  'ketchup',
  'sandwich loaves',
  'poultry',
  'toilet paper',
  'ice cream',
  'ketchup'],
 ['juice', 'lunch meat', 'soda', 'toilet paper', 'all- purpose'],
 ['pasta',
  'tortillas',
  'mixes',
  'hand soap',
  'toilet paper',
  'vegetables',
  'vegetables',
  'paper towels',
  'vegetables',
  'flour',
  'vegetables',
  'pork',
  'poultry',
  'eggs'],
 ['toilet paper',
  'eggs',
  'toilet paper',
  'vegetables',
  'bagels',
  'dishwa

  'toilet paper',
  'shampoo',
  'sandwich loaves',
  'ketchup',
  'sandwich loaves',
  'paper towels',
  'vegetables',
  'vegetables',
  'pork'],
 ['paper towels',
  'sugar',
  'vegetables',
  'sandwich loaves',
  'spaghetti sauce',
  'shampoo'],
 ['beef',
  'beef',
  'sugar',
  'toilet paper',
  'aluminum foil',
  'aluminum foil',
  'hand soap',
  'all- purpose',
  'yogurt',
  'eggs',
  'flour',
  'fruits',
  'waffles',
  'waffles'],
 ['juice',
  'waffles',
  'vegetables',
  'waffles',
  'lunch meat',
  'spaghetti sauce',
  'juice',
  'all- purpose',
  'flour',
  'soap',
  'waffles',
  'juice',
  'spaghetti sauce',
  'vegetables'],
 ['vegetables',
  'soap',
  'soda',
  'cheeses',
  'toilet paper',
  'shampoo',
  'eggs',
  'paper towels',
  'mixes',
  'individual meals',
  'coffee/tea',
  'all- purpose',
  'sugar',
  'waffles'],
 ['coffee/tea',
  'sugar',
  'cheeses',
  'tortillas',
  'all- purpose',
  'milk',
  'pasta',
  'poultry',
  'paper towels',
  'yogurt',
  'eggs',
  'toilet p

  'eggs',
  'soap',
  'cheeses',
  'sugar',
  'cheeses'],
 ['sugar',
  'shampoo',
  'yogurt',
  'vegetables',
  'flour',
  'hand soap',
  'juice',
  'yogurt',
  'pasta',
  'poultry',
  'vegetables',
  'shampoo',
  'sugar',
  'dinner rolls'],
 ['juice',
  'aluminum foil',
  'paper towels',
  'coffee/tea',
  'individual meals',
  'cereals',
  'shampoo',
  'vegetables',
  'beef',
  'cheeses',
  'individual meals',
  'bagels',
  'sandwich loaves',
  'aluminum foil'],
 ['pork',
  'individual meals',
  'aluminum foil',
  'individual meals',
  'cereals',
  'aluminum foil',
  'ice cream',
  'ice cream',
  'dinner rolls',
  'fruits',
  'poultry',
  'laundry detergent',
  'spaghetti sauce',
  'poultry'],
 ['pork',
  'shampoo',
  'waffles',
  'vegetables',
  'vegetables',
  'dinner rolls',
  'spaghetti sauce',
  'fruits',
  'juice',
  'sandwich loaves',
  'butter',
  'butter',
  'lunch meat',
  'beef'],
 ['milk',
  'ice cream',
  'flour',
  'sandwich loaves',
  'shampoo',
  'sandwich bags',
  'fl

  'butter',
  'fruits',
  'lunch meat',
  'aluminum foil',
  'eggs'],
 ['beef',
  'toilet paper',
  'soda',
  'dinner rolls',
  'beef',
  'fruits',
  'eggs',
  'sandwich bags',
  'cereals',
  'waffles',
  'soda',
  'eggs',
  'beef',
  'lunch meat'],
 ['shampoo',
  'pasta',
  'milk',
  'all- purpose',
  'individual meals',
  'sugar',
  'mixes',
  'cheeses',
  'bagels',
  'poultry',
  'butter',
  'coffee/tea',
  'milk',
  'lunch meat'],
 ['sandwich bags',
  'cheeses',
  'juice',
  'laundry detergent',
  'individual meals',
  'lunch meat',
  'waffles',
  'vegetables',
  'yogurt',
  'waffles'],
 ['toilet paper',
  'dinner rolls',
  'juice',
  'vegetables',
  'eggs',
  'poultry',
  'dishwashing liquid/detergent',
  'sandwich loaves',
  'all- purpose',
  'waffles',
  'sandwich loaves',
  'soda',
  'toilet paper',
  'shampoo'],
 ['dinner rolls', 'vegetables', 'milk', 'ice cream', 'vegetables', 'ice cream'],
 ['milk',
  'eggs',
  'poultry',
  'poultry',
  'hand soap',
  'bagels',
  'all- purpo

  'juice',
  'vegetables',
  'all- purpose',
  'dinner rolls',
  'all- purpose',
  'flour',
  'eggs',
  'ketchup',
  'yogurt',
  'flour',
  'laundry detergent'],
 ['vegetables',
  'pasta',
  'beef',
  'lunch meat',
  'dishwashing liquid/detergent',
  'ice cream',
  'lunch meat',
  'mixes',
  'laundry detergent',
  'vegetables',
  'lunch meat',
  'mixes',
  'bagels',
  'bagels'],
 ['pork',
  'laundry detergent',
  'ketchup',
  'tortillas',
  'mixes',
  'eggs',
  'flour',
  'eggs',
  'beef',
  'pasta',
  'flour',
  'ice cream',
  'poultry',
  'dishwashing liquid/detergent'],
 ['vegetables',
  'soap',
  'shampoo',
  'sugar',
  'eggs',
  'cereals',
  'tortillas',
  'butter',
  'spaghetti sauce'],
 ['poultry',
  'cheeses',
  'sandwich loaves',
  'ketchup',
  'sandwich bags',
  'coffee/tea',
  'hand soap',
  'vegetables',
  'lunch meat',
  'fruits',
  'vegetables',
  'beef',
  'paper towels',
  'shampoo'],
 ['vegetables',
  'pork',
  'pork',
  'cereals',
  'vegetables',
  'bagels',
  'waffle

 ['soap',
  'sandwich bags',
  'soap',
  'vegetables',
  'eggs',
  'paper towels',
  'pork',
  'ketchup',
  'tortillas',
  'waffles',
  'vegetables',
  'pork',
  'coffee/tea',
  'ketchup'],
 ['poultry',
  'pork',
  'ketchup',
  'all- purpose',
  'lunch meat',
  'ice cream',
  'all- purpose',
  'laundry detergent'],
 ['spaghetti sauce',
  'lunch meat',
  'vegetables',
  'waffles',
  'butter',
  'tortillas',
  'mixes',
  'ketchup',
  'eggs',
  'dinner rolls',
  'juice',
  'cheeses',
  'toilet paper',
  'toilet paper'],
 ['spaghetti sauce',
  'laundry detergent',
  'vegetables',
  'fruits',
  'vegetables',
  'ketchup',
  'spaghetti sauce',
  'aluminum foil',
  'coffee/tea',
  'shampoo',
  'dinner rolls',
  'vegetables',
  'pasta',
  'vegetables'],
 ['vegetables',
  'eggs',
  'spaghetti sauce',
  'bagels',
  'mixes',
  'butter',
  'poultry',
  'bagels',
  'vegetables',
  'waffles',
  'sandwich bags',
  'vegetables',
  'aluminum foil',
  'pork'],
 ['all- purpose',
  'eggs',
  'juice',
  'sp

 ['hand soap',
  'individual meals',
  'beef',
  'yogurt',
  'tortillas',
  'mixes',
  'vegetables',
  'cereals',
  'toilet paper',
  'lunch meat',
  'aluminum foil',
  'vegetables',
  'yogurt',
  'vegetables'],
 ['tortillas',
  'beef',
  'cereals',
  'ketchup',
  'poultry',
  'sandwich bags',
  'poultry',
  'all- purpose',
  'vegetables',
  'pasta',
  'pasta',
  'individual meals',
  'paper towels',
  'juice'],
 ['fruits',
  'lunch meat',
  'paper towels',
  'paper towels',
  'eggs',
  'yogurt',
  'sandwich bags',
  'butter',
  'all- purpose',
  'eggs',
  'dishwashing liquid/detergent',
  'shampoo',
  'bagels',
  'ice cream'],
 ['shampoo',
  'all- purpose',
  'dinner rolls',
  'coffee/tea',
  'hand soap',
  'aluminum foil'],
 ['vegetables',
  'soap',
  'vegetables',
  'ketchup',
  'mixes',
  'individual meals',
  'laundry detergent',
  'eggs',
  'bagels',
  'vegetables',
  'pasta',
  'juice',
  'soap',
  'mixes'],
 ['aluminum foil',
  'laundry detergent',
  'soap',
  'sugar',
  'eggs'

  'ketchup',
  'dishwashing liquid/detergent'],
 ['pasta',
  'beef',
  'poultry',
  'vegetables',
  'dinner rolls',
  'pork',
  'coffee/tea',
  'pork',
  'aluminum foil',
  'fruits',
  'vegetables',
  'individual meals',
  'paper towels',
  'sugar'],
 ['milk',
  'vegetables',
  'vegetables',
  'soap',
  'mixes',
  'toilet paper',
  'toilet paper',
  'dinner rolls',
  'milk',
  'vegetables',
  'yogurt',
  'tortillas'],
 ['flour',
  'tortillas',
  'sandwich loaves',
  'bagels',
  'hand soap',
  'toilet paper',
  'fruits',
  'pasta',
  'soap',
  'sandwich loaves',
  'milk',
  'aluminum foil',
  'cheeses',
  'bagels'],
 ['flour',
  'milk',
  'coffee/tea',
  'waffles',
  'pasta',
  'cereals',
  'individual meals',
  'toilet paper'],
 ['sandwich loaves',
  'pork',
  'dinner rolls',
  'cheeses',
  'tortillas',
  'vegetables',
  'all- purpose',
  'ice cream',
  'paper towels',
  'eggs',
  'spaghetti sauce',
  'soda',
  'tortillas'],
 ['bagels', 'juice', 'coffee/tea', 'paper towels', 'beef'],
 

## TODO Mining association rules on Groceries datasets
* Apply apriori and association_rules functions from mlxtend library
* What would be a reasonable value of min-support for these supermarket transaction data

In [41]:
oht = TransactionEncoder()
oht_ary = oht. fit(data_list).transform(data_list)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
 
MIN_SUPPORT = 0.05
frequent_itemsets = apriori(df, min_support=MIN_SUPPORT, use_colnames=True)
print (frequent_itemsets)
 
rules= association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
print (rules.as_matrix(columns=['antecedents','consequents','confidence']))
#print (rules[{'antecedents','consequents','confidence'}])



      support                                           itemsets
0    0.263509                                     (all- purpose)
1    0.264176                                    (aluminum foil)
2    0.278185                                           (bagels)
3    0.262842                                             (beef)
4    0.261508                                           (butter)
5    0.273516                                          (cereals)
6    0.260173                                          (cheeses)
7    0.262842                                       (coffee/tea)
8    0.258839                                     (dinner rolls)
9    0.268179                     (dishwashing liquid/detergent)
10   0.268846                                             (eggs)
11   0.257505                                            (flour)
12   0.263509                                           (fruits)
13   0.237492                                        (hand soap)
14   0.274850            

## *STOP PLEASE. THE FOLLOWING IS FOR THE NEXT EXERCISE. THANKS.*

# EXERCISE: FP-Growth

## Rules  generation 


In [42]:
#Install the library if it is not available
#!pip install pyfpgrowth
import pyfpgrowth
MIN_SUPPORT = 2 
MIN_CONFIDENCE = 0.7
DATASET = [['A', 'C', 'D'], ['B', 'C', 'E'], ['A', 'B', 'C','E'],['B', 'E']]

frequent_itemsets = pyfpgrowth.find_frequent_patterns(DATASET, MIN_SUPPORT)
print('Support values for candidate itemsets:\n', frequent_itemsets, '\n')
rules = pyfpgrowth.generate_association_rules(frequent_itemsets, MIN_CONFIDENCE)
print('Resultant assoication rules:\n')
pprint.pprint(rules) 



Support values for candidate itemsets:
 {('A',): 2, ('A', 'C'): 2, ('C',): 3, ('B', 'C'): 2, ('B',): 3, ('E',): 3, ('C', 'E'): 2, ('B', 'E'): 3, ('B', 'C', 'E'): 2} 

Resultant assoication rules:

{('A',): (('C',), 1.0),
 ('B',): (('E',), 1.0),
 ('B', 'C'): (('E',), 1.0),
 ('C', 'E'): (('B',), 1.0),
 ('E',): (('B',), 1.0)}


### TODO Mining association rules using FP-growth  on Groceries datasets
* Try different confidence thresholds
* What’s a reasonable value for real data?



In [43]:
MIN_SUPPORT = 0.05 * len(data_list)
frequent_itemsets = pyfpgrowth.find_frequent_patterns(data_list, MIN_SUPPORT)
print('Support values for candidate itemsets:\n')
pprint.pprint(frequent_itemsets) 
rules = pyfpgrowth.generate_association_rules(frequent_itemsets, MIN_CONFIDENCE)
print('Resultant assoication rules:\n')
pprint.pprint(rules) 

Support values for candidate itemsets:

{('all- purpose', 'aluminum foil'): 120,
 ('all- purpose', 'aluminum foil', 'vegetables'): 100,
 ('all- purpose', 'bagels'): 149,
 ('all- purpose', 'bagels', 'vegetables'): 114,
 ('all- purpose', 'beef'): 135,
 ('all- purpose', 'beef', 'vegetables'): 110,
 ('all- purpose', 'butter'): 127,
 ('all- purpose', 'butter', 'vegetables'): 101,
 ('all- purpose', 'cereals'): 119,
 ('all- purpose', 'cereals', 'vegetables'): 114,
 ('all- purpose', 'cheeses'): 122,
 ('all- purpose', 'cheeses', 'vegetables'): 109,
 ('all- purpose', 'coffee/tea'): 131,
 ('all- purpose', 'coffee/tea', 'vegetables'): 118,
 ('all- purpose', 'dinner rolls'): 130,
 ('all- purpose', 'dinner rolls', 'vegetables'): 113,
 ('all- purpose', 'dishwashing liquid/detergent'): 140,
 ('all- purpose', 'dishwashing liquid/detergent', 'vegetables'): 115,
 ('all- purpose', 'eggs'): 144,
 ('all- purpose', 'eggs', 'vegetables'): 107,
 ('all- purpose', 'flour'): 149,
 ('all- purpose', 'flour', 'veget

 ('ice cream', 'ketchup', 'vegetables'): 124,
 ('ice cream', 'laundry detergent'): 132,
 ('ice cream', 'laundry detergent', 'vegetables'): 128,
 ('ice cream', 'lunch meat'): 140,
 ('ice cream', 'lunch meat', 'vegetables'): 117,
 ('ice cream', 'milk'): 159,
 ('ice cream', 'milk', 'vegetables'): 117,
 ('ice cream', 'mixes'): 130,
 ('ice cream', 'mixes', 'vegetables'): 110,
 ('ice cream', 'paper towels'): 148,
 ('ice cream', 'paper towels', 'vegetables'): 123,
 ('ice cream', 'pasta'): 166,
 ('ice cream', 'pasta', 'vegetables'): 117,
 ('ice cream', 'pork'): 127,
 ('ice cream', 'pork', 'vegetables'): 87,
 ('ice cream', 'poultry'): 167,
 ('ice cream', 'poultry', 'vegetables'): 139,
 ('ice cream', 'sandwich bags'): 137,
 ('ice cream', 'sandwich bags', 'vegetables'): 99,
 ('ice cream', 'sandwich loaves'): 129,
 ('ice cream', 'sandwich loaves', 'vegetables'): 113,
 ('ice cream', 'shampoo'): 115,
 ('ice cream', 'shampoo', 'vegetables'): 102,
 ('ice cream', 'soap'): 149,
 ('ice cream', 'soap', 'v

# End of Tutorial. Many Thanks.