# Discovery of Frequent Itemsets and Association Rules

The problem of discovering association rules between itemsets in a sales transaction database (a set of baskets) includes the following two sub-problems:

1. Finding frequent itemsets with support at least s;
2. Generating association rules with confidence at least c from the itemsets found in the first step.
Remind that an association rule is an implication X → Y, where X and Y are itemsets such that X∩Y=∅. Support of the rule X → Y is the number of transactions that contain X⋃Y. Confidence of the rule X → Y the fraction of transactions containing X⋃Y in all transactions that contain X.

You are to solve the first sub-problem: to implement the A-Priori algorithm for finding frequent itemsets with support at least s in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your A-Priori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions.

Optional task for extra bonus: Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered by using the A-Priori algorithm in a dataset of sales transactions. The rules must have support at least s and confidence at least c, where s and c are given as input parameters.

The sale transaction dataset includes generated transactions (baskets) of hashed items

In [None]:
import csv
from itertools import combinations

In [None]:
baskets = [i.strip().split() for i in open("T10I4D100K.dat").readlines()]

In [None]:
transactions = {} # Dictionary with count as key, and basket as value
count = 0
for basket in  baskets:
    count += 1
    transactions[count] = basket

In [None]:
def read_data(file_loc='GroceryStoreDataSet.csv'):
    trans = dict()
    with open(file_loc) as f:
        filedata = csv.reader(f, delimiter=',')
        count = 0
        for line in filedata:
            count += 1
            trans[count] = list(set(line))
    return trans

In [None]:
read_data("GroceryStoreDataSet.csv")

In [None]:
# Count the frequency of each item.
def freq(items_lst, trans):
    items_counts = dict() # Dictionary of item and its frequency
    for i in items_lst: # Check for every item
        temp_i = {i}
        for j in trans.items(): # and basket
            if temp_i.issubset(set(j[1])): # if item is in basket
                if i in items_counts:
                    items_counts[i] += 1 # if already spotted, add 1 to count
                else:
                    items_counts[i] = 1 # if not spotted yet, set count to 1
    return items_counts

In [None]:
# Form association rules form support.

def association_rules(items_grater_then_min_support):
    rules = []
    dict_rules = {}
    for i in items_grater_then_min_support:
        dict_rules = {}
        if type(i) != type(str()):
            i = list(i)
            temp_i = i[:]
            for j in range(len(i)):
                k = temp_i[j]
                del temp_i[j]
                dict_rules[k] = temp_i
                temp_i = i[:]
        rules.append(dict_rules)
    temp = []
    for i in rules:
        for j in i.items():
            if type(j[1]) != type(str()):
                temp.append({tuple(j[1])[0]: j[0]})
            else:
                temp.append({j[1]: j[0]})
    rules.extend(temp)
    return rules

In [None]:
# Find the confidence of those association rules and take only rules which are greater than the minimum confidence.

def confidence(associations, d, min_confidence):
    ans = {}
    for i in associations:
        for j in i.items():
            if type(j[0]) == type(str()):
                left = {j[0]}
            else:
                left = set(j[0])
            if type(j[1]) == type(str()):
                right = {j[1]}
            else:
                right = set(j[1])

            for k in d:
                if type(k) != type(str()):
                    if left.union(right) - set(k) == set():
                        up = d[k]
                    if len(right) == len(set(k)) and right - set(k) == set():
                        down = d[k]
                else:
                    if len(right) >= len({k}):
                        if right - {k} == set():
                            down = d[k]
                        elif len(right) <= len({k}):
                            if {k} - right == set():    
                                down = d[k]
            if up/down >= min_confidence:
                ans[tuple(left)[0]] = right, up/down, up, down
    return ans

In [None]:
def support(items_counts, trans):
    support = dict()
    total_trans = len(trans)
    for i in items_counts:
        support[i] = items_counts[i]/total_trans
    return support

In [None]:
# Here is our main function that operates above code and here you can change the minimum support and confidence.

def main(min_support, min_confidence, file_loc):
    trans = read_data()
    number_of_trans = [len(i) for i in trans.values()]
    items_lst = set()

    itemcount_track = list()
    
    for i in trans.values():
        for j in i:
            items_lst.add(j)

    store_item_lst = list(items_lst)[:]
    items_grater_then_min_support = list()
    items_counts = frequence(items_lst, trans)
    itemcount_track.append(items_counts)
    items_grater_then_min_support.append({j[0]:j[1] for j in support(items_counts, trans).items() if j[1]>min_support})

    for i in range(2, max(number_of_trans)+1):
        item_list = combinations(items_lst, i)
        items_counts = frequence(item_list, trans, check=True)
        itemcount_track.append(items_counts)
        if list({j[0]:j[1] for j in support(items_counts, trans).items() if j[1]>min_support}.keys()) != []:
            items_grater_then_min_support.append({j[0]:j[1] for j in support(items_counts, trans).items() if j[1]>min_support})

    d = {}
    {d.update(i) for i in itemcount_track}
    associations = association_rules(items_grater_then_min_support[len(items_grater_then_min_support)-1])
    associations_grater_then_confidene = confidence(associations, d, min_confidence)

    print(associations_grater_then_confidene)

main(0.01, 0.7, 'GroceryStoreDataSet.csv')

In [None]:
trans = transactions # dictionary of transactions where key is count and value is basket
number_of_trans = [len(i) for i in trans.values()] # size of each basket
items_lst = set()

In [None]:
itemcount_track = list()

for i in trans.values():
    for j in i:
        items_lst.add(j) # set of items from all baskets

In [None]:
store_item_lst = list(items_lst) # list of unique items from all baskets
items_greater_than_min_support = list()

In [None]:
items_counts = freq(items_lst, trans)
itemcount_track.append(items_counts)

In [None]:
def support(items_counts, trans):
    support = dict()
    total_trans = len(trans)
    for i in items_counts:
        support[i] = items_counts[i]/total_trans # Calculate support for each unique item
    return support

In [None]:
for j in support(items_counts, trans).items():
    print(j)

In [None]:
min_support = 0.01 # Set support threshold
{j[0]:j[1] for j in support(items_counts, trans).items() if j[1]>min_support} # filter

In [None]:
items_greater_than_min_support = [{j[0]:j[1] for j in support(items_counts, trans).items() if j[1]>min_support}]
items_greater_than_min_support

In [None]:
items_lst

In [None]:
for i in combinations(items_lst, 3):
    print(i)

In [None]:
for i in range(2, max(number_of_trans)+1): 
    item_list = combinations(items_lst, i) # make all possible combinations where i is length of tuple
    items_counts = freq(item_list, trans)
    itemcount_track.append(items_counts)
    if list({j[0]:j[1] for j in support(items_counts, trans).items() if j[1]>min_support}.keys()) != []:
        items_grater_then_min_support.append({j[0]:j[1] for j in support(items_counts, trans).items() if j[1]>min_support})

In [None]:
d = {}
{d.update(i) for i in itemcount_track}
associations = association_rules(items_grater_then_min_support[len(items_grater_then_min_support)-1])
associations_grater_then_confidene = confidence(associations, d, min_confidence)

print(associations_grater_then_confidene)