# Homework 2: Discovery of Frequent Itemsets and Association Rules

The minimum support (MIN_SUPPORT = 800) and minimum confidence (MIN_CONFIDENCE = 0.5) are the two thresholds that govern the outcome of the analysis: only itemsets appearing at least 800 times are considered frequent, and only rules with a predictive accuracy of 50% or more are considered valid.

In [17]:
import os
from collections import defaultdict
from itertools import combinations
import matplotlib.pyplot as plt
import pandas as pd
# --- Algo parameters ---
MIN_SUPPORT = 800
MIN_CONFIDENCE = 0.5
current_dir = os.getcwd() 
file_path = os.path.join(current_dir, 'T10I4D100K.dat')
output_file_path = os.path.join(current_dir, 'apriori_results.txt')


The provided code implements the classic Apriori algorithm, which is justified by its efficiency in pruning the vast candidate space using the Apriori Principle. 

**1. The Apriori Principle**
The foundation of this code's efficiency lies in the Apriori principle:

**2. Methods:**

load_transactions(self, file_path)
Justification: The transactional data must be efficiently loaded and represented.We converted each transaction to a set ({item1, item2, ...}) 

Technical Choice: Using sets enabled extremely fast checking of the subset relationship (A <= t) in the get_itemset_support method.
If an itemset is frequent, then all of its subsets must also be frequent.

get_itemset_support(self, itemsets): 
This method performs the counting phaseâ€”the most computationally intensive step in each iteration. It must count the occurrence of every candidate itemset $C_k$ across all transactions.

generate_candidates(self, frequent_itemsets, k):
 This method implements the Apriori-Gen function. It generates candidates $C_k$ (of size $k$) by joining two frequent itemsets $L_{k-1}$ (of size $k-1$) only if they share the first $k-2$ items.

find_frequent_itemsets(self): After finding $L_{k-1}$ and generating $C_k$, we only need to keep the transactions that contain at least one candidate from $C_k$

generate_rules(self): This method generates all possible implication rules ($X \to Y$) from every frequent itemset $L_k$ where $k \geq 2$.Confidence Calculation: The core justification here is the formula for confidence:$$Confidence(X \to Y) = \frac{Support(X \cup Y)}{Support(X)}$$In the code, this is support_itemset / support_sub. Since $X \cup Y$ is the original frequent itemset being processed, its support (support_itemset) is already known.

In [1]:
class Apriori:
    def __init__(self, min_support, min_confidence):
        # Minimum support count required for an itemset to be considered frequent
        self.min_support = min_support
        # List to store all transactions (each transaction is a set of items)
        self.transactions = []
        self.frequent_itemsets = {} #storing all found itemsets with the format {itemset_tuple: support_count}
        # Minimum confidence required for an association rule to be considered valid
        self.min_confidence = min_confidence
        #storing generated association rules 
        self.rules =[] # Format [(sub_set, res, confidence), ...] e.g: {704} -> {825}, Confidence = 0.6143

    def load_transactions(self, file_path):
        """Read dataset and store as list of sets."""
        with open(file_path, 'r') as f:
            self.transactions = [set(line.strip().split()) for line in f]
        print(f"Loaded {len(self.transactions)} transactions.")

    def get_itemset_support(self, itemsets):
        """returns how many transactions contain each itemset """
        support_counts = defaultdict(int)
        for t in self.transactions:
            for itemset in itemsets:
                if set(itemset) <= t:  # itemset is subset of transaction
                    support_counts[itemset] += 1
        return support_counts

    def generate_candidates(self, frequent_itemsets, k):
        """Generate candidate itemsets of size k+1 from previous itemsets"""
        candidates = set()
        frequent_list = list(frequent_itemsets.keys())
        for i in range(len(frequent_list)):
            for j in range(i+1, len(frequent_list)):
                itemset1, itemset2 = frequent_list[i], frequent_list[j]
                if itemset1[:k-1] == itemset2[:k-1]:
                    candidate = tuple(sorted(set(itemset1) | set(itemset2)))
                    candidates.add(candidate)
        return list(candidates)

    def filter_itemsets_by_support(self, itemset_support):
        """Filters the candidate itemsets based on the minimum support count."""
        return {itemset: count for itemset, count in itemset_support.items() if count >= self.min_support}

    def find_frequent_itemsets(self):
        """Main Apriori algorithm."""
        print(f"Running Apriori with min_support = {self.min_support}")
        # Single-item candidates
        items = {item for t in self.transactions for item in t}
        # (C1)
        current_itemsets = [tuple([item]) for item in sorted(items)]
        k = 1

        while current_itemsets:
            #1. (Ck)
            support_counts = self.get_itemset_support(current_itemsets)
            #2. Filtering : (Lk)
            frequent_itemsets_k = self.filter_itemsets_by_support(support_counts)
            print(f"Found {len(frequent_itemsets_k)} frequent itemsets of size {k}")
            # 3. Storing the frequent itemsets
            self.frequent_itemsets.update(frequent_itemsets_k)

            # next iteration (k+1)
            k += 1
            # 4. Generate candidates for the next size (Ck+1)
            current_itemsets = self.generate_candidates(frequent_itemsets_k, k-1)

            if k > 2 and current_itemsets:
                self.transactions = [t for t in self.transactions if any(set(c) <= t for c in current_itemsets)]
    def write_results_to_file(self, output_file_path, rule_count):
        """Writes frequent itemsets (with support) and association rules to a file."""
        print(f"\nWriting results to {output_file_path}")
        
        # Grouping frequent itemsets by size (k)
        itemsets_by_k = defaultdict(list)
        for itemset, support in self.frequent_itemsets.items():
            itemsets_by_k[len(itemset)].append((itemset, support))

        with open(output_file_path, 'w') as f:
            f.write("Frequent Itemsets (by size k)\n\n")

            # Write Frequent Itemsets
            sorted_ks = sorted(itemsets_by_k.keys())
            for k in sorted_ks:
                f.write(f"--- k = {k} ---\n")
                sorted_itemsets = sorted(itemsets_by_k[k], key=lambda x: x[0])
                for itemset, support in sorted_itemsets:
                    itemset_str = "{" + ", ".join(itemset) + "}"
                    f.write(f"{itemset_str}: Support Count = {support}\n")
                f.write("\n")

            f.write("=====================================\n\n")
            f.write("=== Association Rules ===\n\n")

            # Writing Association Rules
            f.write(f"Total rules generated (Confidence >= {self.min_confidence}): {rule_count}\n\n")
            if not self.rules:
                f.write("No association rules found that satisfy min_confidence.\n")
            else:
                # Iterate through stored rules (antecedent, consequent, confidence)
                for antecedent, consequent, confidence in self.rules:
                    ante_str = "{" + ", ".join(sorted(list(antecedent))) + "}"
                    cons_str = "{" + ", ".join(sorted(list(consequent))) + "}"
                    f.write(f"{ante_str} -> {cons_str}\n")
            
            f.write("\n=========================\n")
        print("Results successfully written to file.")
    
    def generate_rules(self):
        cmp = 0 # Counter for rules generated
        for itemset in self.frequent_itemsets:
            if len(itemset)<2:
                continue #rule made of 2 items at least
            # Generate all non-empty proper subsets of the current itemset
            subsets = [set(x) for i in range(1, len(itemset)) for x in combinations(itemset, i)]
            for sub in subsets:
                # Consequent is the rest of the itemset
                res = set(itemset) - sub
                if not res:
                    continue
                support_itemset = self.frequent_itemsets[itemset]
                # Look up support for the antecedent (sub). 0 if not found.
                support_sub = self.frequent_itemsets.get(tuple(sorted(sub)), 0)#because sub is a set and the keys of the dictionary are tuples e.g ('32',)
                if support_sub == 0:
                    continue
                confidence = support_itemset / support_sub

                if confidence >= self.min_confidence:
                    self.rules.append((sub,res, confidence)) #each rule is a tuple + its associated confidence
                    cmp +=1
        return cmp

We instantiate the Apriori object and proceed with the two main stages of association rule discovery. First, apriori.load_transactions(file_path) reads and parses the input file, loading the data into memory as a list of sets. Second, apriori.find_frequent_itemsets() executes the iterative Apriori process, reporting the number of frequent itemsets found at each size $k$. Finally, rule_count = apriori.generate_rules() finds all valid association rules, reporting that rule_count total rules met the confidence criterion. The subsequent console output displays a sample of the frequent itemsets (L-k) and the resulting association rules ($X \to Y$) found in the dataset, confirming the successful discovery phase.

The method apriori.write_results_to_file(output_file_path, rule_count) provides comprehensive output by systematically writing all frequent itemsets (grouped by size $k$) and all valid association rules to the specified file, ensuring all results are preserved for external analysis.

In [4]:
# Initializing and running Apriori
apriori = Apriori(MIN_SUPPORT, MIN_CONFIDENCE)
apriori.load_transactions(file_path)
apriori.find_frequent_itemsets()

# Generate rules and capture the count (called only once)
rule_count = apriori.generate_rules()

# Console Output: Frequent Itemsets
print("\nFrequent Itemsets:")
for itemset, support in apriori.frequent_itemsets.items():
    print(f"{itemset}: {support}")

# Console Output: Association Rules
print("\nGenerating associated rules to the frequent Itemsets found:")
print(f"Total rules found (confidence >= {MIN_CONFIDENCE}): {rule_count}")

# Iterate and print rules (unpacking all 3, but only displaying 2 elements as requested)
for antecedent, consequent, confidence in apriori.rules:
    print(f"{antecedent} -> {consequent}")

# Saving results to a file
apriori.write_results_to_file(output_file_path, rule_count)

Loaded 100000 transactions.
Running Apriori with min_support = 800
Found 443 frequent itemsets of size 1
Found 44 frequent itemsets of size 2
Found 7 frequent itemsets of size 3

Frequent Itemsets:
('240',): 1399
('25',): 1395
('274',): 2628
('368',): 7828
('448',): 1370
('52',): 1983
('538',): 3982
('561',): 2783
('630',): 1523
('687',): 1762
('775',): 3771
('825',): 3085
('834',): 1373
('120',): 4973
('205',): 3605
('39',): 4258
('401',): 3667
('581',): 2943
('704',): 1794
('814',): 1672
('35',): 1984
('674',): 2527
('712',): 845
('733',): 1141
('854',): 2847
('950',): 1463
('422',): 1255
('449',): 1890
('857',): 1588
('895',): 3385
('937',): 4681
('964',): 1518
('229',): 2281
('283',): 4082
('294',): 1445
('352',): 902
('381',): 2959
('708',): 1090
('738',): 2129
('766',): 6265
('853',): 1804
('883',): 4902
('966',): 3921
('978',): 1141
('104',): 1158
('143',): 1417
('569',): 2835
('620',): 2100
('798',): 3103
('185',): 1529
('214',): 1893
('350',): 3069
('529',): 7057
('658',): 188

Following this, the plotting functions analyze the structure of the results. Specifically, plot_itemset_count_vs_k() visualizes the rapid drop in frequent itemset count as $k$ increases, which is a key characteristic of the Apriori algorithm's performance on sparse data. 

In [15]:
def plot_itemset_count_vs_k(frequent_itemsets):
    """Plot 1: Bar chart showing the number of frequent itemsets for each size k."""
    
    # Prepare data: Count itemsets by length
    k_counts = defaultdict(int)
    for itemset in frequent_itemsets.keys():
        k_counts[len(itemset)] += 1
    
    df = pd.Series(k_counts).sort_index().rename_axis('Itemset Size (k)').reset_index(name='Count')
    
    plt.figure(figsize=(8, 5))
    plt.bar(df['Itemset Size (k)'], df['Count'], color='skyblue')
    plt.title('Number of Frequent Itemsets vs. Size (k)')
    plt.xlabel('Itemset Size (k)')
    plt.ylabel('Count of Frequent Itemsets')
    plt.xticks(df['Itemset Size (k)'])
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.savefig('plot_itemset_count_vs_k.png')
    plt.close()
    print("Plot 1: plot_itemset_count_vs_k.png saved.")

Similarly, plot_rules_by_length() summarizes the complexity of the rules found (e.g., $1 \to 1$ rules are likely the most common). These plots are vital for understanding the underlying distribution and complexity of the discovered patterns within the transactional data.

In [14]:
def plot_rules_by_length(rules):
    """Plot : Bar chart showing the count of rules grouped by format (len(A) -> len(B))."""
    
    # Prepare data: Format as "len(A) -> len(B)"
    rule_formats = []
    for antecedent, consequent, _ in rules:
        format_str = f"{len(antecedent)} -> {len(consequent)}"
        rule_formats.append(format_str)
        
    df = pd.Series(rule_formats).value_counts().sort_index().rename_axis('Rule Format').reset_index(name='Count')
    
    plt.figure(figsize=(10, 6))
    plt.bar(df['Rule Format'], df['Count'], color='mediumseagreen')
    plt.title('Count of Association Rules by Format (Antecedent $\\to$ Consequent Size)')
    plt.xlabel('Rule Format (e.g., "1 $\\to$ 1")')
    plt.ylabel('Count of Rules')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.savefig('plot_rules_by_length.png')
    plt.close()
    print("Plot 3: plot_rules_by_length.png saved.")

In [None]:
# --- Generate Plots ---
print("\nGenerating visualizations...")
if apriori.frequent_itemsets:
    plot_itemset_count_vs_k(apriori.frequent_itemsets)
       # plot_top_k1_items(apriori.frequent_itemsets, n=10)
else:
    print("No frequent itemsets found.")
        
if apriori.rules:
    plot_rules_by_length(apriori.rules)


Generating visualizations...
Plot 1: plot_itemset_count_vs_k.png saved.
Plot 3: plot_rules_by_length.png saved.


The following code represents an alternative implementation of the Apriori algorithm developed by another member of the group. In this section, we focus on analyzing the execution time of the algorithm and evaluating how it is affected by different parameters, including the support threshold, the confidence level, and the number of data lines used during mining.

In [None]:
import numpy as np
import time as t
import matplotlib.pyplot as plt
import copy


class FrequentItem:
    itemFound: np.matrix

    def __init__(self) -> None:
        print("Frequent Item Finder")
    
    def findSingletons(self, set: np.matrix,threshold: int) -> list[any]:
        temp = []
        for subset in set:
            temp = np.hstack((temp,np.unique(subset)))
        uniqueItems = np.unique(temp)
        items = []
        for i,item in enumerate(uniqueItems):
            sum = 0
            for j,subset in enumerate(set):
                idx = (subset == item)
                sum += np.sum(idx)
                if(sum/len(set)>=threshold):
                        items.append(item)
                        break
        return items
    
    def buildCandidates(self, singletons: list[any], multiplons: np.matrix) -> list[any]:
        candidates = []
        for set in multiplons:
            for single in singletons:
                if(np.isin(single,set,invert=True)):
                    if isinstance(set, np.integer):
                        temp = [set, single]
                        temp.sort()
                        candidates.append(temp)
                    else:
                        temp = set
                        temp = np.hstack((temp,single))
                        temp.sort()
                        candidates.append(temp)
      
        return np.unique(candidates,axis=0)
    
    #initialize frequentItem with singletons
    #each subset of the initial working set should be a set and not a bag meaning that every subset contain an element exactly once
    def getFrequentItemFinder(self,threshold:int,set: np.matrix,frequentItem : list , singletons : list[any], depth: int = 0) -> list[any]:
        candidates = self.buildCandidates(singletons,frequentItem[depth])
        #print(f"candidates= {candidates}")
        depth +=1
        items = []
        for i,item in enumerate(candidates):
            sum = 0
            for j,subset in enumerate(set):
                mask = np.isin(subset,item)
                if(np.sum(mask)==depth+1):
                    sum += 1
            if(sum/len(set)>=threshold):
                items.append(item)
        #print(f"items that are frequent= {items}")
        if(len(items)==0):
            #print("exting recursive loop")
            #print(frequentItem)
            self.itemFound = frequentItem
            return frequentItem
        else:
            frequentItem.append(items)
            #print(f"resulting matrix= {frequentItem}")
            self.getFrequentItemFinder(threshold,set,frequentItem,singletons,depth)

    def computeSupport(self, set: np.matrix , kFrequent: list[float]) -> float:
        support = 0
        for j,subset in enumerate(set):
            mask = np.isin(subset,kFrequent)
            if(np.sum(mask)==len(kFrequent)):
                    support += 1
        return support

    """
    X->Y is stored as followed
    [[[X1],[Y1]],
     [[X2],[Y2]],
        ...
     [[Xn],[Yn]]]
    
    """

    def findSimpleAssociationRules(self, set : np.matrix ,confidence : float) -> np.matrix:
        simpleAssociationRules = []
        for kFrequents in self.itemFound[1:]: #no need to investigate singletons
            for i,kFrequent in enumerate(kFrequents):
                supportN = self.computeSupport(set,kFrequent)
                for j in range(len(kFrequent)-1,-1,-1):
                    temp = np.delete(kFrequent,j)
                    #print(temp)
                    supportD = self.computeSupport(set,temp)
                    if(supportN/supportD >= confidence):
                        simpleAssociationRules.append([temp,[kFrequent[j]]])
                        #print([temp,[kFrequent[j]]])
                        self.mineAssociationRules(set,temp,[kFrequent[j]],simpleAssociationRules,supportN,confidence)
                        #print(f"building association rules = {simpleAssociationRules}")
        return simpleAssociationRules
                    
    def mineAssociationRules(self,set: np.matrix , X : list[float], Y : list[float], rules : list[any], supportN : int, confidence : float)-> np.matrix:
        if(len(X)<=1):
            return rules
        else:
            for i in range(len(X)):
                temp = np.delete(X,i)
                #print(Y)
                supportD = self.computeSupport(set,temp)
                if(supportN/supportD >= confidence):
                    rules.append([temp,Y+[X[i]]])
                    #print([temp,Y+[X[i]]])
                    self.mineAssociationRules(set,temp,Y+[X[i]],rules,supportN,confidence)

In [None]:
matrix =[]

with open("T10I4D100K.dat", "r") as f:
    for line in f:
        row = list(map(int, line.split()))
        matrix.append(row)

timekFrequent = []
timeAssociation = []
itemFinder = FrequentItem()

thresholds = [4,5,7,10]
for i in thresholds:
    print(f'finding frequent items t={i/100}')
    e = t.time()
    singletons = itemFinder.findSingletons(matrix,i/100)
    kFrequents = itemFinder.getFrequentItemFinder(i/100,matrix,[singletons],singletons)
    timekFrequent.append((t.time()-e)*1000)
    print(f'finding associations rules')
    e = t.time()
    simpleAsso = itemFinder.findSimpleAssociationRules(matrix,0.5)
    timeAssociation.append((t.time()-e)*1000)

plt.scatter(thresholds,timekFrequent,label='kFrequent')
plt.xlabel("threshold (%)")
plt.ylabel("time in ms")
plt.show()

plt.scatter(thresholds,timeAssociation,label='Association')
plt.xlabel("threshold (%)")
plt.ylabel("time in ms")
plt.show()

timeAssociation = []

confidence = [40,50,70,90]
singletons = itemFinder.findSingletons(matrix,0.03)
kFrequents = itemFinder.getFrequentItemFinder(0.03,matrix,[singletons],singletons)

for i in confidence:
    print(f'finding associations rules c={i/100}')
    e = t.time()
    simpleAsso = itemFinder.findSimpleAssociationRules(matrix,i/100)
    timeAssociation.append((t.time()-e)*1000)

plt.scatter(confidence,timeAssociation,label='Association')
plt.xlabel("confidence (%)")
plt.ylabel("time in ms")
plt.show()

All datasets used throughout this lab are publicly available in the following GitHub repository: https://github.com/maximesf/lab_ID2222/ To access them, navigate to the lab1/data/ directory within the repository, where all the files used in our experiments are stored.