# Apriori Algorithm Implementation using Jupyter Labs

This notebook demonstrates the implementation of the **Apriori algorithm** for association rule mining in **Jupyter Labs**.

## Objective:
- We have used a market basket dataset of 9825 transactions including 169 unique items for this demonstration  **grocery basket**.
- *Source* :https://www.kaggle.com/datasets/irfanasrullah/groceries


In [12]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
import time
import sys

**Calculate runtime and memory usage**

This function when called calculates each function runtime and memory usage and will post the data at the end for each function of the code


In [13]:
# Function to calculate runtime and memory usage
def measure_performance(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        start_memory = sys.getsizeof(args) + sys.getsizeof(kwargs)

        result = func(*args, **kwargs)

        end_time = time.time()
        end_memory = sys.getsizeof(result)

        runtime = end_time - start_time
        memory_used = end_memory - start_memory

        print(f"Function {func.__name__} took {runtime:.4f} seconds and used {memory_used} bytes of memory.")
        return result
    return wrapper


**Function to pre-process the dataset on which the apriori algorithm will be performed**
- Group the dataset by column as in this case there are three columns Member_number, Date, itemDescription hence we are classifying groups for each of these columns.
- Here, we have performed one-hot encoding on the dataset.One-hot encoding is necessary for the Apriori algorithm because the algorithm operates on binary transactions, where each item is represented as either present (1) or absent (0) in a transaction. Since Apriori is a frequent itemset mining algorithm, it requires data in a structured format that clearly indicates whether an item appears in each transaction.

In [14]:
@measure_performance
def load_and_preprocess_data(csv_file):
    try:
        df = pd.read_csv(csv_file)
    except FileNotFoundError:
        print(f"Error: File '{csv_file}' not found.")
        return None

    # Group by Member number and aggregate Item details into lists
    transactions = df.groupby('Member_number')['itemDescription'].apply(list).tolist()

    # One-hot encode the transactions
    from mlxtend.preprocessing import TransactionEncoder
    te = TransactionEncoder()
    te_ary = te.fit(transactions).transform(transactions)
    df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

    return df_encoded

**Function to perform Apriori and association rule mining on the dataset**

In [15]:
@measure_performance
def apply_apriori(df_encoded, min_support=0.01):
    frequent_itemsets = apriori(df_encoded, min_support=min_support, use_colnames=True, low_memory=True)  # low_memory=True
    return frequent_itemsets


@measure_performance
def generate_association_rules(frequent_itemsets, min_confidence=0.01, min_lift=1.0):
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)
    rules_with_lift = rules[rules['lift'] > min_lift]
    return rules_with_lift

**Print and load the dataset**

In [16]:
csv_file = "groceries_dataset.csv" 

df_encoded = load_and_preprocess_data(csv_file)

if df_encoded is None:
    exit()

Function load_and_preprocess_data took 0.4720 seconds and used 651018 bytes of memory.


**Set the minimum thresholds for association rule mining**
-This section calculates the confidence, support and lift parameters for the entire dataset 


In [19]:
min_supports_to_try = [0.005, 0.002, 0.001]  # Lower support for large datasets
min_confidences_to_try = [0.01, 0.005, 0.002]
min_lift = 1.0

found_rules = False

for min_support in min_supports_to_try:
    print(f"Trying min_support = {min_support}")
    frequent_itemsets = apply_apriori(df_encoded, min_support)

    print("\nFrequent Itemsets (Top 50 or all if fewer):")
    print(frequent_itemsets.head(min(50, len(frequent_itemsets))))

    for min_confidence in min_confidences_to_try:
        print(f"Trying min_confidence = {min_confidence}")
        rules = generate_association_rules(frequent_itemsets, min_confidence, min_lift)

        if not rules.empty:
            found_rules = True
            print(f"\nAssociation Rules (support={min_support}, confidence={min_confidence}, lift > {min_lift}):")
            print(rules[['antecedents', 'consequents', 'confidence', 'lift', 'support']].head(min(50, len(rules))))
            break

    if found_rules:
        break

if not found_rules:
    print("No association rules found. Check your data, format, or if association rule mining is appropriate.")

Trying min_support = 0.005
Function apply_apriori took 0.2954 seconds and used 2408524 bytes of memory.

Frequent Itemsets (Top 50 or all if fewer):
     support                    itemsets
0   0.015393     (Instant food products)
1   0.078502                  (UHT-milk)
2   0.005644          (abrasive cleaner)
3   0.007440          (artif. sweetener)
4   0.031042             (baking powder)
5   0.119548                      (beef)
6   0.079785                   (berries)
7   0.062083                 (beverages)
8   0.158799              (bottled beer)
9   0.213699             (bottled water)
10  0.009749                    (brandy)
11  0.135967               (brown bread)
12  0.126475                    (butter)
13  0.064905               (butter milk)
14  0.022832                  (cake bar)
15  0.016932                   (candles)
16  0.053874                     (candy)
17  0.165213               (canned beer)
18  0.029502               (canned fish)
19  0.005387              (cann