# Frequent Itemset Mining & Association Rules with CSV Dataset


In [9]:
import pandas as pd
from itertools import combinations

def load_transactions(csv_file):
    """Load transactions from a CSV file"""
    df = pd.read_csv(csv_file)
    transactions = []
    for index, row in df.iterrows():
        transaction = set(row.dropna().values)  # Remove NaN values
        transactions.append(transaction)
    return transactions


In [10]:

def generate_candidates(frequent_itemsets, k):
    """Generate candidate itemsets of size k from frequent itemsets of size k-1"""
    candidates = set()
    frequent_items = list(frequent_itemsets.keys())
    for i in range(len(frequent_items)):
        for j in range(i+1, len(frequent_items)):
            union_set = frequent_items[i] | frequent_items[j]
            if len(union_set) == k:
                candidates.add(frozenset(union_set))
    return candidates

def get_frequent_itemsets(transactions, candidates, min_support):
    """Filter candidates to get frequent itemsets based on support threshold"""
    itemset_counts = {}
    for transaction in transactions:
        for candidate in candidates:
            if candidate.issubset(transaction):
                itemset_counts[candidate] = itemset_counts.get(candidate, 0) + 1
    total_transactions = len(transactions)
    return {itemset: count / total_transactions for itemset, count in itemset_counts.items() if count / total_transactions >= min_support}

def apriori(transactions, min_support):
    """Apriori Algorithm to mine frequent itemsets"""
    single_items = {frozenset([item]) for transaction in transactions for item in transaction}
    frequent_itemsets = get_frequent_itemsets(transactions, single_items, min_support)
    all_frequent_itemsets = frequent_itemsets.copy()
    k = 2
    while frequent_itemsets:
        candidates = generate_candidates(frequent_itemsets, k)
        frequent_itemsets = get_frequent_itemsets(transactions, candidates, min_support)
        all_frequent_itemsets.update(frequent_itemsets)
        k += 1
    return all_frequent_itemsets

In [11]:
def generate_association_rules(frequent_itemsets, min_confidence):
    """Generate association rules from frequent itemsets"""
    rules = []
    for itemset in frequent_itemsets:
        if len(itemset) > 1:
            for i in range(1, len(itemset)):
                for antecedent in combinations(itemset, i):
                    antecedent = frozenset(antecedent)
                    consequent = itemset - antecedent
                    support_itemset = frequent_itemsets[itemset]
                    support_antecedent = frequent_itemsets[antecedent]
                    confidence = support_itemset / support_antecedent
                    if confidence >= min_confidence:
                        rules.append((antecedent, consequent, confidence))
    return rules

In [68]:
# load dataset from CSV file
csv_file = "transactions.csv"
transactions = load_transactions(csv_file)
print(f"Loaded {len(transactions)} transactions from {csv_file}")

# set support and confidence thresholds
min_support = 0.09
min_confidence = 0.30

Loaded 499 transactions from transactions.csv


In [69]:
# step 1: mine frequent itemsets
frequent_itemsets = apriori(transactions, min_support)
print("Frequent Itemsets:")
for itemset, support in frequent_itemsets.items():
    print(f"{set(itemset)}: {support:.2f}")


Frequent Itemsets:
{'Bread'}: 0.26
{'Fish'}: 0.27
{'Juice'}: 0.30
{'Butter'}: 0.28
{'Eggs'}: 0.30
{'Cereal'}: 0.31
{'Tomato'}: 0.30
{'Chicken'}: 0.23
{'Milk'}: 0.27
{'Beer'}: 0.23
{'Pasta'}: 0.29
{'Cheese'}: 0.26
{'Yogurt'}: 0.29
{'Diapers'}: 0.26
{'Rice'}: 0.27
{'Juice', 'Eggs'}: 0.09
{'Bread', 'Eggs'}: 0.09
{'Cereal', 'Eggs'}: 0.10
{'Bread', 'Tomato'}: 0.09
{'Tomato', 'Juice'}: 0.10
{'Tomato', 'Yogurt'}: 0.09
{'Fish', 'Yogurt'}: 0.09
{'Butter', 'Tomato'}: 0.10
{'Yogurt', 'Cereal'}: 0.09
{'Yogurt', 'Pasta'}: 0.10


In [70]:
# Step 2: Generate Association Rules
association_rules = generate_association_rules(frequent_itemsets, min_confidence)
print("\nAssociation Rules:")
for antecedent, consequent, confidence in association_rules:
    print(f"{set(antecedent)} -> {set(consequent)} (Confidence: {confidence:.2f})")



Association Rules:
{'Juice'} -> {'Eggs'} (Confidence: 0.30)
{'Eggs'} -> {'Juice'} (Confidence: 0.31)
{'Bread'} -> {'Eggs'} (Confidence: 0.34)
{'Eggs'} -> {'Bread'} (Confidence: 0.30)
{'Cereal'} -> {'Eggs'} (Confidence: 0.31)
{'Eggs'} -> {'Cereal'} (Confidence: 0.33)
{'Bread'} -> {'Tomato'} (Confidence: 0.36)
{'Tomato'} -> {'Bread'} (Confidence: 0.31)
{'Tomato'} -> {'Juice'} (Confidence: 0.32)
{'Juice'} -> {'Tomato'} (Confidence: 0.32)
{'Yogurt'} -> {'Tomato'} (Confidence: 0.31)
{'Fish'} -> {'Yogurt'} (Confidence: 0.33)
{'Yogurt'} -> {'Fish'} (Confidence: 0.31)
{'Butter'} -> {'Tomato'} (Confidence: 0.35)
{'Tomato'} -> {'Butter'} (Confidence: 0.32)
{'Yogurt'} -> {'Cereal'} (Confidence: 0.31)
{'Yogurt'} -> {'Pasta'} (Confidence: 0.34)
{'Pasta'} -> {'Yogurt'} (Confidence: 0.34)



# **Class Activity Questions:**
 1. Run the script with a large dataset. How do frequent itemsets change as dataset size increases?

  Off the bat, there were no frequent itemsets or association rules with support threshold 0.4 and confidence 0.75. Lowering the thresholds to support 0.3 identified 3 single item frequent itemsets  of  
- {'Juice'}: 0.30
- {'Cereal'}: 0.31
- {'Tomato'}: 0.30

Lowering the threshold to 0.2 idenitifed further single item frequent itemset,s but no multi-item frequent itemsets. In order to get 2-item frequent item sets, i have to lower the threshold to 0.05.

so, overall we can say that larger datasets have a more diverse set of items, which can reduce overall support for any single item. Multi-item frequent sets need lower support thresholds because combinations of items naturally occur less frequently than single items. In real-world applications, support thresholds need careful tuning to balance relevance vs. noise.


 2. Modify support and confidence thresholds. What changes do you observe in the output?
 See above.
 For confidence thresholds, at 0.75- there are no associations. After playing with both the support threshold and confidence threshold, i was able to achieve some  association rules when using support threshold of 0.09 and confidence threshold of 0.3
 Association Rules:
- {'Juice'} -> {'Eggs'} (Confidence: 0.30)
- {'Eggs'} -> {'Juice'} (Confidence: 0.31)
- {'Bread'} -> {'Eggs'} (Confidence: 0.34)
- {'Eggs'} -> {'Bread'} (Confidence: 0.30)
- {'Cereal'} -> {'Eggs'} (Confidence: 0.31)
- {'Eggs'} -> {'Cereal'} (Confidence: 0.33)
- {'Bread'} -> {'Tomato'} (Confidence: 0.36)
- {'Tomato'} -> {'Bread'} (Confidence: 0.31)
- {'Tomato'} -> {'Juice'} (Confidence: 0.32)
- {'Juice'} -> {'Tomato'} (Confidence: 0.32)
- {'Yogurt'} -> {'Tomato'} (Confidence: 0.31)
- {'Fish'} -> {'Yogurt'} (Confidence: 0.33)
- {'Yogurt'} -> {'Fish'} (Confidence: 0.31)
- {'Butter'} -> {'Tomato'} (Confidence: 0.35)
- {'Tomato'} -> {'Butter'} (Confidence: 0.32)
- {'Yogurt'} -> {'Cereal'} (Confidence: 0.31)
- {'Yogurt'} -> {'Pasta'} (Confidence: 0.34)
- {'Pasta'} -> {'Yogurt'} (Confidence: 0.34)

How Frequent Itemsets Change with Larger Datasets
- Diversity Effect: Larger datasets contain a wider variety of items, making it harder for any single item to reach high support.
- Multi-Item Patterns Are Rare: Since customers buy different combinations of products, multi-item frequent sets are significantly less common than single items. This explains why only at very low support levels (0.05) do 2-item frequent sets emerge.
- Threshold Sensitivity: A high support threshold (e.g., 0.4) might be too restrictive, filtering out potential patterns. Lowering the threshold reveals more insights but also increases the risk of noise.

Impact of Modifying Support & Confidence Thresholds
Some rules might be weak, but they highlight possible relationships that a strict threshold would miss. Choosing the right support and confidence is a trade-off:
- High thresholds → Fewer but stronger rules.
- Low thresholds → More rules, but some may be weak or random.

 Certain items may not have strong one-to-one relationships but could be part of a larger pattern (e.g., clusters of products bought together).

 Practical Use Case: Retailers can use these insights to adjust promotions, product placements, and recommendations based on frequently co-purchased items.


 3. Analyze which rules have the highest confidence and explain why they might be useful in business applications.

- {'Bread'} → {'Tomato'} (Confidence: 0.36)
- {'Butter'} → {'Tomato'} (Confidence: 0.35)
- {'Yogurt'} → {'Pasta'} (Confidence: 0.34)
- {'Pasta'} → {'Yogurt'} (Confidence: 0.34)
Interpretation: If a customer buys bread, there is a 36% chance they will also buy tomatoes.

Cross-Selling: Grocery stores can items near each other to encourage impulse purchases.

Recipe-Based Bundling/ Targeted Promotions & Discounts: Bread and tomatoes are common ingredients for sandwiches, bruschetta, or toast. Promotions like "Buy Bread & Get Tomatoes 10% Off" could increase sales. Offer discounts on tomatoes when purchasing bread or butter.

Better Customer Understanding: Recognizing hidden consumption patterns helps businesses tailor recommendations and stock products efficiently.


 4. Discuss real-world applications of association rule mining beyond market basket analysis.
- health care diagnosis
- fraud detection
- recommender systems (netflix, amazon)