# `**CS634-Data Mining Midterm Project**`

**>Part 1**


**Data Preparation and Pattern Generation**
In this part, items for each retailer are defined and deterministic transactions are generated based on certain patterns. These patterns are not random and are designed to reflect the unique characteristics of each retailer. Here’s a brief explanation of the patterns used:

  **•	Amazon:** The pattern for Amazon is simply the index of the transaction modulo 10 plus 1. This means that the first transaction will contain the first item, the second transaction will contain the first two items, and so on, until the tenth transaction which will contain all ten items. After that, the pattern repeats.

  **•	Wayfair:** The pattern for Wayfair is twice the index of the transaction plus 1, but not more than the total number of items. This means that the first transaction will contain the first item, the second transaction will contain the first three items, and so on.

  **•	Walmart:** The pattern for Walmart is twice the index of the transaction plus 2. This means that the first transaction will contain the first two items, the second transaction will contain the first four items, and so on.

  **•	Best Buy:** The pattern for Best Buy is a bit more complex. It involves finding the next prime number that is greater than the index of the transaction modulo the total number of items plus 1.
  
  **• Nike:** The pattern for Nike involves generating the Fibonacci sequence. The index of the transaction plus 1 is used to find the corresponding number in the Fibonacci sequence, and this number determines the number of items in the transaction.

The transactions are then saved in CSV files for further processing. This stage is crucial as it sets up the data that will be used for the rest of the project. The deterministic nature of these transactions, as specified in the project details, allows for consistent and meaningful analysis in the subsequent stages.




**Install pyfpgrowth using the code below**


In [1]:
!pip install pyfpgrowth



In [2]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
import pyfpgrowth
import itertools
import time
from itertools import combinations

In [6]:
# List of items for each retailer
amazon_items = ['Kindle E-reader', 'Echo Dot (4th Gen)', 'Fire TV Stick 4K', 'Microfiber Sheet Set', 'Portable Wireless Bluetooth Speaker', 'Stainless Steel Electric Kettle', 'High-Speed HDMI Cable', 'Silicone Baking Mat Sheet', 'AAA Performance Alkaline Batteries', 'Quick-Dry Bath Towels']
wayfair_items = ['Wayfair Basics 1800 Series Sheet Set', 'Amherst Upholstered Platform Bed', 'Landen Hand-Tufted Silver/Ivory Area Rug', 'Kearney Sectional', 'Gold Flamingo Anna Maria Desk', 'Three Posts Teen Northampt Reversible Quilt Set', 'Mercury Row Wiersma End Table', 'Andover Mills Liesel Comforter Set', 'Greyleigh Dorset Ivory/Fuchsia Indoor Area Rug', 'Zipcode Design Folkston Desk']
walmart_items = ['Mainstays Microfiber Sheet Set', 'Great Value LED Light Bulb', 'Ozark Trail Camping Chair', 'Equate Hand Sanitizer', 'Parents Choice Diapers', 'Onn. Full-Motion Articulating TV Wall Mount', 'Athletic Works Womens Active Core Legging', 'George Mens Regular Fit Jean', 'Mainstays 71" 5-Shelf Bookcase', 'Hyper Tough 20V Max Cordless Drill']
bestbuy_items = ['Insignia™ - 50" Class F30 Series LED 4K UHD Smart Fire TV', 'Apple - AirPods Pro', 'Samsung - Galaxy S21 5G', 'HP - ENVY x360 2-in-1 15.6" Touch-Screen Laptop', 'Sony - PlayStation 5 Console', 'Ring - Video Doorbell 3', 'JBL - Flip 5 Portable Bluetooth Speaker', 'Canon - EOS Rebel T7 DSLR Video Camera', 'WD - Easystore 5TB External USB 3.0 Portable Hard Drive', 'Keurig - K-Classic K50 Single Serve K-Cup Pod Coffee Maker']
nike_items = ['Nike Air Force 1 07', 'Nike Sportswear Club Fleece', 'Nike Dri-FIT Academy', 'Nike Air Zoom Pegasus 38', 'Nike Sportswear Essential', 'Nike Pro', 'Nike Air Max 270', 'Nike Mercurial Superfly 8 Elite FG', 'Nike Yoga Dri-FIT', 'Nike SB Zoom Stefan Janoski RM']


In [7]:
# Dictionary of items for each retailer
retailers_items = {
    'amazon': amazon_items,
    'wayfair': wayfair_items,
    'walmart': walmart_items,
    'bestbuy': bestbuy_items,
    'nike': nike_items
}
def is_prime(n):
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    i = 5
    while i * i <= n:
        if n % i == 0 or n % (i + 2) == 0:
            return False
        i += 6
    return True

def fibonacci(n):
    if n <= 0:
        return 0
    elif n == 1:
        return 1
    else:
        a, b = 0, 1
        for _ in range(n - 1):
            a, b = b, a + b
        return b

# Patterns for each retailer
patterns = {
    'amazon': lambda i, _: i % 10 + 1,
    'wayfair': lambda i, _: 2 * i + 1 if 2 * i + 1 <= len(retailers_items['wayfair']) else 1,
    'walmart': lambda i, _: 2 * (i + 1),
    'bestbuy': lambda i, num_items: next((x for x in range(i % num_items + 1, num_items + 1) if is_prime(x)), 1),
    'nike': lambda i, _: fibonacci(i + 1) if fibonacci(i + 1) <= len(retailers_items['nike']) else 1
}

def generate_deterministic_transactions(retailer, num_transactions):
    if retailer not in retailers_items or retailer not in patterns:
        print(f"No items or pattern found for retailer: {retailer}")
        return

    transactions = []
    for i in range(num_transactions):
        num_items = patterns[retailer](i, len(retailers_items[retailer]))  # Apply the retailer's pattern to determine the number of items
        transaction = retailers_items[retailer][:num_items]  # Select the first 'num_items' from the list
        transactions.append(", ".join(transaction))

    return pd.DataFrame({'Transaction ID': [f'Trans{i+1}' for i in range(num_transactions)],
                         f'{retailer.capitalize()} Transaction': transactions})

# Generate sample transactions for each retailer
for retailer in retailers_items.keys():
    transactions = generate_deterministic_transactions(retailer, 20)
    transactions.to_csv(f'{retailer}_transactions.csv', index=False)
    print(f"Generated transactions for {retailer} and saved to {retailer}_transactions.csv")

Generated transactions for amazon and saved to amazon_transactions.csv
Generated transactions for wayfair and saved to wayfair_transactions.csv
Generated transactions for walmart and saved to walmart_transactions.csv
Generated transactions for bestbuy and saved to bestbuy_transactions.csv
Generated transactions for nike and saved to nike_transactions.csv


**Project Part 2**

**Loading Transactions**

In this stage, the transactions from the CSV files are loaded into a DataFrame. This is important because the subsequent analysis requires the data to be in a specific format (a DataFrame).


**Frequent Itemset Generation**

This is the core of the project. In this stage, three different methods (brute force, Apriori, and FP-Growth) are implemented to generate frequent itemsets from the transactions. Frequent itemsets are sets of items that appear together in the transactions more often than a specified minimum number of times (the support level). This stage is crucial for understanding the relationships between different items.

**Association Rule Mining**

After generating the frequent itemsets, they are used to generate association rules. These rules can tell that if certain items are bought together, it’s likely that another item will be bought as well. This is useful for making recommendations to customers.

**Performance Comparison**
In the final stage, the performance of the three methods used for generating frequent itemsets is compared. This can help understand which method is the most efficient for the data.

####**Prompt the user for the minimum support and confidence levels**

#### **Prompt the user to choose a Database**

In [8]:
# Prompt the user for the minimum support level
while True:
    min_support = float(input("Enter the minimum support level (between 0 and 1): "))
    if 0 <= min_support <= 1:
        break
    else:
        print("Invalid input. Please enter a number between 0 and 1.")

# Prompt the user for the minimum confidence level
while True:
    min_confidence = float(input("Enter the minimum confidence level (between 0 and 1): "))
    if 0 <= min_confidence <= 1:
        break
    else:
        print("Invalid input. Please enter a number between 0 and 1.")

# Prompt the user to choose a database
database_choice = input("Which database would you like to analyze? Enter '1' for Amazon, '2' for Best Buy, '3' for WayFair, '4' for WalMart, or '5' for Nike: ")


Enter the minimum support level (between 0 and 1): 0.5
Enter the minimum confidence level (between 0 and 1): 0.6
Which database would you like to analyze? Enter '1' for Amazon, '2' for Best Buy, '3' for WayFair, '4' for WalMart, or '5' for Nike: 2


**Define Brute-Force method**

In [9]:
# Load the transactions from the CSV files
def load_transactions_from_csv(database_choice):
    if database_choice == '1':
        transactions = pd.read_csv('amazon_transactions.csv')
    elif database_choice == '2':
        transactions = pd.read_csv('bestbuy_transactions.csv')
    elif database_choice == '3':
        transactions = pd.read_csv('wayfair_transactions.csv')
    elif database_choice == '4':
        transactions = pd.read_csv('walmart_transactions.csv')
    elif database_choice == '5':
        transactions = pd.read_csv('nike_transactions.csv')
    else:
        print("Invalid choice. Please enter a number between 1 and 5.")
        return None
    return transactions

In [10]:
# Function to generate frequent itemsets using brute force
def generate_frequent_itemsets_brute_force(transactions, min_support):
    # Flatten the list of transactions and find unique items
    items = list(set(x for sublist in transactions for x in sublist))

    # Define a list to hold the frequent itemsets
    frequent_itemsets = []

    # Generate all possible itemsets up to the length of items
    for r in range(1, len(items) + 1):
        # Use combinations to generate itemsets of length r
        for itemset in combinations(items, r):
            # Count the support for the itemset
            support = sum(1 for transaction in transactions if set(itemset).issubset(transaction))
            # If the support is greater than or equal to the min_support, add it to the list of frequent itemsets
            if support / len(transactions) >= min_support:
                frequent_itemsets.append(itemset)

    return frequent_itemsets

**Step 5: Implement the brute force method, Apriori and FP Growth algorithm with the user specified support level to generate frequent items and generate association rules**

In [16]:
# Function to calculate the support of an itemset
def calculate_support(itemset, transactions):
    count = sum(1 for transaction in transactions if set(itemset).issubset(transaction))
    return count / len(transactions)

# Load the chosen dataset
df_transactions = load_transactions_from_csv(database_choice)

# Convert transactions into a list of item lists
transactions = [transaction.split(', ') for transaction in df_transactions['Bestbuy Transaction']]

# Generate frequent itemsets using the brute force method
start_time = time.time()  # Add this line to record the start time
brute_force_itemsets = generate_frequent_itemsets_brute_force(transactions, min_support)
brute_force_time = time.time() - start_time  # Add this line to calculate the execution time

if not brute_force_itemsets:
    print("No frequent itemsets met the minimum support level using the brute force method.")
else:
    # Convert frequent itemsets to itemsets and support values
    itemsets = [tuple(itemset) for itemset in brute_force_itemsets]
    supports = [calculate_support(itemset, transactions) for itemset in brute_force_itemsets]

    # Create a new DataFrame with itemsets and support columns
    rules_df = pd.DataFrame({'itemsets': itemsets, 'support': supports})

    # Generate association rules from the new DataFrame if it's not empty
    if not rules_df.empty:
        brute_force_rules = association_rules(rules_df, metric="confidence", min_threshold=min_confidence, support_only=True)

        # Print the association rules
        if not brute_force_rules.empty:
            print("\nBrute Force Association Rules:")
            print(brute_force_rules)
        else:
            print("No association rules met the minimum confidence level using the brute force method.")
    else:
        print("No frequent itemsets found. Unable to generate association rules using the brute force method.")


Brute Force Association Rules:
                                         antecedents  \
0                              (Apple - AirPods Pro)   
1  (Insignia™ - 50" Class F30 Series LED 4K UHD S...   

                                         consequents  antecedent support  \
0  (Insignia™ - 50" Class F30 Series LED 4K UHD S...                 NaN   
1                              (Apple - AirPods Pro)                 NaN   

   consequent support  support  confidence  lift  leverage  conviction  \
0                 NaN      0.7         NaN   NaN       NaN         NaN   
1                 NaN      0.7         NaN   NaN       NaN         NaN   

   zhangs_metric  
0            NaN  
1            NaN  


In [27]:
from pyfpgrowth import find_frequent_patterns, generate_association_rules
from mlxtend.frequent_patterns import apriori

##For Apriori
# Convert transactions into a list of item lists
transactions_list = [transaction.split(', ') for transaction in df_transactions['Bestbuy Transaction']]

# Generate the list of unique items
items = sorted(set(item for transaction in transactions_list for item in transaction))

# Convert transactions into a one-hot encoded DataFrame
df = pd.DataFrame([[item in transaction for item in items] for transaction in transactions_list], columns=items)

# Generate frequent itemsets using the Apriori method
start_time = time.time()
apriori_itemsets = apriori(df, min_support=min_support, use_colnames=True)
apriori_time = time.time() - start_time

##For FP_Growth
# Convert transactions into a list of item lists
transactions_list = [transaction.split(', ') for transaction in df_transactions['Bestbuy Transaction']]

# Convert min_support from a fraction to a count
min_support_count = int(min_support * len(transactions_list))

# Find frequent patterns using FP-Growth
start_time = time.time()
patterns = find_frequent_patterns(transactions_list, min_support_count)
fpgrowth_time = time.time() - start_time

if not apriori_itemsets.empty:
    # Generate association rules from the frequent itemsets
    apriori_rules = association_rules(apriori_itemsets, metric="confidence", min_threshold=min_confidence)
    if not apriori_rules.empty:
        print("\nApriori Association Rules:")
        print(apriori_rules)
    else:
        print("No association rules met the minimum confidence level using the Apriori method.")
else:
    print("No frequent itemsets met the minimum support level using the Apriori method.")

# For FP-Growth
if patterns:
    fpgrowth_rules = generate_association_rules(patterns, min_confidence)
    if fpgrowth_rules:
        print("\nFP-Growth Association Rules:")
        print(fpgrowth_rules)  # Print the fpgrowth_rules to inspect its structure
    else:
        print("No association rules met the minimum confidence level using the FP-Growth method.")
else:
    print("No frequent patterns met the minimum support level using the FP-Growth method.")


Apriori Association Rules:
                                         antecedents  \
0                              (Apple - AirPods Pro)   
1  (Insignia™ - 50" Class F30 Series LED 4K UHD S...   
2                          (Samsung - Galaxy S21 5G)   
3                              (Apple - AirPods Pro)   
4                          (Samsung - Galaxy S21 5G)   
5     (Samsung - Galaxy S21 5G, Apple - AirPods Pro)   
6  (Samsung - Galaxy S21 5G, Insignia™ - 50" Clas...   
7  (Apple - AirPods Pro, Insignia™ - 50" Class F3...   
8                          (Samsung - Galaxy S21 5G)   
9                              (Apple - AirPods Pro)   

                                         consequents  antecedent support  \
0  (Insignia™ - 50" Class F30 Series LED 4K UHD S...                 0.7   
1                              (Apple - AirPods Pro)                 1.0   
2                              (Apple - AirPods Pro)                 0.5   
3                          (Samsung - Galaxy S21 5G

**Finally: Performance Comparison**

In [28]:
# Compare the results from the three methods
def compare_results(brute_force_rules, apriori_rules, fpgrowth_rules):
    print("Brute Force Rules:")
    if not brute_force_rules.empty:
        print(brute_force_rules)
    else:
        print("No association rules found using the brute force method.")

    print("\nApriori Rules:")
    if not apriori_rules.empty:
        print(apriori_rules)
    else:
        print("No association rules found using the Apriori method.")

    print("\nFP-Growth Rules:")
    if fpgrowth_rules:
        print(fpgrowth_rules)
    else:
        print("No association rules found using the FP-Growth method.")

# Print the time taken by each method
print("\nTime taken by brute force method: ", brute_force_time)
print("Time taken by Apriori: ", apriori_time)
print("Time taken by FP-Growth: ", fpgrowth_time)


Time taken by brute force method:  0.002124309539794922
Time taken by Apriori:  0.010610103607177734
Time taken by FP-Growth:  0.0003476142883300781


In [5]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)