# CSC440-Data Mining Homework 3 (Written by Haoshu Qin)

Due: 02/26/2023

Description:
Market Basket Analysis (Chapter 6) is commonly used in "recommender" systems.
The basic idea is to discover interesting rules of the form {If someone likes these} -> {then they may also like these}. Download and get to know the Anonymous Microsoft Web Data Data Set:
https://archive.ics.uci.edu/ml/datasets/Anonymous+Microsoft+Web+Data

All students must perform MBA using the Apriori algorithm.
Students enrolled in 440 (Grad students) must in addition also perform analysis using the FP_GROW algorithm and compare the results of the two algorithms.

You may use "library" functions. The Kagle tutorial is a good resource, but you must apply this to the dataset at hand (https://www.kaggle.com/code/rockystats/apriori-algorithm-or-market-basket-analysis).
Submit a pdf "paper"/"report" describing the problem, your approach to pre-processing, the tools you used, the problems you encountered and your results. Briefly discuss why your results are "interesting" and not "trivial". Discuss your choice of min_support and confidence. 
(The target size of the paper is 5 pages, but your mileage may vary.)
Only include essential code in the body your paper. Include the full code and sample runs you used as an appendix to the paper.

Proper formatting of the paper is essential and counts for the grade. Get familiar with standard ACM (Association for Computing Machinery) standards. 
Please understand : results are of course important, but proper presentation is also important.
For extra credit (30%) implement the algorithms yourself without libraries.

## Q1: What is Apriori algorithm ?

Apriori algorithm is a classic algorithm for frequent itemset mining and association rule learning over transactional databases. It is an unsupervised learning algorithm that tries to discover the underlying relationships and patterns among the items in a large dataset. The Apriori algorithm operates by using a "bottom up" approach, starting with individual items and then combining them to form larger itemsets. The algorithm uses two main parameters, support and confidence, to determine which itemsets are considered frequent and which association rules are considered strong. The support of an itemset is defined as the proportion of transactions that contain the itemset, while the confidence of an association rule is defined as the proportion of transactions that contain the antecedent of the rule that also contain the consequent. The Apriori algorithm is widely used in market basket analysis, recommendation systems, and many other data mining tasks.

Apriori algorithm is an association rule learning algorithm used to find frequent item sets in large datasets and generate association rules from those item sets. It works by identifying combinations of items in a dataset that occur frequently together, using the concept of "support". The support of an item set is defined as the proportion of transactions in the dataset that contain the item set. The algorithm starts by finding all the items that have a support greater than a specified threshold, called the minimum support. These items are then combined to form larger item sets, and the process is repeated until no further frequent item sets can be found. Association rules can then be generated by computing the "confidence" of the rules, which is defined as the proportion of transactions containing the antecedent (left-hand side) of the rule that also contain the consequent (right-hand side). The Apriori algorithm is commonly used in market basket analysis and recommendation systems, among other applications.

## Q2: How to implement the algorithm in Python ?

In [3]:
import itertools
from collections import defaultdict

def apriori(transactions, min_support=0.5, min_confidence=0.5):
    # Get all unique items in the transactions
    unique_items = set(item for transaction in transactions for item in transaction)
    # Create a dictionary of item frequency
    item_frequency = defaultdict(int)
    for transaction in transactions:
        for item in unique_items:
            if item.issubset(transaction):
                item_frequency[item] += 1
    # Filter items that meet the minimum support
    frequent_items = [item for item, count in item_frequency.items() if count/len(transactions) >= min_support]
    # Apriori Property: Any subset of a frequent itemset must be frequent
    frequent_items = set(frequent_items)
    frequent_item_sets = [frequent_items]
    while frequent_items:
        frequent_items = [item.union(set([new_item])) for item in frequent_items
                         for new_item in unique_items if new_item not in item and
                         item.union(set([new_item])) in frequent_item_sets[-1]]
        frequent_items = [item for item in frequent_items if item_frequency[item]/len(transactions) >= min_support]
        frequent_item_sets.append(frequent_items)
    # Generate the association rules
    association_rules = []
    for frequent_item_set in frequent_item_sets:
        for item in frequent_item_set:
            if len(item) > 1:
                for sub_item in itertools.combinations(item, len(item) - 1):
                    sub_item = set(sub_item)
                    confidence = item_frequency[item]/item_frequency[sub_item]
                    if confidence >= min_confidence:
                        association_rules.append((sub_item, item.difference(sub_item), confidence))
    return frequent_item_sets, association_rules


In this implementation, the transactions argument is a list of sets, where each set represents a transaction and its items. The min_support argument is the minimum support threshold, which determines the minimum frequency of an itemset to be considered frequent. The min_confidence argument is the minimum confidence threshold, which determines the minimum confidence of an association rule.

The algorithm returns a list of frequent item sets and a list of association rules.

## Q3: How to implement the algorithm without any libraries in Python ?

In [4]:
def apriori(transactions, min_support=0.5, min_confidence=0.5):
    item_count = defaultdict(int)
    for transaction in transactions:
        for item in transaction:
            item_count[item] += 1
    frequent_items = set([item for item, count in item_count.items() if count/len(transactions) >= min_support])
    frequent_item_sets = [frequent_items]
    for k in range(2, len(frequent_items) + 1):
        candidate_item_sets = set([frozenset(x) for x in itertools.combinations(frequent_items, k)])
        frequent_item_sets_k = set()
        for transaction in transactions:
            for candidate_item_set in candidate_item_sets:
                if candidate_item_set.issubset(transaction):
                    item_count[candidate_item_set] += 1
        for candidate_item_set, count in item_count.items():
            if count/len(transactions) >= min_support:
                frequent_item_sets_k.add(candidate_item_set)
        if not frequent_item_sets_k:
            break
        frequent_item_sets.append(frequent_item_sets_k)
        frequent_items = set([item for item_set in frequent_item_sets_k for item in item_set])
    association_rules = []
    for frequent_item_set in frequent_item_sets:
        for item_set in frequent_item_set:
            if len(item_set) > 1:
                for item in item_set:
                    sub_item_set = item_set - set([item])
                    confidence = item_count[item_set]/item_count[sub_item_set]
                    if confidence >= min_confidence:
                        association_rules.append((sub_item_set, item_set - sub_item_set, confidence))
    return frequent_item_sets, association_rules


This implementation follows the same logic as the previous implementation, but uses the built-in Python data structures such as defaultdict and set to achieve the same result. The algorithm takes a list of transactions, where each transaction is a list of items, and returns a list of frequent item sets and a list of association rules. The min_support and min_confidence arguments are used to control the minimum frequency and minimum confidence of the output, respectively.

In [1]:
import pandas as pd

In [6]:
data1 = pd.read_csv('/Users/haydee_mac/Desktop/CSC440-Data Mining/DM HW3/anonymous-msweb.data')

  data1 = pd.read_csv('/Users/haydee_mac/Desktop/CSC440-Data Mining/DM HW3/anonymous-msweb.data')


In [7]:
data1

Unnamed: 0,Unnamed: 1,I,4,www.microsoft.com,created by getlog.pl
T,1,VRoot,0,0,VRoot
N,0,0,,,
N,1,1,,,
T,2,Hide1,0,0,Hide
N,0,0,,,
...,...,...,...,...,...
V,1035,1,,,
V,1001,1,,,
V,1018,1,,,
C,42711,42711,,,


In [None]:
column_names = ["sepal_length", 'sepal_width', 'petal_length', 'petal_width', 'class']
iris = pd.read_csv("iris.data", header = 0, names=column_names)