# Association Rules Analysis

**Association Rules Analysis** is a data mining technique that finds interesting relationships or patterns in large datasets. It is commonly used in **market basket analysis** to discover product associations, but it also applies to many fields, such as recommendation systems, fraud detection, and healthcare.

---

## Key Concepts in Association Rules Analysis

### 1. **Frequent Itemsets**

A **frequent itemset** is a set of items that appear together frequently in transactions. To find frequent itemsets, a user specifies a **minimum support threshold** that determines the minimum frequency an itemset must appear to be considered "frequent."

For example, in a grocery store, if 10% of transactions contain both bread and butter, and the minimum support threshold is set to 5%, then $\{ \text{bread, butter} \}$ would be considered a **frequent itemset**.

Frequent itemsets are essential for generating **association rules**, which help identify relationships between items.

---

## 2. **Apriori Algorithm**

The **Apriori Algorithm** is a classic algorithm used to find frequent itemsets and generate association rules. It uses a **bottom-up** approach, where frequent itemsets are generated level by level (starting with single items and then extending to larger itemsets).

### Steps in the Apriori Algorithm:
1. **Find frequent itemsets**: 
   - Start by identifying all frequent 1-itemsets (itemsets containing a single item that meets the minimum support threshold).
   - Extend to 2-itemsets, 3-itemsets, and so on by combining smaller frequent itemsets.
2. **Generate association rules**: 
   - Once frequent itemsets are identified, generate rules of the form $X \Rightarrow Y$, where $X$ and $Y$ are itemsets.

### Key Principle: **Apriori Property**
The Apriori Algorithm relies on the **Apriori Property**, which states that:
- **If an itemset is frequent, all its subsets must also be frequent**.
  
This property reduces the search space, as any itemset that contains an infrequent subset is pruned from consideration.

### Example:
1. Consider a dataset where we have the following transactions:
   - Transaction 1: $\{ \text{bread, milk} \}$
   - Transaction 2: $\{ \text{bread, butter, milk} \}$
   - Transaction 3: $\{ \text{butter, eggs} \}$
   - Transaction 4: $\{ \text{bread, butter} \}$
   - Transaction 5: $\{ \text{bread, eggs, milk} \}$
   
   If we set a minimum support threshold of 2 transactions (40%), the Apriori Algorithm will first find frequent 1-itemsets, then extend to 2-itemsets, and so on.

### Advantages of Apriori:
- Simple to implement.
- Intuitive approach for finding frequent itemsets.

### Limitations of Apriori:
- Computationally expensive for large datasets due to the need to scan the dataset multiple times.
- The number of candidate itemsets grows exponentially as the size of the itemset increases, leading to inefficiency.

## Metrics for Association Rule Evaluation

### 1. **Support**
Support measures how frequently an itemset appears in the dataset:
$$
\text{Support}(X) = \frac{\text{Number of transactions containing } X}{\text{Total number of transactions}}
$$

### 2. **Confidence**
Confidence measures how often the rule $X \Rightarrow Y$ holds, given that $X$ has occurred:
$$
\text{Confidence}(X \Rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)}
$$

### 3. **Lift**
Lift measures how much more likely $X$ and $Y$ co-occur than if they were independent:
$$
\text{Lift}(X \Rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X) \times \text{Support}(Y)}
$$
- If $\text{Lift} > 1$, $X$ and $Y$ are positively correlated.
- If $\text{Lift} = 1$, $X$ and $Y$ are independent.
- If $\text{Lift} < 1$, $X$ and $Y$ are negatively correlated.

---

### Summary

- **Frequent Itemsets** are sets of items that appear frequently together in transactions.
- **Apriori Algorithm** uses candidate generation and pruning to find frequent itemsets, but can be slow for large datasets.


In [1]:
!pip install mlxtend

Collecting mlxtend
  Downloading mlxtend-0.23.1-py3-none-any.whl.metadata (7.3 kB)
Downloading mlxtend-0.23.1-py3-none-any.whl (1.4 MB)
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   ---------------------------------------- 1.4/1.4 MB 9.5 MB/s eta 0:00:00
Installing collected packages: mlxtend
Successfully installed mlxtend-0.23.1


## Load and Preprocess the Dataset
For this tutorial, we'll use a sample dataset representing transactions (e.g., purchases in a store). Each transaction is represented as a list of items.

In [2]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

# Sample dataset (list of transactions)
dataset = [
    ['Milk', 'Bread', 'Eggs'],
    ['Bread', 'Eggs', 'Butter'],
    ['Milk', 'Eggs'],
    ['Milk', 'Bread', 'Eggs', 'Butter'],
    ['Milk', 'Bread'],
    ['Bread', 'Eggs'],
    ['Milk', 'Bread', 'Butter'],
]

# Convert the dataset into a one-hot encoded format
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
print(df)

   Bread  Butter   Eggs   Milk
0   True   False   True   True
1   True    True   True  False
2  False   False   True   True
3   True    True   True   True
4   True   False  False   True
5   True   False   True  False
6   True    True  False   True


## Perform Frequent Itemset Mining using Apriori
Now, we'll use the Apriori algorithm to mine frequent itemsets from the one-hot encoded dataset based on a minimum support threshold.

In [3]:
from mlxtend.frequent_patterns import apriori

# Define the minimum support threshold (e.g., 0.4 means an itemset must appear in at least 40% of transactions)
min_support = 0.4

# Perform frequent itemset mining using Apriori
frequent_itemsets = apriori(df, min_support=min_support, use_colnames=True)

print(frequent_itemsets)

    support         itemsets
0  0.857143          (Bread)
1  0.428571         (Butter)
2  0.714286           (Eggs)
3  0.714286           (Milk)
4  0.428571  (Bread, Butter)
5  0.571429    (Bread, Eggs)
6  0.571429    (Bread, Milk)
7  0.428571     (Eggs, Milk)


## Generate Association Rules
Next, we'll use the frequent itemsets to generate association rules and calculate various association metrics such as confidence and lift.

In [5]:
from mlxtend.frequent_patterns import association_rules
# Generate association rules with minimum confidence threshold (e.g., 0.6)
min_confidence = 0.6
association_rules_df = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)

print("\nAssociation Rules:")
print(association_rules_df[['antecedents', 'consequents', 'support', 'confidence','lift']])



Association Rules:
  antecedents consequents   support  confidence      lift
0    (Butter)     (Bread)  0.428571    1.000000  1.166667
1     (Bread)      (Eggs)  0.571429    0.666667  0.933333
2      (Eggs)     (Bread)  0.571429    0.800000  0.933333
3     (Bread)      (Milk)  0.571429    0.666667  0.933333
4      (Milk)     (Bread)  0.571429    0.800000  0.933333
5      (Eggs)      (Milk)  0.428571    0.600000  0.840000
6      (Milk)      (Eggs)  0.428571    0.600000  0.840000


## Analyze and Interpret the Results
The output will be a DataFrame containing frequent itemsets, their corresponding support values, and the length of each itemset. The support value represents the percentage of transactions in which the itemset appears.

You can analyze and interpret the results to identify the most frequent itemsets and their support. These frequent itemsets represent the combinations of items that appear frequently in the transactions and can provide valuable insights into item co-occurrences.

In this tutorial, we demonstrated how to recognize frequent itemsets based on a minimum support threshold using the Apriori algorithm in Python. You can adjust the min_support threshold to obtain more or fewer frequent itemsets based on your specific use case.

Frequent itemset mining is a powerful technique for identifying interesting associations between items in transactional data and can be applied to various domains, such as market basket analysis, customer behavior analysis, and recommendation systems.