# Apriori Algorithm

### Apriori in 1 sentence

The Apriori algorithm is an association rule learning algorithm that iteratively identifies frequent item associations in transactional datasets, and is commonly used in recommendation/suggestion systems.

## Introduction

The apriori algorithm is used when working with transactional data. This could be a dataset of literal business transactions to examine the behaviors of customers' purchasing habits. However, user activity on some web app could also be formated to be transactional data. The list of examples of transactional data is quite broad, and the point we need to get at is that a lot more than just literal financial or bartering transactions can be considered *transactional data*. 

For the purposes of experimenting with the apriori algorithm we'll use the *Movielens* 100k dataset.
After getting a portion of the dataset in the following cell, in the subsequent cell we perform basic association rule mining on the data.

In [2]:
import pandas as pd


ratings_url = "https://files.grouplens.org/datasets/movielens/ml-100k/u.data"
df = pd.read_csv(ratings_url, sep="\t", names=["user_id", "movie_id", "rating", "timestamp"])
df.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


In [2]:
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder


# Step 2: Keep only users who rated at least 5 movies (filter noisy data)
user_movie_counts = df["user_id"].value_counts()
filtered_users = user_movie_counts[user_movie_counts >= 5].index
df_filtered = df[df["user_id"].isin(filtered_users)]

# Step 3: Group movies by user (each user's watched movies = a transaction)
transactions = df_filtered.groupby("user_id")["movie_id"].apply(list).tolist()

# Step 4: One-hot encode the transactions
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

# Step 5: Apply Apriori algorithm (min_support=0.01 means 1% of transactions)
frequent_itemsets = apriori(df_encoded, min_support=0.4, use_colnames=True)
print("Frequent Itemsets (Movies Co-Watched by Users):")
print(frequent_itemsets.head())

# Step 6: Generate association rules (confidence threshold = 0.2)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
print("\nAssociation Rules (Movies Likely Watched Together):")
print(rules[["antecedents", "consequents", "support", "confidence", "lift"]].head())

Frequent Itemsets (Movies Co-Watched by Users):
    support itemsets
0  0.479321      (1)
1  0.415695      (7)
2  0.618240     (50)
3  0.417815     (56)
4  0.413574     (98)

Association Rules (Movies Likely Watched Together):
  antecedents consequents   support  confidence      lift
0         (1)        (50)  0.404030    0.842920  1.363420
1        (50)         (1)  0.404030    0.653516  1.363420
2        (50)       (100)  0.417815    0.675815  1.254514
3       (100)        (50)  0.417815    0.775591  1.254514
4        (50)       (174)  0.402969    0.651801  1.463449


### **Frequent Itemsets (Movies Co-Watched by Users):**

- **Support** = Percentage of users who watched the movie(s) in the itemset.  
  - Example: `(1)` with **support 0.479** means **47.9% of users** watched **movie 1**.  
  - `(50)` with **support 0.618** means **61.8% of users** watched **movie 50**.  

---

### **Association Rules (Movies Likely Watched Together):**

- **Antecedents** = The "if" part of the rule (e.g., "If a user watched movie 1").  
- **Consequents** = The "then" part of the rule (e.g., "Then they likely watched movie 50").  
- **Support** = Percentage of users who watched **both** the antecedent and consequent (e.g., 40.4% of users watched **both** movie 1 and movie 50).  
- **Confidence** = Percentage of users who watched the antecedent and **also watched** the consequent (e.g., 84.3% of users who watched movie 1 also watched movie 50).  
- **Lift** = How much **stronger** the relationship is between the antecedent and consequent **compared to random chance**.  
  - **Lift > 1** = They are **more likely** to be watched together than randomly.  
  - **Lift = 1** = No relationship.  
  - **Lift < 1** = They are **less likely** to be watched together than randomly.  

---

### **Example Rule:**
- **Rule:** `(1)` â†’ `(50)`  
  - **Support** = 40.4% of users watched **both** movies 1 and 50.  
  - **Confidence** = 84.3% of users who watched movie 1 **also watched** movie 50.  
  - **Lift** = 1.36 = Users who watched movie 1 are **36% more likely** to watch movie 50 than random users.  

---

### **Key Takeaway:**
- The algorithm finds **patterns** (frequent itemsets) and **rules** (if-then relationships) in the data, using **support**, **confidence**, and **lift** to measure how strong and meaningful those patterns are.