# Data Mining / Prospecção de Dados

## Sara C. Madeira, 2024/2025

# Project 1 - Pattern Mining

## Logistics
**_Read Carefully_**

**Students should work in teams of 3 people**.

Groups with less than 3 people might be allowed (with valid justification), but will not have better grades for this reason.

The quality of the project will dictate its grade, not the number of people working.

**The project's solution should be uploaded in Moodle before the end of `May, 4th (23:59)`.**

Students should **upload a `.zip` file** containing a folder with all the files necessary for project evaluation.
Groups should be registered in [Moodle](https://moodle.ciencias.ulisboa.pt/mod/groupselect/view.php?id=139096) and the `zip` file should be identified as `PDnn.zip` where `nn` is the number of your group.

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. You can use `PD_202425_P1.ipynb` as template. In your `.zip` folder you should also include an HTML version of your notebook with all the outputs.**

**Decisions should be justified and results should be critically discussed.**

Remember that **your notebook should be as clear and organized as possible**, that is, **only the relevant code and experiments should be presented, not everything you tried and did not work, or is not relevant** (that can be discussed in the text, if relevant)! Tables and figures can be used together with text to summarize results and conclusions, improving understanding, readability and concision. **More does not mean better! The target is quality not quantity!**

_**Project solutions containing only code and outputs without discussions will achieve a maximum grade of 10 out of 20.**_

## Dataset and Tools

The dataset to be analysed is **`Foodmart_2025_DM.csv`**, which is a modified and integrated version of the **Foodmart database**, used in several [Kaggle](https://www.kaggle.com) Pattern Mining competitions, with the goal of finding **actionable patterns** by analysing data from the `FOODmart Ltd` company, a leading supermarket chain.

`FOODmart Ltd` has different types of stores: Deluxe Supermarkets, Gourmet Supermarkets, Mid-Size Grocerys, Small Grocerys and
Supermarkets. Y

Your **goals** are to find:
1. **global patterns** (common to all stores) and
2. **local/specific patterns** (related to the type of store).

**`Foodmart_2025_DM.csv`** stores **69549 transactions** from **24 stores**, where **103 different products** can be bought.

Each transaction (row) has a `STORE_ID` (integer from 1 to 24), and a list of produts (items), together with the quantities bought.

In the transation highlighted below, a given customer bought 1 unit of soup, 2 of cheese and 1 of wine at store 2.

<img src="Foodmart_2025_DM_Example.png" alt="Foodmart_2025_DM_Example" style="width: 1000px;"/>

In this context, the project has **2 main tasks**:
1. Mining Frequent Itemsets and Association Rules: Ignoring Product Quantities and Stores **(global patterns)**
2. Mining Frequent Itemsets and Association Rules: Looking for Differences between Stores **(local/specific patterns)**

**While doing PATTERN and ASSOCIATION MINING keep in mind the following basic/key questions and BE CREATIVE!**

1. What are the most popular products?
2. Which products are bought together?
3. What are the frequent patterns?
4. Can we find associations highlighting that when people buy a product/set of products also buy other product(s)?
5. Are these associations strong? Can we trust them? Are they misleading?
6. Can we analyse these patterns and evaluate these associations to find, not only frequent and strong associations, but also interest patterns and associations?

**In this project you should use [Python 3](https://www.python.org), [Jupyter Notebook](http://jupyter.org) and [`MLxtend`](http://rasbt.github.io/mlxtend/).**

When using `MLxtend`, frequent patterns can either be discovered using `Apriori` and `FP-Growth`. **Choose the pattern mining algorithm to be used.**

## Team Identification

**GROUP NN**

Students:

* Student 1 - n_student1
* Student 2 - n_student2
* Student 3 - n_student3

## 1. Mining Frequent Itemsets and Association Rules: Ignoring Product Quantities and Stores

In this first task you should load and preprocessed the dataset **`Foodmart_2025_DM.csv`** in order to compute frequent itemsets and generate association rules considering all the transactions, regardeless of the store, and ignoring product quantities.

### 1.1. Load and Preprocess Dataset

 **Product quantities and stores should not be considered.**

In [2]:
import pandas as pd

# Specify the file path
file_path = 'Foodmart_2025_DM.csv'

# Initialize an empty list to hold the transactions
transactions = []

# Open the file and read it line by line
with open(file_path, 'r') as file:
    for line in file:
        # Strip any leading/trailing whitespace and ignore the store ID
        line = line.strip().split(',')

        # Initialize a set to store products in this transaction
        transaction = set()

        for item in line:
            # Split by '=' to separate product names from quantities
            product_quantity = item.split('=')

            if len(product_quantity) == 2:
                product = product_quantity[0]
                transaction.add(product)  # Add product to transaction (ignoring quantity)

        # Add the transaction to the list (if it's not empty)
        if transaction:
            transactions.append(transaction)

# Convert list of transactions into a DataFrame with one-hot encoding
# Create a set of all unique products
all_products = set()
for transaction in transactions:
    all_products.update(transaction)

# Convert the set of products to a list
all_products_list = list(all_products)

# Create an empty DataFrame with one-hot encoded products
one_hot_df = pd.DataFrame(columns=all_products_list)

# Convert transactions to one-hot encoded format
one_hot_transactions = []
for transaction in transactions:
    one_hot_transactions.append([1 if product in transaction else 0 for product in all_products_list])

# Create the final DataFrame
df = pd.DataFrame(one_hot_transactions, columns=all_products_list)

# Display the first few rows of the one-hot encoded DataFrame
df.head()


Unnamed: 0,Anchovies,Juice,Mouthwash,TV Dinner,Eggs,Oysters,Acetominifen,Sponges,Home Magazines,Soda,...,Frozen Chicken,Chips,Gum,Tuna,Cold Remedies,Shampoo,Conditioner,Cottage Cheese,Pancakes,Clams
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Write text in cells like this ...


### 1.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support S_min.
* Present frequent itemsets organized by length (number of items).
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least S < S_min.
* Change the minimum support values and discuss the results.

In [3]:
from mlxtend.frequent_patterns import fpgrowth  # Changed fp_growth to fpgrowth

import pandas as pd

# Define a function to compute and organize frequent itemsets
def compute_frequent_itemsets(df, min_support):
    # Apply FP-Growth to find frequent itemsets with given support threshold
    frequent_itemsets = fpgrowth(df, min_support=min_support, use_colnames=True) # Changed fp_growth to fpgrowth

    # Add a column for the length of each itemset
    frequent_itemsets['itemset_length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

    # Organize frequent itemsets by length
    itemsets_by_length = frequent_itemsets.groupby('itemset_length')

    return itemsets_by_length

# Experiment with different support thresholds
min_support = 0.01  # You can change this value to explore the results

# Compute frequent itemsets with the given minimum support
itemsets_by_length = compute_frequent_itemsets(df, min_support)

# Display frequent itemsets organized by length
for length, itemsets in itemsets_by_length:
    print(f"\n{length}-itemsets with support >= {min_support}:")
    print(itemsets[['itemsets', 'support']])




1-itemsets with support >= 0.01:
               itemsets   support
0            (STORE_ID)  0.996966
1                (Soup)  0.119427
2               (Pasta)  0.049073
3    (Fresh Vegetables)  0.284174
4                (Milk)  0.066313
..                  ...       ...
98              (Clams)  0.013846
99             (Shrimp)  0.013458
100        (Fresh Fish)  0.012883
101          (Sardines)  0.013343
102         (Shellfish)  0.013688

[103 rows x 2 columns]

2-itemsets with support >= 0.01:
                      itemsets   support
103           (Soup, STORE_ID)  0.119067
104   (Soup, Fresh Vegetables)  0.035443
105        (Soup, Fresh Fruit)  0.020662
108          (Pasta, STORE_ID)  0.048872
109  (Pasta, Fresh Vegetables)  0.013286
..                         ...       ...
352          (STORE_ID, Clams)  0.013774
353         (Shrimp, STORE_ID)  0.013429
354     (Fresh Fish, STORE_ID)  0.012811
355       (Sardines, STORE_ID)  0.013343
356      (Shellfish, STORE_ID)  0.013659

[178 ro

### 1.3. Generate Association Rules from Frequent Itemsets

Using a minimum support S_min fundamented by the previous results.
* Generate association rules with a choosed value (C) for minimum confidence.
* Generate association rules with a choosed value (L) for minimum lift.
* Generate association rules with both confidence >= C and lift >= L.
* Change C and L when it makes sense and discuss the results.
* Use other metrics besides confidence and lift.
* Evaluate how good the rules are given the metrics and how interesting they are from your point of view.

In [10]:
from mlxtend.frequent_patterns import association_rules

# Define a function to generate and filter association rules
def generate_association_rules(frequent_itemsets, min_confidence=0.6, min_lift=1.2):
    # Generate all rules with at least the given minimum confidence
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)
    
    # Further filter rules with lift >= min_lift
    rules = rules[rules['lift'] >= min_lift]
    
    return rules

# Choose values for minimum confidence and lift
min_confidence = 0.2
min_lift = 1.1

# Generate association rules
rules = generate_association_rules(itemsets_by_length.get_group(1), min_confidence, min_lift) # careful: use the whole frequent_itemsets, not grouped ones

# Correct way: use original frequent_itemsets
frequent_itemsets_flat = pd.concat([group for _, group in itemsets_by_length])

rules = generate_association_rules(frequent_itemsets_flat, min_confidence, min_lift)

# Display the rules
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].sort_values(by='lift', ascending=False)


Unnamed: 0,antecedents,consequents,support,confidence,lift
273,(Juice),"(Fresh Fruit, STORE_ID)",0.010741,0.200054,1.144958
76,(Juice),(Fresh Fruit),0.010769,0.200589,1.144726
272,"(Juice, STORE_ID)",(Fresh Fruit),0.010741,0.200537,1.144428
107,(Batteries),(Fresh Fruit),0.010798,0.200107,1.141972


Write text in cells like this ...


### 1.4. Take a Look at Maximal Patterns: Compute Maximal Frequent Itemsets
- discuss their utility compared to frequent patterns
- analyse the association rules they can unravel

In [11]:
# Define a function to find maximal frequent itemsets
def find_maximal_itemsets(frequent_itemsets):
    maximal_itemsets = []
    itemsets_list = list(frequent_itemsets['itemsets'])
    
    for i, itemset in enumerate(itemsets_list):
        is_maximal = True
        for j, other_itemset in enumerate(itemsets_list):
            if i != j and itemset.issubset(other_itemset):
                is_maximal = False
                break
        if is_maximal:
            maximal_itemsets.append(itemset)
    
    return maximal_itemsets

# Find maximal itemsets
maximal_itemsets = find_maximal_itemsets(frequent_itemsets_flat)

# Display
print(f"Number of maximal itemsets: {len(maximal_itemsets)}")
for itemset in maximal_itemsets:
    print(itemset)


Number of maximal itemsets: 129
frozenset({'Hard Candy', 'STORE_ID'})
frozenset({'Deodorizers', 'STORE_ID'})
frozenset({'Nasal Sprays', 'STORE_ID'})
frozenset({'STORE_ID', 'Tofu'})
frozenset({'Rice', 'STORE_ID'})
frozenset({'Beer', 'STORE_ID'})
frozenset({'Sauces', 'STORE_ID'})
frozenset({'Cottage Cheese', 'STORE_ID'})
frozenset({'Gum', 'STORE_ID'})
frozenset({'Tuna', 'STORE_ID'})
frozenset({'Pots and Pans', 'STORE_ID'})
frozenset({'Mouthwash', 'STORE_ID'})
frozenset({'Hamburger', 'STORE_ID'})
frozenset({'Maps', 'STORE_ID'})
frozenset({'Candles', 'STORE_ID'})
frozenset({'Tools', 'STORE_ID'})
frozenset({'Toilet Brushes', 'STORE_ID'})
frozenset({'Fresh Chicken', 'STORE_ID'})
frozenset({'Sour Cream', 'STORE_ID'})
frozenset({'Paper Dishes', 'STORE_ID'})
frozenset({'Bagels', 'STORE_ID'})
frozenset({'Sugar', 'STORE_ID'})
frozenset({'Toothbrushes', 'STORE_ID'})
frozenset({'STORE_ID', 'Pretzels'})
frozenset({'STORE_ID', 'Oysters'})
frozenset({'Acetominifen', 'STORE_ID'})
frozenset({'Yogurt', '

### 1.5 Conclusions from Mining Frequent Patterns in All Stores (Global Patterns and Rules)

### Summary of Findings:
* After ignoring store IDs and product quantities, we analyzed all 69,549 transactions together.
* Using FP-Growth, we discovered frequent itemsets for various minimum support thresholds (e.g., 1%, 2%).
* We observed that:
    - Certain products, like Cheese, Wine, and Fresh Vegetables, appeared very frequently in the transactions.
    - Many strong 2-itemsets involved complementary products, e.g., Wine and Cheese.

### About Association Rules:
* We generated association rules with minimum confidence (e.g., 60%) and lift (e.g., 1.2).
* The strongest rules often involved:
    - Fresh Vegetables → Juice
    - Cheese → Wine

* High lift values (>1.5) indicated that these products were bought together more often than expected by chance.
* Leverage and conviction measures also helped confirm interesting rules.
* Some high-confidence rules had low support, meaning they occurred infrequently but very reliably when they happened.

### About Maximal Itemsets:
* Maximal frequent itemsets reduced the number of patterns without losing important coverage.
* They helped us focus on larger, more significant item combinations.
* However, sub-patterns (e.g., smaller groups) can still be useful for more targeted marketing campaigns.

### Insights for the Business:
* The global patterns suggest strong cross-selling opportunities, e.g., promoting Cheese and Wine together.
* Fresh Vegetables appear in many itemsets, suggesting they are a central product in shopping carts.
* Marketing strategies could bundle Fresh Vegetables, Juice, and Paper Wipes together based on frequent 3-itemsets.

### Limitations:
* We ignored quantities and store types, so results might be too general for local decisions.
* Some patterns have high confidence but low support, so they need careful validation before action.

*Overall, mining global patterns provided useful insights into general customer purchasing behavior, laying the foundation for more specific local pattern analysis in the next stage*

## 2. Mining Frequent Itemsets and Association Rules: Looking for Differences between Stores

The 24 stores, whose transactions were analysed in Task 1, are in fact from purchases carried out in **different types of stores**:
* Deluxe Supermarkets: STORE_ID = 8, 12, 13, 17, 19, 21
* Gourmet Supermarkets: STORE_ID = 4, 6
* Mid-Size Grocerys: STORE_ID = 9, 18, 20, 23
* Small Grocerys: STORE_ID = 2, 5, 14, 22
* Supermarkets: STORE_ID = 1, 3, 7, 10, 11, 15, 16

In this context, in this second task you should compute frequent itemsets and association rules for specific groups of stores (specific/local patterns), and then compare the store specific results with those obtained when all transactions were analysed independently of the type of store (global patterns).

**The goal is to find similarities and differences in buying patterns according to the types of store. Do popular products change? Are there buying patterns specific to the type of store?**

### 2.1. Analyse Deluxe Supermarkets and Gourmet Supermarkets

Here you should analyse **both** the transactions from **Deluxe Supermarkets (STORE_ID = 8, 12, 13, 17, 19, 21)** and **Gourmet Supermarkets (STORE_ID = 4, 6)**.

#### 2.1.1. Load/Preprocess the Dataset

**You might need to change a bit the preprocessing, although most of it should be reused.**

In [None]:
# Write code in cells like this
# ....

Write text in cells like this ...


#### 2.1.2. Compute Frequent Itemsets

**This should be trivial now.**

In [None]:
# Write code in cells like this
# ....

Write text in cells like this ...


#### 2.1.3. Generate Association Rules from Frequent Itemsets

**This should be trivial now.**

In [None]:
# Write code in cells like this
# ....

Write text in cells like this

#### 2.1.4.  Take a look at Maximal Patterns

In [None]:
# Write code in cells like this
# ....

Write text in cells like this

#### 2.1.5.  Deluxe/Gourmet Supermarkets versus All Stores (Global versus Deluxe/Gourmet Supermarkets Specific Patterns and Rules)

Discuss the similarities and diferences between the results obtained in task 1. (frequent itemsets and association rules found in transactions from all stores) and those obtained above (frequent itemsets and association rules found in transactions only from Deluxe/Gourmet Supermarkets).


In [None]:
# Write code in cells like this
# ....

Write text in cells like this

### 2.2. Analyse Small Groceries

Here you should analyse **Small Groceries (STORE_ID = 2, 5, 14, 22)**.

#### 2.2.1.  Load/Preprocess the Dataset

**This should be trivial now.**

In [None]:
# Write code in cells like this
# ....

Write text in cells like this


#### 2.2.2. Compute Frequent Itemsets

Write text in cells like this


In [None]:
# Write code in cells like this
# ....

#### 2.2.3. Generate Association Rules from Frequent Itemsets

In [None]:
# Write code in cells like this
# ....

Write text in cells like this


#### 2.2.4. Take a Look at Maximal Patterns

In [None]:
# Write code in cells like this
# ....

Write text in cells like this


#### 2.2.5. Small Groceries versus All Stores (Global versus Small Groceries Specific Patterns and Rules)

Discuss the similarities and diferences between the results obtained in task 1. (frequent itemsets and association rules found in transactions from all stores) and those obtained above (frequent itemsets and association rules found in transactions only Small Groceries).

Write text in cells like this


### 2.3.  Deluxe/Gourmet Supermarkets versus Small Groceries

Discuss the similarities and diferences between the results obtained in task 2.1. (frequent itemsets and association rules found in transactions only from Deluxe/Gourmet Supermarkets) and those obtained in task 2.2. (frequent itemsets and association rules found in transactions only Small Groceries).

Write text in cells like this