## Task 2.2: Mining Cancer Feature Patterns

**Dataset:**  
Breast Cancer Wisconsin (Diagnostic)

**Source:**  
`sklearn.datasets.load_breast_cancer()` — mirrors the UCI Wisconsin dataset.(Wolberg et al., UCI Machine Learning Repository).

**Goal:**  
Derive **ordered feature sequences per patient** and **mine frequent sequential patterns** to uncover relationships and progression trends among diagnostic features.

**External tools used in this notebook:**
   - scikit-learn: data loading, preprocessing, discretization
   - PrefixSpan (Pei et al. 2001): sequential pattern mining algorithm (we use a Python implementation from the open-source `prefixspan` package: pip install prefixspan)



In [5]:
from collections import Counter

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import KBinsDiscretizer, StandardScaler

# We will use PrefixSpan for sequential pattern mining.
# Reference:
# Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., & Hsu, M. (2001).
# PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth.
# Note: We'll use a Python implementation available as the `prefixspan` package.
try:
    from prefixspan import PrefixSpan
except ImportError:
    print("Please install prefixspan first: pip install prefixspan")

### 1. Load and Inspect Data
We loaded the Breast Cancer Wisconsin dataset using sklearn.datasets.load_breast_cancer() and attached diagnosis labels. 

In [6]:
# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)  # 30 numeric features
y = pd.Series(data.target)  # binary target: 0=malignant, 1=benign

# In sklearn breast_cancer:
# target 0 = malignant, 1 = benign
# Let's map to strings 'M'/'B' like the Kaggle/real dataset format
diagnosis_map = {0: "M", 1: "B"}
diagnosis = y.map(diagnosis_map)

df = X.copy()
df["diagnosis"] = diagnosis

print(df.head())
print(df.shape)
print(df["diagnosis"].value_counts())

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             

### 2. Scale and Discretize Features

#### Why scale?

We standardize z-score each feature so we can later ask: “For this patient, which features are most abnormal compared to the full population?”

The Z-score is defined as:

$$
z = \frac{x - \mu}{\sigma}
$$

Where:

- $x$: the feature value for a specific patient  
- $\mu$: the mean of that feature across the population  
- $\sigma$: the standard deviation of that feature across the population  


#### Why discretize?

We need to turn continuous features like `radius`, `texture`, and `concavity` into categories like  
`low`, `mid`, `high`, using `KBinsDiscretizer` with strategies such as:

- `uniform`
- `quantile`
- `kmeans`

We'll start with one (like `quantile`) for mining, and show `sensitivity` later.


In [7]:
# We standardize features using z-scores:
# This lets us rank "how extreme" each feature is for a given patient.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)

# We can also measure how predictive each feature is of diagnosis
# using mutual information (MI). MI is a non-linear dependency measure.
# This uses scikit-learn's mutual_info_classif.
mi_scores = mutual_info_classif(X, y, discrete_features=False, random_state=42)
mi_series = pd.Series(mi_scores, index=X.columns).sort_values(ascending=False)

print("Top features by mutual information with diagnosis (higher = more predictive):")
print(mi_series.head(10))

Top features by mutual information with diagnosis (higher = more predictive):
worst perimeter         0.471842
worst area              0.464313
worst radius            0.451230
mean concave points     0.438806
worst concave points    0.436255
mean perimeter          0.402361
mean concavity          0.375447
mean radius             0.362276
mean area               0.360023
area error              0.340759
dtype: float64


In [8]:
# We convert each continuous feature column into categorical bins:
#   'low', 'mid', 'high'
# Using scikit-learn's KBinsDiscretizer.
# We will mainly use 'quantile' for mining, but we also generate 'uniform' and
# 'kmeans' for a sensitivity analysis later, as required by the task.


def discretize_features(X_in, strategy, n_bins=3):
    """
    Discretize continuous features using KBinsDiscretizer.
    strategy: 'uniform', 'quantile', or 'kmeans'
    Returns a dataframe of same shape but with categorical labels: 'low','mid','high'
    """
    kbd = KBinsDiscretizer(n_bins=n_bins, encode="ordinal", strategy=strategy)
    binned = kbd.fit_transform(X_in)

    # map 0->low, 1->mid, 2->high (generalizable to n_bins, but report expects 3 bins)
    bin_labels = dict(enumerate(["low", "mid", "high"][:n_bins]))

    df_cat = pd.DataFrame(binned, columns=X_in.columns, index=X_in.index)
    for col in df_cat.columns:
        df_cat[col] = df_cat[col].map(bin_labels)

    return df_cat


disc_uniform = discretize_features(X, strategy="uniform", n_bins=3)
disc_quantile = discretize_features(X, strategy="quantile", n_bins=3)
disc_kmeans = discretize_features(X, strategy="kmeans", n_bins=3)

print("Sample rows - uniform binning:")
print(disc_uniform.head(3))

print("\nSample rows - quantile binning:")
print(disc_quantile.head(3))

print("\nSample rows - kmeans binning:")
print(disc_kmeans.head(3))



Sample rows - uniform binning:
  mean radius mean texture mean perimeter mean area mean smoothness  \
0         mid          low            mid       mid             mid   
1         mid          low            mid       mid             low   
2         mid          mid            mid       mid             mid   

  mean compactness mean concavity mean concave points mean symmetry  \
0             high           high                high          high   
1              low            low                 mid           mid   
2              mid            mid                 mid           mid   

  mean fractal dimension  ... worst radius worst texture worst perimeter  \
0                    mid  ...          mid           low            high   
1                    low  ...          mid           low             mid   
2                    low  ...          mid           mid             mid   

  worst area worst smoothness worst compactness worst concavity  \
0        mid              m

### 3. Rank Features Per Patient and Build Sequences
We turned each row (patient) into an ordered sequence.

**For a given patient (row index = `idx`):**

1. Look at their z-scores for all features.  
2. Rank features by absolute z-score (|z|) in descending order → identifies which features are most abnormal for **this patient**.  
3. Select the top k features.  
4. Handle ties: features with the same |z| within `tie_eps` are grouped into the same itemset.  
5. Convert each feature into a symbolic token in the format:  
   `feature_name=low/mid/high`,  
   using the discretized DataFrame (`disc_df`).

---

**Output format (nested itemsets):**
```python
[
    ['worst_concavity=high', 'worst_compactness=high'],   # most abnormal tie
    ['radius_mean=high'],
    ['smoothness_mean=low']
]


In [9]:
def build_patient_sequence(idx, X_scaled_df, disc_df, k=5, tie_eps=1e-6):
    """
    Returns a sequence (list of itemsets), where each itemset is a list of strings.
    Example: [ ['radius_mean=high'], ['texture_mean=high', 'area_mean=high'], ... ]
    """
    # 1. z-scores for this patient
    z_row = X_scaled_df.loc[idx]

    # 2. sort features by absolute deviation
    #    we'll keep (feature, abs_z, signed_z)
    feat_stats = []
    for feat in X_scaled_df.columns:
        feat_stats.append((feat, abs(z_row[feat]), z_row[feat]))
    feat_stats.sort(key=lambda x: x[1], reverse=True)

    # 3. take top-k
    top_feats = feat_stats[:k]

    # 4. group ties by abs_z within tie_eps to form itemsets
    sequence = []
    current_bucket = [top_feats[0]]
    for fstat in top_feats[1:]:
        if abs(fstat[1] - current_bucket[-1][1]) <= tie_eps:
            current_bucket.append(fstat)
        else:
            sequence.append(current_bucket)
            current_bucket = [fstat]
    sequence.append(current_bucket)

    # 5. convert buckets to itemsets of "feature=level"
    #    get discretized level from disc_df
    seq_itemsets = []
    for bucket in sequence:
        itemset = []
        for feat, _, _ in bucket:
            level = disc_df.loc[idx, feat]  # 'low'/'mid'/'high'
            itemset.append(f"{feat}={level}")
        seq_itemsets.append(itemset)

    return seq_itemsets


# We'll build sequences for all patients using quantile discretization
# (the one we'll mine first)
seqs_all = []
labels_all = []
for idx in X.index:
    seq = build_patient_sequence(idx, X_scaled_df, disc_quantile, k=5, tie_eps=1e-6)
    seqs_all.append(seq)
    labels_all.append(diagnosis.loc[idx])

# Sanity check a couple sequences
for i in range(3):
    print(f"Patient {i}, diagnosis={labels_all[i]}")
    print(seqs_all[i])
    print()

Patient 0, diagnosis=M
[['mean compactness=high'], ['perimeter error=high'], ['worst symmetry=high'], ['mean concavity=high'], ['worst compactness=high']]

Patient 1, diagnosis=M
[['mean area=high'], ['worst area=high'], ['mean radius=high'], ['worst radius=high'], ['mean perimeter=high']]

Patient 2, diagnosis=M
[['mean concave points=high'], ['worst concave points=high'], ['mean radius=high'], ['mean perimeter=high'], ['mean area=high']]



In [10]:
def sanitize_item(item):
    return item.replace(" ", "_").replace("(", "").replace(")", "")


seqs_all_clean = []
for seq in seqs_all:
    clean_seq = []
    for itemset in seq:
        clean_itemset = [sanitize_item(x) for x in itemset]
        clean_seq.append(clean_itemset)
    seqs_all_clean.append(clean_seq)

# overwrite the original with the clean version for mining
seqs_all = seqs_all_clean

In [11]:
seqs_M = [s for s, lab in zip(seqs_all, labels_all, strict=False) if lab == "M"]
seqs_B = [s for s, lab in zip(seqs_all, labels_all, strict=False) if lab == "B"]

print("Number malignant sequences:", len(seqs_M))
print("Number benign sequences:", len(seqs_B))

Number malignant sequences: 212
Number benign sequences: 357


### 4. Preparing data for `prefixspan`

The `prefixspan` library expects:

- A list of **sequences**  
- Each sequence to be a list of **hashable items** (e.g., strings)  
- **No nested lists** inside lists

---

However, our current data structure contains **nested lists** (itemsets).  
We'll “compress” each itemset into a single string by **joining the items with `|`**.

---

**Example:**

```python
[['worst_concavity=high', 'worst_compactness=high'], ['radius_mean=high']]


In [12]:
def collapse_itemsets_for_prefixspan(seq_with_itemsets):
    """
    Input example for ONE patient:
      [
        ['feat1=high', 'feat2=high'],
        ['feat3=low'],
        ...
      ]

    Output we want:
      [
        'feat1=high|feat2=high',
        'feat3=low',
        ...
      ]
    """
    collapsed = []
    for itemset in seq_with_itemsets:
        # itemset is a list like ['feat1=high', 'feat2=high']
        joined = "|".join(sorted(itemset))
        collapsed.append(joined)
    return collapsed


# build NEW collapsed versions from your current seqs_M / seqs_B
seqs_M_collapsed = [collapse_itemsets_for_prefixspan(s) for s in seqs_M]
seqs_B_collapsed = [collapse_itemsets_for_prefixspan(s) for s in seqs_B]

print("After collapsing:")
print("Type of seqs_M_collapsed[0]:", type(seqs_M_collapsed[0]))
print("seqs_M_collapsed[0]:", seqs_M_collapsed[0])
print("Type of seqs_M_collapsed[0][0]:", type(seqs_M_collapsed[0][0]))
print("seqs_M_collapsed[0][0]:", seqs_M_collapsed[0][0])

After collapsing:
Type of seqs_M_collapsed[0]: <class 'list'>
seqs_M_collapsed[0]: ['mean_compactness=high', 'perimeter_error=high', 'worst_symmetry=high', 'mean_concavity=high', 'worst_compactness=high']
Type of seqs_M_collapsed[0][0]: <class 'str'>
seqs_M_collapsed[0][0]: mean_compactness=high


### 5a. Mine Sequential Patterns with PrefixSpan

We now run **PrefixSpan** on:

- **Malignant sequences only**
- **Benign sequences only**

We discover patterns like:

**Citation Notes**

PrefixSpan is an algorithm introduced in:
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., & Hsu, M. (2001).
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth.

We are using the open-source Python package prefixspan that implements this algorithm.



In [13]:
def mine_patterns(sequences, min_support_ratio=0.1, max_pattern_length=3, top_n=20):
    """
    Mine frequent sequential patterns from 'sequences' using PrefixSpan.

    sequences: list of sequences, where each sequence is a list of strings.
               Example: [
                   'worst_concavity=high|worst_compactness=high',
                   'radius_mean=high', ...
               ]
    min_support_ratio: minimum fraction of patients that must contain the pattern.
    max_pattern_length: only keep patterns up to this length for interpretability.
    top_n: return the top_n most supported patterns.

    This function wraps the PrefixSpan API from the open-source `prefixspan` package.
    """
    n_seq = len(sequences)
    min_support = max(1, int(np.ceil(min_support_ratio * n_seq)))

    ps = PrefixSpan(sequences)

    # prefixspan.PrefixSpan.frequent(min_support) returns a list of tuples:
    #   (support_count, pattern_sequence)
    # where pattern_sequence is also a list of strings like the ones in our sequences.
    raw_patterns = ps.frequent(min_support)

    # filter by length so we don't get huge unreadable patterns
    filtered = [
        (supp, pat) for supp, pat in raw_patterns if len(pat) <= max_pattern_length
    ]

    # sort patterns by support desc (and longer patterns slightly later in tie-break)
    filtered.sort(key=lambda x: (x[0], len(x[1])), reverse=True)

    return filtered[:top_n]


patterns_M = mine_patterns(
    seqs_M_collapsed, min_support_ratio=0.1, max_pattern_length=3, top_n=20
)
patterns_B = mine_patterns(
    seqs_B_collapsed, min_support_ratio=0.1, max_pattern_length=3, top_n=20
)

print("Top malignant patterns:")
for supp, pat in patterns_M:
    print(
        f"support={supp} / {len(seqs_M_collapsed)} "
        f"({supp / len(seqs_M_collapsed):.2f})  pattern={pat}"
    )

print("\nTop benign patterns:")
for supp, pat in patterns_B:
    print(
        f"support={supp} / {len(seqs_B_collapsed)} "
        f"({supp / len(seqs_B_collapsed):.2f})  pattern={pat}"
    )

Top malignant patterns:
support=60 / 212 (0.28)  pattern=['mean_radius=high']
support=49 / 212 (0.23)  pattern=['worst_radius=high']
support=47 / 212 (0.22)  pattern=['mean_area=high']
support=45 / 212 (0.21)  pattern=['worst_compactness=high']
support=43 / 212 (0.20)  pattern=['mean_perimeter=high']
support=42 / 212 (0.20)  pattern=['worst_fractal_dimension=high']
support=41 / 212 (0.19)  pattern=['worst_texture=high']
support=40 / 212 (0.19)  pattern=['worst_perimeter=high']
support=39 / 212 (0.18)  pattern=['mean_concave_points=high']
support=39 / 212 (0.18)  pattern=['worst_concavity=high']
support=38 / 212 (0.18)  pattern=['worst_symmetry=high']
support=38 / 212 (0.18)  pattern=['worst_area=high']
support=37 / 212 (0.17)  pattern=['radius_error=high']
support=36 / 212 (0.17)  pattern=['worst_concave_points=high']
support=35 / 212 (0.17)  pattern=['mean_concavity=high']
support=35 / 212 (0.17)  pattern=['worst_smoothness=high']
support=31 / 212 (0.15)  pattern=['perimeter_error=hig

### 5b. Mine Sequential Patterns with GSP Algorithm

**GSP (Generalized Sequential Pattern)** is another classic sequential pattern mining algorithm that:
- Uses an Apriori-like approach
- Generates candidate sequences
- Makes multiple passes over the data
- Has explicit support for time constraints and sliding windows

We'll implement GSP as an alternative to PrefixSpan for comparison.

In [14]:
from collections import defaultdict


def generate_candidates(prev_sequences, k):
    """Generate length-k candidate sequences from length-(k-1) frequent sequences"""
    candidates = set()
    for seq1 in prev_sequences:
        for seq2 in prev_sequences:
            # Join if first k-2 elements match
            if seq1[:-1] == seq2[1:]:
                candidates.add(seq1 + (seq2[-1],))
    return candidates


def gsp_mine(sequences, min_support_ratio=0.1, max_length=3):
    """
    Mine sequential patterns using GSP algorithm

    Parameters:
    sequences: list of sequences (same format as used for PrefixSpan)
    min_support_ratio: minimum support threshold as a fraction
    max_length: maximum pattern length to mine
    """
    n_sequences = len(sequences)
    min_support = max(1, int(np.ceil(min_support_ratio * n_sequences)))

    # Find frequent 1-sequences
    item_counts = defaultdict(int)
    for seq in sequences:
        for item in set(seq):  # Count unique items per sequence
            item_counts[item] += 1

    L1 = {(item,) for item, count in item_counts.items() if count >= min_support}

    current_L = L1
    all_patterns = []
    k = 1

    while current_L and k < max_length:
        # Add current frequent sequences to results
        for pattern in current_L:
            support = sum(1 for seq in sequences if all(p in seq for p in pattern))
            all_patterns.append((support, list(pattern)))

        # Generate candidates
        candidates = generate_candidates(current_L, k + 1)

        # Count support
        current_L = set()
        for cand in candidates:
            support = sum(1 for seq in sequences if all(p in seq for p in cand))
            if support >= min_support:
                current_L.add(cand)
        k += 1

    # Sort by support
    all_patterns.sort(key=lambda x: (-x[0], len(x[1])))
    return all_patterns


# Run GSP on malignant and benign sequences
patterns_M_gsp = gsp_mine(seqs_M_collapsed, min_support_ratio=0.1, max_length=3)
patterns_B_gsp = gsp_mine(seqs_B_collapsed, min_support_ratio=0.1, max_length=3)

print("Top malignant patterns (GSP):")
for supp, pat in patterns_M_gsp[:10]:
    print(
        f"support={supp} / {len(seqs_M_collapsed)} ({supp / len(seqs_M_collapsed):.2f})  pattern={pat}"
    )

print("\nTop benign patterns (GSP):")
for supp, pat in patterns_B_gsp[:10]:
    print(
        f"support={supp} / {len(seqs_B_collapsed)} ({supp / len(seqs_B_collapsed):.2f})  pattern={pat}"
    )

Top malignant patterns (GSP):
support=60 / 212 (0.28)  pattern=['mean_radius=high']
support=60 / 212 (0.28)  pattern=['mean_radius=high', 'mean_radius=high']
support=49 / 212 (0.23)  pattern=['worst_radius=high']
support=49 / 212 (0.23)  pattern=['worst_radius=high', 'worst_radius=high']
support=47 / 212 (0.22)  pattern=['mean_area=high']
support=47 / 212 (0.22)  pattern=['mean_area=high', 'mean_area=high']
support=45 / 212 (0.21)  pattern=['worst_compactness=high']
support=45 / 212 (0.21)  pattern=['worst_compactness=high', 'worst_compactness=high']
support=43 / 212 (0.20)  pattern=['mean_perimeter=high']
support=43 / 212 (0.20)  pattern=['mean_perimeter=high', 'mean_perimeter=high']

Top benign patterns (GSP):
support=91 / 357 (0.25)  pattern=['worst_texture=low']
support=91 / 357 (0.25)  pattern=['worst_texture=low', 'worst_texture=low']
support=84 / 357 (0.24)  pattern=['worst_smoothness=low']
support=84 / 357 (0.24)  pattern=['mean_texture=low']
support=84 / 357 (0.24)  pattern=['

In [15]:
def compare_algorithms(prefixspan_patterns, gsp_patterns, label):
    """Compare patterns found by PrefixSpan and GSP"""
    print(f"\nComparison for {label} patterns:")
    print("\nPatterns found by both algorithms:")
    prefix_set = {tuple(pat) for _, pat in prefixspan_patterns}
    gsp_set = {tuple(pat) for _, pat in gsp_patterns}
    common = prefix_set.intersection(gsp_set)
    for pat in common:
        print(pat)

    print(f"\nUnique to PrefixSpan ({len(prefix_set - gsp_set)}):")
    for pat in prefix_set - gsp_set:
        print(pat)

    print(f"\nUnique to GSP ({len(gsp_set - prefix_set)}):")
    for pat in gsp_set - prefix_set:
        print(pat)


# Compare results
compare_algorithms(patterns_M, patterns_M_gsp, "Malignant")
compare_algorithms(patterns_B, patterns_B_gsp, "Benign")


Comparison for Malignant patterns:

Patterns found by both algorithms:
('mean_radius=high',)
('worst_concavity=high',)
('mean_concavity=high',)
('worst_fractal_dimension=high',)
('worst_texture=high',)
('radius_error=high',)
('perimeter_error=high',)
('worst_perimeter=high',)
('mean_concave_points=high',)
('mean_fractal_dimension=low',)
('worst_radius=high',)
('mean_texture=high',)
('worst_compactness=high',)
('mean_area=high',)
('worst_smoothness=high',)
('mean_perimeter=high',)
('worst_concave_points=high',)
('worst_symmetry=high',)
('mean_radius=high', 'mean_perimeter=high')
('worst_area=high',)

Unique to PrefixSpan (0):

Unique to GSP (48):
('worst_area=high', 'worst_area=high')
('worst_perimeter=high', 'worst_perimeter=high')
('mean_perimeter=high', 'mean_perimeter=high')
('worst_perimeter=high', 'worst_radius=high')
('worst_compactness=high', 'worst_concavity=high')
('area_error=high',)
('worst_compactness=high', 'worst_fractal_dimension=high')
('mean_compactness=high',)
('wors

### 6. Summarize Which Features Dominate Each Class

The `patterns_M` output is still a bit raw. Now we extract which symbolic features (like `worst_concavity=high`) show up most often in high-support malignant patterns vs benign patterns.


In [None]:
def summarize_top_attributes(patterns, top_k_attrs=10):
    """
    patterns: list of (support, pattern)
    pattern is a list of steps, where each step is now a *string*.
    Each step may look like:
        'featA=high|featB=high'
    We'll split by "|" to count features individually.
    """
    attr_counter = Counter()
    for supp, pat in patterns:
        for step in pat:
            for atom in step.split("|"):  # separate 'feat1=high|feat2=high'
                attr_counter[atom] += supp  # weight by pattern support
    return attr_counter.most_common(top_k_attrs)


print("Most characteristic items for Malignant (weighted):")
print(summarize_top_attributes(patterns_M))

print("\nMost characteristic items for Benign (weighted):")
print(summarize_top_attributes(patterns_B))

Most characteristic items for Malignant (weighted):
[('mean_radius=high', 90), ('mean_perimeter=high', 73), ('worst_radius=high', 49), ('mean_area=high', 47), ('worst_compactness=high', 45), ('worst_fractal_dimension=high', 42), ('worst_texture=high', 41), ('worst_perimeter=high', 40), ('mean_concave_points=high', 39), ('worst_concavity=high', 39)]

Most characteristic items for Benign (weighted):
[('worst_texture=low', 132), ('mean_texture=low', 125), ('worst_smoothness=low', 84), ('mean_smoothness=low', 81), ('mean_symmetry=low', 78), ('worst_concave_points=low', 73), ('concave_points_error=low', 68), ('texture_error=low', 66), ('mean_radius=low', 60), ('mean_compactness=low', 58)]


### 7. Sensitivity Check Across Binning Strategies

In [None]:
def rebuild_sequences_from_disc(disc_df, k=5):
    """
    Re-run the sequence construction using a different discretized dataframe
    (e.g. disc_uniform instead of disc_quantile).
    """
    seqs_tmp = []
    for idx in X.index:
        s = build_patient_sequence(idx, X_scaled_df, disc_df, k=k, tie_eps=1e-6)
        seqs_tmp.append(s)
    return seqs_tmp


# 1. sequences using UNIFORM instead of QUANTILE
seqs_all_uniform_raw = rebuild_sequences_from_disc(disc_uniform, k=5)

# 2. collapse for PrefixSpan
seqs_all_uniform_collapsed = [
    collapse_itemsets_for_prefixspan(s) for s in seqs_all_uniform_raw
]

# 3. split by label again
seqs_M_uniform = [
    s
    for s, lab in zip(seqs_all_uniform_collapsed, labels_all, strict=False)
    if lab == "M"
]
seqs_B_uniform = [
    s
    for s, lab in zip(seqs_all_uniform_collapsed, labels_all, strict=False)
    if lab == "B"
]

# 4. mine again
patterns_M_uniform = mine_patterns(
    seqs_M_uniform, min_support_ratio=0.1, max_pattern_length=3, top_n=10
)
patterns_B_uniform = mine_patterns(
    seqs_B_uniform, min_support_ratio=0.1, max_pattern_length=3, top_n=10
)

print("Malignant patterns (quantile binning):")
for supp, pat in patterns_M[:5]:
    print(supp, pat)

print("\nMalignant patterns (uniform binning):")
for supp, pat in patterns_M_uniform[:5]:
    print(supp, pat)

print("\nBenign patterns (quantile binning):")
for supp, pat in patterns_B[:5]:
    print(supp, pat)

print("\nBenign patterns (uniform binning):")
for supp, pat in patterns_B_uniform[:5]:
    print(supp, pat)

Malignant patterns (quantile binning):
60 ['mean_radius=high']
49 ['worst_radius=high']
47 ['mean_area=high']
45 ['worst_compactness=high']
43 ['mean_perimeter=high']

Malignant patterns (uniform binning):
50 ['mean radius=mid']
40 ['mean area=mid']
38 ['worst radius=mid']
35 ['worst compactness=mid']
35 ['mean perimeter=mid']

Benign patterns (quantile binning):
91 ['worst_texture=low']
84 ['mean_texture=low']
84 ['worst_smoothness=low']
81 ['mean_smoothness=low']
78 ['mean_symmetry=low']

Benign patterns (uniform binning):
91 ['worst texture=low']
84 ['mean texture=low']
84 ['worst smoothness=low']
81 ['mean smoothness=low']
79 ['texture error=low']
