# Dichotomic Pattern Mining of Dynamic Functional Network Connectivity (dFNC) Sequences using Seq2Pat 

Aaron S. Kemp, MBA, PhD Candidate<br>
Department of Biomedical Informatics,<br> 
University of Arkansas for Medical Sciences

**Related Publication:**
This notebook includes the code used to generate the Dichotomic Pattern Mining and machine learning classification analyses described in the manuscript entitled, “Sequential Patterning of Dynamic Brain States Distinguish Parkinson’s Disease Patients with Mild Cognitive Impairments”, which is currently under consideration for publication in the journal *NeuroImage: Clinical*.

**Licensing Note:**  
The following blocks of code were adapted from the [Seq2Pat GitHub repository](https://github.com/fidelity/seq2pat), which is available under [Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0).  

**Current Adaptation by Aaron S. Kemp:**  
This adaptation has been modified for the purposes of this project and is provided under the following terms:
- **Copyright (C) 2025 University of Arkansas for Medical Sciences**  
- **Author:** Aaron S. Kemp, askemp@uams.edu  
- **Licensed under the Apache License, Version 2.0**  
You may obtain a copy of the License at [Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0)

**Additional Notes:**  
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. Please refer to both the MIT License (for the original work) and the Apache License 2.0 (for the adaptations) for specific terms and conditions regarding the use, distribution, and modification of the code. You may not use this file except in compliance with the License.

In [None]:
import pandas as pd
from ast import literal_eval
from time import time
from IPython.display import display
from matplotlib_venn import venn2
import matplotlib.pyplot as plt
from pycaret.classification import *
from sequential.seq2pat import Seq2Pat, Attribute
from sequential.pat2feat import Pat2Feat
from sequential.dpm import dichotomic_pattern_mining, DichotomicAggregation

## Arguments

In [None]:
args = {}
args['data'] = "./data/dfc_data.csv"
args['min_frequency_pos'] = 0.05
args['min_frequency_neg'] = 0.05
args['max_span'] = None

### Transform sequence from string to list

In [None]:
# Input lists
sequences = sequence_df[[col for col in sequence_df.columns if 'sequence' in col]].values.tolist()
times = sequence_df[[col for col in sequence_df.columns if 'time' in col]].values.tolist()

## Data Exploration

In [None]:
# EDA for items, max length, average length, number of positive and negative
num_sequences = len(sequence_df)
# max_len = sequence_df['event_sequence'].apply(len).max()
max_len = max(len(x) for x in sequences)
# avg_len = sequence_df['event_sequence'].apply(len).mean()
# avg_len = mean(len(x) for x in sequences)
num_pos = len(sequence_df[sequence_df['label_2class_PD']==0])

print(f'Number of sequences: {num_sequences}')
print(f'Maximum length: {max_len}')
# print(f'Average length: {avg_len}')
print(f'Number of positives: {num_pos}; Number of negatives: {num_sequences - num_pos}')

## Seq2Pat for Positive Labels
- There is one attribute for each event: `event_time`
- Constraint 1: This constraint is to enforce the average event time greater than 20 sec
- Constraint 2: The built-in constraint in Seq2Pat, which can be configured by max_span parameter. This is to enforce the pattern mining to be within a span of `max_span` items (max_span=10 by default). 

In [None]:
# Get sequences having positive labels, and associated attributes.

# sequences_pos = sequence_df[sequence_df['label_2class_PD']==1]['event_sequence'].values.tolist()
sequences_pos = sequence_df[sequence_df['label_2class_PD']==0][[col for col in sequence_df.columns if 'sequence' in col]].values.tolist()

# times_pos = sequence_df[sequence_df['label_2class_PD']==0]['event_time'].values.tolist()
times_pos = sequence_df[sequence_df['label_2class_PD']==0][[col for col in sequence_df.columns if 'time' in col]].values.tolist()

seq2pat_pos = Seq2Pat(sequences_pos)

# Define a constraint on event time, average time >= 20 sec
# time_attr_pos = Attribute(times_pos)
# time_ct_pos = 20000 <= time_attr_pos.average()

# Add constraints to seq2pat
# cs_pos = seq2pat_pos.add_constraint(time_ct_pos)

## Seq2Pat for Negative Labels
- Here we apply the same constraint models to sequences with negative labels

In [None]:
# Get sequences having positive labels, and associated attributes.

# sequences_neg = sequence_df[sequence_df['label']==0]['event_sequence'].values.tolist()
sequences_neg = sequence_df[sequence_df['label_2class_PD']==1][[col for col in sequence_df.columns if 'sequence' in col]].values.tolist()

# times_neg = sequence_df[sequence_df['label']==0]['event_time'].values.tolist()
times_neg = sequence_df[sequence_df['label_2class_PD']==1][[col for col in sequence_df.columns if 'time' in col]].values.tolist()

seq2pat_neg = Seq2Pat(sequences_neg)

# Define a constraint on event time, average time >= 20 sec
# time_attr_neg = Attribute(times_neg)
# time_ct_neg = 20000 <= time_attr_neg.average()

# Add constraints to seq2pat
# cs_neg = seq2pat_neg.add_constraint(time_ct_neg)

## Dichotomic Pattern Mining: From Sequences to Patterns
- In DPM, we utilize the two Seq2Pat models for positive and negative sequences, mine the patterns that are frequent in each outcome and return different aggregations of mined patterns from the two cohorts.

In [None]:
t = time()

# Run DPM on positive and negative patterns and return a dict of pattern aggregations
aggregation_to_patterns = dichotomic_pattern_mining(seq2pat_pos, seq2pat_neg,
                                                    args['min_frequency_pos'],
                                                    args['min_frequency_neg'])

print(f'DPM finished! Runtime: {time()-t:.4f} sec')

for aggregation, patterns in aggregation_to_patterns.items():
    print("Aggregation: ", aggregation, " with number of patterns: ", len(patterns))
    if aggregation == 'unique_positive':
        unique_positive = len(patterns)
    elif aggregation == 'unique_negative':
        unique_negative = len(patterns)
    elif aggregation == 'intersection':
        intersection = len(patterns)

# Create the Venn diagram
venn2(subsets=(unique_negative, unique_positive, intersection), set_labels=(r'$\mathbf{PD}$', r'$\mathbf{HC}$'))
plt.title("Venn Diagram of Detected Sequential Patterns")
plt.show()

## From Patterns to Encodings
- Finally, we generate encodings of all sequences based on an aggregation of patterns found by DPM.

In [None]:
# Notice that constraints are optional to the generation of encodings
# In the following, we define a constraint on event time for all sequences, average time >= 20 sec
# The Seq2Pat built-in span constraint can be enforced in encodings generation by setting `max_span=10`.
time_attr = Attribute(times)
time_ct = 200 <= time_attr.average()

# List of constraints 
constraints = None

for aggregation, patterns in aggregation_to_patterns.items():
    print("Aggregation: ", aggregation)
    
    t = time()
    # find one hot encoding of each sequence for each pattern subject to constraints
    pat2feat = Pat2Feat()
    encodings = pat2feat.get_features(sequences, patterns, constraints, args['max_span'],
                                      drop_pattern_frequency=False)
    
    print(f'Encoding finished! Runtime: {time()-t:.4f} sec')
    display(encodings.head())

In [None]:
#Use this code to save the encodings to csv file
encodings.to_csv('data/encodings_dfc_082324_minsup05.csv') 

In [None]:
# Load the DataFrame from the csv just created
data_dfc = pd.read_csv("./data/encodings_dfc_082324_minsup05.csv") #NB - Stop HERE, if you've already added the labels column

# Drop the 'Unnamed: 0' column if it exists
if 'Unnamed: 0' in data_dfc.columns:
    data_dfc.drop('Unnamed: 0', axis=1, inplace=True)

# Assuming sequence_df['label_2class_PD'] is the series you want to add as the first column
# Using DataFrame.insert() to add it as the first column
data_dfc.insert(0, 'label_2class_PD', sequence_df['label_2class_PD'])

# Display the first few rows to verify
data_dfc.head()

In [None]:
# ONLY use this code if you need to upload a previously saved encoding file
# In this following example code, I am uploading the file after inserting the target that contains all 3 dx labels
data_dfc = pd.read_csv("./data/encodings_dfc_082324_minsup05_label2class_PDonly.csv") #upload file only after inserting first column as the target variable
# Display the first few rows to verify
data_dfc.head()

In [None]:
#Run comparison of various classification models from PyCaret
clf_dfc = setup(data=data_dfc, target='label_2class_MCI', train_size=0.80, ignore_features=['sequence'], session_id=123, imputation_type=None, verbose=False)
best = compare_models()

In [None]:
rf_dfc = create_model('rf')

In [None]:
tuned_rf_dfc = tune_model(rf_dfc)

In [None]:
evaluate_model(tuned_rf_dfc)

In [None]:
lda_dfc = create_model('lda')

In [None]:
tuned_lda_dfc = tune_model(lda_dfc)

In [None]:
evaluate_model(lda_dfc)

In [None]:
unique_positive = (aggregation_to_patterns['unique_positive'])
print(unique_positive)