# Transition Network Analysis (TNA) Tutorial

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mohsaqr/tnapy/blob/main/tutorial.ipynb)

## Introduction

Transition Network Analysis (TNA) represents a novel methodological approach that captures the temporal and relational dynamics of unfolding processes. The core principle involves representing transition matrices between events as graphs, enabling researchers to leverage graph theory and network analysis comprehensively.

TNA functions as a sophisticated combination of process mining and network analysis. Where process mining typically generates sequential maps, TNA represents these through network analysis — but with considerably greater analytical depth. The method applies network analysis to capture structure, time, and relationships holistically. Compared to traditional process mining models, TNA incorporates network measures at node, edge, and graph levels, revealing which events hold importance through centrality measures, which transitions prove central, and which processes demonstrate greater connectivity. The method extends beyond standard network analysis by clustering sub-networks into different network constellations representing typical temporal event patterns — often called tactics.

A distinctive innovation involves statistical validation techniques unavailable in conventional approaches. These include edge verification through bootstrapping, network comparison via permutation testing, and centrality verification through case-dropping methods. These statistical techniques introduce rigor and validation at each analytical step, enabling researchers to verify which edges demonstrate replicability and confirm that inferences remain valid rather than chance artifacts.

### Why TNA?

Learning operates as a complex dynamic system — a collection of interconnected components interacting across time where interactions can enhance, impede, amplify, or reinforce each other. These dynamic interactions generate emergent behaviors that resist full understanding through analyzing individual components in isolation. Such interactions frequently produce processes exceeding the simple sum of their parts, exhibiting non-linear dynamics.

For example, motivation catalyzes achievement, which subsequently catalyzes enhanced engagement, enjoyment, and motivation. These interdependencies, feedback loops, and non-linear dynamics create inherent complexity requiring modeling methods transcending traditional linear approaches. TNA, functioning as a dynamic probabilistic model, addresses these limitations by capturing uncertainties through directional probabilities between learning events. The method accommodates the non-linear, evolving character of learning processes while capturing the constellations and emergent patterns defining or shaping learning processes.

### The Building Blocks of TNA

TNA's foundational elements are transitions between events comprising transition processes. A transition represents a conditional relationship between one occurrence and another — from A to B (a contingency). TNA models transitions in sequential data to compute transition probabilities between events. The resulting transition matrix becomes a weighted directed network where weights represent transition probabilities between events and direction indicates transition direction.

- **Nodes (V)** represent different learning events — watching videos, taking quizzes, submitting assignments — or alternatively, states, dialogue moves, collaborative roles, motivation states, or any event representable as sequence units.
- **Edges (E)** represent transitions between activities, displaying direction from one activity to the next.
- **Weights (W)** represent transitioning probabilities between events or states.

This tutorial demonstrates the complete TNA workflow using the Python `tna` package — from data preparation through model building, visualization, pruning, pattern detection, centrality analysis, community detection, bootstrapping, and group comparison. This tutorial replicates the R TNA tutorial by Saqr & Lopez-Pernas (2025) using the Python implementation.

## 1. Installation & Setup

TNA can analyze any sequence-representable data with transitions or changes across time — learning event sequences, states, phases, roles, dialogue moves, or interactions. This data can originate from time-stamped learning management system data, coded interaction data, event-log data, or ordered event data.

Install the `tna` package and import the required libraries:

In [None]:
# Install tna package (uncomment for Google Colab)
# !pip install git+https://github.com/mohsaqr/tnapy.git -q

import tna
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.dpi'] = 150

print(f"TNA version: {tna.__version__}")

## 2. Getting Started with Long-Format Data

TNA works with sequential event data. The `tna` package accepts sequence data in several formats: a wide `DataFrame` where rows represent sequences and columns represent timepoints, a transition matrix, or long-format event data that gets reshaped using `prepare_data()`.

The built-in dataset contains coded collaborative regulation behaviors from learning sessions, with columns for action, actor, and time. Let's start by loading the long-format dataset:

In [None]:
# Load the built-in dataset of coded collaborative regulation behaviors
group_regulation_long = tna.load_group_regulation_long()
print(f"Shape: {group_regulation_long.shape}")
group_regulation_long.head(10)

Each row is a single event with columns:
- **Action**: The behavioral state (becomes a network node)
- **Actor**: Participant ID (one sequence per actor)
- **Time**: Timestamp (for ordering and session splitting)
- **Achiever**: Achievement group (High/Low, used later for group comparison)
- **Group**: Group identifier
- **Course**: Course identifier

## 3. Understanding `prepare_data()`

The `prepare_data()` function converts long-format event logs into sequences suitable for TNA. It handles session splitting (based on time gaps), ordering, and reshaping. To generate individual sequences for each actor, you must specify both the `actor` and `action` columns.

When timestamps are provided via the `time` column, events happening less than 15 minutes apart are grouped in the same sequence, while events occurring after a longer gap mark the start of a new sequence (session). You can customize this gap using the `time_threshold` argument (in minutes).

An important advantage of using `prepare_data()` prior to constructing the TNA model is that you get to keep other variables of the data (metadata) and use them in your analysis. For instance, you can use `group_tna()` to create a TNA model by achievement group by passing the result of `prepare_data()` and indicating the name of the grouping column.

In [None]:
# Convert long-format event log into sequences for TNA
prepared_data = tna.prepare_data(
    group_regulation_long,
    action="Action",   # column with behavioral states (become network nodes)
    actor="Actor",     # column with participant IDs (one sequence per actor)
    time="Time"        # column with timestamps (for ordering and session splitting)
)
prepared_data

In [None]:
# View the wide-format sequence data (rows = sequences, columns = positions)
print("Sequence data shape:", prepared_data.sequence_data.shape)
prepared_data.sequence_data.head()

In [None]:
# View the preserved metadata (e.g., Achiever group) for each sequence
prepared_data.meta_data.head()

### Alternative Input Formats

In addition to long-format data processed via `prepare_data()`, TNA models can be built directly from:

- **Wide-format data**: A DataFrame where each row is a sequence and each column represents a time step. This is the most straightforward format when sequences are already aligned.
- **Pre-computed transition matrices**: A square DataFrame or NumPy array where entry (i, j) represents the transition probability or frequency from state i to state j.

These alternative inputs provide flexibility for researchers who already have their data in processed formats:

In [None]:
# Wide-format data (rows = sequences, columns = time steps)
group_regulation = tna.load_group_regulation()
print("Wide-format shape:", group_regulation.shape)
group_regulation.head()

In [None]:
# Pre-computed transition matrix
mat = np.array([
    [0.1, 0.6, 0.3],
    [0.4, 0.2, 0.4],
    [0.3, 0.3, 0.4]
])
labels = ["A", "B", "C"]
model_from_matrix = tna.tna(pd.DataFrame(mat, index=labels, columns=labels))
print(model_from_matrix)

### Importing One-Hot Encoded Data

Some datasets encode states as binary (0/1) indicator columns rather than categorical labels — for example, coded observation data where each column indicates whether a particular behavior was present in each time interval. The `import_onehot()` function converts this one-hot format into wide-format sequence data suitable for `tna()`.

The function supports windowing to group multiple time intervals together:
- **`window_size`**: Number of rows per window (default: 1, each row becomes one time step)
- **`window_type`**: `'tumbling'` (non-overlapping chunks) or `'sliding'` (step-by-1 overlap)
- **`aggregate`**: If `True`, collapse each window to the first active state per column (reduces width)

When `actor` or `session` columns are provided, windowing is applied within each group, producing one row per actor/session with all windows concatenated.

In [None]:
# Create example one-hot encoded data (e.g., coded classroom observations)
onehot_data = pd.DataFrame({
    "actor": ["s1"] * 6 + ["s2"] * 6,
    "Reading":  [1, 0, 0, 1, 0, 0,  0, 1, 0, 0, 1, 0],
    "Writing":  [0, 1, 0, 0, 1, 0,  1, 0, 0, 1, 0, 0],
    "Discuss":  [0, 0, 1, 0, 0, 1,  0, 0, 1, 0, 0, 1],
})
print("One-hot input:")
print(onehot_data)

# Convert to wide-format sequences (one row per actor)
states = ["Reading", "Writing", "Discuss"]
wide_seq = tna.import_onehot(onehot_data, cols=states, actor="actor")
print("\nWide-format output:")
wide_seq

## 4. Building the TNA Model

TNA analysis begins by building the primary TNA object (called `model`), containing all information necessary for further analysis — plotting, centrality estimation, or comparison. TNA model estimation employs the `tna()` function, which estimates Markov models from data where initial and transition probabilities derive directly from observed initial state probabilities and transition frequencies.

The resulting model contains:

- **Initial Probabilities (`inits`)**: Define the likelihood of starting in a particular state at the beginning of the process (the first time point, before transitions). In educational contexts, initial probability represents the probability that students begin in specific states (such as "engaged" or "motivated") before activities or interventions occur. These probabilities provide a process snapshot showing student starting positions.

- **Transition Probabilities (`weights`)**: Describe state-to-state movement likelihoods at each process step. Transition probabilities capture how students transition, move, or follow between different learning states or events. Each row of the transition matrix sums to 1, representing a complete probability distribution over next states.

- **Labels (`labels`)**: Provide descriptive network node names, enhancing analysis interpretability. Labels automatically derive from the data categories.

- **Data (`data`)**: The sequence data used to build the model, stored internally for further analysis (permutation testing, bootstrapping, etc.).

In [None]:
# Build the TNA model from the prepared sequence data
model = tna.tna(prepared_data)
print(model)

In [None]:
# Inspect the transition probability matrix
weights_df = model.to_dataframe()
weights_df.round(3)

In [None]:
# Inspect initial probabilities
init_df = pd.Series(model.inits, index=model.labels, name="Initial Probability")
init_df.round(3)

In [None]:
# Model summary
model.summary()

## 5. Visualizations

TNA model visualization enables bird's-eye views of learning processes, capturing the full structure essence, event connectivity, important pattern identification, and temporal event relationships. TNA provides powerful visualization features with several enhancements for comparing and exploring networks.

### 5.1 Transition Network Plot

The network plot represents a directed weighted network where each node (state, event, or learning activity) appears as a colored circle. Node-to-node arrows represent weighted transition probabilities with direction showing transition routes. Loops represent identical state repetition probabilities. Edge width and opacity reflect transition probability — thicker, more opaque edges indicate stronger transitions.

The `plot_network()` function provides two key parameters for managing visual complexity:

- **`minimum`**: Hides edges below this weight entirely, removing visual clutter. Note that these small probabilities remain in the model for all subsequent computations — this is purely a visual filter.
- **`cut`**: Fades edges below this weight (reduced opacity) but still shows them, allowing researchers to see the full network while emphasizing stronger transitions.

In [None]:
# minimum: hide edges below 0.05; cut: fade edges below 0.1
tna.plot_network(model, minimum=0.05, cut=0.1)
plt.show()

### 5.2 Histogram of Edge Weights

Examining the distribution of transition probabilities helps researchers understand the overall structure of the network — whether transitions are uniformly distributed or concentrated among a few strong connections. This informs decisions about pruning thresholds and helps identify the natural "backbone" of the network:

In [None]:
tna.plot_histogram(model)
plt.show()

### 5.3 Frequency Distribution of States

The frequency distribution shows how often each state appears as the first event in a sequence, reflecting the initial state probabilities. This helps identify which states learners most commonly begin with and provides context for interpreting the transition network:

In [None]:
# Bar chart of how often each state appears across all sequences
tna.plot_frequencies(model)
plt.show()

### 5.4 Mosaic Plot

The mosaic (marimekko) plot visualizes the transition matrix as a contingency table. Tile widths are proportional to column totals (incoming transitions) and tile heights are proportional to row proportions (outgoing transitions). Colors represent adjusted standardized residuals from a chi-squared test — blue tiles indicate more transitions than expected, red tiles indicate fewer. This requires a frequency model built with `ftna()`:

In [None]:
# Build frequency model and plot mosaicfmodel = tna.ftna(prepared_data)tna.plot_mosaic(fmodel)plt.show()

## 6. Pruning

Transition networks commonly appear fully connected or saturated — where nearly all nodes connect to all other nodes with some probability. Therefore, mechanisms must retrieve the network core or backbone structure, making networks sparse. Network sparsity enhances interpretability by removing overly complex structures, simplifying important component and relationship identification. It also isolates signal from noise, removing small noisy edges that obscure meaningful patterns, allowing researchers to focus on important interactions.

While researchers can use the `minimum` argument in `plot_network()` to visually hide small edges, those small probabilities remain in the model for all subsequent computations. Researchers who want to actually remove negligible-weight edges from the model can use the `prune()` function, which retains only strong, meaningful connections.

The `prune()` function implements **threshold-based pruning**: edges below a specified threshold value are set to zero (default threshold is 0.05). This provides a clean model where only meaningful transitions remain for downstream analysis.

Pruning with TNA can also be accomplished through bootstrapping (demonstrated in the bootstrapping section below), which offers a statistically grounded approach to identifying and eliminating small and uncertain edges.

In [None]:
# Prune: remove edges with weight below 0.05
pruned = tna.prune(model, threshold=0.05)

print(f"Original edges: {model.summary()['n_edges']}")
print(f"Pruned edges:   {pruned.summary()['n_edges']}")

In [None]:
# Plot the pruned network
tna.plot_network(pruned, cut=0.1)
plt.show()

## 7. Patterns: Cliques

Patterns help understand behavior, identify significant structures, and describe processes in detail. Patterns form fundamental building blocks of structure and learning process dynamics. They furnish insights into behavior and learner strategies during studying or learning material interaction. Furthermore, capturing repeated consistent patterns enables theory building and generalizable inferences.

TNA supports identifying several n-clique pattern types. Network cliques comprise graph node subsets where every node pair connects directly through edges. In network terms, cliques represent tightly-knit communities, closely related entities, or interdependent nodes shaping learning unfolding.

The `cliques()` function identifies n-cliques from TNA models. Its arguments include:
- **`size`**: The clique size to search for (size=2 finds dyads, size=3 finds triads, etc.)
- **`threshold`**: The minimum edge weight required for an edge to participate in a clique

**Dyads** represent TNA's simplest patterns — transitions between two nodes. Mutual dyads (bidirectional) with high edge weights indicate strong interdependence through recurrent occurrence. For instance, consistently moving from reading materials to quiz-taking indicates strong self-evaluative strategies.

**Triads** capture more complex three-node relationships. In TNA, three-node cliques where each connects to the others in either direction indicate strong interdependent node subgroups forming a process core. Triads represent higher-order learning behavior dependencies.

We search for cliques of size 2, 3, and 4 with decreasing thresholds (larger cliques are rarer, so lower thresholds are needed):

In [None]:
# Find cliques of size 2, 3, and 4 with decreasing thresholds
cliques_of_two   = tna.cliques(model, size=2, threshold=0.1)   # dyads
cliques_of_three = tna.cliques(model, size=3, threshold=0.05)  # triads
cliques_of_four  = tna.cliques(model, size=4, threshold=0.03)  # quads

In [None]:
print(cliques_of_two)

In [None]:
print(cliques_of_three)

In [None]:
print(cliques_of_four)

## 8. Centralities

Centrality measures quantify the role or importance of states or events in processes. With centrality measures, researchers can rank events by their value in bridging interactions (betweenness centrality) or receiving the most transitions (in-strength centrality). Centrality measures reveal which behaviors or cognitive states prove central to learning processes — as frequent transition destinations, starting points for various actions, bridges between learning activities, or keys to spreading phenomena. Using centrality measures, researchers can identify important events to target for intervention or improvement.

Importantly, raw or absolute centrality measure values lack inherent meaning in TNA. Relative values matter instead, allowing node ranking and relative importance identification within networks.

### 8.1 Node-Level Centrality Measures

The `centralities()` function computes centrality measures using directed probabilistic process algorithms. By default, it removes loops from calculations (changeable via `loops=True`). Removing loops means all centrality computations proceed without considering self-transitioning or same-state repetition.

Available measures include:
- **OutStrength / InStrength**: Sum of outgoing/incoming transition probabilities. In pruned networks where self-loops are removed, out-strength reflects state stability — higher values indicate greater likelihood of transitioning away.
- **Closeness / InCloseness**: How quickly a state can reach (or be reached from) all other states.
- **Betweenness**: How often a state lies on shortest paths between other states, measuring its bridging role.
- **BetweennessRSP**: Betweenness based on randomized shortest paths — more appropriate for probabilistic networks.
- **Diffusion**: Measures how efficiently information or influence spreads from a state.
- **Clustering**: Local clustering coefficient reflecting the interconnectedness of a state's neighbors.

In [None]:
# Compute all centrality measures for each state
centrality_df = tna.centralities(model)
centrality_df.round(4)

In [None]:
# Plot centralities as faceted bar charts
tna.plot_centralities(centrality_df)
plt.show()

### 8.2 Edge-Level Measures: Edge Betweenness

In TNA, edge centrality measures quantify the importance of transitions between events — rather than the events themselves — furnishing insights into particular transitions' criticality for process flow. Edge betweenness centrality reflects how frequently a transition bridges other transitions in the network.

Edge centrality measures help researchers understand not only which nodes are important but which transitions guide learning processes. For instance, a transition from "planning" to "task execution" might have high edge betweenness, indicating it serves as a critical bridge in the learning process.

The `betweenness_network()` function creates a new TNA model where edge weights are replaced with their betweenness centrality values:

In [None]:
# Compute edge betweenness for all transitions
edge_betweenness = tna.betweenness_network(model)

# Show the betweenness values
edge_betweenness.to_dataframe().round(3)

In [None]:
# Plot edge betweenness network
tna.plot_network(edge_betweenness, cut=0.1, title="Edge Betweenness Network")
plt.show()

### 8.3 Centrality StabilityCentrality stability assessment determines whether centrality rankings remain consistent when cases are progressively dropped from the data. The `estimate_cs()` function implements the case-dropping bootstrap approach: it repeatedly drops increasing proportions of cases (10% to 90%) and recalculates centralities, measuring rank-order correlation with the original.The CS coefficient represents the maximum proportion of cases that can be dropped while maintaining a correlation above 0.7 with at least 95% certainty. CS values above 0.5 indicate stable centrality rankings, while values below 0.25 suggest instability:

In [None]:
# Centrality stability: case-dropping bootstrapcs_result = tna.estimate_cs(model, iter=200, seed=42)print("CS coefficients:", cs_result.cs_coefficients)

## 9. Community Detection

Communities comprise nodes more closely related or densely interconnected together than to other network nodes. In TNA, communities group states or events that frequently transition between one another or share similar dynamics. Communities represent cohesive sequences or activity successions that are more likely to co-occur, revealing typical pathways or recurring behaviors.

Unlike cliques — which maintain fixed or predefined structures (2-cliques or 3-cliques) — communities are data-driven based on connectivity patterns, making them more descriptive of real-world structures. Community identification uncovers latent or hidden clusters of related interaction or behavior during learning. Identifying these clusters provides insight into collaboration and learning effectiveness, common regulatory practices, or interaction patterns.

Furthermore, identifying behavior or event communities can contribute to theory building and learning understanding. These communities represent underlying interaction pattern inferences from densely connected behaviors into simplified meaningful structures, suggesting the presence of underlying constructs or behavioral mechanisms.

The `communities()` function supports several detection algorithms suited for transition networks (typically small, weighted, and directed):

- **Leading Eigenvector** (`leading_eigen`): Uses the leading eigenvector of the modularity matrix to partition nodes. This is the default method.
- **Fast Greedy** (`fast_greedy`): Optimizes modularity by iteratively merging communities.
- **Louvain** (`louvain`): A multi-level modularity optimization algorithm.
- **Label Propagation** (`label_prop`): Each node adopts the most common community among its neighbors.
- **Edge Betweenness** (`edge_betweenness`): Iteratively removes high-betweenness edges to reveal communities.

In [None]:
# Detect communities using the default algorithm (leading eigenvector)
comms = tna.communities(model)
print(comms)

In [None]:
# Plot communities: nodes colored by community assignment
tna.plot_communities(comms, cut=0.1)
plt.show()

In [None]:
# Try multiple community detection methods
comms_multi = tna.communities(model, methods=["leading_eigen", "louvain", "fast_greedy"])
print(comms_multi)

## 10. Bootstrapping

### 10.1 Why Bootstrap?

Bootstrapping represents a robust validation technique for assessing edge-weight accuracy and stability, consequently validating entire models. Through bootstrapping, researchers verify each edge, determine statistical significance, and obtain transition probability confidence intervals. Most network or process mining research employs descriptive methods — model validation or statistical significance proving remain largely absent from the literature. Validated models enable researchers to assess robustness and reproducibility, ensuring insights arise not from chance and therefore remain generalizable.

Bootstrapping — a resampling technique — involves repeatedly drawing samples from original datasets **with replacement** to estimate models for each sample (usually hundreds or thousands of times). Bootstrapping requires no strong data distribution assumptions, rendering it suitable for process data analysis that often does not adhere to specific distributions. Given bootstrap replacement, each sample may include multiple copies of some observations while excluding others, assessing parameter estimate variability. Edges consistently appearing across most estimated models prove stable and significant.

Another key bootstrap advantage involves effectively pruning dense networks. One challenge in probabilistic networks like TNA involves common complete connection — meaning every possible node connection exists to some degree. Bootstrapping mitigates this by identifying and eliminating small and uncertain edges, effectively retrieving the network backbone. The resulting simplified network proves easier to interpret and more likely to be generalizable.

The `bootstrap_tna()` function calculates confidence intervals and p-values for each edge weight. The function features a default of 1000 bootstrap iterations (via the `iter` argument). The `level` argument sets the significance threshold (e.g., 0.05) — if edges consistently appear above this threshold in bootstrapped samples, they are deemed statistically significant.

In [None]:
# Resample sequences 1000 times and assess edge stability
np.random.seed(265)  # for reproducibility
boot = tna.bootstrap_tna(model, iter=1000, level=0.05, seed=265)

### 10.2 Results

The bootstrap result contains several elements:

- **`weights_sig`**: A matrix showing only statistically significant transitions (non-significant weights set to zero)
- **`weights_mean`**: Mean transition matrix across all bootstrap samples
- **`weights_sd`**: Standard deviation matrix across all bootstrap samples
- **`ci_lower` / `ci_upper`**: Bootstrap confidence interval bounds for each transition
- **`p_values`**: Bootstrap p-value matrix for each transition

The `summary()` method returns a convenient DataFrame with all of these statistics per edge:

In [None]:
# Extract the bootstrap summary table
boot_df = boot.summary()
boot_df.head(10)

In [None]:
# Keep only edges that survived the bootstrap and sort by weight
sig_edges = boot_df[boot_df["sig"] == True].sort_values("weight", ascending=False)
print(f"{len(sig_edges)} out of {len(boot_df)} edges are significant")
sig_edges.head(15)

### 10.3 Bootstrapped Network

The bootstrapped model (`boot.model`) contains only statistically significant edges — those that survived the bootstrap validation. Plotting this model shows the validated network backbone, which is more likely to generalize to new data:

In [None]:
# Plot the bootstrapped network (only significant edges)
tna.plot_network(boot.model, cut=0.1, title="Bootstrapped Network (significant edges)")
plt.show()

## 11. Sequence Plots

Sequence plots provide a direct visualization of the raw sequential data before it is aggregated into a transition network. These visualizations help researchers understand the variety and structure of individual sequences.

Two plot types are available:

- **Index plot**: Each row represents one sequence, with colors indicating the state at each position. This reveals the diversity and patterns in individual trajectories — whether sequences are highly varied or follow common templates.
- **Distribution plot**: Shows the proportion of each state at each sequence position, revealing how the state distribution evolves over time. This helps identify whether certain states dominate at the beginning or end of sequences.

In [None]:
# Each row is one sequence; colors represent states at each position
tna.plot_sequences(prepared_data, max_sequences=200)
plt.show()

In [None]:
# Proportion of each state at each sequence position
tna.plot_sequences(prepared_data, plot_type="distribution")
plt.show()

## 12. Group Models

Researchers frequently encounter predefined conditions — high versus low achievers, different course types, or gender groups. Comparing such groups has commonly occurred visually — comparing process models or sequence models. While visual comparison may reveal differences, it fails to indicate statistical significance. Where precisely differences prove statistically significant and where they do not remains unclear.

TNA addresses this by enabling rigorous systematic group comparison. The `group_tna()` function builds separate TNA models for each level of a grouping variable. The metadata preserved by `prepare_data()` (e.g., the **Achiever** column) can be used directly as the grouping variable — no manual data splitting needed.

All standard TNA functions (`centralities()`, `prune()`, `communities()`, `cliques()`, `plot_network()`) work seamlessly with group models, automatically applying per-group and returning combined results. This enables researchers to examine how transition dynamics differ across subgroups without writing any group-splitting code.

In [None]:
# Build group models directly from the prepared data using the Achiever metadata column
group_model = tna.group_tna(prepared_data, group="Achiever")
print(group_model)
print()

# Summary statistics per group
group_model.summary()

In [None]:
# Access individual models using dict-style indexing
print(group_model["High"])
print()
print("Group names:", group_model.names())

In [None]:
# Plot all group networks side by side (automatic multi-panel)
tna.plot_network(group_model, minimum=0.05, cut=0.1)
plt.show()

In [None]:
# Prune all groups at once — returns a new GroupTNA with pruned models
pruned_group = tna.prune(group_model, threshold=0.05)
print(pruned_group)

# Compare edge counts
for name in group_model:
    orig = group_model[name].summary()["n_edges"]
    prun = pruned_group[name].summary()["n_edges"]
    print(f"  {name}: {orig} → {prun} edges")

In [None]:
# Centralities across groups — returns a single DataFrame with a 'group' column
group_cent = tna.centralities(group_model, measures=["OutStrength", "InStrength", "Betweenness"])
group_cent

In [None]:
# Communities per group
group_comms = tna.communities(group_model)
for name, result in group_comms.items():
    print(f"{name}: {result.counts}")
    print(result.assignments)
    print()

### 12.1 Permutation Testing Between Groups

To address the limitations of simple visual comparison, TNA employs rigorous permutation-based approaches for determining whether observed differences between group models are statistically significant. Permutation tests involve repeatedly shuffling the data between groups and generating a distribution of differences under the null hypothesis. For each edge, the test provides p-values helping researchers identify statistically significant differences. This rigorous approach ensures TNA insights reflect true underlying differences rather than chance artifacts.

The `permutation_test()` function compares two TNA models by shuffling sequences between groups for a specified number of iterations (`iter`), creating a null distribution of edge-weight differences. Edges where the observed difference exceeds the permutation distribution are flagged as statistically significant.

Access individual group models with dict-style indexing to compare specific groups:

In [None]:
# Permutation test: compare High vs Low achievers
perm_result = tna.permutation_test(
    group_model["High"], group_model["Low"],
    iter=500, seed=42, level=0.05
)

# Show significant edge differences
sig_perm = perm_result.edges["stats"][
    perm_result.edges["stats"]["p_value"] < 0.05
].sort_values("p_value")

print(f"{len(sig_perm)} significant edge differences found")
sig_perm

### 12.2 Difference Network

The `plot_compare()` function visualizes the difference between two TNA models as a network. Green edges indicate transitions that are stronger in the first model, red edges indicate transitions stronger in the second. Edge width is proportional to the absolute difference. Node colors reflect differences in initial probabilities:

In [None]:
# Difference network: High vs Low achievers
tna.plot_compare(group_model["High"], group_model["Low"])
plt.show()

## 13. Sequence Clustering (Tactics)

### 13.1 Why Cluster Sequences?

The analyses presented so far — transition networks, centralities, communities, bootstrapping, and group comparison — all operate at the **network level**, characterizing the aggregate dynamics of how states connect and transition. However, within any dataset, individual sequences often exhibit substantial heterogeneity. Not all learners follow the same behavioral patterns. Some may consistently cycle between planning and execution, while others predominantly engage in monitoring with occasional social interactions.

Sequence clustering addresses this heterogeneity by grouping **individual sequences** (learners, sessions, or actors) into clusters of similar behavioral trajectories — often called **tactics** or **strategies** in educational research. This complements network-level analysis by revealing the diversity of approaches within a population. Where the overall TNA model shows the average transition structure, tactics reveal the distinct behavioral patterns that compose that average.

This distinction from community detection (Section 9) is important. Communities group **states** that frequently co-transition within the network — identifying which behaviors tend to co-occur. Tactics group **entire sequences** — identifying which learners behave similarly across their full trajectory. A learner's tactic reflects their overall strategy, while communities reflect structural relationships among behaviors.

Identifying tactics serves several research purposes:

- **Typology development**: Discovering naturally occurring behavioral patterns supports theory building about learning strategies, self-regulation approaches, or collaborative styles.
- **Intervention targeting**: Different tactics may require different interventions. Learners who predominantly monitor but rarely plan may benefit from different support than those who cycle rapidly between all states.
- **Outcome prediction**: Tactics can predict learning outcomes — some behavioral patterns may consistently associate with higher or lower achievement.
- **Group comparison via TNA**: Perhaps most powerfully, discovered tactics can serve as grouping variables for further TNA analysis, building separate transition networks per tactic to examine how transition dynamics differ across behavioral strategies.

### 13.2 Distance Metrics

Sequence clustering requires measuring how similar or different two sequences are. The `cluster_sequences()` function supports four distance metrics, each capturing different aspects of sequence similarity:

- **Hamming distance** (`'hamming'`): Counts the number of positions where two sequences differ. Fast and intuitive, but requires sequences of equal length and treats all positions equally. Best suited for aligned sequences where positional correspondence matters — for example, comparing what students did at time step 1, time step 2, etc.

- **Levenshtein distance** (`'lv'`): The minimum number of insertions, deletions, and substitutions needed to transform one sequence into another. Handles unequal-length sequences naturally. Appropriate when the exact temporal alignment is less important than the overall ordering of events.

- **Optimal String Alignment** (`'osa'`): Extends Levenshtein distance by also allowing adjacent transpositions (swapping two neighboring elements). Useful when near-swaps should be considered minor differences — for example, if planning-then-executing versus executing-then-planning represents a small rather than large behavioral difference.

- **Longest Common Subsequence** (`'lcs'`): Measures distance as the length difference after removing the longest shared subsequence. Focuses on what two sequences have in common regardless of position. Particularly appropriate when the presence of certain behavioral patterns matters more than their exact timing.

For most TNA applications with aligned sequences from `prepare_data()`, Hamming distance provides a good default. For sequences of varying length or when positional alignment is uncertain, Levenshtein or LCS distances are more appropriate.

In [None]:
# Cluster sequences into 3 tactics using PAM with Hamming distance
clust = tna.cluster_sequences(prepared_data, k=3, dissimilarity="hamming", method="pam")
print(clust)
print(f"\nCluster sizes: {clust.sizes}")
print(f"Silhouette score: {clust.silhouette:.4f}")

### 13.3 Clustering Methods

Two families of clustering methods are available:

**Partitioning Around Medoids (PAM)** (`method='pam'`) is the default and generally recommended method. PAM identifies *medoid* sequences — actual sequences from the data that best represent each cluster center. Unlike k-means (which uses abstract centroids), PAM's medoids are real, interpretable sequences that researchers can examine directly. PAM is also more robust to outliers than centroid-based methods.

**Hierarchical clustering** builds a tree-like dendrogram by iteratively merging the most similar sequences or clusters. The tree is then cut at the desired number of clusters. Several linkage methods control how inter-cluster distances are computed:

- `'complete'` — maximum distance between any pair across clusters (produces compact, spherical clusters)
- `'average'` — mean distance between all pairs across clusters (balanced approach)
- `'ward.D'` / `'ward.D2'` — minimizes within-cluster variance (tends to produce equal-sized clusters)
- `'single'` — minimum distance between any pair across clusters (can produce elongated, chain-like clusters)

The choice of method and distance metric can substantially affect results. Comparing multiple configurations and evaluating silhouette scores helps identify the most meaningful clustering for your data.

In [None]:
# Compare distance metrics (using first 200 sequences for slower metrics)
subset = prepared_data.sequence_data.iloc[:200]
print("Distance metric comparison (PAM, k=3, n=200):")
for metric in ["hamming", "lcs", "osa"]:
    result = tna.cluster_sequences(subset, k=3, dissimilarity=metric)
    print(f"  {metric:>7s}: sizes={result.sizes}, silhouette={result.silhouette:.4f}")

print("\nLinkage method comparison (Hamming, k=3, full data):")
for method in ["pam", "complete", "average", "ward.D2"]:
    result = tna.cluster_sequences(prepared_data, k=3, method=method)
    print(f"  {method:>8s}: sizes={result.sizes}, silhouette={result.silhouette:.4f}")

### 13.4 Choosing the Number of Clusters

Selecting the appropriate number of clusters is a critical decision. The **silhouette score** measures how well each sequence fits its assigned cluster compared to the nearest alternative cluster. Values range from -1 (poor fit) to +1 (excellent fit), with higher mean silhouette scores indicating better-separated, more cohesive clusters.

A practical approach is to compute silhouette scores for a range of *k* values and select the *k* that maximizes the score — or the *k* where the score plateaus, indicating diminishing returns from additional clusters. Domain knowledge should also inform the decision: the identified clusters should be interpretable and meaningful in the research context.

In [None]:
# Sweep k values and compare silhouette scores
print("Silhouette scores for different k values:")
for k in range(2, 6):
    result = tna.cluster_sequences(prepared_data, k=k)
    print(f"  k={k}: silhouette={result.silhouette:.4f}, sizes={result.sizes}")

### 13.5 Building TNA Models per Tactic

The most powerful application of sequence clustering in TNA is using the discovered tactics as grouping variables to build separate transition networks per cluster. This reveals how transition dynamics differ across behavioral strategies — for example, whether learners in a "monitoring-heavy" tactic show different transition patterns than those in a "planning-heavy" tactic.

The workflow is straightforward: cluster the sequences, assign each sequence to its tactic, and use `group_tna()` with the tactic labels as the grouping variable. All standard TNA functions — centralities, pruning, communities, bootstrapping, permutation testing — then work seamlessly on the tactic-based group model.

In [None]:
# Step 1: Cluster sequences into tactics
clust = tna.cluster_sequences(prepared_data, k=3, dissimilarity="hamming", method="pam")

# Step 2: Add tactic labels to the sequence data
tactic_data = prepared_data.sequence_data.copy()
tactic_data["Tactic"] = [f"Tactic {c}" for c in clust.assignments]

# Step 3: Build a TNA model for each tactic
tactic_model = tna.group_tna(tactic_data, group="Tactic")
print(tactic_model)
print()
tactic_model.summary()

In [None]:
# Compare transition networks across tactics
tna.plot_network(tactic_model, minimum=0.05, cut=0.1)
plt.show()

In [None]:
# Centralities per tactic — which states are central in each behavioral strategy?
tactic_cent = tna.centralities(tactic_model, measures=["OutStrength", "InStrength", "Betweenness"])
tactic_cent

## 14. Complete Workflow at a Glance

The following code summarizes the full TNA analysis pipeline. This can serve as a template for your own analyses:

```python
import tna
import pandas as pd

# 1. Load and prepare data
my_data = pd.read_csv("your_data.csv")
prepared = tna.prepare_data(my_data, action="event", actor="user_id", time="timestamp")

# 2. Build model
model = tna.tna(prepared)

# 3. Visualize
tna.plot_network(model, minimum=0.05, cut=0.1)
tna.plot_histogram(model)
tna.plot_frequencies(model)

# 4. Prune
pruned = tna.prune(model, threshold=0.05)
tna.plot_network(pruned, cut=0.1)

# 5. Cliques
print(tna.cliques(model, size=2, threshold=0.1))
print(tna.cliques(model, size=3, threshold=0.05))

# 6. Centralities
tna.plot_centralities(tna.centralities(model))
tna.plot_network(tna.betweenness_network(model), cut=0.1)

# 7. Communities
tna.plot_communities(tna.communities(model), cut=0.1)

# 8. Bootstrap
boot = tna.bootstrap_tna(model, iter=1000, level=0.05, seed=265)
tna.plot_network(boot.model, cut=0.1)

# 9. Sequences
tna.plot_sequences(prepared)

# 10. Group models (from metadata column)
gm = tna.group_tna(prepared, group="achievement")
tna.plot_network(gm)                    # side-by-side networks
tna.centralities(gm)                    # centralities with group column
tna.prune(gm, threshold=0.05)           # prune each group
tna.permutation_test(gm["A"], gm["B"])  # compare two groups

# 11. Sequence clustering (tactics)
clust = tna.cluster_sequences(prepared, k=3, dissimilarity="hamming")
tactic_data = prepared.sequence_data.copy()
tactic_data["Tactic"] = [f"Tactic {c}" for c in clust.assignments]
tactic_model = tna.group_tna(tactic_data, group="Tactic")
tna.plot_network(tactic_model)          # per-tactic networks

# 12. Import one-hot encoded data
wide = tna.import_onehot(onehot_df, cols=["State_A", "State_B"], actor="id")
model_oh = tna.tna(wide)
```

For more information, see the [TNA package documentation](https://github.com/mohsaqr/tnapy) and the [R TNA tutorial](https://lamethods.org/book2/chapters/ch15-tna/ch15-tna.html) by Saqr & Lopez-Pernas.