# ðŸ§ª Tutorial: Processing Crystal Simulations for Graph Kernels

**Goal:** Convert raw simulation graphs (from `.pkl` files) into **labeled subgraphs** ready for machine learning experiments.

### Why do we need this processing step?
1.  **Scaling:** We cannot run Graph Kernels on full 3,000-node simulation boxes (it would take terabytes of RAM). Instead, we extract **Ego Graphs** (local neighborhoods) to characterize the structure.
2.  **Compatibility:** The Weisfeiler-Lehman (WL) Kernel requires **discrete labels** (like "Color A", "Color B"). Our data has continuous Minkowski order parameters ($M_4, M_6...$). This pipeline converts those floats into discrete **"Bins"**.
3.  **Data Hygiene:** We automatically filter out noisy data. If a simulation has `noise > 0.3`, we label it **'Disordered'** to prevent the model from learning incorrect patterns.

In [1]:
%load_ext autoreload
%autoreload 2

import sys
import pickle
import numpy as np
from pathlib import Path

# --- SETUP PATHS ---
# Add the project root to the python path so we can import from 'src'
current_dir = Path.cwd()
project_root = current_dir.parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import our custom library functions
from src.config import RAW_DATA_PATH
from src.processing import process_graphs

## 1. Load Raw Data
We load the dataset containing 132 large crystal simulations. Each graph represents one full simulation box.

In [2]:
print(f"Loading data from: {RAW_DATA_PATH}")
with open(RAW_DATA_PATH, 'rb') as f:
    data = pickle.load(f)

graphs = data['graphs']
metadata = data['metadata']

print(f"Loaded {len(graphs)} simulations.")
print(f"Sample Metadata: {metadata[0]}")

Loading data from: /home/npkamath/553Project/553ProjectGraphKernels/data/raw/crystal_graphs_dataset.pkl
Loaded 132 simulations.
Sample Metadata: {'crystal_type': 'sc', 'noise_level': 0.0, 'sample_idx': 0, 'n_nodes': 3375, 'n_edges': 10125, 'system_size': 15, 'scale_factor': 1.0, 'ls': (4, 5, 6, 8, 10, 12), 'weight_threshold': 0.05, 'cutoff': 1.8}


## 2. The Processing Pipeline (`process_graphs`)

We now run the core function `process_graphs`. This function performs three critical steps automatically:

1.  **Discretization:** It collects all Minkowski features ($M_4 \dots M_{12}$) from every node and trains a **K-Means model** to group them into **20 discrete bins**. 
    * *Example:* `[0.12, 0.88]` $\rightarrow$ `Bin 2` and `Bin 18` $\rightarrow$ Label `"2-18"`.
2.  **Relabeling:** It checks the `noise_level` in the metadata. If `noise > 0.3`, the graph is labeled **'Disordered'**, regardless of its original type.
3.  **Subgraph Extraction:** It randomly samples **30 nodes** from each graph and extracts their **1-hop neighborhood** (Ego Graph). This creates a dataset of ~4,000 small, manageable graphs.

In [3]:
print("Running processing pipeline (this may take 1-2 minutes)...")

# Returns:
#   subgraphs: A list of grakel.Graph objects
#   labels: A list of strings (e.g., 'FCC', 'BCC', 'Disordered')
subgraphs, labels = process_graphs(graphs, metadata)

print(f"âœ… Success! Generated {len(subgraphs)} subgraphs ready for training.")

Running processing pipeline (this may take 1-2 minutes)...
Step 1: Collecting features from all graphs...
Step 2: Discretizing features into 20 bins...
Step 3: Relabeling and extracting subgraphs...
   > Extraction complete. Created 396 subgraphs.
âœ… Success! Generated 396 subgraphs ready for training.


## 3. Inspecting the Data
Let's look at one of the subgraphs to understand what the Kernel will see.
Notice the **node labels** are strings like `'17-0-5...'`. These are the discretized "Barcodes" representing the local physics.

In [4]:
# Inspect the first sample
sample_idx = 0
sample_graph = subgraphs[sample_idx]
sample_label = labels[sample_idx]

print(f"--- Sample {sample_idx} ---")
print(f"Ground Truth Label: {sample_label}")

# Get internal GraKeL representation
gk_labels = sample_graph.get_labels(purpose='dictionary')

print(f"Number of nodes in subgraph: {len(gk_labels)}")
print("\nNode Labels (The 'Barcode' the kernel sees):")
for node_id, label in list(gk_labels.items())[:5]:
    print(f"  Node {node_id}: {label}")

--- Sample 0 ---
Ground Truth Label: sc
Number of nodes in subgraph: 7

Node Labels (The 'Barcode' the kernel sees):
  Node 2619: 18-0-8-19-11-19
  Node 2169: 18-0-8-19-11-19
  Node 2409: 18-0-8-19-11-19
  Node 2379: 18-0-8-19-11-19
  Node 2393: 18-0-8-19-11-19


## 4. Ready for Experiments
You can now use `subgraphs` and `labels` in your own experiments. You do **not** need to use the default SVM pipeline.

### Example: Custom Train/Test Split

In [5]:
from sklearn.model_selection import train_test_split

# Split the processed data
X_train, X_test, y_train, y_test = train_test_split(
    subgraphs, 
    labels, 
    test_size=0.3, 
    stratify=labels
)

print(f"Training Set: {len(X_train)} graphs")
print(f"Test Set:     {len(X_test)} graphs")

# NOW YOU CAN INSERT YOUR CUSTOM KERNEL/MODEL CODE HERE
# Example:
# from grakel.kernels import WeisfeilerLehman
# gk = WeisfeilerLehman(n_iter=3)
# K_train = gk.fit_transform(X_train)

Training Set: 277 graphs
Test Set:     119 graphs
