# 3. High-Fidelity Synthetic Data Generation 🧪

This notebook demonstrates how to use the pre-trained CPT Foundation Model as a **CPT simulator** to generate entirely new, realistic-looking CPT profiles.

## The Process: Autoregressive Generation

The core technique used here is **autoregressive generation**. It works similarly to how a large language model writes an essay sentence by sentence:

1.  **Provide a Prompt**: We give the model a short, real sequence of CPT data points (e.g., the first 20 measurements) to set the initial geological context.
2.  **Predict the Next Step**: The model takes this prompt and predicts the very next data point in the sequence.
3.  **Append and Repeat**: The newly predicted data point is appended to the end of the sequence. This new, slightly longer sequence becomes the input for the next prediction step.
4.  **Generate Full Profile**: This process is repeated until a CPT profile of the desired length is generated.

### Applications
- **Data Augmentation**: Create vast amounts of synthetic data to train other specialized machine learning models (e.g., for soil classification or liquefaction analysis), improving their robustness.
- **Scenario Modeling**: Generate profiles for specific "what-if" scenarios to test engineering designs under various hypothetical ground conditions.

## 1. Setup

First, we'll import the necessary libraries and load our trained model, configuration file, and the data scaler used during training.

In [20]:
import os
import yaml
import torch
import numpy as np
import joblib
from tqdm import tqdm
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Change working directory to project root
if os.path.basename(os.getcwd()) == 'notebooks':
    os.chdir('..')

# Add src to path
import sys
sys.path.append(os.path.abspath('src'))

from data_utils import CPTDataModule
from model import CPTFoundationModel

### Load Configuration, Model, and Scaler

In [21]:
CONFIG_PATH = 'configs/PG_dataset.yaml'

# Load YAML config
with open(CONFIG_PATH, 'r') as f:
    config = yaml.safe_load(f)

# Setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Initialize the model
model_params = config['model_params']
model = CPTFoundationModel(
    num_features=model_params['num_features'],
    model_dim=model_params['model_dim'],
    num_heads=model_params['num_heads'],
    num_layers=model_params['num_layers']
).to(device)

# Load the saved model checkpoint
model_path = config['data_paths']['model_save_path']
if os.path.exists(model_path):
    checkpoint = torch.load(model_path, map_location=device)
    model.load_state_dict(checkpoint['model_state_dict'])
    model.eval() # Set to evaluation mode
    print(f"Model loaded successfully from '{model_path}'")
else:
    print(f"ERROR: Model not found at '{model_path}'. Please train the model first.")
    model = None

# Load the scaler
scaler_path = config['data_paths'].get('scaler_path')
if scaler_path and os.path.exists(scaler_path):
    scaler = joblib.load(scaler_path)
    print(f"Scaler loaded from '{scaler_path}'")
else:
    print("ERROR: Scaler not found. Please run the data preprocessing script.")
    scaler = None

Using device: cuda
Model loaded successfully from 'models/foundation_model_PG.pth'
Scaler loaded from 'data/processed/PG/scaler.joblib'


## 2. Autoregressive Generation Function

This is the core function that performs the step-by-step generation.

In [22]:
def generate_synthetic_cpt(model, scaler, prompt, max_len, device, temperature=0.0):
    """
    Generates a synthetic CPT profile autoregressively from a prompt.
    
    Args:
        model (CPTFoundationModel): The trained transformer model.
        scaler (StandardScaler): The scaler used for the training data.
        prompt (np.array): A short sequence of initial CPT data (shape [prompt_len, num_features]).
        max_len (int): The total desired length of the generated CPT.
        device (torch.device): The device to run generation on.
        temperature (float): Controls randomness in generation. 0 for deterministic, >0 for stochastic.
        
    Returns:
        np.array: The generated CPT data, inverse-transformed to its original scale.
    """
    if not model or not scaler:
        print("Model or scaler not available. Aborting generation.")
        return None
    
    num_features = model.input_projection.in_features
    
    # Scale the prompt and convert to a tensor with a batch dimension
    prompt_scaled = scaler.transform(prompt[:, :num_features])
    generated_sequence = torch.tensor(prompt_scaled, dtype=torch.float32).unsqueeze(0).to(device)
    
    model.eval() # Ensure model is in evaluation mode
    with torch.no_grad():
        pbar = tqdm(range(len(prompt), max_len), desc="Generating Synthetic CPT")
        for _ in pbar:
            # Prepare the current sequence and attention mask
            input_sequence = generated_sequence
            attention_mask = torch.ones(input_sequence.shape[:2], device=device)
            
            # Get contextual embeddings from the model
            contextual_embeddings = model(input_sequence, attention_mask)
            
            # IMPORTANT: Select the embedding of the VERY LAST token
            last_token_embedding = contextual_embeddings[:, -1:, :] # Shape: [1, 1, model_dim]
            
            # Project this single embedding to predict the next data point
            next_token_prediction = model.output_projection(last_token_embedding) # Shape: [1, 1, num_features]
            
            # Apply temperature for stochastic sampling if temperature > 0
            if temperature > 0:
                # Add Gaussian noise scaled by temperature
                noise = torch.randn_like(next_token_prediction) * temperature
                next_token_prediction = next_token_prediction + noise
            
            # Append the prediction to our sequence
            generated_sequence = torch.cat([generated_sequence, next_token_prediction], dim=1)
            
    # Move sequence to CPU, remove batch dimension, and inverse transform
    final_sequence_scaled = generated_sequence.squeeze(0).cpu().numpy()
    final_sequence_unscaled = scaler.inverse_transform(final_sequence_scaled)
    
    return final_sequence_unscaled

## 3. Generate and Visualize

Now, let's put it all together. We'll load the test data to get a real CPT, use its beginning as a prompt, and generate a new profile.

In [23]:
# --- Get a real CPT to use for a prompt ---
print("Loading test data to find a prompt...")
data_module = CPTDataModule(config)
data_module.setup()
test_dataset = data_module.test_dataset

# Get the first CPT chunk from the test set
# The dataset returns (data, mask, cpt_id), so we unpack all three
original_cpt_scaled = test_dataset[0] 
original_cpt_unscaled = scaler.inverse_transform(original_cpt_scaled.numpy())

# --- Set Generation Parameters ---
PROMPT_LENGTH = 100 # Use the first 20 data points as the context
GENERATION_LENGTH = 500 # Generate a CPT with 500 data points
TEMPERATURE = 0 # Set to 0 for deterministic output, >0 for random variations

# Create the prompt from the original CPT
cpt_prompt = original_cpt_unscaled[:PROMPT_LENGTH]

# --- Run Generation ---
synthetic_cpt = generate_synthetic_cpt(
    model=model,
    scaler=scaler,
    prompt=cpt_prompt,
    max_len=GENERATION_LENGTH,
    device=device,
    temperature=TEMPERATURE
)

Loading test data to find a prompt...
Found existing processed data in 'data/processed/PG'. Delete to reprocess.
Processing 1071 files with max_len=512 and overlap=128...


Loading and Chunking Data: 100%|██████████| 1071/1071 [00:23<00:00, 44.98it/s]


Processing 133 files with max_len=512 and overlap=128...


Loading and Chunking Data: 100%|██████████| 133/133 [00:02<00:00, 47.12it/s]


Processing 135 files with max_len=512 and overlap=128...


Loading and Chunking Data: 100%|██████████| 135/135 [00:02<00:00, 46.84it/s]


Train dataset size: 5338
Validation dataset size: 681
Test dataset size: 680


Generating Synthetic CPT: 100%|██████████| 400/400 [00:02<00:00, 145.04it/s]


### Visualize the Result

Let's plot our generated CPT against the original one to see how it compares. The generated profile should look statistically similar and geologically plausible, but not be an exact copy.

In [24]:
def plot_generated_cpt(original_cpt, synthetic_cpt, prompt_len, feature_names=["qc", "fs"]):
    """Plots a comparison of an original and a synthetically generated CPT."""
    if synthetic_cpt is None:
        print("No synthetic data to plot.")
        return
        
    fig = make_subplots(rows=1, cols=len(feature_names), subplot_titles=[f'Feature: {name}' for name in feature_names])
    
    # Plot Original CPT
    original_depth = np.arange(len(original_cpt))
    for i, name in enumerate(feature_names):
        fig.add_trace(go.Scatter(x=original_cpt[:, i], y=original_depth, mode='lines', name='Original', line=dict(color='blue', width=2)), row=1, col=i+1)

    # Plot Synthetic CPT
    synthetic_depth = np.arange(len(synthetic_cpt))
    for i, name in enumerate(feature_names):
        fig.add_trace(go.Scatter(x=synthetic_cpt[:, i], y=synthetic_depth, mode='lines', name='Synthetic', line=dict(color='green', dash='dash')), row=1, col=i+1)
    
    # Highlight the prompt region
    for i, name in enumerate(feature_names):
        fig.add_shape(
            type="rect",
            x0=min(original_cpt[:, i].min(), synthetic_cpt[:, i].min()) * 0.9,
            x1=max(original_cpt[:, i].max(), synthetic_cpt[:, i].max()) * 1.1,
            y0=-0.5,
            y1=prompt_len - 0.5,
            fillcolor="red",
            opacity=0.1, layer="below", line_width=0,
            row=1, col=i+1
        )
        fig.add_annotation(
            text="Prompt",
            x=np.median(original_cpt[:prompt_len, i]), y=prompt_len / 2,
            showarrow=False, bgcolor="rgba(255, 255, 255, 0.5)",
            row=1, col=i+1
        )

    fig.update_yaxes(autorange="reversed", title_text="Depth Index")
    fig.update_layout(
        title_text='Synthetic CPT Generation vs. Original',
        height=700, width=1000,
        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
    )
    fig.show()

# Plot the results
plot_generated_cpt(original_cpt_unscaled, synthetic_cpt, PROMPT_LENGTH, feature_names=["qc", "fs"])