# CNN Training for DAS Data (Binary Classification)

This notebook demonstrates how to:
1. Load the DAS data (train/test) from `.npy` files.
2. Visualize random patches for noise/events.
3. Train a CNN model on imbalanced data, focusing on F1 score, confusion matrix, etc.
4. Evaluate using classification metrics (Precision, Recall, F1, PR-AUC, ROC-AUC, etc.).
5. Generate embeddings and visualize with t-SNE.

---


In [9]:
# notebook cell: imports
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

# Import our training script
from my_scripts.model_training import (
    visualize_examples,
    train_cnn_model,
    plot_history,
    evaluate_model,
    plot_pr_curve,
    plot_roc_curve,
    generate_embeddings,
    plot_tsne
)


## Step 1: Load the Data

We assume you've already processed the data into `.npy` arrays:  
- `X_train.npy`, `y_train.npy`  
- `X_test.npy`, `y_test.npy`  

where each is shaped appropriately for a binary classification problem.  


In [2]:
# notebook cell: data loading
X_train = np.load('train_data_thres40/X_train.npy')
y_train = np.load('train_data_thres40/y_train.npy')
X_test  = np.load('test_data_thres40/X_test.npy')
y_test  = np.load('test_data_thres40/y_test.npy')

print("Train data shape:", X_train.shape)
print("Train labels shape:", y_train.shape)
print("Test data shape:", X_test.shape)
print("Test labels shape:", y_test.shape)


FileNotFoundError: [Errno 2] No such file or directory: 'train_data_thres40/X_train.npy'

## Step 2: Quick Visualization
Let's visualize some random *noise* and *event* patches from the training set.  

**Note on Imbalanced Data:** If the dataset is heavily skewed (e.g., much more noise than events), F1, precision, and recall may be more relevant than raw accuracy.


In [3]:
# notebook cell: data visualization
# Let's just visualize the first N=6 samples or random samples
visualize_examples(X_train, num_examples=6)


NameError: name 'visualize_examples' is not defined

## Step 3: Define and Train the CNN

We'll train a deeper CNN with L2 regularization, dropout, and an exponential-decay learning rate schedule. We’ll also distribute across multiple GPUs if available via `tf.distribute.MirroredStrategy`.


In [4]:
# notebook cell: model training
model_save_path = 'saved_models/cnn_model_improved.h5'

model, history = train_cnn_model(
    X_train, y_train,
    save_path=model_save_path,
    batch_size=32,
    epochs=100,
    val_split=0.1,
    initial_lr=1e-4
)


NameError: name 'train_cnn_model' is not defined

## Step 4: Plot Training Curves
Let's visualize accuracy and loss for both training and validation.


In [5]:
plot_history(history, out_path='imbtrain/Acc_Loss.png')


NameError: name 'plot_history' is not defined

## Step 5: Evaluation with Confusion Matrix & Metrics

### Why These Metrics?
- **Confusion Matrix**: Helps us see how many noise/events were misclassified.  
- **Accuracy**: Overall fraction of correct predictions.  
- **Precision**: Out of all predicted positives, how many were truly positive?  
- **Recall**: Out of all actual positives, how many did we predict correctly?  
- **F1 Score**: Harmonic mean of precision and recall, often more informative than accuracy for imbalanced data.  

We also recommend **PR-AUC** (Precision-Recall AUC) for highly skewed classes, as ROC curves can sometimes give overly optimistic views in extreme class imbalance.


In [6]:
# notebook cell: evaluate on test set
y_pred = evaluate_model(model, X_test, y_test)


NameError: name 'evaluate_model' is not defined

## Step 6: Precision-Recall Curve & ROC Curve


In [7]:
# We'll use the predicted probability for these plots
y_pred_prob = model.predict(X_test, batch_size=1).ravel()

pr_auc = plot_pr_curve(y_test, y_pred_prob, label_name="CNN Model")
roc_auc = plot_roc_curve(y_test, y_pred_prob, label_name="CNN Model")


NameError: name 'model' is not defined

## Step 7: t-SNE Visualization of Embeddings

We can extract the "embedding_layer" output (128-dim) to visualize how the network clusters events vs noise. Then apply t-SNE to reduce to 2D.


In [8]:
# notebook cell: embed & visualize
embeddings_test = generate_embeddings(model, X_test)

# t-SNE might be computationally expensive, consider a subset if needed
plot_tsne(embeddings_test, y_test, title="t-SNE on Test Embeddings")


NameError: name 'generate_embeddings' is not defined