
# Baseline Training Notebook  
## Controlled vs Uncontrolled Building Demolition Videos

This notebook demonstrates a **baseline training pipeline** for the dataset using:

- PyTorch
- ResNet18 + LSTM
- RAM-safe video loading
- Focal Loss
- Per-video aggregation for evaluation

This baseline is intentionally **simple and conservative**, designed for **small, imbalanced video datasets**.


## 1. Install Dependencies

In [None]:

!pip install torch torchvision torchaudio opencv-python pyyaml tqdm


## 2. Import Libraries

In [None]:

import torch
import yaml
from model import CNNLSTMVideoClassifier
from dataset_safe import SafeVideoDataset
from losses import BinaryFocalLoss
from utils import compute_metrics_from_preds


## 3. Load Configuration

In [None]:

with open("config.yaml", "r") as f:
    cfg = yaml.safe_load(f)

device = cfg.get("device", "cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)


## 4. Load Dataset

In [None]:

train_ds = SafeVideoDataset(
    root_split_dir="dataset/train",
    class_names=cfg["data"]["class_names"],
    num_frames=cfg["video"]["num_frames"],
    sampling="random",
)

val_ds = SafeVideoDataset(
    root_split_dir="dataset/val",
    class_names=cfg["data"]["class_names"],
    num_frames=cfg["video"]["num_frames"],
    sampling="random",
    fixed_seed=1234,
)

print("Train videos:", len(train_ds))
print("Val videos:", len(val_ds))


## 5. Initialize Model

In [None]:

model = CNNLSTMVideoClassifier(
    encoder_name="resnet18",
    pretrained=True,
    lstm_hidden=64,
    lstm_layers=1,
    dropout=0.5,
    unfreeze_layer4=False,
).to(device)

print("Trainable parameters:", sum(p.numel() for p in model.parameters() if p.requires_grad))


## 6. Loss and Optimizer

In [None]:

criterion = BinaryFocalLoss(alpha=0.75, gamma=2.0)

optimizer = torch.optim.AdamW(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=cfg["train"]["lr"],
    weight_decay=cfg["train"]["weight_decay"],
)


## 7. Single Training Step (Sanity Check)

In [None]:

frames, labels, _ = train_ds[0]
frames = frames.unsqueeze(0).to(device)
labels = labels.unsqueeze(0).float().to(device)

model.train()
logits = model(frames)
loss = criterion(logits, labels)

loss.backward()
optimizer.step()

print("Sanity loss:", loss.item())


## 8. Per-Video Aggregated Evaluation (Example)

In [None]:

model.eval()

probs = []
labels = []

for idx in range(len(val_ds)):
    clip_probs = []
    for k in range(cfg["eval"]["clips_per_video"]):
        frames, label, _ = val_ds.get_clip(idx, clip_key=f"eval_{k}")
        frames = frames.unsqueeze(0).to(device)
        with torch.no_grad():
            logit = model(frames).clamp(-10, 10)
            prob = torch.sigmoid(logit)[0].item()
        clip_probs.append(prob)
    probs.append(sum(clip_probs) / len(clip_probs))
    labels.append(label.item())

preds = torch.tensor(probs) >= cfg["eval"]["threshold"]
metrics = compute_metrics_from_preds(preds.long(), torch.tensor(labels))

metrics



## Notes

- This notebook is a **baseline demonstration**, not a full training loop.
- For full training, use `train.py`.
- This notebook is useful for:
  - sanity checks
  - debugging preprocessing
  - verifying per-video aggregation logic
