This is a wrapper that helps us run a few types of models on a few splits of
data. The outputs of this script are (1) saved features and (2) a trained model,
selected to have the best dev set score. The script is exactly the same as the model training
for the simulation, except some of the paths and configuration files have been changed.

The main input parameters are the path to the `train_yaml` (relative to the root directory) and the bootstrap index to use. Since we may want to run this notebook as a python script (using `nbconvert`) we look up these arguments using environmental variables.

In [None]:
import os
import json
from addict import Dict
from pathlib import Path
import sys
sys.path.append("../../../../inst/python")
from data import initialize_loader
from models.vae import VAE, vae_loss
from models.cnn import CBRNet, cnn_loss
import models.random_features as rcf
import train as st
import train_rcf as srcf
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import pandas as pd
import torch
import torch.optim
import yaml

train_yaml = Path("conf/tnbc_rcf-k256.yaml")
bootstrap = 1
data_dir = Path("../../../data/raw_data/stability_data/")
save_dir = Path("../../../data/derived_data/tnbc_models") / train_yaml.name.replace(".yaml", "") / str(bootstrap)
save_dir.mkdir(parents=True, exist_ok=True)
(save_dir / "features" / "logs").mkdir(parents=True, exist_ok=True)
opts = Dict(yaml.safe_load(open(train_yaml, "r")))
print(opts.train)

I assume that the data have already been preprocessed using the `prepare_mibi.Rmd` document, also in this `data_analysis` folder. We have provided a saved version of these output in the `stability_data_tnbc.tar.gz` archive. The block below is unzipping these data so that they can be referred to during model training. Note that this will overwrite any previously unzipped simulation data.

In [None]:
%%capture
%cd ../../data/raw_data/
!rm -rf stability_data/
!tar -zxvf stability_data_tnbc.tar.gz
%cd ../../data_analysis/learning/

Next, we'll create directories for saving all the features. We'll also read in all paths for training / development / testing. This is a bit more involved than the usual training process, since we'll want loaders specifically for looking at changes in feature activations.

In [None]:
features_dir = data_dir / opts.organization.features_dir
os.makedirs(features_dir, exist_ok=True)

splits = pd.read_csv(data_dir / opts.organization.splits)
resample_ix = pd.read_csv(data_dir / opts.bootstrap.path)

paths = {
    "train": splits.loc[splits.split == "train", "path"].values[resample_ix.loc[bootstrap]],
    "dev": splits.loc[splits.split == "dev", "path"].values,
    "test": splits.loc[splits.split == "test", "path"].values,
    "all": splits["path"].values
}

np.random.seed(0)
save_ix = np.random.choice(len(splits), opts.train.save_subset, replace=False)
loaders = {
    "train_fixed": initialize_loader(paths["train"], data_dir, opts),
    "train": initialize_loader(paths["train"], data_dir, opts, shuffle=True),
    "dev": initialize_loader(paths["dev"], data_dir, opts),
    "test": initialize_loader(paths["test"], data_dir, opts),
    "features": initialize_loader(paths["all"][save_ix], data_dir, opts)
}

Let's define the model and the loss functions. This is not super elegant, basically a long switch statement.

In [None]:
if opts.train.model == "cnn":
    model = CBRNet(nf=opts.train.nf, p_in=opts.train.p_in)
    loss_fn = cnn_loss
elif opts.train.model == "vae":
    model = VAE(z_dim=opts.train.z_dim, p_in=opts.train.p_in)
    loss_fn = vae_loss
elif opts.train.model == "rcf":
    patches = rcf.random_patches([data_dir / p for p in paths["train"]], k=opts.train.n_patches)
    model = rcf.WideNet(patches)
else:
    raise NotImplementedError()

Next, let's prepare a logger to save the training progress. We also save the indices of the samples for which we'll write activations -- it would be too much (and not really necessary) to write activations for all the samples.

In [None]:
subset_path = data_dir / opts.organization.features_dir / "subset.csv"
splits.iloc[save_ix, :].to_csv(subset_path)
writer = SummaryWriter(features_dir / "logs")
writer.add_text("conf", json.dumps(opts))
out_paths = [
    save_dir / opts.organization.features_dir, # where features are saved
    save_dir / opts.organization.metadata, # metadata for features (e.g., layer name)
    save_dir / opts.organization.model # where model gets saved
]

Finally, we can train our model. Training for the random convolutional features model is just ridge regression -- there are no iterations necessary. For the CNN and VAE, all the real logic is hidden away in the `st.train` function. The trained model and extracted features get saved into the `save_dir` folder. To save features across many runs, we rerun this notebook across many values of the `bootstrap` parameter. We find this step worth parallelizing on a computer cluster. The HTCondor submit scripts used in our paper are available [here](https://github.com/krisrs1128/learned_inference/blob/master/run_scripts/train.submit).

In [None]:
if opts.train.model == "rcf":
    srcf.train_rcf(model, loaders, out_paths, alpha=opts.train.alpha, l1_ratio=opts.train.l1_ratio, normalize=True)
else:
    optim = torch.optim.Adam(model.parameters(), lr=opts.train.lr)
    st.train(model, optim, loaders, opts, out_paths, writer, loss_fn)