## Overview

This notebook covers the basic model building, data generation, storage, and loading functions of the library. We start by demonstrating how to extract and save features from whole-slide images using `HistomicsStream` and the `mil.io` subpackage. Then we demonstrate how to build either set-based or structured models from these datasets using the `mil.models` subpackage.

Concepts:
   - Feature extraction (multi-GPU)
   - Wrapping models for histomics stream
   - Structured and flattened data formats
   - Convolutional and dense WS/MIL models

In [None]:
# install openslide, histomics_stream, pandas
!apt-get update
!apt-get install -y openslide-tools
!pip install openslide-python
!pip install histomics_stream 'large_image[openslide]' scikit_image --find-links https://girder.github.io/large_image_wheels
!pip install pandas

# install mil library with extra ray[tune]
user = '########' #git username
token = '################' #personal access token - see https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
branch = 'dev'
!python -m pip install git+https://{user}:{token}@github.com/PathologyDataScience/mil.git@{branch}#egg=mil[ray]
        
# imports
from mil.metrics import Balanced, F1, Mcc, Sensitivity, Specificity
from mil.models import convolutional_model, attention_flat, attention_flat_tune
from mil.io.reader import read_record, peek
from mil.io.transforms import parallel_dataset
from mil.io.utils import inference, study
from mil.io.writer import write_record
import numpy as np
import os
import pandas as pd
import tensorflow as tf
import ray

## Feature extraction parameters

Feature extraction parameters have a significant impact on model performance, both in terms of accuracy and the time it takes to train a model. Here, we specify the magnification used to extract features, the size of tiles that features are extracted from, the overlap between these tiles, and the number of tiles contained within each read (chunk). We specify a pre-trained model to use for feature extraction, as well as a set of whole-slide images. A large overlap ensures that important structures will appear whole in at least some tiles, but will significantly increase the amount of data that is saved and subsequently used in training.

Parameters are also required for saving the features in .tfr files. We have to define the names of the subject labels stored in the .tfr, and the location to save these files.

In [None]:
# feature extraction parameters
t=224 # tile size (pixels)
overlap=0 # tile overlap (pixels)
chunk=1792 # chunk size (pixels)
magnification=20 # magnification
tile_batch=128 # the number of tiles to batch
tile_prefetch=2 # the number of batches to prefetch
model_name='EfficientNetV2L' # the pre-trained model for feature extraction
svspath = '/data/transplant/nwu/wsi/' # path for the whole-slide images

# .tfr saving parameters
csvfile = './northwestern/CTOT08_clinical_BiopsyImageKeys_4.27.22.csv' # path for the table containing subject data
column_mapping = {'SVS_FileName': 'name', 'G': 'g', 'PTC': 'ptc',
                  'V': 'v', 'TG': 'tg', 'CG': 'cg', 'MM': 'mm', 'CI': 'ci',
                  'CT': 'ct', 'CV': 'cv', 'I': 'i', 'T': 't', 'AH': 'ah'}
csvfile = '/data/transplant/nwu/CTOT08_clinical_BiopsyImageKeys.csv' # path for the table containing subject data

outpath = f'/data/renal_allograft/nwu_features/{model_name}_{t}_{overlap}_{magnification}X/' # location to store structured .tfr files
if not os.path.exists(outpath):
    os.mkdir(outpath)

## Loading and wrapping the feature extraction model

To use HistomicsStream for feature extraction, we have to wrap the feature extraction model so that the tile location information and other metadata can be passed through the model and captured at the output registered to the features. HistomicsStream is used to generate a tf.data.Dataset of tiles, and features are extracted from this using tf.keras.Model.predict. Since predict takes a single input, we combine (tiles, tile_metadata) for passing to the wrapped model. Inside the wrapper these are separated and inference is done on the tiles. Wrapping is necessary to avoid having the tile_metadata discarded. We also add a dummy 'y' variable 0. to be discarded by predict.

To enable multi-GPU feature extraction, the feature extraction model is loaded outside the parallel context, and is wrapped inside the parallel context.

In [None]:
# define the wrapped model class
class WrappedModel(tf.keras.Model):
    def __init__(self, extractor, *args, **kwargs):
        super(WrappedModel, self).__init__(*args, **kwargs)
        self.model = extractor
        
    def call(self, inputs, *args, **kwargs):
        return self.model(inputs[0]), inputs[1]
    

# create the feature extractor model to be wrapped
model = tf.keras.applications.efficientnet_v2.EfficientNetV2L(
        include_top=False, weights='imagenet', input_shape=(t, t, 3),
        pooling='avg')

# get dimensionality of extracted features
D = model.output_shape[-1]

# create a distributed wrapped model
with tf.distribute.MirroredStrategy().scope():
    
    # wrap the model
    wrapped_model = WrappedModel(model, name='wrapped_model')

## Creating .tfr files

This cell reads in a subject .csv file to build a table of whole-slide image files and subject labels. Each row of the table defines one subject. In this example each subject has a single whole-slide image, although the library supports multiple images per subject. In that case, the features from multiple images are stored in a single .tfr, along with an array indexing each feature to each slide, and lists of slide names and properties.

We iterate over each row of the table, generating a HistomicsStream study, doing inference on the resulting tf.data.Dataset, and saving the features to a .tfr along with tile, slide, and subject metadata.

In this example, we write the features in a structured format, which places the feature vector obtained from each tile into a 3D tensor which preserves their spatial organization as found in the slide. This enables us to build convolutional models that can leverage spatial information spanning multiple tiles. Structured format is not supported when there are multiple slides / patient.

In [None]:
# extract labels from csv
table = pd.read_csv(csvfile)
table = table[list(column_mapping.keys())]
table = table.rename(columns=column_mapping)

# match table entries to existing files
files = [slide for slide in os.listdir(svspath) if os.path.splitext(slide)[1] == '.svs']
table = table[table.name.isin(files)].reset_index()

# write a tf record for each slide
for i, entry in table.iterrows():
    
    # slide
    slide = entry['name']
    
    # add the subject labels present in the table
    label = {l:float(entry[l]) for l in entry.keys() if l != 'name'}
    
    # we can also add custom metadata as scalars, lists of scalars, and np.ndarrays
    label['model_name'] = model_name 
    label['stain'] = 'periodic acid–schiff'
    label['encounters'] = ['2/16/19', '4/1/20']
    label['test_results'] = np.array([0.3, 1.1])
    label['age'] = 64
    
    # generate tf record filename
    filename = slide + '.' + model_name + '_' + str(t) + '_' + str(magnification) + 'X_2d.tfr'     
    
    # skip if file exists
    if os.path.exists(outpath + filename):
        continue
    else:
        print(slide)
    
    # create histomics stream study
    try:
        hs_study = study([svspath+slide], (t, t), (overlap, overlap), (chunk, chunk), magnification)
    except:
        print('Skipping slide ' + slide + ', slide reading error.')
        continue
        
    # do inference
    features, tile_info = inference(wrapped_model, hs_study, batch=tile_batch, prefetch=tile_prefetch)

    # write record to disk
    write_record(outpath + filename, features, tile_info, label, structured=True)
    

## Inspect a .tfr file

`mil.io.reader.peek` inspects the contents of a .tfr file and returns a dictionary of the variable names and types. This can be helpful to inspect datasets and determine the user metadata embedded in the .tfr files.

Due to the way tensorflow handles loading .tfr files, this information cannot be acquired at runtime, and so we capture it here in eager mode and provide it to `mil.io.reader.read_record` when training the networks in graph mode.

In [None]:
import json

# get list of created tf.records
files = [outpath + file for file in os.listdir(outpath) if os.path.splitext(file)[1] == '.tfr']

# inspect contents of one file
serialized = list(tf.data.TFRecordDataset(files[0]))[0]
variables = peek(serialized)

# display variables
print(json.dumps(variables, indent=2))

## Train a convolutional model from structured tensors

Here we use the `mil.models` subpackage to build and train a simple convolutional model for the structured tensor. This model uses weighted-average pooling (attention) to pool the convolutional feature maps over the entire image to make a prediction. This enables support for variable-sized images.

We use the `mil.io.reader.read_record` function with a `tf.data.Dataset` to read features in structured format. Interpreting the .tfr requires passing in the label names were stored within the file. When we load the data, we pick a single label and threshold that to form a binary classificaiton problem (the original labels range from 0 to 4).

We also use several new metrics from the `tf.metrics` subpackage to monitor performance during training. These metrics were implemented to address the specific issues of validating pathology models, and are not available in the TensorFlow core package.

In [None]:
# create a list of metrics to monitor performance during training
metrics = [tf.keras.metrics.BinaryAccuracy(),
           tf.keras.metrics.AUC(curve='ROC'),
           Balanced(threshold=0.5),
           F1(threshold=0.5),
           Mcc(threshold=0.5),
           Sensitivity(threshold=0.5),
           Specificity(threshold=0.5)]

# create and compile model
model = convolutional_model(D)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4), 
              loss={'softmax': tf.keras.losses.BinaryCrossentropy()},
              metrics={'softmax': metrics})

#define label function for training dataset
def threshold(value, key='t', cond=lambda x: x>=2.0):
    return tf.one_hot(tf.cast(cond(value[key]), tf.int32), depth=2)

# build dataset and train
train_ds = tf.data.TFRecordDataset(files, num_parallel_reads=4).shuffle(len(files))
train_ds = train_ds.map(lambda x: read_record(x, variables, structured=True))
train_ds = train_ds.map(lambda x, y, z, _: (x, threshold(y, 't')[0]))
train_ds = train_ds.batch(1).prefetch(2)

# train model
model.fit(train_ds, batch_size=1, epochs=5)

## Train a set-based model from flattened tensors

Although the tensors are stored in a structured format, they can also be read in a flattened format using *io.read_record* with structured=False.

The *models* subpackage also contains set-based models with dense layers that process each tile independently.

In [None]:
# build and compile model
model = attention_flat(D)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4), 
              loss={'softmax': tf.keras.losses.BinaryCrossentropy(), "attention_weights": None},
              metrics={'softmax': metrics, "attention_weights": None})

# build dataset and train
train_ds = tf.data.TFRecordDataset(files, num_parallel_reads=4).shuffle(len(files))
train_ds = train_ds.map(lambda x: read_record(x, variables, structured=False))
train_ds = train_ds.map(lambda x, y, z, _: (x, threshold(y, 't')[0]))
train_ds = train_ds.batch(1).prefetch(2)

# train model
model.fit(train_ds, batch_size=1, epochs=5)

### Multi-GPU training

Distributed training can be performed by transforming the dataset to address variable image sizes.

When creating a model, setting `ragged=True` indicates to the model to expect a ragged dataset where feature tensors with possibly variable dimensions are batched.

The function `mil.io.transforms.parallel_dataset` performs the necessary transformation of the dataset.

In [None]:
# create a MirroredStrategy for multi-GPU training
strategy = tf.distribute.MirroredStrategy()  

# create and compile the model and metrics in the strategy scopes
with strategy.scope():
    
    # create a model with ragged inputs
    model = attention_flat(D, ragged=True) 
    
    # metrics will be aggregated across gpus
    metrics = [tf.keras.metrics.BinaryAccuracy(),
            tf.keras.metrics.AUC(curve='ROC'),
            Balanced(threshold=0.5),
            F1(threshold=0.5),
            Mcc(threshold=0.5),
            Sensitivity(threshold=0.5),
            Specificity(threshold=0.5)]

    # compile the model
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4), 
                loss={'softmax': tf.keras.losses.BinaryCrossentropy()},
                metrics={'softmax': metrics})


#define label function for training dataset
def threshold(value, key='t', cond=lambda x: x>=2.0):
    return tf.one_hot(tf.cast(cond(value[key]), tf.int32), depth=2)

# build dataset and train
train_ds = tf.data.TFRecordDataset(files, num_parallel_reads=strategy.num_replicas_in_sync).shuffle(len(files))
train_ds = train_ds.map(lambda x: read_record(x, variables, structured=False))
train_ds = train_ds.map(lambda x, y, z, _: (x, threshold(y, 't')[0]))
train_ds = parallel_dataset(train_ds, 
                            D, 
                            strategy.num_replicas_in_sync,
                            structured=False)
train_ds = train_ds.prefetch(2)

# train model
model.fit(train_ds, batch_size=strategy.num_replicas_in_sync, epochs=5)

### Hyperparameter tuning

Hyperparameter tuning for attention_flat model can be performed by creating an object of class `attention_flat_tune` and setting the number of trials `trial_num` and number of allocated GPUs/CPUs per `trial resources_per_trial`

In [None]:
#create an object of attention_flat_tune
tuner = attention_flat_tune(trial_num=20, resources_per_trial=1)
config = tuner.get_config()
config

# Modify and set tuning config
config["dataset_params"] = {"files": files, "variables": variables, "structured": False}
config["training_params"]["D"] = D
tuner.set_config(config)

# Run tuner to look for the best hyperparameters
attention_flat_best_params = tuner.tune()

In [None]:
# Build attention_flat using the best hyperparameters
model = attention_flat(D, config=attention_flat_best_params, ragged=False) 

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4), 
              loss={'softmax': tf.keras.losses.BinaryCrossentropy()},
              metrics={'softmax': metrics})

# build dataset and train
train_ds = tf.data.TFRecordDataset(files, num_parallel_reads=4).shuffle(len(files))
train_ds = train_ds.map(lambda x: read_record(x, variables, structured=False))
train_ds = train_ds.map(lambda x, y, z, _: (x, threshold(y, 't')[0]))
train_ds = train_ds.batch(1).prefetch(2)

# train model
model.fit(train_ds, batch_size=1, epochs=5)