# Synthetic Staining and Binary Classification Pipeline

## Introduction

The purpose of this notebook is to guide you through the process of taking SVS files (or synehtically generated PNGs), synthetically staining a-synuclein proteins and then classifying pathology based on their presence.


## Setup
The following cell includes the basic Python enviroment requirements. For each step of the process (colourisation, training and validation) the relevent Polygeist module is imported preceeding the relevent example code below.

In [None]:
# Real vs Fake Config
is_synthetic = True
if is_synthetic:
    datadir = "./Data/fake"  # Input examples supplied in the repo
else:
    datadir = "./Data/real"  # Not supplied

In [None]:
# System utilities
import os
import pathlib
import random
import shutil
import time
from glob import glob

# Numeric includes and plotting
import numpy as np
from pqdm.processes import pqdm

%matplotlib widget
# Image loading
import lycon
from matplotlib import pyplot as plt

# Move cwd to project root
os.chdir("..")

### Configuration (for Staining)
We are going to use a pretrained caffe model, that comes with the iDeepColor repository to stain our a-syn.  Below are some configuration parameters that will be used during our staining procedure:

`state_path` : This is the location of our caffe model

`dump_path_segmented`: This is the path where we will dump each portion (stained window segment) of our images that we stain.

`dump_path_full` : This is where we will dump all of our full scale stained images (stitched together from multiple segments)

## Synthetic Staining Procedure

Firstly we tumble over our slide, using the `staining_window`.  We produce a conservative binary mask, which will demarcate some a-syn as well as some unwanted cell bodies etc.  This binary mask, and the monochrome slide will be passed to the iDeepColor network, with the intention that it will fill in more a-syn, and will not fill weakly masked bodies (such as the neuromelanin pigmentation).  Each window segment is then resized, and dumped to disk if we detect some staining, as well as being stitched together to produce a full resolution stained slide of all tiles (regardless of staining).
![alt text](assets/im1.png "Colourisation")

We invert our slide (`I_invert = 255 - I`) to make them easier to see.

In [None]:
from polygeist.preprocess import colourise_slide_and_segment

In [None]:
# Destination location for stained png images
# Expecting PD and Control subdirectories to exist
dump_path_full = f"{datadir}/full_stain_dump"
dump_path_segmented = f"{datadir}/segmented_dump"

### Example configuration for fake data
This first configuration is for fake data to allow cursory code testing and dissemination.

In [None]:
# Get the positive runs and negative runs - EXAMPLE CONFIG FOR FAKE DATA
if is_synthetic:
    positive_run = glob(f"{datadir}/input/PD/*.png")
    negative_run = glob(f"{datadir}/input/Control/*.png")

    # Use all fake slides

### Example configuration for real data - Only use DMNoV
For actual slides, we weed our slide names by ID=17, which indicates slides of the Dorsal Nucleus of the Vagus, which has a-syn present for Braak 1+ PD cases, but should not have any present for the control cases.

In [None]:
# Get the positive runs and negative runs- EXAMPLE CONFIG FOR REAL DATA
if not is_synthetic:
    positive_run = [x for x in glob(f"{datadir}/input/PD/*.svs") if "-17_" in x]
    negative_run = [x for x in glob(f"{datadir}/input/Control/*.svs") if "-17_" in x]

    # Only look at Braak 1, slide 17 for this run (The Slide index is in the filename)

### Run preprocessing on slide data

In [None]:
# Prepare parameters for colourise_slide_and_segment jobs
positive_run_kwargs = [
    {
        "slide_file": x,
        "is_synthetic": is_synthetic,
        "dump_path_full": dump_path_full,
        "dump_path_segmented": dump_path_segmented,
        "subdirectory": "PD",
    }
    for x in positive_run
]

negative_run_kwargs = [
    {
        "slide_file": x,
        "is_synthetic": is_synthetic,
        "dump_path_full": dump_path_full,
        "dump_path_segmented": dump_path_segmented,
        "subdirectory": "Control",
    }
    for x in negative_run
]

In [None]:
# Produce the slide sections, and full stains for the PD and Control groups for Slide 17 (real data).
# We set 1 workers here, which can be increased or decreased depending on available compute.
# For the real dataset, running time on 3090 (utilisation around 30%), 10 hours.

_ = pqdm(
    positive_run_kwargs, colourise_slide_and_segment, n_jobs=1, argument_type="kwargs"
)
_ = pqdm(
    negative_run_kwargs, colourise_slide_and_segment, n_jobs=1, argument_type="kwargs"
)

## Generating testing and training sets for our filtered patches

Our filtered patches will now contain a chunk of legitimate stains, as well as edge cases (from the edge of the slide) and foreign bodies (like mould etc).  We will chunk these into train and test sets.  Note, this needs only to be done once, so this can be skipped if you have already done this previously.

In [None]:
# Now we have our folders, we need to create a training and validation set.
# We will use a clean copy of the data for performance, repeatability and safety.
training_dump_path = f"{datadir}/training_dump"

In [None]:
# Should we copy files to create training dataset?
# Skip this cell if the data has already been prepared
# Do not run twice as it does not remove old datasets.
skip = len(glob(f"{training_dump_path}/train/*/*.png")) > 0

if not skip:
    # Splits
    prop_data_train = 0.75

    # Copy and partition the files (train and val)
    for s in ["Control", "PD"]:
        for file in glob(f"{dump_path_segmented}/{s}/*.png"):
            # basename for dumping out
            base = os.path.basename(file)
            if random.random() > (1.0 - prop_data_train):
                shutil.copyfile(file, f"{training_dump_path}/train/{s}/" + base)
            else:
                shutil.copyfile(file, f"{training_dump_path}/val/{s}/" + base)

## Training pipeline

We are now all the way to training, we have chunked through our slides, stained, filtered and segmented into training and test sets.  The next steps are to setup our runtime transformations for our training, and actually train our PDNet model.
![alt text](assets/im0.png "Colourisation")

In [None]:
from polygeist.training import train_model

In [None]:
# Our dump path for our model training run, model checkpoints will be saved here
model_dump_dir = f"{datadir}/model_dump"

In [None]:
# Start a timer
start_time = time.time()

latest_model_name = train_model(training_dump_path, model_dump_dir)

time_elapsed = time.time() - start_time
print(f"Training complete in {time_elapsed // 60}m {time_elapsed % 60}s")

## Model Architecture

We have loaded our PDNet from our model file, but below we can see a diagram of the architecture, and the parameters we will use to train it.
![alt text](assets/im2.png "Network")

# Validation
Here we load the model file that we have just trained.  This will be stored in `latest_model_name`.  Below we are using a model that has been previously trained. Again, patches are resized to the network size. We will run a sweep of thresholds instead of using `T > 0` as a boolean classifier.  This will allow us to establish the best threshold for use on our validation set.

In [None]:
from polygeist.validation import plot_roc, validate

In [None]:
# Now we can run validation, on slide and case level
# latest_model_name will have our last model, or it maybe specified manually.
# E.g. model_file = f"{model_dump_dir}/PDNET_checkpoint_490_16_18_48"
model_file = f"{model_dump_dir}/{latest_model_name}"

In [None]:
output_data_and_labels = validate(model_file, training_dump_path)

In [None]:
outputs = np.hstack(output_data_and_labels["outputs"])
labels = np.hstack(output_data_and_labels["labels"])

matched = outputs[labels == 1.0]
non_matched = outputs[labels == 0]

_, stats = plot_roc(
    plt, matched, non_matched, return_stats=True, verbose=False, steps=500000
)

specification_metric = "F1"
in_ = np.where(stats[specification_metric] == np.max(stats[specification_metric]))[0][0]
print(
    f"Best M({specification_metric}): gives {stats['H'][in_]} hits and {stats['F'][in_]} FAs, S={stats['S'][in_]}, "
    f"P={stats['P'][in_]},"
    f" F1={stats['F1'][in_]}, A={stats['A'][in_]}"
)

plt.show()

## Conclusions on Training and Validation

Below is the performance from our development run of PDNet

|              | Hits                 | FAs                   | S                    | P                    | F1                    | A                    |
|--------------|----------------------|-----------------------|----------------------|----------------------|-----------------------|----------------------|
| Best M(F1)   | 0.9292604501607717   | 0.12290502793296089   | 0.8770949720670391   | 0.8831884998207366   | F0.9056389068818823   | 0.9031777111139054   |

Classification is on a per-patch basis, so some aggregation over those patches should yield good classification results per case.  A more conservative threshold should be selected for this task.

This work shows that classification is possible, and that there are more frequent and different stains between the PD and control groups.

## Example: Marking Regions of Positive Classification

As an additional example, we illustrate the utility of the model by iteratating over a fully stained slide and classifing each window, marking any positives as we go. 

## Weeding Results

The network was only trained of pre-filtered a-synuclein containing regions, which means that the results on regions which do not contain a-syn marking / highlighting is undefined.

We will go over the stained image, see if it was passed to the network and then if it was, we will see what its score was.  If it is greater than 95% confidence, we will mark it in red and the re-encode an image at the end.

In [None]:
from polygeist.example import label_image_with_confidence

In [None]:
# This is the file we will mark with binary results
file_to_stain = f"{datadir}/full_stain_dump/PD/slide_100.png_synthetic_stain.png"
marker_output_path = f"{datadir}/full_stain_PD_with_regions"

In [None]:
# Run the algo using our results that we have just gathered
label_image_with_confidence(model_file, file_to_stain, marker_output_path)

In [None]:
# Now lets load the image and view it.
annotated = lycon.load(f"{marker_output_path}/{os.path.basename(file_to_stain)}")
plt.figure()
plt.imshow(annotated, interpolation="nearest")
plt.axis("off")
plt.show()