segmentation_results.mp4
This repository contains the segmentation pipeline described in
Benjamin Gallusser, Giorgio Maltese, Giuseppe Di Caprio et al.
Deep neural network automated segmentation of cellular structures in volume electron microscopy,
Journal of Cell Biology, 2022.
Please cite the publication if you are using this code in your research.
Our semi-automated annotation tool from the same publication is available at https://github.com/kirchhausenlab/gc_segment.
- Interactive Demo
- Datasets
- Installation
- Optional: Download our data
- Optional: Docker
- Prepare your own data for prediction
- Prediction
- Prepare your own ground truth annotations for fine-tuning or training
- Fine-Tuning
- Training
An interactive demo is provided via Google Colab. This notebook can be used to work with sample data in and learn the basics of incasem
. You can also work with your own data in the notebook, but it will require some modifications as specified in the notebook.
Take a look at the lab's FIB-SEM datasets (raw, labels, predictions) directly in the browser with our simple-to-use cell viewing tool based on neuroglancer.
This package is written for machines with either a Linux or a MacOS operating system.
This README was written to work with the
bash
console. If you want to usezsh
(default on newer versions of MacOS) or any other console, please make sure that you adapt things accordingly.
Newer versions of MacOS (Catalina or newer): the following commands work correctly if you run the Terminal under Rosetta. In Finder, go to
Applications/Utilities
, right click onTerminal
, selectGet Info
, tickOpen using Rosetta
.
Open a terminal window and download anaconda.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Optional: In case of permission issues, run
chmod +x Miniconda3-latest-Linux-x86_64.sh
Install anaconda.
bash Miniconda3-latest-Linux-x86_64.sh
Type conda
to check whether the installation worked, if not try to reload your .bashrc
file with
source ~/.bashrc
.
git clone https://github.com/kirchhausenlab/incasem.git ~/incasem
conda create -n incasem --no-default-packages python=3.9
Activate the new environment.
conda activate incasem
Install
pip install -e ./incasem
5. Install pytorch as outlined here.
pip install git+https://github.com/kirchhausenlab/funlib.show.neuroglancer.git@incasem_scripts#egg=funlib.show.neuroglancer
-
If not already installed on your system (check by running
mongod
), install MongoDB. -
Start up the MongoDB service (refer to documentation):
- on Ubuntu:
bash sudo service mongod start
- on MacOS (assuming that you have installed mongodb via
homebrew
):brew services start mongodb-community
- on Ubuntu:
-
Run
cd ~/incasem; python download_models.py
Install Nodeenv.
pip install nodeenv
Create a new node.js environment (and thereby install node.js, if not already installed).
This may take a while.
cd ~; nodeenv omniboard_environment
Activate the environment.
source omniboard_environment/bin/activate
Install omniboard.
This may take a while.
npm install -g omniboard
The datasets in the publication are available in an AWS bucket and can be downloaded with the quilt3 API.
Navigate to ~/incasem/data
:
cd ~/incasem/data
Open a python session and run the following lines.
It may take a while until the download starts. Expected download speed is >= 2MB/s.
import quilt3
b = quilt3.Bucket("s3://asem-project")
# download
b.fetch("datasets/cell_6/cell_6_example.zarr/", "cell_6/cell_6.zarr/")
We provide all datasets as 2d .tiff
images as well as in .zarr
format, which is more suitable for deep learning on 3D images. Above we only downloaded the .zarr
format.
Example: Cell 6 raw electron microscopy data, Endoplasmic Reticulum prediction and corresponding Endoplasmic Reticulum ground-truth annotation.
neuroglancer -f cell_6/cell_6.zarr -d volumes/raw_equalized_0.02 volumes/predictions/er/segmentation volumes/labels/er
Navigate to position 520, 1164, 2776
(z,y,x) to focus on the Endoplasmic Reticulum predictions. You can simply overwrite the coordinates on the top left to do so.
If you are not familiar with inspecting 3D data with neuroglancer, you might want to have a look at this video tutorial.
Note:
neuroglancer
might not work in Safari. In this case, simply copy the link given byneuroglancer
to Chrome or Firefox.
If you are interested in a Docker installation, refer to the steps in docker/README
We assume that the available 3d data is stored as a sequence of 2d .tif
images in a directory.
cp -r old/data/location ~/incasem/data/my_new_data
cd ~/incasem/scripts/01_data_formatting
In case you have not installed the python environment yet, refer to the installation instructions.
Before running python scripts, activate the incasem
environment
conda activate incasem
Convert the sequence of .tif
images (3D stack) to .zarr
format.
python 00_image_sequences_to_zarr.py -i ~/incasem/data/my_new_data -f ~/incasem/data/my_new_data.zarr
To obtain documentation on how to use a script, run
python <script_name>.py -h
.
If your datasets is hundreds of GB in size, try using the conversion script 01_image_sequences_to_zarr_with_dask.py
. You will need to install a different conda environment to work with dask
, details directly in the script.
python 01_image_sequences_to_zarr_with_dask.py -i ~/incasem/data/my_new_data -f ~/incasem/data/my_new_data.zarr -d volumes/raw --resolution 5 5 5
Equalize the raw data with CLAHE (Contrast limited adaptive histogram equalization). The default clip limit is 0.02
.
python 40_equalize_histogram.py -f ~/incasem/data/my_new_data.zarr -d volumes/raw -o volumes/raw_equalized_0.02
neuroglancer -f ~/incasem/data/my_new_data.zarr -d /volumes/raw
Refer to our instructions on how to use neuroglancer.
For running a prediction you need to create a configuration file in JSON format that specifies which data should be used.
Here is an example, also available at ~/incasem/scripts/03_predict/data_configs/example_cell6.json
:
{
"Cell_6_example_roi_nickname" : {
"file": "cell_6/cell_6.zarr",
"offset": [400, 926, 2512],
"shape": [241, 476, 528],
"voxel_size": [5, 5, 5],
"raw": "volumes/raw_equalized_0.02"
}
}
offset
and shape
are specified in voxels and in z, y, x format. They have to outline a region of interest (ROI) that lies within the total available ROI of the dataset (as defined in .zarray
and .zattrs
files of each zarr volume).
Note that the offset in each
.zattr
file is defined in nanometers, while the shape in.zarray
is defined in voxels.
We assume the data to be in ~/incasem/data
, as defined here.
We provide the following pre-trained models:
- For FIB-SEM data prepared by chemical fixation,
5x5x5
nm3 resolution:- Mitochondria (model ID
1847
) - Golgi Apparatus (model ID
1837
) - Endoplasmic Reticulum (model ID
1841
)
- Mitochondria (model ID
- For FIB-SEM data prepared by high-pressure freezing,
4x4x4
nm3 resolution:- Mitochondria (model ID
1675
) - Endoplasmic Reticulum (model ID
1669
)
- Mitochondria (model ID
- For FIB-SEM data prepared by high-pressure freezing,
5x5x5
nm3 resolution:- Clathrin-Coated Pits (model ID
1986
) - Nuclear Pores (model ID
2000
)
- Clathrin-Coated Pits (model ID
A checkpoint file for each of these models is stored in ~/incasem/models/pretrained_checkpoints/
.
Activate the omniboard environment.
source ~/omniboard_environment/bin/activate
Run
omniboard -m localhost:27017:incasem_trainings
and paste localhost:9000
into your browser.
Cell 6 has been prepared by chemical fixation and we will generate predictions for Endoplasmic Reticulum in this example, using model ID 1841
. In the prediction scripts folder,
cd ~/incasem/scripts/03_predict
Run
python predict.py --run_id 1841 --name example_prediction_cell6_ER with config_prediction.yaml 'prediction.data=data_configs/example_cell6.json' 'prediction.checkpoint=../../models/pretrained_checkpoints/model_checkpoint_1841_er_CF.pt'
Note that we need to specify which model to use twice:
--run_id 1841
to load the appropriate settings from the models database.'prediction.checkpoint=../../models/pretrained_checkpoints/model_checkpoint_1841_er_CF.pt'
to pass the path to the checkpoint file.
You can check the status of the prediction in omniboard:
omniboard -m localhost:27017:incasem_predictions
If you have corresponding ground truth annotations, create a metric exclusion zone as described below. For the example of predicting Endoplasmic Reticulum in cell 6 from above, put the metric exclusion zone in cell_6/cell_6.zarr/volumes/metric_masks/er
and adapt data_configs/example_cell6.json
to:
{
"Cell_6_example_roi_nickname" : {
"file": "cell_6/cell_6.zarr",
"offset": [400, 926, 2512],
"shape": [241, 476, 528],
"voxel_size": [5, 5, 5],
"raw": "volumes/raw_equalized_0.02",
"metric_masks": [
"volumes/metric_masks/er"
],
"labels": {
"volumes/labels/er": 1,
}
}
}
Now run
python predict.py --run_id 1841 --name example_prediction_cell6_ER_with_GT with config_prediction.yaml 'prediction.log_metrics=True' 'prediction.data=data_configs/example_cell6.json' 'prediction.checkpoint=../../models/pretrained_checkpoints/model_checkpoint_1841_er_CF.pt'
, which will print an F1 score for the generated prediction given the ground truth annotations (labels
).
Every prediction is stored with a unique identifier (increasing number). If the example above was your first prediction run, you will see a folder ~/incasem/data/cell_6/cell_6.zarr/volumes/predictions/train_1841/predict_0001/segmentation
. To inspect these predictions, together with the corresponding EM data and ground truth, use the following command:
neuroglancer -f ~/incasem/data/cell_6/cell_6.zarr -d volumes/raw_equalized_0.02 volumes/labels/er volumes/predictions/train_1841/predict_0001/segmentation
Run cd ~/incasem/scripts/04_postprocessing
to access the postprocessing scripts.
Now adapt and execute the conversion command below. In this example command, we assume that we have used model ID 1841
to generate Endoplasmic Reticulum predictions for a subset of cell 6, and the automatically assigned prediction ID is 0001
.
python 20_convert_zarr_to_image_sequence.py --filename ~/incasem/data/cell_6/cell_6.zarr --datasets volumes/predictions/train_1841/predict_0001/segmentation --out_directory ~/incasem/data/cell_6 --out_datasets example_er_prediction
You can open the resulting TIFF stack for example in ImageJ. Note that since we only made predictions on a subset of cell 6, the prediction TIFF stack is smaller than the raw data TIFF stack.
Example: Endoplasmic reticulum (ER) annotations.
We assume that the available 3d pixelwise annotations are stored as a sequence of 2d .tif
images in a directory and that the size of each .tif
annotation image matches the size of the corresponding electron microscopy .tif
image.
Furthermore, we assume that you have already prepared the corresponding electron microscopy images as outlined above.
The minimal block size that our training pipeline is set up to process is
(204, 204, 204)
voxels.
cp -r old/annotations/location ~/incasem/data/my_new_er_annotations
cd ~/incasem/scripts/01_data_formatting
In case you have not installed the python environment yet, refer to the installation instructions.
Before running python scripts, activate the incasem
environment
conda activate incasem
Convert the sequence of .tif
annotations (3D stack) to .zarr
format.
In this example, we use
python 00_image_sequences_to_zarr.py -i ~/incasem/data/my_new_er_annotations -f ~/incasem/data/my_new_data.zarr -d volumes/labels/er --dtype uint32
We assume the .tif
file names are in the format name_number.tif
, as encapsulated by the default regular expression .*_(\d+).*\.tif$
. If you want to change it, add -r your_regular_expression
to the line above.
If your datasets is hundreds of GB in size, try using the conversion script 01_image_sequences_to_zarr_with_dask.py
. You will need to install a different conda environment to work with dask
, details directly in the script.
python 01_image_sequences_to_zarr_with_dask.py -i ~/incasem/data/my_new_er_annotations -f ~/incasem/data/my_new_data.zarr -d volumes/labels_er --resolution 5 5 5 --dtype uint32
Inspect the converted data with neuroglancer
:
neuroglancer -f ~/incasem/data/my_new_data.zarr -d volumes/raw volumes/labels/er
Refer to our instructions on how to use neuroglancer.
If the position of the labels is wrong, you can correct the offset by directly editing the dataset attributes file on disk:
cd ~/incasem/data/my_new_data.zarr/volumes/labels/er
vim .zattrs
In this file the offset is expressed in nanometers instead of voxels. So if the voxel size is (5,5,5) nm
, you need to multiply the previous coordinates (z,y,x) by 5.
We create a mask that will be used to calculate the F1 score for predictions, e.g. in the periodic validation during training. This mask, which we refer to as exclusion zone, simply sets the pixels at the object boundaries to 0, as we do not want that small errors close to the object boundaries affect the overall prediction score.
We suggest the following exclusion zones in voxels:
- mito: 4
--exclude_voxels_inwards 4 --exclude_voxels_outwards 4
- golgi: 2
--exclude_voxels_inwards 2 --exclude_voxels_outwards 2
- ER: 2
--exclude_voxels_inwards 2 --exclude_voxels_outwards 2
- NP (nuclear pores): 1
--exclude_voxels_inwards 1 --exclude_voxels_outwards 1
- CCP (coated pits): 1
--exclude_voxels_inwards 2 --exclude_voxels_outwards 2
For our example with Endoplasmic Reticulum annotations, we run
python 60_create_metric_mask.py -f ~/incasem/data/my_new_data.zarr -d volumes/labels/er --out_dataset volumes/metric_masks/er --exclude_voxels_inwards 2 --exclude_voxels_outwards 2
If the prediction quality on a new target cell when using one of our pre-trained models is not satisfactory, you can finetune the model with a very small amount of ground truth from that target cell.
This is an example based on our datasets, which are publicly available in .zarr
format via Amazon Web Services.
We will fine-tune the mitochondria model ID 1847
, which was trained on data from cells 1 and 2, with a small amount of additional data from cell 3.
If you haven't done so before, download cell_3
from our published datasets as outlined in the section Download our data.
For fine-tuning a model you need to create a configuration file in JSON
format that specifies which data should be used.
Here is an example, also available at ~/incasem/scripts/02_train/data_configs/example_finetune_mito.json
:
{
"cell_3_finetune_mito" : {
"file": "cell_3/cell_3.zarr",
"offset": [700, 2000, 6200],
"shape": [250, 250, 250],
"voxel_size": [5, 5, 5],
"raw": "volumes/raw_equalized_0.02",
"labels" : {
"volumes/labels/mito": 1
}
}
}
Refer to the section Training for a detailed walk-through of such a configuration file.
In the training scripts folder,
cd ~/incasem/scripts/02_train
run
python train.py --name example_finetune --start_from 1847 ~/incasem/models/pretrained_checkpoints/model_checkpoint_1847_mito_CF.pt with config_training.yaml training.data=data_configs/example_finetune_mito.json validation.data=data_configs/example_finetune_mito.json torch.device=0 training.iterations=15000
Note that since we do not have extra validation data on the target cell 3, we simply pass the training data configuration file to define a dummy validation dataset.
Each training run logs information to disk and to the training database, which can be inspected using Omniboard.
The log files on disk are stored in ~/incasem/training_runs
.
To monitor the training loss in detail, open tensorboard:
tensorboard --logdir=~/incasem/training_runs/tensorboard
To observe the training and validation F1 scores, as well as the chosen experiment configuration, we use Omniboard.
Activate the omniboard environment:
source ~/omniboard_environment/bin/activate
Run
omniboard -m localhost:27017:incasem_trainings
and paste localhost:9000
into your browser.
Since we usually do not have any ground truth on the target cell that we fine-tuned for, we cannot rigorously pick the best model iteration.
We find that for example with ground truth in a 2 um3 region of interest, typically after 5,000 - 10,000 iterations the fine-tuning has converged. The training loss (visible in tensorboard) can serve as a proxy for picking a model iteration in said interval.
Now you can use the fine-tuned model to generate predictions on the new target cell, as described in the section Prediction.
This is an example based on our datasets, which are publicly available in .zarr
format via Amazon Web Services.
Download cell_1
and cell_2
from our published datasets as outlined in the section Download our data.
We create a mask that will be used to calculate the F1 score for predictions, e.g. in the periodic validation during training. This mask, which we refer to as exclusion zone, simply sets the pixels at the object boundaries to 0, as we do not want that small errors close to the object boundaries affect the overall prediction score.
For our example with Endoplasmic Reticulum annotations on cell_1
and cell_2
, we run (from the data formatting directory):
python 60_create_metric_mask.py -f ~/incasem/data/cell_1/cell_1.zarr -d volumes/labels/er --out_dataset volumes/metric_masks/er --exclude_voxels_inwards 2 --exclude_voxels_outwards 2
and
python 60_create_metric_mask.py -f ~/incasem/data/cell_2/cell_2.zarr -d volumes/labels/er --out_dataset volumes/metric_masks/er --exclude_voxels_inwards 2 --exclude_voxels_outwards 2
For running a training you need to create a configuration file in JSON
format that specifies which data should be used.
Here is an example, also available at ~/incasem/scripts/02_train/data_configs/example_train_er.json
:
We assume the data to be in
~/incasem/data
, as defined here.
{
"cell_1_er" : {
"file": "cell_1/cell_1.zarr",
"offset": [150, 120, 1295],
"shape": [600, 590, 1350],
"voxel_size": [5, 5, 5],
"raw": "volumes/raw_equalized_0.02",
"metric_masks": [
"volumes/metric_masks/er"
],
"labels" : {
"volumes/labels/er": 1
}
},
"cell_2_er": {
"file": "cell_2/cell_2.zarr",
"offset": [100, 275, 700],
"shape": [500, 395, 600],
"voxel_size": [5, 5, 5],
"raw": "volumes/raw_equalized_0.02",
"metric_masks": [
"volumes/metric_masks/er"
],
"labels": {
"volumes/labels/er": 1
}
}
}
offset
and shape
are specified in voxels and in z, y, x format. They have to outline a region of interest (ROI) that lies within the total available ROI of the dataset (as defined in .zarray
and .zattrs
files of each zarr volume).
Note that the offset in each
.zattr
file is defined in nanometers, while the shape in.zarray
is defined in voxels.
All pixels inside the ROIs that belong to the structure of interest (e.g. endoplasmic reticulum above) in such a data configuration file have to be fully annotated. Additionally, our network architecture requires a context of 47 voxels of raw EM data around each ROI.
Additionally, you need to create a configuration file in JSON
format that specifies which data should be used for periodic validation of the model during training.
Here is an example, also available at ~/incasem/scripts/02_train/data_configs/example_validation_er.json
:
{
"cell_1_er_validation" : {
"file": "cell_1/cell_1.zarr",
"offset": [150, 120, 2645],
"shape": [600, 590, 250],
"voxel_size": [5, 5, 5],
"raw": "volumes/raw_equalized_0.02",
"metric_masks": [
"volumes/metric_masks/er"
],
"labels" : {
"volumes/labels/er": 1
}
},
"cell_2_er_validation": {
"file": "cell_2/cell_2.zarr",
"offset": [300, 70, 700],
"shape": [300, 205, 600],
"voxel_size": [5, 5, 5],
"raw": "volumes/raw_equalized_0.02",
"metric_masks": [
"volumes/metric_masks/er"
],
"labels": {
"volumes/labels/er": 1
}
}
}
The file config_training.yaml
exposes a lot of parameters of the model training.
Most importantly:
- If you would like to use data with a different resolution, apart from specifying in the data configuration files as outlined above, you need to adapt
data.voxel_size
inconfig_training.yaml
. - We guide the random sampling of blocks by rejecting blocks that consist of less than a given percentage (
training.reject.min_masked
) of foreground voxels with a chosen probability ('training.reject.probability'). If your dataset contains a lot of background, or no background at all, you might want to adapt these parameters accordingly.
At the training scripts folder,
cd ~/incasem/scripts/02_train
run
python train.py --name example_training with config_training.yaml training.data=data_configs/example_train_er.json validation.data=data_configs/example_validation_er.json torch.device=0
Each training run logs information to disk and to the training database, which can be inspected using Omniboard.
The log files on disk are stored in ~/incasem/training_runs
.
To monitor the training loss in detail, open tensorboard:
tensorboard --logdir=~/incasem/training_runs/tensorboard
To observe the training and validation F1 scores, as well as the chosen experiment configuration, we use Omniboard.
Activate the omniboard environment:
source ~/omniboard_environment/bin/activate
Run
omniboard -m localhost:27017:incasem_trainings
and paste localhost:9000
into your browser.
Using Omniboard, pick a model iteration where the validation loss and the validation F1 score have converged. Now use this model to generate predictions on new data, as described in the section Prediction.