# Optimization for medical image segmentation with 2D U-Net on Intel(R) Xeon CPUs

#### Agenda

1. Brain MRI scan
2. U-Net for brain images segmentation
3. Intel's optimizations
5. Let's do coding!

### 1. Brain MRI scan
Magnetic resonance imaging (MRI) of the brain is a safe and painless test that uses a magnetic field and radio waves to produce detailed images of the brain and the brain stem. An MRI differs from a CAT scan (also called a CT scan or a computed axial tomography scan) because it does not use radiation.

An MRI scanner consists of a large doughnut-shaped magnet that often has a tunnel in the center. Patients are placed on a table that slides into the tunnel. Some centers have open MRI machines that have larger openings and are helpful for patients with claustrophobia. MRI machines are located in hospitals and radiology centers.

During the exam, radio waves manipulate the magnetic position of the atoms of the body, which are picked up by a powerful antenna and sent to a computer. The computer performs millions of calculations, resulting in clear, cross-sectional black and white images of the body. These images can be converted into three-dimensional (3-D) pictures of the scanned area. This helps pinpoint problems in the brain and the brain stem when the scan focuses on those areas.

**Reference:** https://kidshealth.org/en/parents/mri-brain.html

<table><tr><td><img src='https://github.com/IntelAI/unet/raw/master/3D/images/BRATS_152_img3D.gif'></td><td><img src='https://github.com/IntelAI/unet/blob/master/3D/images/BRATS_195_img.gif?raw=true'></td></tr></table>

**Reference:** https://github.com/IntelAI/unet

### 2. U-Net for brain images segmentation

U-Net implementation in TensorFlow for FLAIR abnormality segmentation in brain MRI based on a deep learning segmentation algorithm used in [Association of genomic subtypes of lower-grade gliomas with shape features automatically extracted by a deep learning algorithm](https://doi.org/10.1016/j.compbiomed.2019.05.002).

```latex
@article{buda2019association,
  title={Association of genomic subtypes of lower-grade gliomas with shape features automatically extracted by a deep learning algorithm},
  author={Buda, Mateusz and Saha, Ashirbani and Mazurowski, Maciej A},
  journal={Computers in Biology and Medicine},
  volume={109},
  year={2019},
  publisher={Elsevier},
  doi={10.1016/j.compbiomed.2019.05.002}
}
```

Topology structured as the following:
<img src='https://github.com/mateuszbuda/brain-segmentation-pytorch/raw/master/assets/unet.png'>

**Reference:** https://github.com/mateuszbuda/brain-segmentation-pytorch

### 3. Intel's optimization

#### Intel Optimization for Tensorflow

In order to take full advantage of Intel® architecture and to extract maximum performance, the TensorFlow framework has been optimized with Intel® Math Kernel Library for Deep Neural Networks (Intel® oneDNN) primitives, a popular performance library for deep learning applications.

For more information about the optimizations as well as performance data, see the blog post:[TensorFlow Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture).

Installation guide of Intel Optimization for TensorFlow can be found at [Intel® Optimization for TensorFlow Installation Guide](https://software.intel.com/en-us/articles/intel-optimization-for-tensorflow-installation-guide).


#### 3.1 Optimization with TensorFlow switches
**intra_op_parallelism_threads**
- Number of threads in each threadpool for an operation (like matrix multiplication or reduction). 
- Recommend: #physical cores, found in Linux with ‘lscpu’ command. 

**inter_op_parallelism_threads**
- Number of thread pools for independent operations.
- Recommend: #cpu sockets,  found in Linux with ‘lscpu’ command.

Note, need to test with the model & platform to find the best parameters.

#### 3.2 Optimization with Intel(R) oneDNN switches
Intel oneDNN utilizes OpenMP to leverage Intel architecture.
Following environment variables for vectorization and multi-threading.

**KMP_AFFINITY**
- Restricts execution of certain threads to a subset of the physical processing units in a multiprocessor computer.
- Recommend: ```export KMP_AFFINITY=granularity=fine,compact,1,0```

**KMP_BLOCKTIME**
- Set the time (milliseconds), that a thread wait for, after completing the execution of a parallel region, before sleeping. 
- Recommend: ```export KMP_BLOCKTIME=0 (or 1)```

**OMP_NUM_THREADS**
- Set maximum number of threads to use for OpenMP parallel regions
- Recommend: ```export OMP_NUM_THREADS=num physical cores```

Note, recommend users tuning these values for their specific neural network model and platform.

#### 3.3 Optimization with miscellaneous configurations/tools
**Numactl**
- Running on a NUMA-enabled machine brings with it special considerations. NUMA or non-uniform memory access is a memory layout design used in data center machines meant to take advantage of locality of memory in multi-socket machines with multiple memory controllers and blocks. In most cases, inference runs best when confining both the execution and memory usage to a single NUMA node.
- Recommend: ```numactl --cpunodebind=N --membind=N python <pytorch_script>```

**Batch size**
- Can increase usage and efficiency of hardware resources.
- Optional according to your requirements.
- Recommend: $2^{n}, n \in N_+$

A more detailed introduction of maximizing performance with Intel Optimization for TensorFlow can be found [here](https://software.intel.com/en-us/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference).

### 4. Let's do coding!

#### 4.0 Dataset
We use [brain tumor segmentation (BraTS) subset](https://drive.google.com/file/d/1A2IU8Sgea1h3fYLpYtFb2v7NYdMjvEhU/view?usp=sharing) of the [Medical Segmentation Decathlon](http://medicaldecathlon.com/) dataset. The dataset has the [Creative Commons Attribution-ShareAlike 4.0 International license](https://creativecommons.org/licenses/by-sa/4.0/).

Please follow instructions [here](https://github.com/IntelAI/unet/blob/master/2D/00_Prepare-Data.ipynb) to prepare the dataset.

#### 4.1 Import required packages

In [None]:
%matplotlib inline
import os
import numpy as np
import tensorflow as tf
import keras as K
import h5py
import time
import matplotlib.pyplot as plt

from data import load_data
from model import unet

import sys; sys.argv=['']; del sys
from argparser import args

#### 4.2 Check TensorFlow version, and do sanity check

In [None]:
print ("We are using Tensorflow version", tf.__version__,\
       "with Intel(R) oneDNN", "enabled" if tf.pywrap_tensorflow.IsMklEnabled() else "disabled",)

#### 4.3 Define the DICE coefficient and loss function
The Sørensen–Dice coefficient is a statistic used for comparing the similarity of two samples. Given two sets, X and Y, it is defined as

\begin{equation}
Dice = \frac{2|X\cap Y|}{|X|+|Y|}
\end{equation}

In [None]:
def calc_dice(target, prediction, smooth=0.01):
    """
    Sorensen Dice coefficient
    """
    prediction = np.round(prediction)

    numerator = 2.0 * np.sum(target * prediction) + smooth
    denominator = np.sum(target) + np.sum(prediction) + smooth
    coef = numerator / denominator

    return coef

def calc_soft_dice(target, prediction, smooth=0.01):
    """
    Sorensen (Soft) Dice coefficient - Don't round preictions
    """
    numerator = 2.0 * np.sum(target * prediction) + smooth
    denominator = np.sum(target) + np.sum(prediction) + smooth
    coef = numerator / denominator

    return coef

#### 4.4 Load images

In [None]:
data_path = os.path.join("../../data/decathlon/144x144/")
data_filename = "Task01_BrainTumour.h5"
hdf5_filename = os.path.join(data_path, data_filename)
imgs_train, msks_train, imgs_validation, msks_validation, imgs_testing, msks_testing = load_data(hdf5_filename)
imgs_warmup=imgs_testing[:500]
imgs_infere=imgs_testing[500:2500]
print("Number of imgs_warmup: {}".format(imgs_warmup.shape[0]))
print("Number of imgs_infere: {}".format(imgs_infere.shape[0]))

#### 4.5 Load model

In [None]:
unet_model = unet()
model = unet_model.load_model(os.path.join("./output/unet_model_for_decathlon.hdf5"))

#### 4.6 Define function to inference on input images and plot results out

In [None]:
def plot_results(model, imgs_validation, msks_validation, img_no):
    img = imgs_validation[idx:idx+1]
    msk = msks_validation[idx:idx+1]
    
    pred_mask = model.predict(img, verbose=1, steps=None)

    plt.figure(figsize=(15, 15))
    plt.subplot(1, 3, 1)
    plt.imshow(img[0, :, :, 0], cmap="bone", origin="lower")
    plt.axis("off")
    plt.title("MRI Input", fontsize=20)
    plt.subplot(1, 3, 2)
    plt.imshow(msk[0, :, :, 0], origin="lower")
    plt.axis("off")
    plt.title("Ground truth", fontsize=20)
    plt.subplot(1, 3, 3)
    plt.imshow(pred_mask[0, :, :, 0], origin="lower")
    plt.axis("off")
    plt.title("Prediction\nDice = {:.4f}".format(calc_dice(pred_mask, msk)), fontsize=20)
    plt.tight_layout()

#### 4.7 Run inference and plot

In [None]:
indicies_validation = [40, 63, 43, 55, 99]
for idx in indicies_validation:
    plot_results(model, imgs_validation, msks_validation, idx)

#### 4.8 Benchmark

See demo in console.

- Single instance
  - Batch size 1
    - Without numactl
      - Default configuration
      - Configuration with optimization
    - With numactl
      - Default configuration
      - Configuration with optimization
  - Batch size 128
    - With numactl
      - Default configuration
      - Configuration with optimization
- Multiple instances
  - Batch size 128
    - Configuration with optimization
      - 2 instances
      - 4 instances

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
   http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

SPDX-License-Identifier: EPL-2.0
