# Introduction

Cancer is a deadly disease, which is responsible for over 9 millions death in 2018 \cite{wth}. Therefore, it is crucial to find the effective and efficient treatments. One of the most effective cancer treatments is radiotherapy, where cancer cells are killed using doses of radiation. However, the irradiation process also affects healthy tissues surrounding the cancer tumour. Thus, accuracy in radiotherapy has to be increased to minimise the radiation dose delivered to healthy cells and maximise the dose to cancer tumour. A study by \citet{Njeh} describes the steps in radiotherapy as links in a chain, in which the weakest link will decide the accuracy of radiotherapy. \citet{Njeh}'s study shows that tumour delineation is the weakest link, and therefore has significant impacts in radiotherapy treatments.

- waiting time
- interobserver

- Why is delineation of tumours important?
- How is delineation done today?
- What uncertainties are involved?
- CNNs for auto-delineation of tumours
- Propose solution - goal

This report will define a software requirement specification (SRS) and Design Document of a Keras-based framework for automatic delineation of cancer tumours. Besides, the progress of the development based on the SRS and Design as well as the current results are also included.

# Theory and Definition

## Deep Learning

## Convolution Neural Network

- Define CNN here
- Remember to mention:
  - Model
  - Layers
  - Briefly describe CONV, how does it apply to image
  - Activation function
  - Loss
  - Optimizer
  - Scheduler?

## Sequential Model

## U-Net Architecture

## Machine Learning Steps

# The Framework

The development of the resulted framework, named *deoxys*, has the goal of providing the users the ability to run multiple experiments of different CNN models and then choose the best model for final prediction. This framework should be specialized in deep-learning in medical images, especially in auto-delineation of cancer tumour. Because of that, it should integrate u-net architecture and image preprocessing modules, as well as logging tools and performance visualization tools when running experiments. These are the minimum requirements of the framework. It can be later extended with other types of architectures, preprocessors, automation, interactive verbose configuration and visualization.

Planned development time as long as maintenance time will be continuously from October 1st, 2019 until May 1st, 2020. The first milestone is on January 6th, 2020, with the goal of creating a framework that satisfies the minimum requirements, which will be defined in detail in the software requirement specification (see \ref{software-requirement-specification}).

## Software Requirement Specification

This part defines the requirement specification of the developing framework. Because of that, terms indicating future-tense such as "should", "shall", "will" as well as terms indicating ability such as "can", "may" will be used when describing framework.

In order to reach the goal of the development, the *deoxys* framework should satisfy all the requirements defined in User Requirement Specification (see \ref{user-requirement-specification}) and System Requirement Specification (see \ref{system-requirement-specification})

### User Requirement Specification

Users are referring to master students, PhD candidates, researchers and anyone who wants to use deep-learning on automatic delineation of cancer tumors.
This framework is targeted to the users with basic programming knowledge, including the usage of JSON data structure, and with the knowledge of deep learning, especially in convolution neural network. Basic programming knowledge is including but not limited to object-oriented programming in python, other python libraries such as matplotlib, keras, h5py.

With the help of *deoxys*, users shall have the ability to perform multiple CNN experiments by creating configurable JSON files. Users can define their own sequential or u-net model with the choices of layers, loss functions, optimizers, metrics and many other hyper-parameters. In addition, users can choose how to split the data for training, validation and testing. Each experiment should include training the data, logging the performance and evaluation of trained model on test data. All trained models can be saved to disk and loaded back for continuation of training or any other purposes.

As a follow-up after running an experiment, users can also check the predicted outputs as delineated images in comparison with the original image and view the performance graphs of the trained model.

Users with advanced programming knowledge can also customize and create their custom model architecture, layers, activation functions, loss functions, optimizers, metrics, etc...

### Use cases

From the user requirement specification, the *deoxys* framework should support the following 5 use cases:

1. Create a model
1. Train a model
1. Save a trained model
1. Load a model from file(s)
1. Create and apply customized model objects to the model


#### Use case diagram

Figure \ref{fig:usecase} shows the all the use cases and their interaction inside the framework. There are three main flows of the use cases:

- Setup experiment using configurations to run and evaluate experiment. This starts with creating a model from configuration, then setting up experiment by training and evaluating the configured model.
- Load and save trained model from and to disk. 
- Create customize objects / elements for the experiment. This includes: Layers, Activation functions, Loss functions, Optimizers, Callbacks

\begin{figure*}[h!]
  \includegraphics[width=\textwidth]{img/use_case.png}
  \caption{Use Case Diagram}
  \label{fig:usecase}
\end{figure*}

#### Use case 1: Create a model

Every actions of the user involves the use of the model. The model term in the *deoxys* framework refers to a group of the three components. 
The first component is a convolution neural network. It can be a sequential CNN or an U-net CNN, or even a customized CNN defined by the users. This neural network contains input shapes, layers, activation functions. We call this component the architecture of the model. 
The second component is the set of hyper-parameters of the neural network, which includes the optimizers, loss function, and metrics. 
To train and evaluate a neural network, the data - medical images with delineation contour should 
The last component, called Data Reader, acts as a data provider, which feed the data into the neural network for training and evaluation. This involves splitting up the data into  A typical way of spitting the data in machine learning is to split the data into training data, validation data, and test data. Training data is used for training the model, while validation data is used for checking the quality of the model. Test data is usually the data the model need to test

#### Use case 2: Train a model

#### Use case 03: Create customize objects / elements for the experiment.

Step 1: Create new customized objects

Step 2: Register the created objects to the framework. System Check for validity of the object, raise an exception if the object is failed to register

Step 3: Create an experiment using the newly create objects. System Perform the experiment without error using the new objects

Step 4: View the output from files or diagram (depending on the configuration)

### System Requirement Specification

The *deoxys* framework should have the following attributes: usability, reliability, flexibility, maintainability and portability.

#### Usability
The *deoxys* framework should be easy to install, learn and use. The expected training and learning time for a user to use this framework effectively should not take more than 40 hours. For this reason, this framework should have a detailed documentation of installation guide and usage of each class, function and property. It should also provide sample code snippets which can be applied to the defined use cases.

#### Reliability
The output generated when running code from *deoxys* framework should have the behaviors as documented. In addition, the unexpected error rate should be under 5% and at least 80% of code lines should have been tested before release.

#### Flexibility
User should be able to customize and create new components to integrate in *deoxys* framework.

#### Maintainability
The *deoxys* framework should be easy to maintain. Therefore, it should be divided into separated modules. Moreover, all of the source code should follow the PEP8 coding convention. Also, this framework should log all actions in different versions and issues from the users.

Maintaining the framework includes fixing bugs, handling issues, updating and adding new features. The maintaining activities should last at least until May 2020.

#### Portability
The *deoxys* framework should work properly when the following hardware requirements and environment are satisfied:

- System memory: at least 8GB with GPU or 13GB without GPU
- Python version: at least 3.7

## Designs

### Overview

Before development, the designs of the framework have to be considered.

The first things to concern are the usability and maintainability of the framework. As stated, in the previous sessions, all source code shall follow PEP8 coding convention. Sphinx will be used as the tool of documentation. In addition, git is used as a tool to handle logging and version management. All source code should be available in github.com/huynhngoc/deoxys.

Implementation all layers and other components in convolution neural network within three-month time is impossible. Therefore, keras is used as a based library, as it contains implemented layers, activation functions, optimizers and other components in CNN. Also, keras is compatible with tensorflow 1.x, 2.x, which is a very powerful backend tool in deep-learning, as well as other backend suchs as Theano, etc...

Finally, this report suggests that the framework should have the following modules:
- Models: contains a wrapper of a keras model. Other keras objects such as optimizers, activation functions.. are also included.
- Infrastructure and configuration loader
- Data reader: Since the target of this framework are medical images, the input data are often large in size and usually cannot fit into the computer memory. In order to avoid out of memory errors, this module should contain a data generator that split image data into smaller batches that can fit into the memory when training the model.
- Experiment:

- Using keras, why?
  - Implemented layers, activation function, optimizers ...
  - Compability with tensorflow 1.x, 2+, multiple backend
- Coding convention:
  - PEP8
  - pydoc
- Structure
  - Model:
  - Infrastructure / config loader
  - Data reader:
     - Generators: why need it (batching data)
     - Preprocessors
  - experiment:
     - Single: logging and performance and plot result
     - Multi: choose best model
     
     Explain why i use this structure

#### Structure diagram

![](img/project_structure.png)

### Model Objects

These modules are the components creating a model. They are layers, loss functions, activation functions, metrics, optimizer and callbacks.

### Model

Firstly, this module should be a wrapper of a keras model. As a result, it should have methods of the keras model such as:

- `load`: loading model
- `save`: save model to file
- `fit`: fit a model with data
- `predict`: predict the target
- `evaluate`: evaluate the performance of the current state of the model

Secondly, it should have a Data Reader (see \ref{data-reader}) instance, which provided proper inputs for actions on the model.

Finally, by performing method in keras model using the inputs from the data reader, the model should have the following methods:

- `fit_train`: fit the training data
- `predict_val`: predict the validation data
- `predict_test`: predict the test data
- `evaluate_test`: evaluate the performance of the current state of the model on the test data

Training data, validation data, test data are explained in \ref{data-reader}.

### Infrastructure

This module should have a function to create a model from one of the predefined structures. 

#### Sequential

#### Unet

### Data Reader

The data reader module should provide input data for training and evaluating the model. A typical way of spitting the data in machine learning is to split the data into training data, validation data, and test data. Training data is used for training the model, while validation data is used for checking the quality of the model. Test data is usually the data the model need to test. To use the method of splitting, the data reader should provide three sets of data: training data, validation data, test data. These three sets should be in the form of a python generator, which is wrapped into a Data Generator. Using a python generator is essential because medical image data usually has large size, and may not be able to fit into the running environment's memory. Using a python generator will feed the model with small part of the data and minimize the chance of getting out of memory error. List of preprocessors to be applied on the data should be configurable.

#### HDF5 Data Reader

`h5` or `hdf5` is a file format that has the ability to store large dataset with compression and hierarchy, as well as meta-data. The main components of an HDF5 file are groups and datasets, where datasets are pieces of data that is stored in file while groups are containers of datasets.

The *deoxys* framework should have a HDF5 Data Reader, which is a Data Reader that process data from a hdf5 file. As a result, it should provide the three datasets: train, validation and test. In addition, since an HDF5 file can be splitted into groups, the HDF5 Data Reader should provide an aid for configure which groups of data to be in the 3 basic sets. It should be easy to configure different group into different purpose for cross-validation.

### Configuration


- Explain the json structure, note the require keywords
- Why it's easy to setup experiment
- Example of setting up experiment

# Results

## Implementation progress

### Completed modules

By the time this report is submitted, users can perform a single experiment, with saving, loading, and visualization using the *deoxys* framework. This means that, all parts but "multiple experiments" from the design diagram in figure ~\ref{fig:design} have been implemented.

### In-progress modules
Modules related to running multiple experiments are still in development. There are problems involving the process of combining multiple single experiments into a batch of experiments, as well as the concurrent programming that allows to run multiple experiments in parallel.

Besides, there are still lack of tests and documentations that needs to be resolved.

## Run on test data

The data from Oslo University hospital is used for running a test experiment. The model parameters are taken from Yngve Mardal Moe's master thesis (\cite{}) and run the training set with only 3000 slices of images.

The result was amazing as the dice is about 0.5 (graph, output log). 

# Discussions

- Advantages:
- Disadvantages:
- Improvement:
  - What should be added more (Preprocessor, Callbacks, ....)
  - Auto-generate config tool (web-base)
  - Back-propagation implementation based on implemented model
  - Visualize progress of training / prediction
  - Datagenerator as sequential model for multiprocessing

**Reference List**

- M Jameson, L Holloway, P Vial, S Vinod, P Metcalfe. 2010. A review of methods of analysis in contouring studies for radiation oncology. Journal of Medical Imaging and Radiation Oncology 54:401-410. Note: Understand the basics for your introduction:

"There is no consistent or widely accepted method of
systematic contour comparison. A number of contouring
metrics exist; some of which are available in treatment
planning systems and others require specialised software. "

"Volume was the most frequently used metric
across all tumour sites. Shape/dimension was the next
most frequently used metric in all tumour sites except for
brain and head and neck, where COV and CI were the
next most frequent metrics, respectively"

"the choice of metrics in many studies is
somewhat arbitrary and not determined on any established clinical basis. Further studies are needed to assess
the advantages and disadvantages of each metric in
various situations"

- C Njeh. 2008. Tumor delineation: The weakest link in the search for accuracy in radiotherapy. Journal of Medical Physics 33:136-140. Note: Understand the basics for your introduction [link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2772050/)

Use Figure 1: Some of the steps in radiotherapy that can be represented by links in a chain; treatment accuracy will be limited by the weakest link in the chain"

Use Figure 3
Illustration of the effect of high conformal radiation therapy and geometric miss due to delineation

"It is evident that tumor delineation is currently the weakest link in radiotherapy accuracy and will continue to have a significant impact until improvement in tumor delineation is achieved. With the advancement of computer programming and imaging technology, especially functional imaging using PET, there is a possibility of converging and making tumor identification and definition less subjective and less observer-dependent."


In [4]:
import io
from IPython.nbformat import current

with io.open('DAT390_Report.ipynb', 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')

word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print(word_count)

2457
