Unsupervised medical feature learning

Project done in context of the Deep Learning course of ETH Zürich.

Motivation

Machine learning is rapidly making its way into the field of medicine, most prominently into subfields such as oncology, pathology, radiology and genetics. Here, the vast majority of the work done so far focuses on supervised detection of end diagnoses. Despite the usefulness of this approach, we believe that, for certain clinical problems, such as complex prognostic predictions and treatment decisions, it is important to learn meaningful representations of the data and to automatically identify and extract the features that characterize certain diseases. Such views blend in with the growing concern regarding the inability of humans to understand the models used. The need for "interpretability" is, as a result, frequently invoked in these discussions. In this project, we attempt to tackle the issue of meaningful representation learning, without supervision, in a medical environment. In particular, we constrain our project to the neuroimaging domain. To do so, we take an adversarial representation learning approach. Indeed, we were inspired by previous work on latent variable generative models, namely Generative adversarial networks (GANs) and Variational Autoencoders (VAEs), as well as more novel approaches where mutual information (MI) between the latent (or feature) space and the data is introduced as an extra building block.

How to run our code

Please consult the package requirements under the requirements.txt file.

The main flags to run our code are contained within the main.py file. An example command to run a model would be:

python src/main.py --model_dir=gs://[BUCKET_NAME]/cifar10/outputs --data_dir=gs://[BUCKET_NAME]/cifar10/data  --tpu=[TPU_NAME] --dataset=[DATASET]

In order to run the project using TPU, the data needs to be loaded onto a GCP bucket.

The results can be visualized on Tensorboard. In GCP virtual machines they can be seen in the local machine by forwarding the ssh port, e.g.

ssh -L 6006:localhost:6006 [HOST]

Then, Tensorboard can be used normally from the VM, even with a GCP bucket as a target directory.

How to download the datasets

To run the code, you will have to download the datasets.

MNIST is readily available in the Python scikit-learn software package and will be loaded for you in the relevant script.
The celebA can be found under: https://www.kaggle.com/jessicali9530/celeba-dataset
The CQ500 dataset can be downloaded from : http://headctstudy.qure.ai/dataset

Project structure

Below is an overview of the structure of this project:

project
│   README.md
│   requirements.txt
│   to_run.sh
│   monitor.sh
│
├───src
│   │   main.py
│   │   main_search.py
│   │
│   ├─── models
│   │    │   basic_model.py
│   │    │   resBlocks_model.py
│   │    │   resBlocks_ops.py
│   │
│   │
│   ├─── datamanager
│   │    │   celebA_input_functions.py
│   │    │   CIFAR_input_functions.py
│   │    │   cq500_input_functions.py
│   │    │   cq500_256_input_functions.py
│   │
│   │
│   ├─── core
│   │    │   core_model_estimator.py
│   │
│   │ 
│   │─── util 
│        │   cq500_folder_structure.py
│        │   dcm_manipulation.py
│        │   get_dcms_for_PCA.py
│        │   image_postprocessing.py
│        │   image_preprocessing.py
│        │   pca_mnist.py
│        │   tensorboard_logging.py
│        │   tpu_teraflops_measure.py
│        │   windowing.py
│
│
│
└───scripts
    │ compare_z-locs.py
    │ compare_z-locs_updated.py
    │ convert_dcm_to_unique_tfRecord.py
    │ ct_test.dcm
    │ fractures.csv
    │ fractures.txt
    │ generate_cifar10_tfrecords.py
    │ healthy.csv
    │ healthy.txt
    │ hemorrhage.txt
    │ hemorrhages.csv
    │ reads.csv
    │ restart_tpu.sh
    │ retrieve_all_dcms.py
    │ rewrite_normalize_dcm.py
    │ validation_dataset_creation.py

The src/ folder contains the most important scripts to be able to run the models, and includes the models themselves, input functions for each dataset and utility functions. CIFAR is included as we ran early tests on this dataset, however, we report our results on the MNIST, celebA and CQ500 datasets.

The scripts/ folder contains a series of scripts which were necessary to analyze and manage the data. The CQ500 dataset in particular required a lot of data analysis and special handling, given that the original data is in a DICOM format.

Happy model running! 🏎

Spotted an error? Feedback and pull requests are welcome!

The authors

Nil Adell Mill
Inês Pereira
Patrick Haller
Lama Saouma

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised medical feature learning

Motivation

How to run our code

How to download the datasets

Project structure

Happy model running! 🏎

The authors

Project using Tensorflow Project Template. Check it for specifics about the structure and template files.

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
monitor.sh		monitor.sh
requirements.txt		requirements.txt
to_run.sh		to_run.sh

niladell/unsupervised-medical-learning

Folders and files

Latest commit

History

Repository files navigation

Unsupervised medical feature learning

Motivation

How to run our code

How to download the datasets

Project structure

Happy model running! 🏎

The authors

Project using Tensorflow Project Template. Check it for specifics about the structure and template files.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages