Preprocessing stages for the PadChest dataset.
We assume that we are working under /home/user/padchest
directory. Modify this conveniently.
-
Download the dataset described by Bustos et al., 2019. Request acess from the bimcv website.
-
Follow the preprocessing steps from Rx-thorax-automatic-captioning:
-
Resize each image to 1024x1024px:
mkdir 1024 ; find . -iname "*.png" | parallel convert -resize "1024x1024^" {} 1024/{}
-
-
General structure: We will organize the dataset in 3 folders:
-
Annotations
: Contains all the non-image related information. E.g. Captions, labels, lists containing image/feature paths for each split, etc. -
Images
: Contains the raw images from the dataset (conveniently resized). -
Features
: Contains the features extracted for the images.
-
-
Following this structure, lets organize our folders:
- Create the directories:
mkdir Features Images Annotations
- Put the padchest csv into the Annotations folder:
cp PADCHEST_chest_x_ray_images_labels_160K_01.02.19.csv Annotations
- Move all (resized) images to the Images folder:
for folder in `seq 54` ; do mv ${folder}/1024/* Images; done
-
Generate the lists for the dataset. You need to create 3 files. Typically, for split in ['train', 'val', 'test']:
-
split_list_ids.txt: Contains the sample ids for each split.
-
split_list_images.txt: Contains the path to each image for each split.
-
split_list.txt: sample_id \t data_column1 \t ... \t data_columnN. Where data_column are labels (e.g. the reports).
These lists can be generated with the
generate_lists.py
script. For example, for using a 96% of the dataset for training (142k samples), a 2% for development (~3k samples) and a 8% for testing (~13k samples), execute:python padchest_preprocessing/generate_lists.py --root-dir /home/user/DATASETS/padchest/ --labels Annotations/PADCHEST_chest_x_ray_images_labels_160K_01.02.19.csv --fraction 0.9 0.02 0.08 -v
-
-
Extract features from images.
-
Make sure you correclty installed Mulimodal Keras Wrapper and Keras (or our version of Keras).
-
Select the configuration of the extractor in
feature_extraction/config.py
. -
Extract the features!
python padchest_preprocessing/eature_extraction/keras/simple_extractor.py
-
-
Generate a list pointing to the extracted features:
split_list_features.txt
. Note that we also want to remove the MIME extension from the features, so we call this scrpit with the option --replace-extension 4:
python padchest_preprocessing/generate_feature_lists.py --root-dir /home/lvapeab/DATASETS/padchest --features-dir Features/padchest_NASNetLarge/ --features NASNetLarge --lists-dir Annotations --extension .npy --replace-extension 4
- Retrieve only the captions from split_list.txt:
bash padchest_preprocessing/process_captions.sh /home/lvapeab/DATASETS/padchest/Annotations _list.txt captions
- Profit! This file structure can be directly applied on interactive-keras-captioning.
* Make `feature_extraction/keras/simple_extractor.py` work in batch.
* There are some pre-defined splits?