Padchest preprocessing

Preprocessing stages for the PadChest dataset.

We assume that we are working under /home/user/padchest directory. Modify this conveniently.

Download the dataset described by Bustos et al., 2019. Request acess from the bimcv website.
Follow the preprocessing steps from Rx-thorax-automatic-captioning:
1. Resize each image to 1024x1024px:
  
  mkdir 1024 ; find . -iname "*.png" | parallel convert -resize "1024x1024^" {} 1024/{}
General structure: We will organize the dataset in 3 folders:
1. Annotations: Contains all the non-image related information. E.g. Captions, labels, lists containing image/feature paths for each split, etc.
2. Images: Contains the raw images from the dataset (conveniently resized).
3. Features: Contains the features extracted for the images.
Following this structure, lets organize our folders:
1. Create the directories:
mkdir Features Images Annotations
1. Put the padchest csv into the Annotations folder:
cp PADCHEST_chest_x_ray_images_labels_160K_01.02.19.csv Annotations
1. Move all (resized) images to the Images folder:
for folder in `seq 54` ; do mv ${folder}/1024/* Images; done
Generate the lists for the dataset. You need to create 3 files. Typically, for split in ['train', 'val', 'test']:
1. split_list_ids.txt: Contains the sample ids for each split.
2. split_list_images.txt: Contains the path to each image for each split.
3. split_list.txt: sample_id \t data_column1 \t ... \t data_columnN. Where data_column are labels (e.g. the reports).
These lists can be generated with the generate_lists.py script. For example, for using a 96% of the dataset for training (142k samples), a 2% for development (~3k samples) and a 8% for testing (~13k samples), execute:
```
python padchest_preprocessing/generate_lists.py --root-dir /home/user/DATASETS/padchest/ --labels Annotations/PADCHEST_chest_x_ray_images_labels_160K_01.02.19.csv --fraction 0.9 0.02 0.08 -v
```
Extract features from images.
1. Make sure you correclty installed Mulimodal Keras Wrapper and Keras (or our version of Keras).
2. Select the configuration of the extractor in feature_extraction/config.py.
3. Extract the features!
```
python padchest_preprocessing/eature_extraction/keras/simple_extractor.py
```
Generate a list pointing to the extracted features: split_list_features.txt. Note that we also want to remove the MIME extension from the features, so we call this scrpit with the option --replace-extension 4:

python padchest_preprocessing/generate_feature_lists.py --root-dir /home/lvapeab/DATASETS/padchest --features-dir Features/padchest_NASNetLarge/ --features NASNetLarge --lists-dir Annotations --extension .npy --replace-extension 4

Retrieve only the captions from split_list.txt:

bash padchest_preprocessing/process_captions.sh /home/lvapeab/DATASETS/padchest/Annotations _list.txt captions

Profit! This file structure can be directly applied on interactive-keras-captioning.

TO-DO list.

* Make `feature_extraction/keras/simple_extractor.py` work in batch.
* There are some pre-defined splits?

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
padchest_preprocessing		padchest_preprocessing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Padchest preprocessing

TO-DO list.

About

Releases

Packages

Languages

License

lvapeab/padchest_preprocessing

Folders and files

Latest commit

History

Repository files navigation

Padchest preprocessing

TO-DO list.

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages