Skip to content

Extracting data for triplet-based distance metric learning. Original datasets: THINGS, IHSJ, Yummly.

License

Notifications You must be signed in to change notification settings

greenfieldvision/ditdml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Interfaces for Triplet-based Distance Metric Learning

Interfaces to the THINGS (1 2), IHSJ (3 4) and Yummly (5 6) datasets that produce data for triplet-based distance metric learning.

The code is close to production grade and provides an effective way to access triplet labeled datasets for distance metric learning.

Requirements

  • Python 3.8+
  • more_itertools
  • numpy
  • scipy
  • scikit-learn
  • Pillow
  • Tkinter
  • psiz==0.5.1

Instructions for Use

THINGS

  1. Navigate to the main THINGS dataset page on OSF and download the Main folder as a zip archive.
  2. Unzip the archive and its subarchives to folder {THINGS_ROOT}/Main.
  3. Navigate to the "Revealing the multidimensional mental representations..." page on OSF and download both the "data" and the "variables" folder as zip archives.
  4. Unzip the two archives to folder {THINGS_ROOT}/Revealing.
  5. Ask the corresponding author of the THINGS dataset for the labeled triplet data.
  6. Place the files in {THINGS_ROOT}/Revealing/triplets.

IHSJC

  1. Download the ImageNet dataset to folder {IHSJ_ROOT}/imagenet.
  2. Navigate to the IHSJ dataset page on OSF and download the file data/deprecated/psiz0.4.1/catalog.hdf5 to folder {IHSJ_ROOT}/val/catalogs/psiz0.4.1 and data/deprecated/psiz0.4.1/obs-195.hd5 to folder {IHSJ_ROOT}/val/obs/psiz0.4.1.

Yummly

  1. Download the zip archive http://vision.cornell.edu/se3/wp-content/uploads/2014/09/food100-dataset.zip.
  2. Unzip the archive to {YUMMLY_ROOT}.

Take a look at the scripts in the tools/ directory, eg report_data_statistics.py, and the *DataInterface classes in order to understand how to implement a PyTorch Dataset / write TensorFlow records using on the data interfaces.

Dataset Splits

The code includes functionality to split the triplets into training, validation and test subsets - unit testing included. If desired, new splits can be implemented in the ThingsDataInterface class.

Tools

All tools must be run from the ditdml folder.

To see statistics like the number of images etc:

python ditdml/tools/report_data_statistics.py --dataset-name things --data-directory-name {THINGS_ROOT} --split-type quasi_original --seed 13
python ditdml/tools/report_data_statistics.py --dataset-name things --data-directory-name {THINGS_ROOT} --split-type by_class --class-triplet-conversion-type all_instances --seed 13
python ditdml/tools/report_data_statistics.py --dataset-name things --data-directory-name {THINGS_ROOT} --split-type by_class --class-triplet-conversion-type prototypes --seed 13
python ditdml/tools/report_data_statistics.py --dataset-name things --data-directory-name {THINGS_ROOT} --split-type by_class_same_training_validation --class-triplet-conversion-type all_instances --seed 13
python ditdml/tools/report_data_statistics.py --dataset-name things --data-directory-name {THINGS_ROOT} --split-type by_class_same_training_validation --class-triplet-conversion-type prototypes --seed 13

python ditdml/tools/report_data_statistics.py --dataset-name ihsjc --data-directory-name {IHSJ_ROOT} --split-type by_class --class-triplet-conversion-type all_instances --seed 15
python ditdml/tools/report_data_statistics.py --dataset-name ihsjc --data-directory-name {IHSJ_ROOT} --split-type by_class --class-triplet-conversion-type prototypes --seed 15
python ditdml/tools/report_data_statistics.py --dataset-name ihsjc --data-directory-name {IHSJ_ROOT} --split-type by_class_same_training_validation --class-triplet-conversion-type all_instances --seed 15
python ditdml/tools/report_data_statistics.py --dataset-name ihsjc --data-directory-name {IHSJ_ROOT} --split-type by_class_same_training_validation --class-triplet-conversion-type prototypes --seed 15

python ditdml/tools/report_data_statistics.py --dataset-name yummly --data-directory-name {YUMMLY_ROOT} --split-type same_training_validation_test --seed 16
python ditdml/tools/report_data_statistics.py --dataset-name yummly --data-directory-name {YUMMLY_ROOT} --split-type by_instance --seed 16
python ditdml/tools/report_data_statistics.py --dataset-name yummly --data-directory-name {YUMMLY_ROOT} --split-type by_instance_same_training_validation --seed 16

To interactively visualize labeled triplets:

python ditdml/tools/visualize_triplets.py --dataset-name things --data-directory-name {THINGS_ROOT} --split-type quasi_original --seed 23 --subset-name test --initial-triplet-index 200
python ditdml/tools/visualize_triplets.py --dataset-name things --data-directory-name {THINGS_ROOT} --split-type by_class --class-triplet-conversion-type all_instances --seed 23 --subset-name training --initial-triplet-index 315715
python ditdml/tools/visualize_triplets.py --dataset-name things --data-directory-name {THINGS_ROOT} --split-type by_class --class-triplet-conversion-type prototypes --seed 23 --subset-name validation --initial-triplet-index 42
python ditdml/tools/visualize_triplets.py --dataset-name things --data-directory-name {THINGS_ROOT} --split-type by_class_same_training_validation --class-triplet-conversion-type all_instances --seed 23 --subset-name test --initial-triplet-index 101
python ditdml/tools/visualize_triplets.py --dataset-name things --data-directory-name {THINGS_ROOT} --split-type by_class_same_training_validation --class-triplet-conversion-type prototypes --seed 23 --subset-name training --initial-triplet-index 22

python ditdml/tools/visualize_triplets.py --dataset-name ihsjc --data-directory-name {IHSJ_ROOT} --split-type by_class --seed 25 --subset-name test --initial-triplet-index 200
python ditdml/tools/visualize_triplets.py --dataset-name ihsjc --data-directory-name {IHSJ_ROOT} --split-type by_class --class-triplet-conversion-type prototypes --seed 25 --subset-name validation --initial-triplet-index 300

python ditdml/tools/visualize_triplets.py --dataset-name yummly --data-directory-name {YUMMLY_ROOT} --split-type same_training_validation_test --seed 26 --subset-name training --initial-triplet-index 222
python ditdml/tools/visualize_triplets.py --dataset-name yummly --data-directory-name {YUMMLY_ROOT} --split-type by_instance --seed 26 --subset-name test --initial-triplet-index 333
python ditdml/tools/visualize_triplets.py --dataset-name yummly --data-directory-name {YUMMLY_ROOT} --split-type by_instance_same_training_validation --seed 26 --subset-name validation --initial-triplet-index 444

(press left, right arrows)

To interactively visualize neighbors according to the provided embedding for THINGS:

python ditdml/tools/visualize_neighbors.py --data-directory-name {THINGS_ROOT} --num-neighbors 4 --initial-class-index 1854

(press left, right arrows)

To interactively visualize the similarity matrix for THINGS:

python ditdml/tools/visualize_similarity_matrix.py --data-directory-name {THINGS_ROOT}

(click on matrix elements in left pane to show image pairs)

About

Extracting data for triplet-based distance metric learning. Original datasets: THINGS, IHSJ, Yummly.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages