First we create a balanced set of images for adapting the weights of naphash 
In order 16384 frames from each of these datasets:
* CelebA https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
* COCO https://cocodataset.org/#home
* Alternative ImageNet datasets  "ImageNetV2"( https://github.com/modestyachts/ImageNetV2 ) and ImageNet-Sketch (https://github.com/HaohanWang/ImageNet-Sketch)
* Fashionpedia Dataset https://fashionpedia.github.io/home/Fashionpedia_download.html
* iNaturalist dataset 2019 https://www.kaggle.com/c/inaturalist-2019-fgvc6
* Places365-Standard http://places2.csail.mit.edu/download.html
* ImageNet fall11 Release https://www.image-net.org/

This will also load the images, calculate the dct, and store it on disk for faster weight adaption training and testing

In [1]:
%load_ext autoreload
%autoreload 2
#There are three hyper-parameters for weight adaption session:
dct_dim = 32
min_img_dims = 128
trg_num_samples = 16384 #largest power of two which can fit in the individual dataset sizes (Places365 is 36500 but with the potential amount of large/small images we stay on the cautious side)

#Experimental switches:
use_pil_rz = False #uses PIL LANCZOS for downsampling instead of opencv INTER_AREA  
num_threads = 8 #use multi-core procedures with this many cores where possible

Each dataset as a different number of samples and some contain a different degree of images too small for use (128x128 is used as a minimum)

In [2]:
import gzip
import json
import glob
seekall = '/workspace/data/data/imagenet_fall11/fall11_whole.seekhelper.txt.gz'
def load_from_gz(filepath, tar_root_folder="", as_list_style=True):
    if as_list_style:
        info = []
    else:
        info = {}
    with gzip.open(filepath, 'rb') as f_in:
        for line in f_in:
            if as_list_style:
                info.append(tar_root_folder+line.decode('ascii'))
            else:
                l_sp = line.decode('ascii').split(':')
                info[l_sp[0]] = [int(l_sp[1]),int(l_sp[2])]
    return info

def load_from_seekhelper(dir0):
    if not dir0[-1] == '/':
        dir0+='/'
    return [dir0+p for p in load_from_gz(dir0+'.seekhelper.txt.gz')]
imgnet_alt_paths = load_from_seekhelper('/workspace/data/data/imagenet_alt/')
fashionpedia_paths = load_from_seekhelper('/workspace/data/data/fashionpedia/')
inat2019_paths = load_from_seekhelper('/workspace/data/inat2019/')
places365_paths = load_from_seekhelper('/workspace/data/places365/')
celeba_paths = load_from_seekhelper('/workspace/data/celeba/jpg256')

imgnet_root='/workspace/data/data/imagenet_fall11/'
imgnet_paths = load_from_seekhelper(imgnet_root)
coco_inp_dir = '/workspace/data/data/coco/images/train2017'
coco_paths = sorted(glob.glob(coco_inp_dir+'/*.jpg'))

In [3]:
import random
train_paths = [celeba_paths, coco_paths, imgnet_alt_paths, fashionpedia_paths, inat2019_paths, places365_paths, imgnet_paths]

In [4]:
print([len(p) for p in train_paths])

[70000, 118287, 53542, 48823, 303593, 36500, 14197087]


In [5]:
#this step can be inserted for a second round to create a dedicated test set (all images not used during weight adaptation)
#import numpy as np
#precalc_dcts = np.load('ordered_dct_balanced.npz')
#skip_paths = set([s.strip() for s in precalc_dcts['paths'].tolist()])

In [6]:
#normalize sets and take smaller subsample (this is a random subset; use cell below instead to recreate the exact subsets from the paper)
trg_num_samples_use = int(trg_num_samples*1.25) #about 12% of imagenet frames have one dimension smaller than 128 -> add double for buffer; will be straightened later
all_paths_balanced = []
skipped_cnt = 0
for t in train_paths:
    t0 = t[:]
    random.shuffle(t0)
    t1 = []
    for t in t0:
        if t.strip() in skip_paths:
            skipped_cnt += 1
            continue
        t1.append(t)
        if len(t1) >= trg_num_samples_use:
            break
    all_paths_balanced += t1
print("Skipped ",skipped_cnt)

Skipped  46520


In [7]:
#Uncomment to recreate the original data paths 
#!gunzip ordered_paths_balanced.txt.gz
#with open('ordered_paths_balanced.txt', 'r') as f_out:
#    all_paths_balanced = [p for p in f_out]

In [176]:
from async_dct_loader import async_load_dct_paths, tqdm_nb
from naphash_py import naphash as nhcpp, rot_inv_type
nhcpp_objs = [nhcpp(dct_dim=dct_dim, rot_inv_mode=rot_inv_type.none, apply_center_crop=False, is_rgb=False) for _ in range(num_threads)] #no center crop
dcts = await async_load_dct_paths(nhcpp_objs, all_paths_balanced, num_threads, dct_dim, min_img_dims, tqdm_vers=tqdm_nb, pil_sz=32)

HBox(children=(FloatProgress(value=0.0, max=17875.0), HTML(value='')))




The next step balances the datasets (some files might not load or result in images smaller than 128x128)

Each subset is represented trg_num_samples times (=16385)

In [9]:
import numpy as np
def dataset_by_path(p):
    path_context = ['/celeba/','/coco/','/imagenet_alt/', '/fashionpedia/','/inat2019/','/places365/', '/imagenet_fall11/']
    for i,c in enumerate(path_context):
        if c in p: return i
    return -1
orig_dct, all_bu, count_ds, path_per_ds = [], [], {}, {i:[] for i in range(7)}
all_paths_balanced = paths
for i in range(len(all_paths_balanced)):
    if dcts[i] is None:
        continue
    idx_dataset = dataset_by_path(all_paths_balanced[i])
    if count_ds.get(idx_dataset,0) > trg_num_samples:
        continue
    path_per_ds[idx_dataset].append(all_paths_balanced[i])
    count_ds[idx_dataset] = count_ds.get(idx_dataset,0) + 1  
    all_bu.append(all_paths_balanced[i])
print(count_ds)

{0: 16385, 1: 16385, 2: 16385, 3: 16385, 4: 16385, 5: 16385, 6: 16385}


We save the exact list used for experiments for later evaluation

In [None]:
with open('ordered_paths_balanced.txt', 'wt') as f_out:
    for p in all_bu:
        f_out.write(p.replace('\n','')+'\n')
!gzip ordered_paths_balanced.txt

In [185]:
np.savez_compressed('ordered_dct_balanced.npz', dcts=orig_dct, paths=all_bu)  #round one: training set for weight adaptation
#np.savez_compressed('ordered_dct_pil_testing.npz', dcts=orig_dct, paths=all_bu) #round two: test set for checking robustness

The steps below pre-calculate NPHASH hashes for the CIFAR10 dataset (this is done after weight adaptation!)

In [2]:
from async_dct_loader import async_load_dct_paths
cifar10_train = list(sorted(files_in_subdirs('./cifar10/train')))
cifar10_val = list(sorted(files_in_subdirs('./cifar10/test')))
nhcpp_objs = [nhcpp(dct_dim=dct_dim, rot_inv_mode=rot_inv_type.none, apply_center_crop=False, is_rgb=False) for _ in range(num_threads)] #no center crop
hashes = await async_load_dct_paths(nhcpp_objs, cifar10_val+cifar10_train, num_threads, dct_dim, -1, tqdm_vers=tqdm_nb, ret_hash = True)

HBox(children=(FloatProgress(value=0.0, max=7500.0), HTML(value='')))




In [10]:
np.savez_compressed('cifar10_hashes.npz', hashes=np.vstack(hashes), paths=list(cifar10_val+cifar10_train))