# Test dataset generator
We will prepare tensor binary files from imagenette (valid) dataset for C runtime here.

In [None]:
from fastai.vision.all import *
from torch.utils.data import Dataset
import struct
from torchvision.transforms.functional import to_pil_image

In [None]:
path = untar_data(URLs.IMAGENETTE_320,data=Path.cwd()/'data')

We will eventually generate raw tensor binary files for C runtime as a `test` dataset. Those files have a file extension `.bin`. The name of parent direcotries have been already encoded from `0` to `9` accordingly. So we need to modify `ImageBlock` and `get_y` to support a newly generated `test` dataset. And we set up that before `test` dataset is generated.

Usually `ImageBlock` takes care of most of image file conversion but in our case eventually we'll get a raw tensor binary files for a test dataset. To deal all of them in an unified way, we need to implement our file loader below.

In [None]:
class ImageTensorLoader(Transform):
    def encodes(self, fn:Path):
        fn = str(fn)
        if fn.lower().endswith('.jpg') or fn.lower().endswith('.jpeg'):
            return PILImage.create(fn)
        elif fn.lower().endswith('.bin'):
            with open(fn, 'rb') as f:
                x = struct.unpack(f'{3*224*224}f', f.read())
            x = torch.tensor(x).view(3, 224, 224)
            mean, std = [torch.tensor(o).view(3,1,1) for o in imagenet_stats]
            x = x * std + mean
            return PILImage.create(to_pil_image(x))
        else:
            raise Exception(f'Unknown file type for {fn}')

Not only with a custom file loader, we use encoded `label`s for raw tensor binary files. If the `label` is numerical value from `0` to `9`, they should be decoded back to the original `label` to align with the original jpeg datasets

In [None]:
x = L(o for o in get_files(path) if str(o).lower().endswith('jpeg') or str(o).lower().endswith('jpg'))
vocab = list(set(o.parent.name for o in x))
i2o = {i:o for i, o in enumerate(vocab)}

def my_parent_label(fn:Path):
    pa = parent_label(fn)
    return i2o[int(pa)] if pa.isdigit() else pa    

Here, we'll build up our own dataloader to handle the original jpeg files and raw tensor binary files at once, although we still haven't generated such raw tensor binary files yet. The original jpeg files for either `train` or `valid` dataset and newly generated raw tensor binary files for `test` dataset.

In [None]:
db = DataBlock(
    blocks=(TransformBlock(type_tfms=ImageTensorLoader), CategoryBlock),
    get_items=get_files,
    splitter=GrandparentSplitter(valid_name='val'),
    get_y=my_parent_label,
    item_tfms=Resize(224)
)
dls = db.dataloaders(path)

In [None]:
learn = vision_learner(dls, resnet18, metrics=accuracy, pretrained=True)

If model parameters were saved in a file previously, load those previously trained parameters. Otherwise, run `finetune` and save those newly trained parameters into a file for the future use.

In [None]:
fn = path/'model.pth'
if os.path.exists(fn):
    learn.model.load_state_dict(torch.load(fn))
else:
    learn.fine_tune(1)
    torch.save(learn.model.state_dict(), fn)    

We'll generate `test` dataset while running inference for `validation` dataset. `test` dataset is augmented from `valid` dataset and stored in `tensor` format for C runtimes.

In [None]:
class SaveImageFilesCallback(Callback):
    def __init__(self, save_dir, ncat=10, nitems=100):
        self.save_dir = save_dir
        self.ncat = ncat
        self.counts = [0] * int(nitems/ncat)
        for i in range(self.ncat):
            os.makedirs(save_dir/str(i), exist_ok=True)
   
    def after_batch(self):
        if self.learn.training or self.learn.model.training:
            return
        for X,y in zip(self.learn.xb[0], self.learn.y):
            if self.counts[y] >= self.ncat:
                continue
            
            fn = f'{str(self.save_dir)}/{y.item()}/{self.counts[y]}'
            with open(fn+'.bin', "wb") as f:
                f.write(struct.pack('f'*X.numel(), *X.flatten()))                
            
            self.counts[y] += 1
            if sum(self.counts)==self.ncat*len(self.dls.vocab):
                break

In [None]:
%%time
learn.validate(cbs=SaveImageFilesCallback(path/'test'))

CPU times: user 7min, sys: 1min 45s, total: 8min 45s
Wall time: 1min 58s


(#2) [337.6174621582031,0.10140127688646317]

## Test dataset directory structure

In [None]:
!tree -d data/imagenette2-320/test
!ls -al data//imagenette2-320/test/[2,6]/[3,7].bin

[38;5;5mdata/imagenette2-320/test[0m
├── [38;5;5m0[0m
├── [38;5;5m1[0m
├── [38;5;5m2[0m
├── [38;5;5m3[0m
├── [38;5;5m4[0m
├── [38;5;5m5[0m
├── [38;5;5m6[0m
├── [38;5;5m7[0m
├── [38;5;5m8[0m
└── [38;5;5m9[0m

10 directories
-rw-rw-r-- 1 doyu doyu 602112 Apr 23 11:37 data//imagenette2-320/test/2/3.bin
-rw-rw-r-- 1 doyu doyu 602112 Apr 23 11:37 data//imagenette2-320/test/2/7.bin
-rw-rw-r-- 1 doyu doyu 602112 Apr 23 11:37 data//imagenette2-320/test/6/3.bin
-rw-rw-r-- 1 doyu doyu 602112 Apr 23 11:37 data//imagenette2-320/test/6/7.bin


save `learner` to use later, and then, load `learner` & run inference for `test` dataset

In [None]:
learn.export()
learn = load_learner(path/'export.pkl')

In [None]:
test_dl = dls.test_dl(get_files(path, extensions=['.bin']), with_labels=True)

In [None]:
learn.validate(dl=test_dl)

(#2) [382.3768310546875,0.10000000149011612]