# Test dataset generator
We will prepare tensor binary files from imagenette (valid) dataset for C runtime here.

In [1]:
from fastai.vision.all import *
import struct
from torchvision.transforms.functional import to_pil_image
from export import serialize_fp32

In [2]:
path = untar_data(URLs.IMAGENETTE_320,data=Path.cwd()/'data')

We will eventually generate raw tensor binary files for C runtime as a `test` dataset. Those files have a file extension `.bin`. The name of parent direcotries have been already encoded from `0` to `9` accordingly. So we need to modify `ImageBlock` and `get_y` to support a newly generated `test` dataset. And we set up that before `test` dataset is generated.

## Generating test set

Usually `ImageBlock` takes care of most of image file conversion but in our case eventually we'll get a raw tensor binary files for a test dataset. To deal all of them in an unified way, we need to implement our file loader below.

In [3]:
class ImageTensorLoader(Transform):
    def encodes(self, fn:Path):
        fn = str(fn)
        if fn.lower().endswith('.jpg') or fn.lower().endswith('.jpeg'):
            return PILImage.create(fn)
        elif fn.lower().endswith('.bin'):
            with open(fn, 'rb') as f:
                x = struct.unpack(f'{3*224*224}f', f.read())
            x = torch.tensor(x).view(3, 224, 224)
            mean, std = [torch.tensor(o).view(3,1,1) for o in imagenet_stats]
            x = x * std + mean
            return PILImage.create(to_pil_image(x))
        else:
            raise Exception(f'Unknown file type for {fn}')

Not only with a custom file loader, we use encoded `label`s for raw tensor binary files. If the `label` is numerical value from `0` to `9`, they should be decoded back to the original `label` to align with the original jpeg datasets

In [4]:
x = L(o for o in get_files(path) if str(o).lower().endswith('jpeg') or str(o).lower().endswith('jpg'))
vocab = list(set(o.parent.name for o in x))
i2o = {i:o for i, o in enumerate(vocab)}

def my_parent_label(fn:Path):
    pa = parent_label(fn)
    return i2o[int(pa)] if pa.isdigit() else pa    

Here, we'll build up our own dataloader to handle the original jpeg files and raw tensor binary files at once, although we still haven't generated such raw tensor binary files yet. The original jpeg files for either `train` or `valid` dataset and newly generated raw tensor binary files for `test` dataset.

In [5]:
db = DataBlock(
    blocks=(TransformBlock(type_tfms=ImageTensorLoader, batch_tfms=IntToFloatTensor), CategoryBlock),
    get_items=get_files,
    splitter=GrandparentSplitter(valid_name='val'),
    get_y=my_parent_label,
    item_tfms=Resize(224),
    batch_tfms=Normalize.from_stats(*imagenet_stats)
)
dls = db.dataloaders(path)

In [6]:
learn = vision_learner(dls, resnet18, metrics=accuracy, pretrained=True)

If model parameters were saved in a file previously, load those previously trained parameters. Otherwise, run `finetune` and save those newly trained parameters into a file for the future use.

In [7]:
fn = path/'model.pth'
if os.path.exists(fn):
    learn.model.load_state_dict(torch.load(fn))
else:
    learn.fine_tune(1)
    torch.save(learn.model.state_dict(), fn)    

  learn.model.load_state_dict(torch.load(fn))


We'll generate `test` dataset while running inference for `validation` dataset. `test` dataset is augmented from `valid` dataset and stored in `tensor` format for C runtimes.

In [8]:
class SaveImageFilesCallback(Callback):
    def __init__(self, save_dir: Path, num_classes=10, max_images_per_class=50):
        self.save_dir = save_dir
        self.num_classes = num_classes
        self.max_images_per_class = max_images_per_class
        self.counts = [0] * num_classes
        
        # Create subdirectories for each class
        for i in range(num_classes):
            os.makedirs(save_dir / str(i), exist_ok=True)
   
    def after_batch(self):
        if self.learn.training or self.learn.model.training:
            return
        
        for X, y in zip(self.learn.xb[0], self.learn.y):
            class_idx = y.item()
            if self.counts[class_idx] >= self.max_images_per_class:
                continue
            
            fn = f'{str(self.save_dir)}/{class_idx}/{self.counts[class_idx]}'
            with open(fn + '.bin', "wb") as f:
                # Serialize tensor as binary
                serialize_fp32(f, X)      
            
            self.counts[class_idx] += 1
            if sum(self.counts) == self.num_classes * self.max_images_per_class:
                break

In [9]:
%%time
learn.validate(cbs=SaveImageFilesCallback(path/'test', max_images_per_class=50))

CPU times: total: 2min 57s
Wall time: 5min 36s


(#2) [5.374879360198975,0.10165604948997498]

## Test dataset directory structure

In [21]:
!tree -d data/imagenette2-320/test
!ls -al data//imagenette2-320/test/[2,6]/[3,7].bin

Too many parameters - data/imagenette2-320/test


'ls' is not recognized as an internal or external command,
operable program or batch file.


save `learner` to use later, and then, load `learner` & run inference for `test` dataset

In [22]:
learn.export(path/'export.pkl')
learn = load_learner(path/'export.pkl')

In [23]:
test_dl = dls.test_dl(get_files(path, extensions=['.bin']), with_labels=True)

In [24]:
learn.validate(dl=test_dl)

(#2) [5.341228485107422,0.10000000149011612]

## Upload updated dataset to HuggingFace

In [11]:
!git clone https://huggingface.co/datasets/ninjalabo/imagenette2-320

Cloning into 'imagenette2-320'...
Updating files:   5% (690/13495)
Updating files:   6% (810/13495)
Updating files:   7% (945/13495)
Updating files:   8% (1080/13495)
Updating files:   9% (1215/13495)
Updating files:  10% (1350/13495)
Updating files:  11% (1485/13495)
Updating files:  12% (1620/13495)
Updating files:  13% (1755/13495)
Updating files:  14% (1890/13495)
Updating files:  15% (2025/13495)
Updating files:  16% (2160/13495)
Updating files:  17% (2295/13495)
Updating files:  18% (2430/13495)
Updating files:  19% (2565/13495)
Updating files:  19% (2615/13495)
Updating files:  20% (2699/13495)
Updating files:  21% (2834/13495)
Updating files:  22% (2969/13495)
Updating files:  23% (3104/13495)
Updating files:  24% (3239/13495)
Updating files:  25% (3374/13495)
Updating files:  26% (3509/13495)
Updating files:  27% (3644/13495)
Updating files:  28% (3779/13495)
Updating files:  29% (3914/13495)
Updating files:  30% (4049/13495)
Updating files:  31% (4184/13495)
Updating files:  

In [10]:
import os
import shutil

def copy_files(src_folder, dest_folder):
    """
    Copy all files from src_folder to dest_folder.

    Args:
        src_folder (str or Path): Path to the source folder.
        dest_folder (str or Path): Path to the destination folder.
    """
    # Ensure paths are in string format
    src_folder = str(src_folder)
    dest_folder = str(dest_folder)
    
    # Create destination folder if it doesn't exist
    os.makedirs(dest_folder, exist_ok=True)

    # Iterate over all files in the source folder
    for dir_folder in os.listdir(src_folder):
        # Initiate subdirectory source & destination folder
        subdir_src_folder = os.path.join(src_folder, dir_folder)
        subdir_dest_folder = os.path.join(dest_folder, dir_folder)
        
        os.makedirs(subdir_dest_folder, exist_ok=True)

        for filename in os.listdir(subdir_src_folder):
            src_file = os.path.join(subdir_src_folder, filename)
            dest_file = os.path.join(subdir_dest_folder, filename)

            # Only copy files (not directories)
            if os.path.isfile(src_file):
                shutil.copy(src_file, dest_file)
                print(f'Copied: {src_file} to {dest_file}')
            else:
                print(f'Skipped (not a file): {src_file}')

# Define source and destination folders
source_folder = path/"test"    
destination_folder = "imagenette2-320/test"

# Copy files
copy_files(source_folder, destination_folder)

Copied: c:\Users\nghiv\Desktop\Projects\tinyMLaaS\tinyRuntime\data\imagenette2-320\test\0\0.bin to imagenette2-320/test\0\0.bin
Copied: c:\Users\nghiv\Desktop\Projects\tinyMLaaS\tinyRuntime\data\imagenette2-320\test\0\1.bin to imagenette2-320/test\0\1.bin
Copied: c:\Users\nghiv\Desktop\Projects\tinyMLaaS\tinyRuntime\data\imagenette2-320\test\0\10.bin to imagenette2-320/test\0\10.bin
Copied: c:\Users\nghiv\Desktop\Projects\tinyMLaaS\tinyRuntime\data\imagenette2-320\test\0\11.bin to imagenette2-320/test\0\11.bin
Copied: c:\Users\nghiv\Desktop\Projects\tinyMLaaS\tinyRuntime\data\imagenette2-320\test\0\12.bin to imagenette2-320/test\0\12.bin
Copied: c:\Users\nghiv\Desktop\Projects\tinyMLaaS\tinyRuntime\data\imagenette2-320\test\0\13.bin to imagenette2-320/test\0\13.bin
Copied: c:\Users\nghiv\Desktop\Projects\tinyMLaaS\tinyRuntime\data\imagenette2-320\test\0\14.bin to imagenette2-320/test\0\14.bin
Copied: c:\Users\nghiv\Desktop\Projects\tinyMLaaS\tinyRuntime\data\imagenette2-320\test\0\15.b

In [1]:
# Uncomment this to update test data set
!cd imagenette2-320/ & git add test & git commit -m "update test set" & git push

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean


fatal: User cancelled dialog.
bash: line 1: /dev/tty: No such device or address
error: failed to execute prompt script (exit code 1)
fatal: could not read Username for 'https://huggingface.co': No such file or directory


In [5]:
from huggingface_hub import HfApi

api = HfApi()

api.upload_folder(
    repo_id="ninjalabo/imagenette2-320",
    repo_type="dataset",
    token="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",
    path_in_repo="test",
    folder_path=path/"test",
    commit_message="update test set"
)

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/nghivo/test-dataset/commit/fae7e526d6a2c4b00646fb1f0f0d2f7b97a3b037', commit_message='Test upload', commit_description='', oid='fae7e526d6a2c4b00646fb1f0f0d2f7b97a3b037', pr_url=None, pr_revision=None, pr_num=None)