### Merge Heterogeneous Datasets for Detection
Datumaro supports merging heterogeneous datasets into a unified data format.

In this document, we import two heterogeneous datasets and export a merged dataset into a unified data format.
First, we import two sample datasets that the data formats of them are COCO detection, VOC detection, respectively.
Then, we will export the merged dataset as `yolo` format.

In [1]:
# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

import datumaro as dm
from datumaro.components.operations import IntersectMerge
from datumaro.components.operations import compute_image_statistics

We export sample COCO dataset and VOC datasets seperately.
Note that we can omit passing `format` parameter.
Without the explicit format parameter, Datumaro will dectect the proper dataset format automatically.

Since we only interest in COCO instances data, the other annotations (such that labels, stuff, ...) are ignored.
Datumaro will print warning message during import.

In [2]:
dataset_coco = dm.Dataset.import_from("../tests/assets/coco_dataset/coco", format="coco_instances")
dataset_voc = dm.Dataset.import_from(
    "../tests/assets/voc_dataset/voc_dataset1", format="voc_detection"
)



In the sample COCO dataset, there are 2 images where each image is divided into `train` and `val` subset.
In the sample VOC dataset, 1 image to `train` subset and 1 image to `test` subset.

In [3]:
print("statistics for a sample COCO dataset")
compute_image_statistics(dataset_coco)

statistics for a sample COCO dataset


{'dataset': {'images count': 2,
  'unique images count': 1,
  'repeated images count': 1,
  'repeated images': [[('a', 'train'), ('b', 'val')]]},
 'subsets': {'val': {'images count': 1,
   'image mean': [0.9999999999999987, 0.9999999999999987, 0.9999999999999987],
   'image std': [6.361265799828938e-08,
    6.361265799828938e-08,
    6.361265799828938e-08]},
  'train': {'images count': 1,
   'image mean': [0.9999999999999987, 0.9999999999999987, 0.9999999999999987],
   'image std': [6.361265799828938e-08,
    6.361265799828938e-08,
    6.361265799828938e-08]}}}

In [4]:
print("statistics for a sample VOC dataset")
compute_image_statistics(dataset_voc)

statistics for a sample VOC dataset


{'dataset': {'images count': 2,
  'unique images count': 1,
  'repeated images count': 1,
  'repeated images': [[('2007_000001', 'train'), ('2007_000002', 'test')]]},
 'subsets': {'test': {'images count': 1,
   'image mean': [0.9999999999999971, 0.9999999999999971, 0.9999999999999971],
   'image std': [9.411065220006367e-08,
    9.411065220006367e-08,
    9.411065220006367e-08]},
  'train': {'images count': 1,
   'image mean': [0.9999999999999971, 0.9999999999999971, 0.9999999999999971],
   'image std': [9.411065220006367e-08,
    9.411065220006367e-08,
    9.411065220006367e-08]}}}

Since the target datasets are heterogeneous, we should call `IntersectMerge` to merge them.
In the merged dataset, there is a total of 4 images.
We can see that the subsets of image are merged by the name.

In [5]:
dataset = IntersectMerge()(datasets=[dataset_coco, dataset_voc])

print("statistics for the merged dataset")
compute_image_statistics(dataset)

statistics for the merged dataset


{'dataset': {'images count': 4,
  'unique images count': 2,
  'repeated images count': 2,
  'repeated images': [[('2007_000001', 'train'), ('2007_000002', 'test')],
   [('a', 'train'), ('b', 'val')]]},
 'subsets': {'train': {'images count': 2,
   'image mean': [0.9999999999999978, 0.9999999999999978, 0.9999999999999978],
   'image std': [8.873923033857324e-08,
    8.873923033857324e-08,
    8.873923033857324e-08]},
  'test': {'images count': 1,
   'image mean': [0.9999999999999971, 0.9999999999999971, 0.9999999999999971],
   'image std': [9.411065220006367e-08,
    9.411065220006367e-08,
    9.411065220006367e-08]},
  'val': {'images count': 1,
   'image mean': [0.9999999999999987, 0.9999999999999987, 0.9999999999999987],
   'image std': [6.361265799828938e-08,
    6.361265799828938e-08,
    6.361265799828938e-08]}}}

Finally, Datumaro converts the dataset format to `yolo` and exports them to the `merged_dataset` folder.

In [6]:
dataset.export("merged_dataset", "yolo")
!ls merged_dataset

['obj.data',
 'obj.names',
 'obj_test_data',
 'obj_train_data',
 'obj_val_data',
 'test.txt',
 'train.txt',
 'val.txt']