# Merge datasets
version: 1

info:
- Merge different datasets annotations json

author: nuno costa

## motionLAB Annotations Data Format

If you wish to combine multiple datasets, it is often useful to convert them into a unified data format. 

Objective: This script will allow you to merge the annotations into motionLab format (COCO & TAO-style annotation file) containing Image IDs in your data.json (general) file.

##### NOTE: Check at the end of this script for the formats

MOLA format : motionLab annotations format

COCO format : https://cocodataset.org/#format-data ; https://www.immersivelimit.com/tutorials/create-coco-annotations-from-scratch

TAO format : https://github.com/TAO-Dataset/tao/blob/master/tao/toolkit/tao/tao.py




In [11]:
from annotate_v5 import *
import platform 

In [12]:
#Define root dir dependent on OS
rdir='D:/external_datasets/' #WARNING needs to be root datasets 
print('OS: {}'.format(platform.platform()))
if str(platform.platform()).upper().find('linux'.upper())>-1: rdir='/home/administrator/Z/Datasets/External Datasets/' #'/mnt/d/external_datasets/'
print('root dir: {}'.format(rdir))

OS: Linux-5.4.0-65-generic-x86_64-with-glibc2.10
root dir: /home/administrator/Z/Datasets/External Datasets/


## 1. INIT motionLAB JSON
- uses annotate.init_json() function

In [2]:
molafile=rdir+'mola.json'
init_json(file=molafile)

JSON INITIATED : D:/external_datasets/mlab.json


## 2. LOAD & ORGANIZE original datasets JSONs

### COCO

#### Organize original COCO & save fullcoco
- without changing ids - necessary if you gonna mix different types of annotations
- #NOTE the only divergent hyperparameter between instances, captions and person_keypoints is the "annotations"
- #WARNING COCO captions annotations are different from instances and person_keypoints -> #SOLUTION move caption to "images" hyperparameter

In [3]:
# merge train

### 1.Captions -> #WARNING move caption annotations to images as new "caption" subkey ->below
newjson = json.load(open(rdir+'COCO/2017/annotations/captions_train2017.json'))
ann_caption=[]
ann_imgid=[]
for an in tqdm(newjson['annotations']):
    ann_caption.append(an['caption'])
    ann_imgid.append(an['image_id'])
### 2.Instances
newjson =  json.load(open(rdir+'COCO/2017/annotations/instances_train2017.json'))
key='images' # add images
root_dir='COCO/2017/images/train2017/' 
for ik, k in enumerate(tqdm(newjson[key], desc='add: {}'.format(key))):
    newjson[key][ik]['file_name'] = root_dir + newjson[key][ik]['file_name'] # change images: file_name
    imgid=newjson[key][ik]['id']
    try:
        imgidx=ann_imgid.index(imgid) #assuming one caption per imgid
        newjson[key][ik]['caption'] = ann_caption[imgidx] # add captions
    except:
        newjson[key][ik]['caption'] = 'missing caption!'         
fulljson = newjson # init fulljson
### 3. Person Keypoints
newjson = json.load(open(rdir+'COCO/2017/annotations/person_keypoints_train2017.json'))
key='annotations' # add annotations
fulljson[key] = fulljson[key] + newjson[key]        
fulljson['categories'][0]=newjson['categories'][0] # update person category based on person_keypoints


# merge val

### 1.Captions -> #WARNING move caption annotations to images as new "caption" subkey ->below
newjson = json.load(open(rdir+'COCO/2017/annotations/captions_val2017.json'))
ann_caption=[]
ann_imgid=[]
for an in tqdm(newjson['annotations']):
    ann_caption.append(an['caption'])
    ann_imgid.append(an['image_id'])
### 2.Instances
newjson = json.load(open(rdir+'COCO/2017/annotations/instances_val2017.json'))
key='images' # add images
root_dir='COCO/2017/images/val2017/' 
for ik, k in enumerate(tqdm(newjson[key], desc='add: {}'.format(key))):
    newjson[key][ik]['file_name'] = root_dir + newjson[key][ik]['file_name'] # change images: file_name
    imgid=newjson[key][ik]['id']
    try:
        imgidx=ann_imgid.index(imgid) #assuming one caption per imgid
        newjson[key][ik]['caption'] = ann_caption[imgidx] # add captions
    except:
        newjson[key][ik]['caption'] = 'missing caption!'    
fulljson[key] = fulljson[key] + newjson[key] 
key='annotations' # add annotations
fulljson[key] = fulljson[key] + newjson[key] 
### 3.Person Keypoints
newjson = json.load(open(rdir+'COCO/2017/annotations/person_keypoints_val2017.json'))
key='annotations' # add annotations
fulljson[key] = fulljson[key] + newjson[key]   

# save
print('\n >> SAVING...')
jsonfile=rdir+'COCO/2017/annotations/fullcoco2017.json'
with open(jsonfile, 'w') as f:
    json.dump(fulljson, f)
print("JSON SAVED : {} \n".format(jsonfile))

print(fulljson.keys())
print(fulljson['info'])
print(len(fulljson['licenses']))
print(len(fulljson['images']))
print(len(fulljson['annotations']))
print(len(fulljson['categories']))
print(fulljson['images'][10000])

100%|██████████████████████████████████████████████████████████████████| 591753/591753 [00:00<00:00, 1586466.52it/s]
add: images: 100%|█████████████████████████████████████████████████████████| 118287/118287 [10:35<00:00, 186.10it/s]
100%|████████████████████████████████████████████████████████████████████| 25014/25014 [00:00<00:00, 1567857.50it/s]
add: images: 100%|████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 4784.48it/s]



 >> SAVING...
JSON SAVED : D:/external_datasets/COCO/2017/annotations/fullcoco2017.json 

dict_keys(['info', 'licenses', 'images', 'annotations', 'categories'])
{'description': 'COCO 2017 Dataset', 'url': 'http://cocodataset.org', 'version': '1.0', 'year': 2017, 'contributor': 'COCO Consortium', 'date_created': '2017/09/01'}
8
123287
1170251
80


### TAO

#### Organize original TAO & save fulltao
- #WARNING TAO dataset has no annotations for some categories -> #SOLVED this was on purpose (see below on the section ANNOTATIONS FORMAT)
- #WARNING TAO dataset has no images for some categories - #SOLVE ??

In [18]:
TAO_ROOT="TAO/TAO_DIR/"
# merge train
newjson =  json.load(open(rdir+TAO_ROOT+'annotations/train.json'))
key='images' #alter paths to datasets root
for ik, k in enumerate(tqdm(newjson[key], desc='rename file_name: {}'.format(key))):
    root_dir=TAO_ROOT+'frames/' # change images: file_name
    newjson[key][ik]['file_name'] = root_dir + newjson[key][ik]['file_name']
    root_dir=TAO_ROOT+'videos/' # change images: video
    newjson[key][ik]['video'] = root_dir + newjson[key][ik]['video']
key='videos' #alter paths to datasets root
for ik, k in enumerate(tqdm(newjson[key], desc='rename file_name: {}'.format(key))):
    root_dir=TAO_ROOT+'videos/' # change images: video
    newjson[key][ik]['name'] = root_dir + newjson[key][ik]['name']
fulljson = newjson
# merge val
newjson =  json.load(open(rdir+TAO_ROOT+'annotations/validation.json'))
key='images' #alter paths to datasets root
for ik, k in enumerate(tqdm(newjson[key], desc='rename file_name: {}'.format(key))):
    root_dir=TAO_ROOT+'frames/' # change images: file_name
    newjson[key][ik]['file_name'] = root_dir + newjson[key][ik]['file_name']
    root_dir=TAO_ROOT+'videos/' # change images: video
    newjson[key][ik]['video'] = root_dir + newjson[key][ik]['video']
fulljson[key] = fulljson[key] + newjson[key]
key='videos' #alter paths to datasets root
for ik, k in enumerate(tqdm(newjson[key], desc='rename file_name: {}'.format(key))):
    root_dir=TAO_ROOT+'videos/' # change images: video
    newjson[key][ik]['name'] = root_dir + newjson[key][ik]['name']
fulljson[key] = fulljson[key] + newjson[key] 
key='tracks'
fulljson[key] = fulljson[key] + newjson[key] 
key='annotations'
fulljson[key] = fulljson[key] + newjson[key] 

# save
print('\n >> SAVING...')
jsonfile=rdir+TAO_ROOT+'annotations/fulltao.json'
with open(jsonfile, 'w') as f:
    json.dump(fulljson, f)
print("JSON SAVED : {} \n".format(jsonfile))

print(fulljson.keys())
print(fulljson['info'])
print(len(fulljson['licenses']))
print(len(fulljson['images']))
print(len(fulljson['annotations']))
print(len(fulljson['categories']))

rename file_name: images: 100%|██████████| 18274/18274 [00:00<00:00, 1065692.16it/s]
rename file_name: videos: 100%|██████████| 500/500 [00:00<00:00, 1088863.97it/s]
rename file_name: images: 100%|██████████| 36375/36375 [00:00<00:00, 977603.97it/s]
rename file_name: videos: 100%|██████████| 988/988 [00:00<00:00, 1013196.17it/s]



 >> SAVING...
JSON SAVED : /home/administrator/Z/Datasets/External Datasets/TAO/TAO_DIR/annotations/fulltao.json 

dict_keys(['videos', 'annotations', 'tracks', 'images', 'info', 'categories', 'licenses'])
{'year': 2020, 'version': '0.1.20200120', 'description': 'Annotations imported from Scale', 'contributor': '', 'url': '', 'date_created': '2020-01-20 15:49:53.519740'}
1
54649
167751
1230


## 3. MERGE datasets
- #WARNING merge is slow -> #TODO #SOLUTION use same approach as fixclasses and mixclasses

In [None]:
molafile=rdir+'mola.json'
mergecoco=rdir+'COCO/2017/annotations/fullcoco2017.json'
mergetao=rdir+TAO_ROOT+'annotations/fulltao.json'

In [None]:
#WARNING if memory error go to a terminal ipython shell and paste this comands
!python annotate_v5.py --molafile $molafile --mergefile $mergecoco --dataset_id 1
!python annotate_v5.py --molafile $molafile --mergefile $mergetao --dataset_id 2 

### 4. TEST MERGED JSON ANNOTATIONS DUPLICATES

In [2]:
molajson = json.load(open(rdir+'MOLA/annotations/mola.json'))

In [3]:
for k in molajson:
    print(k, len(molajson[k]))

info 5
licenses 9
categories 1310
videos 1488
images 177936
tracks 8132
segment_info 0
annotations 1338002
datasets 2


In [4]:
# annotations category_id
ann_ids=[]
for an in tqdm(molajson['annotations']):
    ann_ids.append(an['id'])
print(len(ann_ids))

#TEST duplicates v3 -faster
u, c = np.unique(np.array(ann_ids), return_counts=True)
duplicates_l= u[c > 1].tolist()
print(len(duplicates_l))

100%|████████████████████████████████████████████████████████████████| 1338002/1338002 [00:00<00:00, 1387926.68it/s]


1338002
0


# ANNOTATIONS FORMAT

## MOLA
Format in annotate_v5.init_json()

In [None]:
output = {
        "info": None,
        "licenses": [],
        "categories": [],
        "videos": [],
        "images": [],
        "tracks": [],
        "segment_info": [],
        "annotations": [],
        "datasets": [{'name': 'COCO', 'id': 1}, {'name': 'TAO', 'id': 2}]
    }
    
output['info'] = {
        "description": "Mixed Dataset",
        "url": "",
        "version": "1",
        "year": 2020,
        "date_created": datetime.datetime.utcnow().isoformat(' ')
    }

## COCO

Annotation file format: https://cocodataset.org/#format-data ; https://www.immersivelimit.com/tutorials/create-coco-annotations-from-scratch

In [None]:
{
    "info": {info},
    "licenses": [license],
    "images": [image],
    "annotations": [annotation],
    "categories": [category], <-- Not in Captions annotations
    "segment_info": [segment] <-- Only in Panoptic annotations
}

In [None]:
info{
    "year": int, 
    "version": str, 
    "description": str, 
    "contributor": str, 
    "url": str, 
    "date_created": datetime,
}
license{
    "id": int, 
    "name": str, 
    "url": str,
}
image{
    "id": int, 
    "width": int, 
    "height": int, 
    "file_name": str, 
    "license": int, "flickr_url": str, 
    "coco_url": str, "date_captured": datetime,
}
annotation{
    "id": int, 
    "image_id": int, 
    "category_id": int, 
    "segmentation": RLE or [polygon], 
    "area": float, 
    "bbox": [x,y,width,height], 
    "iscrowd": 0 or 1,
}

category{
    "id": int, 
    "name": str, 
    "supercategory": str,
}
segment{
    "id": int, 
    "category_id": int, 
    "area": int, 
    "bbox": [x,y,width,height], 
    "iscrowd": 0 or 1,
}


## TAO

Annotation file format: https://github.com/TAO-Dataset/tao/blob/master/tao/toolkit/tao/tao.py


#NOTE: https://github.com/TAO-Dataset/tao/blob/master/docs/faqs.md . Why does the training set only contain 216 LVIS categories?
- TAO contains a total of 482 LVIS categories. However, not all categories are present in the train, val, and test sets. Instead, we encourage researchers to train detectors on the LVIS v0.5 dataset, which contains a superset of the 482 categories, and trackers on existing single-object tracking datasets. TAO is primarily a benchmark dataset, but we provide a small set of training videos for tuning trackers.

In [None]:
{
    "info" : info,
    "images" : [image],
    "videos": [video],
    "tracks": [track],
    "annotations" : [annotation],
    "categories": [category],
    "licenses" : [license],
}

In [None]:
info: "like MS COCO"

license: {
    "id" : int,
    "name" : str,
    "url" : str,
}
category: {
    "id": int,
    "name": str,
    "synset": str,  # For non-LVIS objects, this is "unknown"
    ... [other fields copied from LVIS v0.5 and unused]
}

video: {
    "id": int,
    "name": str,
    "width" : int,
    "height" : int,
    "neg_category_ids": [int],
    "not_exhaustive_category_ids": [int],
    "metadata": dict,  # Metadata about the video
}
image: {
    "id" : int,
    "video_id": int,
    "file_name" : str,
    "license" : int,
    # Redundant fields for COCO-compatibility
    "width": int,
    "height": int,
    "frame_index": int
}    
track: {
    "id": int,
    "category_id": int,
    "video_id": int
}
annotation: {
    "image_id": int,
    "track_id": int,
    "bbox": [x,y,width,height],
    "area": float,
    # Redundant field for compatibility with COCO scripts
    "category_id": int
}
