# Conversion of COCO annotation JSON file to TFRecords

The COCO file and dataset is converted to the TFRecord format for faster training.

Given a COCO annotated JSON file, your goal is to convert it into a TFRecords  file necessary to train with the Mask RCNN model.

To accomplish this task, you will clone the TensorFlow Model Garden repo. The TensorFlow Model Garden is a repository with a number of different implementations of state-of-the-art (SOTA) models and modeling solutions for TensorFlow users.

This notebook is an end to end example. When you run the notebook, it will take COCO annotated JSON train and test files as an input and will convert them into TFRecord files. You can also output sharded TFRecord files in case your training and validation data is huge. It makes it easier for the algorithm to read and access the data.

**Note** - In this example, we assume that all our data is saved on Google drive and we will also write our outputs to Google drive. We also assume that the script will be used as a Google Colab notebook. But this can be changed according to the needs of users. They can modify this in case they are working on their local workstation, remote server or any other database. This colab notebook can be changed to a regular jupyter notebook running on a local machine according to the need of the users.

## Run the below command to connect to your google drive

In [1]:
# !pip install -q tf-nightly
!pip install tensorflow
!pip install tf-keras
!pip install gin-config
!pip install -q tensorflow-addons

[31mERROR: Could not find a version that satisfies the requirement tensorflow-addons (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for tensorflow-addons[0m[31m
[0m

In [2]:
!pip install -q pycocotools

In [3]:
import sys

import tensorflow as tf
from tensorflow import keras

2025-04-30 18:26:52.612414: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
# # import libraries
# from google.colab import drive
# 

In [5]:
# # connect to google drive
# drive.mount('/content/gdrive')

# # making an alias for the root path
# try:
#   !ln -s /content/gdrive/My\ Drive/ /mydrive
#   print('Successful')
# except Exception as e:
#   print(e)
#   print('Not successful')

In [4]:
# "opencv-python-headless" version should be same of "opencv-python"
import pkg_resources
version_number = pkg_resources.get_distribution("opencv-python").version

!pip install -q opencv-python-headless==$version_number

## Clone TensorFlow Model Garden repository

In [5]:
# clone the Model Garden directory for Tensorflow where all the config files and scripts are located for this project.
# project folder name is - 'waste_identification_ml'
!git clone https://github.com/tensorflow/models.git

fatal: destination path 'models' already exists and is not an empty directory.


In [6]:
# Go to the model folder
%cd models

/Users/oysterable/delete/recyclables-detector/pre_processing/models


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


## Create TFRecord for training data

In [7]:
# training_images_folder = '/mydrive/rc40cocodataset_splitted/train/images/'  #@param {type:"string"}
# training_annotation_file = '/mydrive/rc40cocodataset_splitted/train/coco_train.json'  #@param {type:"string"}
# output_folder = '/mydrive/rc40tfrecords/train/'  #@param {type:"string"}

training_images_folder = '../../rc40cocodataset_splitted/train/images/'  #@param {type:"string"}
training_annotation_file = '../../rc40cocodataset_splitted/train/coco_train.json'  #@param {type:"string"}
output_folder = '../../rc40tfrecords/train/'  #@param {type:"string"}

In [8]:
# TensorFlow Model Garden to convert COCO annotations json file to TFRecord files.

# --num_shards (how many TFRecord sharded files you want)
!python3 -m official.vision.data.create_coco_tf_record \
      --logtostderr \
      --image_dir=$training_images_folder \
      --object_annotations_file=$training_annotation_file \
      --output_file_prefix=$output_folder \
      --num_shards=100 \
      --include_masks=True \
      --num_processes=0

2025-04-30 18:28:53.339949: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
I0430 18:29:00.090511 140704509742016 create_coco_tf_record.py:502] writing to output path: ../../rc40tfrecords/train/
I0430 18:29:00.098459 140704509742016 create_coco_tf_record.py:374] Building bounding box index.
I0430 18:29:00.098671 140704509742016 create_coco_tf_record.py:385] 0 images are missing bboxes.
I0430 18:29:00.158688 140704509742016 tfrecord_lib.py:168] On image 0
I0430 18:29:03.573285 140704509742016 tfrecord_lib.py:168] On image 100
I0430 18:29:06.901711 140704509742016 tfrecord_lib.py:168] On image 200
I0430 18:29:10.281229 140704509742016 tfrecord_lib.py:168] On image 300
I0430 18:29:12.334167 140704509742016 tfrecord_lib.py:180] Finished writing, skipped 0 an

## Create TFRecord for validation data

In [9]:
# validation_images_folder = '/mydrive/rc40cocodataset_splitted/val/images/'  #@param {type:"string"}
# validation_annotation_file = '/mydrive/rc40cocodataset_splitted/val/coco_val.json'  #@param {type:"string"}
# output_folder = '/mydrive/rc40tfrecords/val/'  #@param {type:"string"}

validation_images_folder = '../../rc40cocodataset_splitted/val/images/'  #@param {type:"string"}
validation_annotation_file = '../../rc40cocodataset_splitted/val/coco_val.json'  #@param {type:"string"}
output_folder = '../../rc40tfrecords/val/'  #@param {type:"string"}


In [10]:
# run the script to convert your json file to TFRecord file
# --num_shards (how many TFRecord sharded files you want)
!python3 -m official.vision.data.create_coco_tf_record --logtostderr \
      --image_dir=$validation_images_folder \
      --object_annotations_file=$validation_annotation_file \
      --output_file_prefix=$output_folder \
      --num_shards=100 \
      --include_masks=True \
      --num_processes=0

2025-04-30 18:29:27.998611: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
I0430 18:29:33.701936 140704509742016 create_coco_tf_record.py:502] writing to output path: ../../rc40tfrecords/val/
I0430 18:29:33.704122 140704509742016 create_coco_tf_record.py:374] Building bounding box index.
I0430 18:29:33.704246 140704509742016 create_coco_tf_record.py:385] 0 images are missing bboxes.
I0430 18:29:33.762863 140704509742016 tfrecord_lib.py:168] On image 0
I0430 18:29:35.235285 140704509742016 tfrecord_lib.py:180] Finished writing, skipped 0 annotations.
I0430 18:29:35.235805 140704509742016 create_coco_tf_record.py:537] Finished writing, skipped 0 annotations.
