In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'  # default is ‘last_expr'

%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append('/Users/siyuyang/Source/repos/GitHub_MSFT/CameraTraps')  # append this repo to PYTHONPATH

In [3]:
import json
import os
from collections import Counter, defaultdict
from random import sample
import math

from tqdm import tqdm
from unidecode import unidecode 

from data_management.megadb.schema import sequences_schema_check
from data_management.annotations.add_bounding_boxes_to_megadb import *
from data_management.megadb.converters.cct_to_megadb import make_cct_embedded, process_sequences, write_json

# Import Sul Ross 2018 sets
Initial drop's class labels extracted from EXIF fields.

Give the path to a JSON file where output from this script will be written to. You can then take this file to the .Net app for ingestion to the database.

In [4]:
path_to_output = '/Users/siyuyang/OneDrive - Microsoft/AI4Earth/CameraTrap/Databases/megadb_2020/sulross_2018_megadb.json'  

**Name of the dataset**

In [5]:
dataset_name = 'sulross_2018'

## Step 0 - Add an entry to the `datasets` table

Added 

## Step 1 - Prepare the `sequence` objects to insert into the database

### Step 1a - If you have metadata in COCO Camera Traps (CCT) format already...

For a dataset, you probably have one or two JSONs in the CCT format, one containing image-level species labels and another containing bounding box annotations. Here we combine them and embed any annotation items into the image items.

In [6]:
# path to the CCT json, or a loaded json object
path_to_image_cct = '/Users/siyuyang/Source/temp_data/CameraTrap/engagements/SulRoss/20190522/Database/sulross_20190530.json'  # set to None if not available
path_to_bbox_cct = None  # set to None if not available
assert not (path_to_image_cct is None and path_to_bbox_cct is None)

In [7]:
embedded = make_cct_embedded(image_db=path_to_image_cct, bbox_db=path_to_bbox_cct)

Loading image DB...
Number of items from the image DB: 493627
Number of images with more than 1 species: 0 (0.0% of image DB)
No bbox DB provided.


In the following step, properties will be moved to the highest level that is still correct, i.e. if a property at the image-level always has the smae value for all images in a sequence, it will be moved to be a sequence-level property.

If a sequence-level property has the same value throughout this dataset (often 'rights holder'), it will be removed from the `sequence` objects. A message about this will be printed, and you should add that property and its (constant) value to this dataset's entry in the `datasets` table.

In [8]:
sequences = process_sequences(embedded, dataset_name)

The dataset_name is set to sulross_2018. Please make sure this is correct!
Making a deep copy of docs...


  7%|▋         | 33442/493627 [00:00<00:01, 334339.81it/s]

Putting 493627 images into sequences...


100%|██████████| 493627/493627 [00:01<00:00, 324137.03it/s]


Number of sequences: 172769
Checking the location field...
Checking which fields in a CCT image entry are sequence-level...

all_img_properties
{'id', 'location', 'datetime', 'class', 'file', 'frame_num'}

img_level_properties
{'id', 'datetime', 'class', 'file', 'frame_num'}

image-level properties that really should be sequence-level
{'location'}

Finished processing sequences.
Example sequence items:

{'seq_id': 'Summer2018/S1/Summer2018__S1__2018-06-26__15-51', 'dataset': 'sulross_2018', 'images': [{'id': 'Summer2018/S1/Summer2018__S1__2018-06-26__15-51-19(1)', 'frame_num': 1, 'datetime': '2018-06-26 15:51:19', 'file': 'Summer2018/S1/Summer2018__S1__2018-06-26__15-51-19(1).JPG', 'class': ['empty']}, {'id': 'Summer2018/S1/Summer2018__S1__2018-06-26__15-51-36(2)', 'frame_num': 2, 'datetime': '2018-06-26 15:51:36', 'file': 'Summer2018/S1/Summer2018__S1__2018-06-26__15-51-36(2).JPG', 'class': ['empty']}, {'id': 'Summer2018/S1/Summer2018__S1__2018-06-26__15-51-56(3)', 'frame_num': 3, 'da

In [None]:
# sample some sequences to make sure they are what you expect
sample(sequences, 10)

### Some sequence frame numbers are not unique

In [13]:
problem_seqs = []

for seq in sequences:
    frame_nums = []
    for image in seq['images']:
        frame_nums.append(image['frame_num'])
    if len(set(frame_nums)) != len(seq['images']):
        problem_seqs.append(seq['seq_id'])

len(problem_seqs)

1

In [14]:
problem_seqs

['Summer2018/D9/Summer2018__D9__2018-10-08__17-48']

This is a domestic cattle sequence that has two image both labeled with frame 1. Getting rid of this sequence.

In [16]:
good_seqs = []
for seq in sequences:
    if seq['seq_id'] != 'Summer2018/D9/Summer2018__D9__2018-10-08__17-48':
        good_seqs.append(seq)
len(good_seqs)

172768

## Step 2 - Pass the schema check

In [17]:
sequences_schema_check.sequences_schema_check(good_seqs)

Verified that the sequence items meet requirements not captured by the schema.
Verified that the sequence items conform to the schema.


## Step 3 - Add any iMerit bbox annotations

Only classification labels.

## Step 4 - Save the `sequence` items to a file

You can now take the resulting JSON file to the .Net application for bulk insertion to the database:

In [20]:
with open(path_to_output, 'w') as f:
    json.dump(good_seqs, f)

You can check that the bounding box annotations and paths to images all survived by running the `visualization/visualize_megadb.py` using the above exported file. 

Done