# Reading JSON metadata

#### This notebook will show how to use DataChain to parse JSON-based metadata. 

We will use public datasets as examples, including [Microsoft COCO](https://cocodataset.org/#home) and [Google OpenImages](https://storage.googleapis.com/openimages/web/index.html).
Here are the topics covered:

- [Simple JSON schemas](#simple)
- [JSON lines format](#jsonl)
- [Nested JSON schemas](#nested)
- [Parsing JSON files](#parsing)
    - [Dealing with validation errors](#errors)
- [Static schema definitions](#static)
- [Merging multi-layer annotations](#merging)
    - [COCO mini-tutorial](#coco)
  

## Preface

JSON is a popular format for capturing metadata. Many public and private datasets are annotated in JSON files.
For computer vision and multimodal applications, the metadata is typically stored alongside samples (e.g. images or audio files).
In the natural language datasets, text samples (snippets) are often a part of the JSON file itself.

The common theme here is that every sample corresponds to an entry in JSON annotations.
These annotations, in turn, can be organized in different ways:

- Every sample is stored alongside a matching JSON file ("json-pairs" format).
- Every sample corresponds to a line in a shared "JSON lines" file (.jsonl format).
- Every sample corresponds to an array member inside a common JSON file.

JSON files can be large and hard to read on computer screen. To better understand metadata models, DataChain provides functions `print_json_schema()` and `print_jsonl_schema()` which can read and print the JSON layout in Pydantic format. Once the data model is apporoved, data loading and validation is handled by functions `from_json()` and `from_jsonl()`.

💡 DataChain supports lazy execution. No data is parsed until the results are requested by downstream chains. This means, for example, that validation errors (if any) will not occur immediately but may be triggered by the downstream `exec()`, `count()`, `collect()` or similar actions. When many operations are chained together, it is common to intersperse with `save` operations to cache the intermediate results.

All operations in this tutorial depende on the below imports.
We will occasionally repeat them throughout examples so they can be launched independently.


In [1]:
from datachain.lib.dc import Column, DataChain

<a id='simple'></a>
## Simple JSON schemas

Plain JSON schemas assign all the information in a root-level JSON object to a single sample.
For instance, the "json-pairs" convention couples samples with identically named JSON files:


In [2]:
!datachain ls gs://datachain-demo/dogs-and-cats/*1009*

cat.1009.jpg
cat.1009.json
dog.1009.jpg
dog.1009.json


Metadata in `cat.1009.json` corresponds to image `cat.1009.jpg`. For this simple annotation, we just eyeball this pair. Notice how the file type is specified as "image" for convenience; this avoids using the PIL library every time to open a binary file.

In [33]:
next(DataChain.from_storage('gs://datachain-demo/dogs-and-cats/cat.1009.jpg', type='image').collect("file")).read()

Processed: 1 rows [00:00, 703.15 rows/s]


<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=353x400>

In [4]:
next(DataChain.from_storage('gs://datachain-demo/dogs-and-cats/cat.1009.json', type='text').collect("file")).read()

Processed: 1 rows [00:00, 923.25 rows/s]


'{"class": "cat", "id": "1009", "num_annotators": 8, "inference": {"class": "dog", "confidence": 0.68}}'

Pydantic schema for this metadata layout goes as follows:

In [5]:
from datachain.lib.dc import Column, DataChain
DataChain.from_storage('gs://datachain-demo/dogs-and-cats/cat.1009.json', type='text').print_json_schema().exec()

Processed: 1 rows [00:00, 1036.40 rows/s]
Download: 0.00B [00:00, ?B/s]

# generated by datamodel-codegen:
#   filename:  <stdin>
#   timestamp: 2024-07-30T20:33:38+00:00

from __future__ import annotations

from pydantic import BaseModel, Field


class Inference(BaseModel):
    class_: str = Field(..., alias='class')
    confidence: float


class Modeljson970d6f6a76f94747a1252f1f68d7f251(BaseModel):
    class_: str = Field(..., alias='class')
    id: str
    num_annotators: int
    inference: Inference


from datachain.lib.data_model import DataModel


DataModel.register(Modeljson970d6f6a76f94747a1252f1f68d7f251)


spec=Modeljson970d6f6a76f94747a1252f1f68d7f251




Download: 102B [00:00, 172B/s]ws/s][A
Processed: 1 rows [00:00, 335.01 rows/s]


<datachain.lib.dc.DataChain at 0x13b2a28b0>

At root level, we can see fields `class`, `id`, and `num_annotators`, and a nested field `inference` holding an object with fields `class` and `confidence`. As we can see, this schema fully describes our dataset sample.

💡 By default, schema generator uses randomly assigned root class names, like _Modeljson9bb238bdb73940458dc4ee87445b112d_ – which can be changed with a `model_name` argument. 

<a id='jsonl'></a>
## JSON lines format

JSON lines format is also relatively straighforward. It assumes that each sample corresponds to a JSON object stored in a separate line of a common `.jsonl` file.

For example, here is a Pydantic schema for [localized narratives](https://google.github.io/localized-narratives/) metadata from the [Google OpenImages](https://storage.googleapis.com/openimages/web/index.html) dataset. Note how we can assign an informative root model name while parsing these narratives:

In [7]:
from datachain.lib.dc import Column, DataChain
uri = "gs://datachain-demo/openimages-jsonl/open_images_validation_localized_narratives.jsonl"
DataChain.from_storage(uri).print_jsonl_schema(model_name="Narrative").exec()

Processed: 1 rows [00:00, 796.79 rows/s]
Download: 18.1kB [00:00, 185kB/s]

# generated by datamodel-codegen:
#   filename:  <stdin>
#   timestamp: 2024-07-30T20:33:54+00:00

from __future__ import annotations

from typing import List

from pydantic import BaseModel


class TimedCaptionItem(BaseModel):
    utterance: str
    start_time: float
    end_time: float


class Trace(BaseModel):
    x: float
    y: float
    t: float


class Narrative(BaseModel):
    dataset_id: str
    image_id: str
    annotator_id: int
    caption: str
    timed_caption: List[TimedCaptionItem]
    traces: List[List[Trace]]
    voice_recording: str


from datachain.lib.data_model import DataModel


DataModel.register(Narrative)


spec=Narrative




Download: 30.6kB [00:00, 46.4kB/s]][A
Processed: 1 rows [00:00, 425.86 rows/s]


<datachain.lib.dc.DataChain at 0x13b4026a0>

<a id='nested'></a>
## Nested JSON schemas

Simple JSON schemas are easy to understand but are not flexible because they cannot offer the common ("shared") metadata block describing the entire dataset, and cannot be easily extended with new annotation types. 

For this reason, richly annotated datasets (like [Microsoft COCO](https://cocodataset.org/#home)) are often bundled with nested JSON files.

Nested annotations typically have a schema that collectively describes the dataset at root level, while individual samples are described in separate JSON arrays organized by application. For instance, the following JSON file for image captions features a root-level object `Info` describing the entire dataset, and a list of licenses. Samples are described in arrays under keys `images` and `annotations`:

In [8]:
from datachain.lib.dc import Column, DataChain
DataChain.from_storage("gs://datachain-demo/coco2017/annotations/captions_val2017.json").print_json_schema(model_name="COCO").exec()


Processed: 1 rows [00:00, 1071.89 rows/s]
Download: 0.00B [00:00, ?B/s]

# generated by datamodel-codegen:
#   filename:  <stdin>
#   timestamp: 2024-07-30T20:34:08+00:00

from __future__ import annotations

from typing import List

from pydantic import BaseModel


class Info(BaseModel):
    description: str
    url: str
    version: str
    year: int
    contributor: str
    date_created: str


class License(BaseModel):
    url: str
    id: int
    name: str


class Image(BaseModel):
    license: int
    file_name: str
    coco_url: str
    height: int
    width: int
    date_captured: str
    flickr_url: str
    id: int


class Annotation(BaseModel):
    image_id: int
    id: int
    caption: str


class COCO(BaseModel):
    info: Info
    licenses: List[License]
    images: List[Image]
    annotations: List[Annotation]


from datachain.lib.data_model import DataModel


DataModel.register(COCO)


spec=COCO




Download: 3.69MB [00:02, 1.66MB/s]][A
Processed: 1 rows [00:00, 306.09 rows/s]


<datachain.lib.dc.DataChain at 0x13b531340>

To specify which part of the top-level JSON object contains the annotation we care about, DataChain schema parser supports [JMESPATH](https://jmespath.org) expression argument which is expected to resolve to an object of interest, or to a JSON array of such objects.

For example, here is how to reduce the MS COCO captions schema above to the `annotations` array:

In [10]:
DataChain.from_storage("gs://datachain-demo/coco2017/annotations/captions_val2017.json").print_json_schema(jmespath="annotations", model_name="Annotations").exec()

Processed: 1 rows [00:00, 1043.88 rows/s]
Download: 0.00B [00:00, ?B/s]

# generated by datamodel-codegen:
#   filename:  <stdin>
#   timestamp: 2024-07-30T20:34:29+00:00

from __future__ import annotations

from pydantic import BaseModel


class Annotations(BaseModel):
    image_id: int
    id: int
    caption: str


from datachain.lib.data_model import DataModel


DataModel.register(Annotations)


spec=Annotations




Download: 3.69MB [00:00, 6.97MB/s]][A
Processed: 1 rows [00:00, 398.70 rows/s]


<datachain.lib.dc.DataChain at 0x13a65c160>

<a id='parsing'></a>
## Parsing JSON files

Once the schema is correctly identified, parsing the JSON files into the metadata becomes a one-liner call to `from_json()` or `from_jsonl()`. During parsing, the metadata is validated against the Pydantic class definitions and added onto the chain:

In [11]:
from datachain.lib.dc import Column, DataChain
DataChain.from_json('gs://datachain-demo/dogs-and-cats/cat.1009.json').show()

Processed: 1 rows [00:00, 973.61 rows/s]
Download: 0.00B [00:00, ?B/s]
Download: 102B [00:00, 188B/s]ws/s][A
Processed: 1 rows [00:00, 406.70 rows/s]
Processed: 1 rows [00:00, 1138.83 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Download: 0.00B [00:00, ?B/s][A

Download: 102B [00:00, 23.5kB/s]/s][A[A
Processed: 1 rows [00:00,  1.62 rows/s]
Generated: 1 rows [00:00, 367.79 rows/s]


Unnamed: 0_level_0,json,json,json,json,json
Unnamed: 0_level_1,class_,id,num_annotators,inference,inference
Unnamed: 0_level_2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,class_,confidence
0,cat,1009,8,dog,0.68


<a id='errors'></a>
### Dealing with validation errors

Sometimes data records fail to validate against the Pydantic class definition at parse time.
This may indicate one of the two problems:

1. The raw data has errors
2. The automatically defined schema is incomplete and needs to be corrected (e.g. some fields must be made "optional")

In all cases, the first step is to find the data piece that causes the trouble.

To debug these parsing errors, it helps to limit the number of JSON files to process, the number of objects parsed, or the scope of JSON metadata. In the following example, some json-pairs fail to validate, and Pydantic error message highlights the offending filename(s):

In [12]:
DataChain.from_json("gs://datachain-demo/openimages-v6-test-jsonpairs/3*json", model_name="OpenImage").exec()

Processed: 10 rows [00:00, 4655.17 rows/s]
Download: 0.00B [00:00, ?B/s]
Download: 1.45kB [00:00, 2.69kB/s]][A
Processed: 1 rows [00:00, 362.58 rows/s]
Processed: 10 rows [00:00, 4265.97 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Processed: 2 rows [00:00,  3.42 rows/s]

Validation error occurred in row 0 file 3122f16026310c3e.json: 2 validation errors for OpenImage
image_id.Thumbnail300KURL
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
image_id.Rotation
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type



Processed: 3 rows [00:00,  3.18 rows/s]

Validation error occurred in row 0 file 348824dd8c0c74e6.json: 1 validation error for OpenImage
image_id.Rotation
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type


Processed: 4 rows [00:01,  2.88 rows/s]

Validation error occurred in row 0 file 358315a151efa740.json: 1 validation error for OpenImage
image_id.Rotation
  Input should be a valid string [type=string_type, input_value=0.0, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type



Processed: 5 rows [00:01,  2.58 rows/s]

Validation error occurred in row 0 file 364d0be55f24616a.json: 1 validation error for OpenImage
image_id.Rotation
  Input should be a valid string [type=string_type, input_value=0.0, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type


Processed: 6 rows [00:02,  2.42 rows/s]

Validation error occurred in row 0 file 37b72b4e808bcd30.json: 1 validation error for OpenImage
image_id.Rotation
  Input should be a valid string [type=string_type, input_value=0.0, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type



Processed: 7 rows [00:02,  2.38 rows/s]

Validation error occurred in row 0 file 384e33c0ffdfe052.json: 1 validation error for OpenImage
image_id.Rotation
  Input should be a valid string [type=string_type, input_value=0.0, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type



Processed: 8 rows [00:03,  2.45 rows/s]

Validation error occurred in row 0 file 39224d0d713cb866.json: 1 validation error for OpenImage
image_id.Rotation
  Input should be a valid string [type=string_type, input_value=0.0, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type



Processed: 9 rows [00:03,  2.28 rows/s]

Validation error occurred in row 0 file 3b727441da9834b4.json: 1 validation error for OpenImage
image_id.Rotation
  Input should be a valid string [type=string_type, input_value=0.0, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type



Processed: 10 rows [00:04,  2.30 rows/s]

Validation error occurred in row 0 file 3cbec6265c443ea4.json: 1 validation error for OpenImage
image_id.Rotation
  Input should be a valid string [type=string_type, input_value=0.0, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type



Download: 25.6kB [00:03, 6.58kB/s][A
Processed: 10 rows [00:04,  2.19 rows/s]

Validation error occurred in row 0 file 3fa6819854b27685.json: 1 validation error for OpenImage
image_id.Rotation
  Input should be a valid string [type=string_type, input_value=0.0, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type





<datachain.lib.dc.DataChain at 0x13d916610>

As we can see from examining one of the offending files, the problem lies in the key `Thumbnal300KURL` which features value `NaN` instead of an empty string: 


In [13]:
next(DataChain.from_storage("gs://datachain-demo/openimages-v6-test-jsonpairs/3122f16026310c3e.json", type='text').collect("file")).read()


Processed: 1 rows [00:00, 780.63 rows/s]


'{\n  "id": "3122f16026310c3e",\n  "split": "test",\n  "image_id": {\n    "Subset": "test",\n    "OriginalURL": "https://c7.staticflickr.com/3/2275/2407197299_d4ea3fdab2_o.jpg",\n    "OriginalLandingURL": "https://www.flickr.com/photos/starshaped/2407197299",\n    "License": "https://creativecommons.org/licenses/by/2.0/",\n    "AuthorProfileURL": "https://www.flickr.com/people/starshaped/",\n    "Author": "Aubrey",\n    "Title": "My worst nightmare",\n    "OriginalSize": 85370,\n    "OriginalMD5": "O+50caXE4Ll4s3pmWcFa2w==",\n    "Thumbnail300KURL": NaN,\n    "Rotation": NaN\n  },\n  "classifications": [\n    {\n      "Source": "verification",\n      "LabelName": "/m/0k0pj",\n      "Confidence": 0\n    },\n    {\n      "Source": "verification",\n      "LabelName": "/m/03q69",\n      "Confidence": 0\n    },\n    {\n      "Source": "verification",\n      "LabelName": "/m/02dl1y",\n      "Confidence": 0\n    },\n    {\n      "Source": "verification",\n      "LabelName": "/m/014sv8",\n    

The parser has no better options but to ignore the non-compliant JSON file; but a better solution, of course, is to fix the errors in data.

⚠️ Finding data error in large nested JSONs can be a little bit trickier because error messages can be overwhelming. There are two technique that help with that: 

- `nrows` argument can limit the number of JSON lines or array members processed
- [JMESPATH](https://jmespath.org) is a great tool to slice JSON arrays or reduce the expressions that potentially cause trouble:

In [14]:
DataChain.from_json("gs://datachain-demo/coco2017/annotations/captions_val2017.json", jmespath="annotations[1:3]").show()

Processed: 1 rows [00:00, 1089.71 rows/s]
Download: 0.00B [00:00, ?B/s]
Download: 3.69MB [00:00, 6.86MB/s]][A
Processed: 1 rows [00:00, 377.49 rows/s]
Processed: 1 rows [00:00, 1073.26 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Download: 0.00B [00:00, ?B/s][A

Download: 3.69MB [00:00, 60.2MB/s]][A[A
Processed: 1 rows [00:01,  1.01s/ rows]
Generated: 2 rows [00:00, 666.40 rows/s]


Unnamed: 0_level_0,annotations,annotations,annotations
Unnamed: 0_level_1,image_id,id,caption
0,179765,182,A Honda motorcycle parked in a grass driveway
1,190236,401,An office cubicle with four different types of...


<a id='static'></a>
## Static schema definitions

Sometimes we are not satisfied with the automatically generated and implicitly instatiated schema. Here are some cases where it may happen:

* we plan to reuse our dataset in another Python session, so we want to code an explicit data model.

* input data object may have too many properties and we want to trim it

* automatic schema generator may produce wrong field types, or remain unaware of fields that must be made optional or mandatory.

In all those cases, the simplest way forward is to start with an auto-generated schema and modify it to fit the bill. Here is an example of an auto-generated schema we intend to modify:

```
            # generated by datamodel-codegen:
            #   filename:  <stdin>
            #   timestamp: 2024-07-25T22:53:27+00:00
            
            from __future__ import annotations
            
            from pydantic import BaseModel, Field
            
            
            class Inference(BaseModel):
                class_: str = Field(..., alias='class')
                confidence: float
            
            
            class Modeljsone31f95f9eaa1404ab3277e455970e295(BaseModel):
                class_: str = Field(..., alias='class')
                id: str
                num_annotators: int
                inference: Inference
            
            
            from datachain.lib.data_model import DataModel
            
            
            DataModel.register(Modeljsone31f95f9eaa1404ab3277e455970e295)
            
            
            spec=Modeljsone31f95f9eaa1404ab3277e455970e295
```

 
 Let us trim it for an abbreviated data model and feed it back to the JSON parser:

In [15]:
from datachain.lib.dc import Column, DataChain
from pydantic import BaseModel, Field
from typing import Optional

class Inference(BaseModel):
    class_: str = Field(..., alias='class')
    # removed "confidence" key


class AnimalModel(BaseModel):
    class_: str = Field(..., alias='class')
    id: str
    # made the "num_annotators" optional
    num_annotators: Optional[int]
    inference: Inference


from datachain.lib.data_model import DataModel
DataModel.register(AnimalModel)

DataChain.from_json('gs://datachain-demo/dogs-and-cats/cat.1009.json', spec=AnimalModel, object_name="animal").show()


Processed: 1 rows [00:00, 901.81 rows/s]
Download: 0.00B [00:00, ?B/s]
Download: 102B [00:00, 185B/s]ws/s][A
Processed: 1 rows [00:00, 364.69 rows/s]
Processed: 1 rows [00:00, 643.40 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Download: 0.00B [00:00, ?B/s][A

Download: 102B [00:00, 5.52kB/s]/s][A[A
Processed: 1 rows [00:00,  2.69 rows/s]
Generated: 1 rows [00:00, 99.26 rows/s]


Unnamed: 0_level_0,animal,animal,animal,animal
Unnamed: 0_level_1,class_,id,num_annotators,inference
Unnamed: 0_level_2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,class_
0,cat,1009,8,dog


<a id='merging'></a>
## Merging annotations

DataChain represents files in datasets as pointers. This means that merging JSON annotations with files they describe is a routine workflow in annotated datasets.

Repeated merging may be required when working with richly annotation samples. For example, MS COCO images have meta properties, captions, and object detections assigned to them – each in its own file and JSON object. If you want to work across these annotation layers, stitching them together becomes mandatory.

The following mini-tutorial shows how to build the entire COCO dataset by merging. We will be using the validation subset of 2017 data to demonstrate some typical data problems and ways to deal with them. Our goal is to build a dataset where every entry is a detected object instance (multiple objects can be detected per single image).

<a id='coco'></a>
### COCO 2017 tutorial


In [15]:
from datachain.lib.dc import Column, DataChain 

images_uri="gs://datachain-demo/coco2017/images/val/"
captions_uri="gs://datachain-demo/coco2017/annotations/captions_val2017.json" 
detections_uri="gs://datachain-demo/coco2017/annotations/instances_val2017.json"

#### COCO images and files metadata

The first step is to create a dataset from the storage, where 5,000 validation images live. This puts on the chain object "file" with fields describing the cloud location:

In [16]:
images = DataChain.from_storage(images_uri, object_name="file", type="image")
images.print_schema()

 file: ImageFile
     source: str
     parent: str
     name: str
     size: int
     version: str
     etag: str
     is_latest: bool
     last_modified: datetime
     location: Union[dict, list[dict], NoneType]
     vtype: str


We already saw in discussing the [Nested JSON images](#nested) section that COCO annotation schema has two arrays at the root level – "images" carrying image metadata including image ids, and "annotations" that match annotations to image ids. So let us unpack the images metadata. By default, the object name ("images" above) will match the JMESAPTH expression, but it can also be renamed with `object_name` argument:

In [17]:
meta = DataChain.from_json(captions_uri, jmespath = "images", object_name="image", model_name="Image")
meta.print_schema()

Listing gs://datachain-demo: 6 objects [00:00, 108.89 objects/s]
Processed: 1 rows [00:00, 909.04 rows/s]
Download: 0.00B [00:00, ?B/s]
Download: 3.69MB [00:00, 4.98MB/s]][A
Processed: 1 rows [00:00, 397.30 rows/s]

 image: Image
     license: int
     file_name: str
     coco_url: str
     height: int
     width: int
     date_captured: str
     flickr_url: str
     id: int






Now we have two datasets which contain files and their metadata. We can join them by matching fields referencing filenames:

In [18]:
images_meta = images.merge(meta, on="file.name", right_on="image.file_name")
images_meta.print_schema()

 file: ImageFile
     source: str
     parent: str
     name: str
     size: int
     version: str
     etag: str
     is_latest: bool
     last_modified: datetime
     location: Union[dict, list[dict], NoneType]
     vtype: str
 image: Image
     license: int
     file_name: str
     coco_url: str
     height: int
     width: int
     date_captured: str
     flickr_url: str
     id: int


This datachain has been composed but not executed. Let's run it to see if we still have 5,000 samples:

In [19]:
images_meta.count()

Listing gs://datachain-demo: 5000 objects [00:00, 6640.81 objects/s]
Processed: 5000 rows [00:00, 13535.35 rows/s]
Processed: 1 rows [00:00, 984.12 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Download: 0.00B [00:00, ?B/s][A

Generated: 0 rows [00:00, ? rows/s][A[A

Download: 3.69MB [00:00, 15.3MB/s].94 rows/s][A[A
Processed: 1 rows [00:00,  1.55 rows/s]
Generated: 5000 rows [00:00, 25711.42 rows/s]


5000

#### Image captions

Our next step is to bring captions to these images. However, COCO carries multiple caption versions for each image. If we are not interested in these versions, we can just drop the redundant entries and merge the first-choice captions with images on _image_id_ field:

In [20]:
captions = DataChain.from_json(captions_uri, jmespath = "annotations", object_name="annotation", model_name="Annotation")
captions = captions.distinct("annotation.image_id")
captioned_images = images_meta.merge(captions, on="image.id", right_on="annotation.image_id")
captioned_images.print_schema()

Processed: 1 rows [00:00, 545.57 rows/s]
Download: 0.00B [00:00, ?B/s]
Download: 3.69MB [00:00, 7.09MB/s]][A
Processed: 1 rows [00:00, 391.70 rows/s]

 file: ImageFile
     source: str
     parent: str
     name: str
     size: int
     version: str
     etag: str
     is_latest: bool
     last_modified: datetime
     location: Union[dict, list[dict], NoneType]
     vtype: str
 image: Image
     license: int
     file_name: str
     coco_url: str
     height: int
     width: int
     date_captured: str
     flickr_url: str
     id: int
 annotation: Annotation
     image_id: int
     id: int
     caption: str





#### Image detections
##### The next (and the final) step is to bring in the instance detections. This is where it gets more interesting:

⚠️ The annotated instances data in 2017 validation COCO subset contain a schema error. If we naively read this data, we will get this Pydantic error:

```
Validation error occurred in row 36336 file instances_val2017.json: 1 validation error for Instance
segmentation.0MB [00:00, 24.2MB/s]
  Input should be a valid array [type=list_type, input_value={'counts': [272, 2, 4, 4,...50], 'size': [240, 320]}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/list_type
```

##### Indeed, we can verify that schema in 2017 COCO detections changes around entry #36335 through the _instances_val2017.json_ file:

```
DataChain.from_storage(detections_uri).show_json_schema(model_name="Narrative", jmespath="annotations[0]").exec()

                >>>
                class Instance(BaseModel):
                    segmentation: List[List[float]]
                    area: float
                    iscrowd: int
                    image_id: int
                    bbox: List[float]
                    category_id: int
                    id: int
```

```
DataChain.from_storage(detections_uri).show_json_schema(model_name="Narrative", jmespath="annotations[36336]").exec()

                >>>
                class Segmentation(BaseModel):
                    counts: List[int]
                    size: List[int]
                
                
                class Instance(BaseModel):
                    segmentation: Segmentation
                    area: int
                    iscrowd: int
                    image_id: int
                    bbox: List[int]
                    category_id: int
                    id: int

```

To avoid this problem, we will just ignore the last 446 detected instances with argument `nrows`

##### Object name collisions

If we examing the schema for JSON detections, we will see it carries metadata in array 'annotations' – which is named identical to an array with captions. To avoid namespace collisions, let us rename the detected object instances using an `object_name` argument:

In [None]:
detections = DataChain.from_json(detections_uri, jmespath = "annotations", object_name="instance", model_name="Instance", nrows=36335)
instances = captioned_images.merge(detections, on="image.id", right_on="instance.image_id", inner=True).save("coco-detections")

In [22]:
instances.count()

36335

In [23]:
instances.print_schema()

 file: ImageFile
     source: str
     parent: str
     name: str
     size: int
     version: str
     etag: str
     is_latest: bool
     last_modified: datetime
     location: Union[dict, list[dict], NoneType]
     vtype: str
 image: Image
     license: int
     file_name: str
     coco_url: str
     height: int
     width: int
     date_captured: str
     flickr_url: str
     id: int
 annotation: Annotation
     image_id: int
     id: int
     caption: str
 instance: Instance
     segmentation: list[list[float]]
     area: float
     iscrowd: int
     image_id: int
     bbox: list[float]
     category_id: int
     id: int


##### Great! 
As a last step, let us collect COCO data categories so we can use names instead of numbers:

In [24]:
categories = DataChain.from_json(detections_uri, jmespath = "categories", model_name="Category")
categories_iterator = categories.collect("categories")
coco_dict = {obj.name: obj.id for obj in categories_iterator}

Processed: 1 rows [00:00, 686.80 rows/s]
Download: 15.0MB [00:01, 13.8MB/s]
Download: 19.1MB [00:02, 9.44MB/s]][A
Processed: 1 rows [00:00, 400.30 rows/s]
Processed: 1 rows [00:00, 769.17 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Download: 0.00B [00:00, ?B/s][A
Download: 5.02MB [00:00, 13.2MB/s][A
Download: 10.0MB [00:00, 12.6MB/s][A
Download: 15.0MB [00:01, 13.1MB/s][A

Download: 19.1MB [00:01, 11.9MB/s]][A[A
Processed: 1 rows [00:02,  2.14s/ rows]
Generated: 80 rows [00:00, 16521.95 rows/s]


#### Now we built a dataset with all validation object instances for COCO 2017

It is time to do fun stuff!


##### For example, let us see images where cats and dogs are shown together:

In [25]:
dogs = instances.filter(Column("instance.category_id") == coco_dict["dog"])
cats = instances.filter(Column("instance.category_id") == coco_dict["cat"])

# drop all columns in "cats" except id and the filename we will be merging on, rename columns to avoid collision at merge
cat_ids = cats.mutate(cat_id=Column("instance.id")).mutate(cat_fname=Column("file.name")).select("cat_id", "cat_fname")

# inner = True, drop all records without a merging match:
cats_and_dogs = dogs.merge(cat_ids, on="file.name", right_on="cat_fname", inner=True)

In [36]:
animals = cats_and_dogs.collect("file")
next(animals).read()

<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x334>

##### We can also the images where a dog is detected but not mentioned in a caption:

In [35]:
unmentioned_dogs = dogs.filter(~Column("annotation.caption").glob('*dog*')).distinct("file.name")
lost_dogs = list(unmentioned_dogs.collect("file", "annotation.caption", "instance.bbox"))
lost_dog = lost_dogs[3]
lost_dog_pic, lost_dog_caption, bbox = lost_dog
lost_dog_pic.read()

<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x485>

In [31]:
lost_dog_caption


'A lot of people that are looking at a pool.'

##### Let's get this dog out of the bounding box

In [32]:
x_min = bbox[0]
y_min = bbox[1]
x_max = bbox[0] + bbox[2] 
y_max = bbox[1] + bbox[3]
bbox_converted = (x_min, y_min, x_max, y_max)
image = lost_dogs[3][0].read()
recovered_dog = image.crop(bbox_converted)
recovered_dog

<PIL.Image.Image image mode=RGB size=198x202>

##### Curiously, COCO also has images captioned as "dogs" but lacking the dog detections.

In the code below, we are filtering all instances which mention "dog" in captions. Since one image may have multiple detected instances, we are interested only in the images that lack the detected "dog" category. To find these, we aggregate by the filename, and return only file records without 'dog' category assigned to them. Note that our files in storage were originally configured as 'PIL images' for convenience; we want to maintain this distinction by specifying the ImageFile output type. 

In [30]:
from datachain.lib.file import ImageFile

instances = DataChain.from_dataset("coco-detections")
mentioned_dogs = instances.filter(Column("annotation.caption").glob("* dog *"))

def no_dog_category(file, instance):
    return_file = file[0]
    category_set = set([inst.category_id for inst in instance])
    if coco_dict["dog"] not in category_set:
        yield return_file

no_dogs_detected = mentioned_dogs.agg(no_dog_category, output={"file": ImageFile}, partition_by = Column("file.name"))
no_dogs_detected.count()

Processed: 0 rows [00:00, ? rows/s]
Processed: 646 rows [00:00, 10666.61 rows/s]
Generated: 17 rows [00:00, 325.61 rows/s]


17

##### What are these images?
##### They turn out to be various hotdogs, and one picture of a dog on a bus:

In [34]:
mysterious_dogs = list(no_dogs_detected.collect("file"))
mysterious_dogs[11].read()

Processed: 0 rows [00:00, ? rows/s]
Processed: 646 rows [00:00, 10342.90 rows/s]
Generated: 17 rows [00:00, 313.85 rows/s]


<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x427>

### That's it, folks! Enjoy your JSONs.