Document, Executor, and Flow are the three fundamental concepts in Jina.

Document is the basic data type in Jina;
Executor is how Jina processes Documents;
Flow is how Jina streamlines and scales Executors.

Learn them all, nothing more, you are good to go.

Cookbook on `Document`/`DocumentArray` 2.0 API

Document is the basic data type that Jina operates with. Text, picture, video, audio, image or 3D mesh: They are all Documents in Jina.

DocumentArray is a sequence container of Documents. It is the first-class citizen of Executor, serving as the Executor's input and output.

You could say Document is to Jina is what np.float is to Numpy, and DocumentArray is similar to np.ndarray.

Table of Contents

Minimum working example
Document API
DocumentArray API

Minimum working example

from jina import Document

d = Document()

`Document` API

`Document` Attributes

A Document object has the following attributes, which can be put into the following categories:


Content attributes	`.buffer`, `.blob`, `.text`, `.uri`, `.content`, `.embedding`
Meta attributes	`.id`, `.weight`, `.mime_type`, `.location`, `.tags`, `.offset`, `.modality`, `siblings`
Recursive attributes	`.chunks`, `.matches`, `.granularity`, `.adjacency`
Relevance attributes	`.score`, `.evaluations`

Set & Unset Attributes

Set a attribute:

from jina import Document

d = Document()
d.text = 'hello world'

<jina.types.document.Document id=9badabb6-b9e9-11eb-993c-1e008a366d49 mime_type=text/plain text=hello world at 4444621648>

Unset a attribute:

d.pop('text')

<jina.types.document.Document id=cdf1dea8-b9e9-11eb-8fd8-1e008a366d49 mime_type=text/plain at 4490447504>

Unset multiple attributes:

d.pop('text', 'id', 'mime_type')

<jina.types.document.Document at 5668344144>

Construct `Document`

Content Attributes


`doc.buffer`	The raw binary content of this Document
`doc.blob`	The `ndarray` of the image/audio/video Document
`doc.text`	The text info of the Document
`doc.uri`	A uri of the Document could be: a local file path, a remote url starts with http or https or data URI scheme
`doc.content`	One of the above non-empty field
`doc.embedding`	The embedding `ndarray` of this Document

You can assign str, ndarray, buffer or uri to a Document.

from jina import Document
import numpy as np

d1 = Document(content='hello')
d2 = Document(content=b'\f1')
d3 = Document(content=np.array([1, 2, 3]))
d4 = Document(content='https://static.jina.ai/logo/core/notext/light/logo.png')

<jina.types.document.Document id=2ca74b98-aed9-11eb-b791-1e008a366d48 mimeType=text/plain text=hello at 6247702096>
<jina.types.document.Document id=2ca74f1c-aed9-11eb-b791-1e008a366d48 buffer=DDE= mimeType=text/plain at 6247702160>
<jina.types.document.Document id=2caab594-aed9-11eb-b791-1e008a366d48 blob={'dense': {'buffer': 'AQAAAAAAAAACAAAAAAAAAAMAAAAAAAAA', 'shape': [3], 'dtype': '<i8'}} at 6247702416>
<jina.types.document.Document id=4c008c40-af9f-11eb-bb84-1e008a366d49 uri=https://static.jina.ai/logo/core/notext/light/logo.png mimeType=image/png at 6252395600>

The content will be automatically assigned to either the text, buffer, blob, or uri fields. id and mime_type are auto-generated when not given.

You can get a visualization of a Document object in Jupyter Notebook or by calling .plot().

Exclusivity of `doc.content`

Note that one Document can only contain one type of content: it is either text, buffer, blob or uri. Setting text first and then setting uri will clear the text field.

d = Document(text='hello world')
d.uri = 'https://jina.ai/'
assert not d.text  # True

d = Document(content='https://jina.ai')
assert d.uri == 'https://jina.ai'  # True
assert not d.text  # True
d.text = 'hello world'

assert d.content == 'hello world'  # True
assert not d.uri  # True

Conversion between `doc.content`

You can use the following methods to convert between .uri, .text, .buffer and .blob:

doc.convert_buffer_to_blob()
doc.convert_blob_to_buffer()
doc.convert_uri_to_buffer()
doc.convert_buffer_to_uri()
doc.convert_text_to_uri()
doc.convert_uri_to_text()

You can convert a URI to a data URI (a data in-line URI scheme) using doc.convert_uri_to_datauri(). This will fetch the resource and make it inline.

In particular, when you work with an image Document, there are some extra helpers that enable more conversion:

doc.convert_image_buffer_to_blob()
doc.convert_image_blob_to_uri()
doc.convert_image_uri_to_blob()
doc.convert_image_datauri_to_blob()

Set Embedding

An embedding is a high-dimensional representation of a Document. You can assign any Numpy ndarray as a Document's embedding.

import numpy as np
from jina import Document

d1 = Document(embedding=np.array([1, 2, 3]))
d2 = Document(embedding=np.array([[1, 2, 3], [4, 5, 6]]))

Construct with Multiple Attributes

Meta Attributes


`doc.tags`	A structured data value, consisting of fields which map to dynamically typed values
`doc.id`	A hexdigest that represents a unique Document ID
`doc.weight`	The weight of the Document
`doc.mime_type`	The mime type of the Document
`doc.location`	The position of the Document. This could be start and end index of a string; x,y (top, left) coordinates of an image crop; timestamp of an audio clip, etc
`doc.offset`	The offset of the Document in the previous granularity Document
`doc.modality`	An identifier of the modality the Document belongs to

You can assign multiple attributes in the constructor via:

from jina import Document

d = Document(uri='https://jina.ai',
             mime_type='text/plain',
             granularity=1,
             adjacency=3,
             tags={'foo': 'bar'})

<jina.types.document.Document id=e01a53bc-aedb-11eb-88e6-1e008a366d48 uri=https://jina.ai mimeType=text/plain tags={'foo': 'bar'} granularity=1 adjacency=3 at 6317309200>

Construct from Dict or JSON String

You can build a Document from a dict or JSON string:

from jina import Document
import json

d = {'id': 'hello123', 'content': 'world'}
d1 = Document(d)

d = json.dumps({'id': 'hello123', 'content': 'world'})
d2 = Document(d)

Parsing Unrecognized Fields

Unrecognized fields in a dict/JSON string are automatically put into the Document's .tags field:

from jina import Document

d1 = Document({'id': 'hello123', 'foo': 'bar'})

<jina.types.document.Document id=hello123 tags={'foo': 'bar'} at 6320791056>

You can use field_resolver to map external field names to Document attributes:

from jina import Document

d1 = Document({'id': 'hello123', 'foo': 'bar'}, field_resolver={'foo': 'content'})

<jina.types.document.Document id=hello123 mimeType=text/plain text=bar at 6246985488>

Construct from Another `Document`

Assigning a Document object to another Document object will make a shallow copy:

from jina import Document

d = Document(content='hello, world!')
d1 = d

assert id(d) == id(d1)  # True

To make a deep copy, use copy=True:

d1 = Document(d, copy=True)

assert id(d) == id(d1)  # False

You can partially update a Document according to another source Document:

from jina import Document

s = Document(
    id='🐲',
    content='hello-world',
    tags={'a': 'b'},
    chunks=[Document(id='🐢')],
)
d = Document(
    id='🐦',
    content='goodbye-world',
    tags={'c': 'd'},
    chunks=[Document(id='🐯')],
)

# only update `id` field
d.update(s, include_fields=('id',))

# only preserve `id` field
d.update(s, exclude_fields=('id',))

Construct from JSON, CSV, `ndarray` and Files

The jina.types.document.generators module let you construct Document from common file types such as JSON, CSV, ndarray and text files. The following functions will give a generator of Document, where each Document object corresponds to a line/row in the original format:


`from_ndjson()`	Yield `Document` from a line-based JSON file. Each line is a `Document` object
`from_csv()`	Yield `Document` from a CSV file. Each line is a `Document` object
`from_files()`	Yield `Document` from a glob files. Each file is a `Document` object
`from_ndarray()`	Yield `Document` from a `ndarray`. Each row (depending on `axis`) is a `Document` object

Using a generator is sometimes less memory-demanding, as it does not load/build all Document objects in one shot.

To convert the generator to DocumentArray use:

from jina import DocumentArray
from jina.types.document.generators import from_files

DocumentArray(from_files('/*.png'))

Serialize `Document`

You can serialize a Document into JSON string or Python dict or binary string:

from jina import Document

d = Document(content='hello, world')
d.json()

{
  "id": "6a1c7f34-aef7-11eb-b075-1e008a366d48",
  "mimeType": "text/plain",
  "text": "hello world"
}

d.dict()

{'id': '6a1c7f34-aef7-11eb-b075-1e008a366d48', 'mimeType': 'text/plain', 'text': 'hello world'}

d.binary_str()

b'\n$6a1c7f34-aef7-11eb-b075-1e008a366d48R\ntext/plainj\x0bhello world'

Add Recursion to `Document`

Recursive Attributes

Document can be recursed both horizontally and vertically:


`doc.chunks`	The list of sub-Documents of this Document. They have `granularity + 1` but same `adjacency`
`doc.matches`	The list of matched Documents of this Document. They have `adjacency + 1` but same `granularity`
`doc.granularity`	The recursion "depth" of the recursive chunks structure
`doc.adjacency`	The recursion "width" of the recursive match structure

You can add chunks (sub-Document) and matches (neighbour-Document) to a Document:

Add in constructor:

d = Document(chunks=[Document(), Document()], matches=[Document(), Document()])

Add to existing Document:

d = Document()
d.chunks = [Document(), Document()]
d.matches = [Document(), Document()]

Add to existing doc.chunks or doc.matches:

d = Document()
d.chunks.append(Document())
d.matches.append(Document())

Note that both doc.chunks and doc.matches return DocumentArray, which we will introduce later.

Visualize `Document`

To better see the Document's recursive structure, you can use .plot() function. If you are using JupyterLab/Notebook, all Document objects will be auto-rendered:

import numpy as np
from jina import Document

d0 = Document(id='🐲', embedding=np.array([0, 0]))
d1 = Document(id='🐦', embedding=np.array([1, 0]))
d2 = Document(id='🐢', embedding=np.array([0, 1]))
d3 = Document(id='🐯', embedding=np.array([1, 1]))

d0.chunks.append(d1)
d0.chunks[0].chunks.append(d2)
d0.matches.append(d3)

d0.plot()  # simply `d0` on JupyterLab

Add Relevancy to `Document`s

Relevance Attributes


`doc.score`	The relevance information of this Document
`doc.evaluations`	The evaluation information of this Document

You can add a relevance score to a Document object via:

from jina import Document

d = Document()
d.score.value = 0.96
d.score.description = 'cosine similarity'
d.score.op_name = 'cosine()'

<jina.types.document.Document id=0a986c50-aeff-11eb-84c1-1e008a366d48 score={'value': 0.96, 'opName': 'cosine()', 'description': 'cosine similarity'} at 6281686928>

Score information is often used jointly with matches. For example, you often see the indexer adding matches as follows:

from jina import Document

# some query Document
q = Document()
# get match Document `m`
m = Document()
m.score.value = 0.96
q.matches.append(m)

`DocumentArray` API

A DocumentArray is a list of Document objects. You can construct, delete, insert, sort and traverse a DocumentArray like a Python list.

Methods supported by DocumentArray:


Python `list`-like interface	`__getitem__`, `__setitem__`, `__delitem__`, `__len__`, `insert`, `append`, `reverse`, `extend`, `pop`, `remove`, `__iadd__`, `__add__`, `__iter__`, `__clear__`, `sort`
Persistence	`save`, `load`
Advanced getters	`get_attributes`, `get_attributes_with_docs`

Construct `DocumentArray`

You can construct a DocumentArray from an iterable of Documents:

from jina import DocumentArray, Document

# from list
da1 = DocumentArray([Document(), Document()])

# from generator
da2 = DocumentArray((Document() for _ in range(10)))

# from another `DocumentArray`
da3 = DocumentArray(da2)

Persistence via `save()`/`load()`

To save all elements in a DocumentArray in a JSON line format:

from jina import DocumentArray, Document

da = DocumentArray([Document(), Document()])

da.save('data.json')
da1 = DocumentArray.load('data.json')

Access Element

You can access a Document in the DocumentArray via integer index, string id or slice indices:

from jina import DocumentArray, Document

da = DocumentArray([Document(id='hello'), Document(id='world'), Document(id='goodbye')])

da[0]
# <jina.types.document.Document id=hello at 5699749904>

da['world']
# <jina.types.document.Document id=world at 5736614992>

da[1:2]
# <jina.types.arrays.document.DocumentArray length=1 at 5705863632>

Sort Elements

DocumentArray is a subclass of MutableSequence, therefore you can use built-in Python sort to sort elements in a DocumentArray object, e.g.

from jina import DocumentArray, Document

da = DocumentArray(
    [
        Document(tags={'id': 1}),
        Document(tags={'id': 2}),
        Document(tags={'id': 3})
    ]
)

da.sort(key=lambda d: d.tags['id'], reverse=True)
print(da)

To sort elements in da in-place, using tags[id] value in a descending manner:

<jina.types.arrays.document.DocumentArray length=3 at 5701440528>

{'id': '6a79982a-b6b0-11eb-8a66-1e008a366d49', 'tags': {'id': 3.0}},
{'id': '6a799744-b6b0-11eb-8a66-1e008a366d49', 'tags': {'id': 2.0}},
{'id': '6a799190-b6b0-11eb-8a66-1e008a366d49', 'tags': {'id': 1.0}}

Filter Elements

You can use Python's built-in filter() to filter elements in a DocumentArray object:

from jina import DocumentArray, Document

da = DocumentArray([Document() for _ in range(6)])

for j in range(6):
    da[j].score.value = j

for d in filter(lambda d: d.score.value > 2, da):
    print(d)

<jina.types.document.Document id=c5e588f4-b6b0-11eb-af83-1e008a366d49 score={'value': 3.0} at 5696708048>
<jina.types.document.Document id=c5e58958-b6b0-11eb-af83-1e008a366d49 score={'value': 4.0} at 5696705040>
<jina.types.document.Document id=c5e589b2-b6b0-11eb-af83-1e008a366d49 score={'value': 5.0} at 5696708048>

You can build a DocumentArray object from the filtered results:

from jina import DocumentArray, Document

da = DocumentArray([Document(weight=j) for j in range(6)])
da2 = DocumentArray(list(filter(lambda d: d.weight > 2, da)))

print(da2)

DocumentArray has 3 items:
{'id': '3bd0d298-b6da-11eb-b431-1e008a366d49', 'weight': 3.0},
{'id': '3bd0d324-b6da-11eb-b431-1e008a366d49', 'weight': 4.0},
{'id': '3bd0d392-b6da-11eb-b431-1e008a366d49', 'weight': 5.0}

Use `itertools` on `DocumentArray`

As DocumentArray is an Iterable, you can also use Python's built-in itertools module on it. This enables advanced "iterator algebra" on the DocumentArray.

For instance, you can group a DocumentArray by parent_id:

from jina import DocumentArray, Document
from itertools import groupby

da = DocumentArray([Document(parent_id=f'{i % 2}') for i in range(6)])
groups = groupby(sorted(da, key=lambda d: d.parent_id), lambda d: d.parent_id)
for key, group in groups:
    key, len(list(group))

('0', 3)
('1', 3)

Get Attributes in Bulk

DocumentArray implements powerful getters that lets you fetch multiple attributes from the Documents it contains in one-shot:

import numpy as np

from jina import DocumentArray, Document

da = DocumentArray([Document(id=1, text='hello', embedding=np.array([1, 2, 3])),
                    Document(id=2, text='goodbye', embedding=np.array([4, 5, 6])),
                    Document(id=3, text='world', embedding=np.array([7, 8, 9]))])

da.get_attributes('id', 'text', 'embedding')

[('1', '2', '3'), ('hello', 'goodbye', 'world'), (array([1, 2, 3]), array([4, 5, 6]), array([7, 8, 9]))]

This can be very useful when extracting a batch of embeddings:

import numpy as np

np.stack(da.get_attributes('embedding'))

[[1 2 3]
 [4 5 6]
 [7 8 9]]

Access nested attributes from tags

Document contains the tags field that can hold a map-like structure that can map arbitrary values.

from jina import Document

doc = Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0}})

doc.tags['dimensions']

{'weight': 10.0, 'height': 5.0}

In order to provide easy access to nested fields, the Document allows to access attributes by composing the attribute qualified name with interlaced __ symbols:

from jina import Document

doc = Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0}})

doc.tags__dimensions__weight

10.0

This also allows to access nested metadata attributes in bulk from a DocumentArray.

from jina import Document, DocumentArray

da = DocumentArray([Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0}}) for _ in range(10)]) 

da.get_attributes('tags__dimensions__height', 'tags__dimensions__weight')

[[5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0], [10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0]]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document.md

Document.md

Cookbook on `Document`/`DocumentArray` 2.0 API

Minimum working example

`Document` API

`Document` Attributes

Set & Unset Attributes

Construct `Document`

Content Attributes

Exclusivity of `doc.content`

Conversion between `doc.content`

Set Embedding

Construct with Multiple Attributes

Meta Attributes

Construct from Dict or JSON String

Parsing Unrecognized Fields

Construct from Another `Document`

Construct from JSON, CSV, `ndarray` and Files

Serialize `Document`

Add Recursion to `Document`

Recursive Attributes

Visualize `Document`

Add Relevancy to `Document`s

Relevance Attributes

`DocumentArray` API

Construct `DocumentArray`

Persistence via `save()`/`load()`

Access Element

Sort Elements

Filter Elements

Use `itertools` on `DocumentArray`

Get Attributes in Bulk

Access nested attributes from tags

Files

Document.md

Latest commit

History

Document.md

File metadata and controls

Cookbook on Document/DocumentArray 2.0 API

Minimum working example

Document API

Document Attributes

Set & Unset Attributes

Construct Document

Content Attributes

Exclusivity of doc.content

Conversion between doc.content

Set Embedding

Construct with Multiple Attributes

Meta Attributes

Construct from Dict or JSON String

Parsing Unrecognized Fields

Construct from Another Document

Construct from JSON, CSV, ndarray and Files

Serialize Document

Add Recursion to Document

Recursive Attributes

Visualize Document

Add Relevancy to Documents

Relevance Attributes

DocumentArray API

Construct DocumentArray

Persistence via save()/load()

Access Element

Sort Elements

Filter Elements

Use itertools on DocumentArray

Get Attributes in Bulk

Access nested attributes from tags

Cookbook on `Document`/`DocumentArray` 2.0 API

`Document` API

`Document` Attributes

Construct `Document`

Exclusivity of `doc.content`

Conversion between `doc.content`

Construct from Another `Document`

Construct from JSON, CSV, `ndarray` and Files

Serialize `Document`

Add Recursion to `Document`

Visualize `Document`

Add Relevancy to `Document`s

`DocumentArray` API

Construct `DocumentArray`

Persistence via `save()`/`load()`

Use `itertools` on `DocumentArray`