Skip to content

Latest commit

 

History

History
724 lines (515 loc) · 21.2 KB

Document.md

File metadata and controls

724 lines (515 loc) · 21.2 KB

Document, Executor, and Flow are the three fundamental concepts in Jina.

  • Document is the basic data type in Jina;
  • Executor is how Jina processes Documents;
  • Flow is how Jina streamlines and scales Executors.

Learn them all, nothing more, you are good to go.


Cookbook on Document/DocumentArray 2.0 API

Document is the basic data type that Jina operates with. Text, picture, video, audio, image or 3D mesh: They are all Documents in Jina.

DocumentArray is a sequence container of Documents. It is the first-class citizen of Executor, serving as the Executor's input and output.

You could say Document is to Jina is what np.float is to Numpy, and DocumentArray is similar to np.ndarray.

Table of Contents

Minimum working example

from jina import Document

d = Document() 

Document API

Document Attributes

A Document object has the following attributes, which can be put into the following categories:

Content attributes .buffer, .blob, .text, .uri, .content, .embedding
Meta attributes .id, .weight, .mime_type, .location, .tags, .offset, .modality, siblings
Recursive attributes .chunks, .matches, .granularity, .adjacency
Relevance attributes .score, .evaluations

Set & Unset Attributes

Set a attribute:

from jina import Document

d = Document()
d.text = 'hello world'
<jina.types.document.Document id=9badabb6-b9e9-11eb-993c-1e008a366d49 mime_type=text/plain text=hello world at 4444621648>

Unset a attribute:

d.pop('text')
<jina.types.document.Document id=cdf1dea8-b9e9-11eb-8fd8-1e008a366d49 mime_type=text/plain at 4490447504>

Unset multiple attributes:

d.pop('text', 'id', 'mime_type')
<jina.types.document.Document at 5668344144>

Construct Document

Content Attributes
doc.buffer The raw binary content of this Document
doc.blob The ndarray of the image/audio/video Document
doc.text The text info of the Document
doc.uri A uri of the Document could be: a local file path, a remote url starts with http or https or data URI scheme
doc.content One of the above non-empty field
doc.embedding The embedding ndarray of this Document

You can assign str, ndarray, buffer or uri to a Document.

from jina import Document
import numpy as np

d1 = Document(content='hello')
d2 = Document(content=b'\f1')
d3 = Document(content=np.array([1, 2, 3]))
d4 = Document(content='https://static.jina.ai/logo/core/notext/light/logo.png')
<jina.types.document.Document id=2ca74b98-aed9-11eb-b791-1e008a366d48 mimeType=text/plain text=hello at 6247702096>
<jina.types.document.Document id=2ca74f1c-aed9-11eb-b791-1e008a366d48 buffer=DDE= mimeType=text/plain at 6247702160>
<jina.types.document.Document id=2caab594-aed9-11eb-b791-1e008a366d48 blob={'dense': {'buffer': 'AQAAAAAAAAACAAAAAAAAAAMAAAAAAAAA', 'shape': [3], 'dtype': '<i8'}} at 6247702416>
<jina.types.document.Document id=4c008c40-af9f-11eb-bb84-1e008a366d49 uri=https://static.jina.ai/logo/core/notext/light/logo.png mimeType=image/png at 6252395600>

The content will be automatically assigned to either the text, buffer, blob, or uri fields. id and mime_type are auto-generated when not given.

You can get a visualization of a Document object in Jupyter Notebook or by calling .plot().

Exclusivity of doc.content

Note that one Document can only contain one type of content: it is either text, buffer, blob or uri. Setting text first and then setting uri will clear the text field.

d = Document(text='hello world')
d.uri = 'https://jina.ai/'
assert not d.text  # True

d = Document(content='https://jina.ai')
assert d.uri == 'https://jina.ai'  # True
assert not d.text  # True
d.text = 'hello world'

assert d.content == 'hello world'  # True
assert not d.uri  # True

Conversion between doc.content

You can use the following methods to convert between .uri, .text, .buffer and .blob:

doc.convert_buffer_to_blob()
doc.convert_blob_to_buffer()
doc.convert_uri_to_buffer()
doc.convert_buffer_to_uri()
doc.convert_text_to_uri()
doc.convert_uri_to_text()

You can convert a URI to a data URI (a data in-line URI scheme) using doc.convert_uri_to_datauri(). This will fetch the resource and make it inline.

In particular, when you work with an image Document, there are some extra helpers that enable more conversion:

doc.convert_image_buffer_to_blob()
doc.convert_image_blob_to_uri()
doc.convert_image_uri_to_blob()
doc.convert_image_datauri_to_blob()
Set Embedding

An embedding is a high-dimensional representation of a Document. You can assign any Numpy ndarray as a Document's embedding.

import numpy as np
from jina import Document

d1 = Document(embedding=np.array([1, 2, 3]))
d2 = Document(embedding=np.array([[1, 2, 3], [4, 5, 6]]))

Construct with Multiple Attributes

Meta Attributes
doc.tags A structured data value, consisting of fields which map to dynamically typed values
doc.id A hexdigest that represents a unique Document ID
doc.weight The weight of the Document
doc.mime_type The mime type of the Document
doc.location The position of the Document. This could be start and end index of a string; x,y (top, left) coordinates of an image crop; timestamp of an audio clip, etc
doc.offset The offset of the Document in the previous granularity Document
doc.modality An identifier of the modality the Document belongs to

You can assign multiple attributes in the constructor via:

from jina import Document

d = Document(uri='https://jina.ai',
             mime_type='text/plain',
             granularity=1,
             adjacency=3,
             tags={'foo': 'bar'})
<jina.types.document.Document id=e01a53bc-aedb-11eb-88e6-1e008a366d48 uri=https://jina.ai mimeType=text/plain tags={'foo': 'bar'} granularity=1 adjacency=3 at 6317309200>

Construct from Dict or JSON String

You can build a Document from a dict or JSON string:

from jina import Document
import json

d = {'id': 'hello123', 'content': 'world'}
d1 = Document(d)

d = json.dumps({'id': 'hello123', 'content': 'world'})
d2 = Document(d)
Parsing Unrecognized Fields

Unrecognized fields in a dict/JSON string are automatically put into the Document's .tags field:

from jina import Document

d1 = Document({'id': 'hello123', 'foo': 'bar'})
<jina.types.document.Document id=hello123 tags={'foo': 'bar'} at 6320791056>

You can use field_resolver to map external field names to Document attributes:

from jina import Document

d1 = Document({'id': 'hello123', 'foo': 'bar'}, field_resolver={'foo': 'content'})
<jina.types.document.Document id=hello123 mimeType=text/plain text=bar at 6246985488>

Construct from Another Document

Assigning a Document object to another Document object will make a shallow copy:

from jina import Document

d = Document(content='hello, world!')
d1 = d

assert id(d) == id(d1)  # True

To make a deep copy, use copy=True:

d1 = Document(d, copy=True)

assert id(d) == id(d1)  # False

You can partially update a Document according to another source Document:

from jina import Document

s = Document(
    id='🐲',
    content='hello-world',
    tags={'a': 'b'},
    chunks=[Document(id='🐢')],
)
d = Document(
    id='🐦',
    content='goodbye-world',
    tags={'c': 'd'},
    chunks=[Document(id='🐯')],
)

# only update `id` field
d.update(s, include_fields=('id',))

# only preserve `id` field
d.update(s, exclude_fields=('id',))

Construct from JSON, CSV, ndarray and Files

The jina.types.document.generators module let you construct Document from common file types such as JSON, CSV, ndarray and text files. The following functions will give a generator of Document, where each Document object corresponds to a line/row in the original format:

from_ndjson() Yield Document from a line-based JSON file. Each line is a Document object
from_csv() Yield Document from a CSV file. Each line is a Document object
from_files() Yield Document from a glob files. Each file is a Document object
from_ndarray() Yield Document from a ndarray. Each row (depending on axis) is a Document object

Using a generator is sometimes less memory-demanding, as it does not load/build all Document objects in one shot.

To convert the generator to DocumentArray use:

from jina import DocumentArray
from jina.types.document.generators import from_files

DocumentArray(from_files('/*.png'))

Serialize Document

You can serialize a Document into JSON string or Python dict or binary string:

from jina import Document

d = Document(content='hello, world')
d.json()
{
  "id": "6a1c7f34-aef7-11eb-b075-1e008a366d48",
  "mimeType": "text/plain",
  "text": "hello world"
}
d.dict()
{'id': '6a1c7f34-aef7-11eb-b075-1e008a366d48', 'mimeType': 'text/plain', 'text': 'hello world'}
d.binary_str()
b'\n$6a1c7f34-aef7-11eb-b075-1e008a366d48R\ntext/plainj\x0bhello world'

Add Recursion to Document

Recursive Attributes

Document can be recursed both horizontally and vertically:

doc.chunks The list of sub-Documents of this Document. They have granularity + 1 but same adjacency
doc.matches The list of matched Documents of this Document. They have adjacency + 1 but same granularity
doc.granularity The recursion "depth" of the recursive chunks structure
doc.adjacency The recursion "width" of the recursive match structure

You can add chunks (sub-Document) and matches (neighbour-Document) to a Document:

  • Add in constructor:

    d = Document(chunks=[Document(), Document()], matches=[Document(), Document()])
  • Add to existing Document:

    d = Document()
    d.chunks = [Document(), Document()]
    d.matches = [Document(), Document()]
  • Add to existing doc.chunks or doc.matches:

    d = Document()
    d.chunks.append(Document())
    d.matches.append(Document())

Note that both doc.chunks and doc.matches return DocumentArray, which we will introduce later.

Visualize Document

To better see the Document's recursive structure, you can use .plot() function. If you are using JupyterLab/Notebook, all Document objects will be auto-rendered:

import numpy as np
from jina import Document

d0 = Document(id='🐲', embedding=np.array([0, 0]))
d1 = Document(id='🐦', embedding=np.array([1, 0]))
d2 = Document(id='🐢', embedding=np.array([0, 1]))
d3 = Document(id='🐯', embedding=np.array([1, 1]))

d0.chunks.append(d1)
d0.chunks[0].chunks.append(d2)
d0.matches.append(d3)

d0.plot()  # simply `d0` on JupyterLab

Add Relevancy to Documents

Relevance Attributes

doc.score The relevance information of this Document
doc.evaluations The evaluation information of this Document

You can add a relevance score to a Document object via:

from jina import Document

d = Document()
d.score.value = 0.96
d.score.description = 'cosine similarity'
d.score.op_name = 'cosine()'
<jina.types.document.Document id=0a986c50-aeff-11eb-84c1-1e008a366d48 score={'value': 0.96, 'opName': 'cosine()', 'description': 'cosine similarity'} at 6281686928>

Score information is often used jointly with matches. For example, you often see the indexer adding matches as follows:

from jina import Document

# some query Document
q = Document()
# get match Document `m`
m = Document()
m.score.value = 0.96
q.matches.append(m)

DocumentArray API

A DocumentArray is a list of Document objects. You can construct, delete, insert, sort and traverse a DocumentArray like a Python list.

Methods supported by DocumentArray:

Python list-like interface __getitem__, __setitem__, __delitem__, __len__, insert, append, reverse, extend, pop, remove, __iadd__, __add__, __iter__, __clear__, sort
Persistence save, load
Advanced getters get_attributes, get_attributes_with_docs

Construct DocumentArray

You can construct a DocumentArray from an iterable of Documents:

from jina import DocumentArray, Document

# from list
da1 = DocumentArray([Document(), Document()])

# from generator
da2 = DocumentArray((Document() for _ in range(10)))

# from another `DocumentArray`
da3 = DocumentArray(da2)

Persistence via save()/load()

To save all elements in a DocumentArray in a JSON line format:

from jina import DocumentArray, Document

da = DocumentArray([Document(), Document()])

da.save('data.json')
da1 = DocumentArray.load('data.json')

Access Element

You can access a Document in the DocumentArray via integer index, string id or slice indices:

from jina import DocumentArray, Document

da = DocumentArray([Document(id='hello'), Document(id='world'), Document(id='goodbye')])

da[0]
# <jina.types.document.Document id=hello at 5699749904>

da['world']
# <jina.types.document.Document id=world at 5736614992>

da[1:2]
# <jina.types.arrays.document.DocumentArray length=1 at 5705863632>

Sort Elements

DocumentArray is a subclass of MutableSequence, therefore you can use built-in Python sort to sort elements in a DocumentArray object, e.g.

from jina import DocumentArray, Document

da = DocumentArray(
    [
        Document(tags={'id': 1}),
        Document(tags={'id': 2}),
        Document(tags={'id': 3})
    ]
)

da.sort(key=lambda d: d.tags['id'], reverse=True)
print(da)

To sort elements in da in-place, using tags[id] value in a descending manner:

<jina.types.arrays.document.DocumentArray length=3 at 5701440528>

{'id': '6a79982a-b6b0-11eb-8a66-1e008a366d49', 'tags': {'id': 3.0}},
{'id': '6a799744-b6b0-11eb-8a66-1e008a366d49', 'tags': {'id': 2.0}},
{'id': '6a799190-b6b0-11eb-8a66-1e008a366d49', 'tags': {'id': 1.0}}

Filter Elements

You can use Python's built-in filter() to filter elements in a DocumentArray object:

from jina import DocumentArray, Document

da = DocumentArray([Document() for _ in range(6)])

for j in range(6):
    da[j].score.value = j

for d in filter(lambda d: d.score.value > 2, da):
    print(d)
<jina.types.document.Document id=c5e588f4-b6b0-11eb-af83-1e008a366d49 score={'value': 3.0} at 5696708048>
<jina.types.document.Document id=c5e58958-b6b0-11eb-af83-1e008a366d49 score={'value': 4.0} at 5696705040>
<jina.types.document.Document id=c5e589b2-b6b0-11eb-af83-1e008a366d49 score={'value': 5.0} at 5696708048>

You can build a DocumentArray object from the filtered results:

from jina import DocumentArray, Document

da = DocumentArray([Document(weight=j) for j in range(6)])
da2 = DocumentArray(list(filter(lambda d: d.weight > 2, da)))

print(da2)
DocumentArray has 3 items:
{'id': '3bd0d298-b6da-11eb-b431-1e008a366d49', 'weight': 3.0},
{'id': '3bd0d324-b6da-11eb-b431-1e008a366d49', 'weight': 4.0},
{'id': '3bd0d392-b6da-11eb-b431-1e008a366d49', 'weight': 5.0}

Use itertools on DocumentArray

As DocumentArray is an Iterable, you can also use Python's built-in itertools module on it. This enables advanced "iterator algebra" on the DocumentArray.

For instance, you can group a DocumentArray by parent_id:

from jina import DocumentArray, Document
from itertools import groupby

da = DocumentArray([Document(parent_id=f'{i % 2}') for i in range(6)])
groups = groupby(sorted(da, key=lambda d: d.parent_id), lambda d: d.parent_id)
for key, group in groups:
    key, len(list(group))
('0', 3)
('1', 3)

Get Attributes in Bulk

DocumentArray implements powerful getters that lets you fetch multiple attributes from the Documents it contains in one-shot:

import numpy as np

from jina import DocumentArray, Document

da = DocumentArray([Document(id=1, text='hello', embedding=np.array([1, 2, 3])),
                    Document(id=2, text='goodbye', embedding=np.array([4, 5, 6])),
                    Document(id=3, text='world', embedding=np.array([7, 8, 9]))])

da.get_attributes('id', 'text', 'embedding')
[('1', '2', '3'), ('hello', 'goodbye', 'world'), (array([1, 2, 3]), array([4, 5, 6]), array([7, 8, 9]))]

This can be very useful when extracting a batch of embeddings:

import numpy as np

np.stack(da.get_attributes('embedding'))
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Access nested attributes from tags

Document contains the tags field that can hold a map-like structure that can map arbitrary values.

from jina import Document

doc = Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0}})

doc.tags['dimensions']
{'weight': 10.0, 'height': 5.0}

In order to provide easy access to nested fields, the Document allows to access attributes by composing the attribute qualified name with interlaced __ symbols:

from jina import Document

doc = Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0}})

doc.tags__dimensions__weight
10.0

This also allows to access nested metadata attributes in bulk from a DocumentArray.

from jina import Document, DocumentArray

da = DocumentArray([Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0}}) for _ in range(10)]) 

da.get_attributes('tags__dimensions__height', 'tags__dimensions__weight')
[[5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0], [10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0]]