Skip to content

Commit

Permalink
docs(fundamentals): more polishing (#3924)
Browse files Browse the repository at this point in the history
* docs(fundamentals): polish document api

* docs(fundamentals): polish documentarray api

* docs(fundamentals): polish documentmemmaparray api
  • Loading branch information
alexcg1 committed Nov 13, 2021
1 parent 70643b7 commit 62f244f
Show file tree
Hide file tree
Showing 3 changed files with 114 additions and 114 deletions.
76 changes: 38 additions & 38 deletions docs/fundamentals/document/document-api.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Document

`Document` is the basic data type that Jina operates with text, picture, video, audio, image or 3D mesh: they are
`Document` is Jina's basic data type. Whether you're working with text, image, video, audio, or 3D meshes, they are
all `Document`s in Jina.

## Minimum working example
Expand All @@ -26,18 +26,18 @@ d3 = Document(blob=numpy.array([1, 2, 3]))

### Content

`text`, `blob`, and `buffer` are three content attributes of a Document. They correspond to string-like data (e.g. for natural language), `ndarray`-like data (e.g. for image/audio/video data), and binary data for general purpose, respectively. Each Document can contain only one type of content.
`text`, `blob`, and `buffer` are the three content attributes of a Document. They correspond to string-like data (e.g. for natural language), `ndarray`-like data (e.g. for image/audio/video data), and binary data for general purpose, respectively. Each Document can contain only one type of content.

| Attribute | Accept type | Use case |
| --- | --- | --- |
| `doc.text` | Python string | To contain text |
| `doc.blob` | Numpy `ndarray`, Scipy sparse matrix (`spmatrix`), Tensorflow dense & sparse tensor, PyTorch dense & sparse tensor, PaddlePaddle dense tensor | To contain image/video/audio |
| `doc.buffer` | Binary string | To contain intermediate IO buffer |
| `doc.text` | Python string | Contain text |
| `doc.blob` | Numpy `ndarray`, SciPy sparse matrix (`spmatrix`), TensorFlow dense & sparse tensor, PyTorch dense & sparse tensor, PaddlePaddle dense tensor | Contain image/video/audio |
| `doc.buffer` | Binary string | Contain intermediate IO buffer |

````{admonition} Exclusivity of the content
:class: important
Note that one `Document` can only contain one type of `content`: it is either `text`, `buffer`, or `blob`. If you set one, the other will be cleared.
Note that one `Document` can only contain one type of `content`: either `text`, `buffer`, or `blob`. If you set one, the others will be cleared.
```python
import numpy as np
Expand All @@ -50,17 +50,17 @@ d.text # <- now it's empty
````

````{admonition} Why Document contains only data type
````{admonition} Why a Document contains only data type
:class: question
What if you want to represent more than one kind of information? Say, to fully represent a PDF slide you need to store both image and text. In this case, you can use {ref}`nested Document<recursive-nested-document>` by putting image into one sub Document, and putting text into another sub Document.
What if you want to represent more than one kind of information? Say, to fully represent a PDF page you need to store both image and text. In this case, you can use {ref}`nested Document<recursive-nested-document>`s by putting image into one sub-Document, and text into another.
```python
d = Document(chunks=[Document(blob=...), Document(text=...)])
```
The principle is each Document contains only one modality. This makes the whole logic more clear.
The principle is each Document contains only one modality. This makes the whole logic clearer.
````

```{tip}
Expand All @@ -71,11 +71,11 @@ There is also a `doc.content` sugar getter/setter of the above non-empty field.

#### Loading content from URI

Often, you need to load data from a URI instead of assign them directly in your code, `.uri` is the attribute you must learn.
Often, you need to load data from a URI instead of assigning them directly in your code, `.uri` is the attribute you must learn.

After set `.uri`, you can load the data into `.text`/`.buffer`/`.blob` as below.
After setting `.uri`, you can load data into `.text`/`.buffer`/`.blob` as follows.

The value of `.uri` can point to either local or remote or [data URI](https://en.wikipedia.org/wiki/Data_URI_scheme).
The value of `.uri` can point to either local URI, remote URI or [data URI](https://en.wikipedia.org/wiki/Data_URI_scheme).

````{tab} Local image URI
Expand Down Expand Up @@ -138,11 +138,11 @@ blob [[[255 255 255]
````

There are more `.convert_uri_to_*` functions allow you to read {ref}`text<text-type>`, {ref}`image<image-type>`, {ref}`video<video-type>`, {ref}`3D mesh<mesh-type>`, {ref}`audio<audio-type>` and {ref}`tabular<table-type>` data into Jina.
There are more `.convert_uri_to_*` functions that allow you to read {ref}`text<text-type>`, {ref}`image<image-type>`, {ref}`video<video-type>`, {ref}`3D mesh<mesh-type>`, {ref}`audio<audio-type>` and {ref}`tabular<table-type>` data into Jina.

```{admonition} Write to data URI
:class: tip
Inline data URI is helpful when you need a quick visualization in a HTML, as it embeds all resources directly into that HTML.
Inline data URI is helpful when you need a quick visualization in HTML, as it embeds all resources directly into that HTML.
You can convert a URI to a data URI using `doc.convert_uri_to_datauri()`. This will fetch the resource and make it inline.
```
Expand All @@ -152,9 +152,9 @@ You can convert a URI to a data URI using `doc.convert_uri_to_datauri()`. This w

Embedding is a multi-dimensional representation of a `Document` (often a `[1, D]` vector). It serves as a very important piece in the neural search.

Document has an attribute `.embedding` to contain the embedding infromation.
Document has an attribute `.embedding` to contain the embedding information.

Like `.blob`, you can assign it with Numpy `ndarray`, Scipy sparse matrix (`spmatrix`), Tensorflow dense & sparse tensor, PyTorch dense & sparse tensor, PaddlePaddle dense tensor.
Like `.blob`, you can assign it with Numpy `ndarray`, SciPy sparse matrix (`spmatrix`), TensorFlow dense and sparse tensor, PyTorch dense and sparse tensor, or PaddlePaddle dense tensor.

```python
import numpy as np
Expand All @@ -172,7 +172,7 @@ d5 = Document(embedding=tf.sparse.from_dense(np.array([[1, 2, 3], [4, 5, 6]])))

#### Finding nearest-neighbours

Once a document has `.embedding` filled, it can be "matched". In this example, we build 10 Documents and put them into a {ref}`DocumentArray<da-intro>`, and then use another Document to search against with.
Once a Document has `.embedding` filled, it can be "matched". In this example, we build ten Documents and put them into a {ref}`DocumentArray<da-intro>`, and then use another Document to search against them.

```python
from jina import DocumentArray, Document
Expand All @@ -194,14 +194,14 @@ print(q.matches[0])
```{admonition} Working on multiple queries
:class: tip
When you want to match a set Documents (let's call it set `A`) against another set of Documents (set `B`), where you want to find for each element in `A` what are its nearest neighbours in `B`. Then you need `DocumentArray.match()`. It is far more efficient than looping over each Document. You can {ref}`read more about it here<match-documentarray>`.
When you want to match a set of Documents (let's call it set `A`) against another set of Documents (set `B`), where you want to find for each element in `A` its nearest neighbours in `B`, then you need `DocumentArray.match()`. It is far more efficient than looping over each Document. You can {ref}`read more about it here<match-documentarray>`.
```

(recursive-nested-document)=
### Recursive Document
### Recursive Documents

`Document` can be recursive both horizontally and vertically. The following graphic illustrates the recursive `Document` structure. Each `Document` can have multiple "Chunks"
and "matches". Chunks and matches are `Document` object as well.
and "matches". Chunks and matches are `Document` objects as well.

<img src="https://hanxiao.io/2020/08/28/What-s-New-in-Jina-v0-5/blog-post-v050-protobuf-documents.jpg">

Expand Down Expand Up @@ -244,7 +244,7 @@ of {ref}`DocumentArray<documentarray>`. We will introduce `DocumentArray` later.

#### Caveat: order matters

When adding sub-Documents to `Document.chunks`, do not create them in one line to keep recursive document structure correct. This is because `chunks` use `ref_doc` to control its `granularity`, at `chunk` creation time, it didn't know anything about its parent, and will get a wrong `granularity` value.
When adding sub-Documents to `Document.chunks`, avoid creating them in one line, otherwise the recursive Document structure will not be correct. This is because `chunks` use `ref_doc` to control their `granularity`. At `chunk` creation time the `chunk` doesn't know anything about its parent, and will get a wrong `granularity` value.

````{tab} ✅ Do
```python
Expand Down Expand Up @@ -277,7 +277,7 @@ root_document = Document(
### Tags

`Document` contains the `tags` field that can hold a map-like structure that can map arbitrary values.
In practice, one can store meta information in `tags`.
In practice, you can store meta information in `tags`.

```python
from jina import Document
Expand All @@ -291,7 +291,7 @@ doc.tags['dimensions']
{'weight': 10.0, 'height': 5.0, 'last_modified': 'Monday'}
```

In order to provide easy access to nested fields, the `Document` allows to access attributes by composing the attribute
To provide easy access to nested fields, the `Document` allows you to access attributes by composing the attribute
qualified name with interlaced `__` symbols:

```python
Expand Down Expand Up @@ -326,10 +326,10 @@ da.get_attributes('tags__dimensions__height', 'tags__dimensions__weight')
As `tags` does not have a fixed schema, it is declared with type `google.protobuf.Struct` in the `DocumentProto`
protobuf declaration. However, `google.protobuf.Struct` follows the JSON specification and does not
differentiate `int` from `float`. So, data of type `int` in `tags` will be **always** casted to `float` when request is
sent to executor.
differentiate `int` from `float`. So, data of type `int` in `tags` will be **always** casted to `float` when a request is
sent to an Executor.
As a result, users need be explicit and cast the data to the expected type as follows.
As a result, users need be explicit and cast the data to the expected type as follows:
````{tab} ✅ Do
```{code-block} python
Expand Down Expand Up @@ -455,7 +455,7 @@ d.dict()

### Set/unset attributes

Set an attribute as how you would set an attribute to any Python object:
Set an attribute as you would with any Python object:

```python
from jina import Document
Expand Down Expand Up @@ -503,7 +503,7 @@ d.pop('text', 'id', 'mime_type')
### Add relevancy

```{tip}
You normaly don't need to add those values by yourself. For example, `.match()` function automatically fills in
You normally don't need to add those values by yourself. For example, the `.match()` function automatically fills in
`.scores` of each matched Document.
```

Expand Down Expand Up @@ -578,7 +578,7 @@ for evaluation_key, evaluation_score in d.evaluations.items():
| Attribute | Description |
| --- | --- |
| `doc.id` | A hexdigest that represents a unique Document ID |
| `doc.parent_id` | A hexdigest that represents the document's parent id |
| `doc.parent_id` | A hexdigest that represents the Document's parent id |
| `doc.weight` | The weight of the Document |
| `doc.mime_type` | The mime type of the Document |
| `doc.content_type` | The content type of the Document |
Expand Down Expand Up @@ -692,17 +692,17 @@ d.update(s)

#### Construct from JSON, CSV, files

The `jina.types.document.generators` module let you construct `Document` from common file types such as JSON,
CSV, `ndarray` and text files. The following functions will give a generator of `Document`, where each `Document` object
The `jina.types.document.generators` module let you construct `Document`s from common file types such as JSON,
CSV, `ndarray` and text files. The following functions will give a generator of `Document`s, where each `Document` object
corresponds to a line/row in the original format:

| | |
| --- | --- |
| `from_ndjson()` | Yield `Document` from a line-based JSON file. Each line is a `Document` object |
| `from_csv()` | Yield `Document` from a CSV file. Each line is a `Document` object |
| `from_files()` | Yield `Document` from a glob files. Each file is a `Document` object |
| `from_ndarray()` | Yield `Document` from a `ndarray`. Each row (depending on `axis`) is a `Document` object |
| `from_lines()` | Yield `Document` from lines, json and csv |
| | |
| --- | --- |
| `from_ndjson()` | Yield `Document` from a line-based JSON file. Each line is a `Document` object |
| `from_csv()` | Yield `Document` from a CSV file. Each line is a `Document` object |
| `from_files()` | Yield `Document` from a glob of files. Each file is a `Document` object |
| `from_ndarray()` | Yield `Document` from an `ndarray`. Each row (depending on `axis`) is a `Document` object |
| `from_lines()` | Yield `Document` from lines, JSON and CSV |

Using a generator is sometimes less memory-demanding, as it does not load/build all Document objects in one shot.

Expand Down

0 comments on commit 62f244f

Please sign in to comment.