Skip to content

Commit

Permalink
docs(readme): polish (#16)
Browse files Browse the repository at this point in the history
  • Loading branch information
alexcg1 committed Jan 13, 2022
1 parent 0a88beb commit 995dbd3
Showing 1 changed file with 20 additions and 18 deletions.
38 changes: 20 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@

<!-- start elevator-pitch -->

DocArray is a library for nested, unstructured data such as text, image, audio, video, 3D mesh. It allows deep learning engineers to efficiently process, embed, search, recommend, store, transfer the data with Pythonic API.
DocArray is a library for nested, unstructured data such as text, image, audio, video, or 3D mesh. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the data with a Pythonic API.

🌌 **All data types**: super-expressive data structure for representing complicated/mixed/nested text, image, video, audio, 3D mesh data.

🐍 **Pythonic experience**: designed to be as easy as Python list. If you know how to Python, you know how to DocArray. Intuitive idioms and type annotation simplify the code you write.
🐍 **Pythonic experience**: designed to be as easy as a Python list. If you know how to Python, you know how to DocArray. Intuitive idioms and type annotation simplify the code you write.

🧑‍🔬 **Data science powerhouse**: greatly accelerate data scientists work on embedding, matching, visualizing, evaluating via Torch/Tensorflow/ONNX/PaddlePaddle on CPU/GPU.
🧑‍🔬 **Data science powerhouse**: greatly accelerate data scientists' work on embedding, matching, visualizing, evaluating via Torch/TensorFlow/ONNX/PaddlePaddle on CPU/GPU.

🚡 **Portable**: ready-to-wire at anytime with efficient and compact serialization from/to Protobuf, bytes, base64, JSON, CSV, dataframe.
🚡 **Portable**: ready-to-wire at anytime with efficient and compact serialization from/to Protobuf, bytes, base64, JSON, CSV, DataFrame.

<!-- end elevator-pitch -->

Expand Down Expand Up @@ -50,7 +50,7 @@ DocArray consists of two simple concepts:

### A 10-liners text matching

We search for top-5 similar sentences of <kbd>she smiled too much</kbd> in "Pride and Prejudice".
Let's search for top-5 similar sentences of <kbd>she smiled too much</kbd> in "Pride and Prejudice".

```python
from docarray import Document, DocumentArray
Expand All @@ -75,11 +75,11 @@ print(q.matches[:5, ('text', 'scores__jaccard__value')])
[0.3333333333333333, 0.6666666666666666, 0.7, 0.7272727272727273, 0.75]]
```

Here the feature embedding is done by simple [feature hashing](https://en.wikipedia.org/wiki/Feature_hashing) and distance metric is [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index). You got better embedding? Of course you do! Looking forward to seeing your results.
Here the feature embedding is done by simple [feature hashing](https://en.wikipedia.org/wiki/Feature_hashing) and distance metric is [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index). You have better embeddings? Of course you do! We look forward to seeing your results!

### A complete workflow of visual search

Let's use DocArray and [Totally Looks Like](https://sites.google.com/view/totally-looks-like-dataset) dataset to build simple meme image search. The dataset contains 6016 image-pairs stored in `/left` and `/right`. Images that shares the same filename are perceptually similar. For example,
Let's use DocArray and the [Totally Looks Like](https://sites.google.com/view/totally-looks-like-dataset) dataset to build a simple meme image search. The dataset contains 6,016 image-pairs stored in `/left` and `/right`. Images that share the same filename are perceptually similar. For example:

<table>
<thead>
Expand All @@ -100,11 +100,11 @@ Let's use DocArray and [Totally Looks Like](https://sites.google.com/view/totall
</tbody>
</table>

Our problem is given an image from `/left` and find its most-similar image in `/right` (without looking at the filename of course).
Our problem is given an image from `/left`, can we find its most-similar image in `/right`? (without looking at the filename of course).

### Load images

First load images and preprocess them with standard computer vision techniques:
First load images and pre-process them with standard computer vision techniques:

```python
from docarray import DocumentArray
Expand All @@ -124,7 +124,7 @@ left_da.plot_image_sprites()

### Apply preprocessing

Let's do some standard computer vision preprocessing:
Let's do some standard computer vision pre-processing:

```python
from docarray import Document
Expand All @@ -133,12 +133,12 @@ def preproc(d: Document):
return (d.load_uri_to_image_blob() # load
.set_image_blob_shape((200, 200)) # resize all to 200x200
.set_image_blob_normalization() # normalize color
.set_image_blob_channel_axis(-1, 0)) # switch color axis for the pytorch model later
.set_image_blob_channel_axis(-1, 0)) # switch color axis for the PyTorch model later

left_da.apply(preproc)
```

Did I mention `apply` work in parallel?
Did I mention `apply` works in parallel?

### Embed images

Expand All @@ -147,10 +147,10 @@ Now convert images into embeddings using a pretrained ResNet50:
```python
import torchvision
model = torchvision.models.resnet50(pretrained=True) # load ResNet50
left_da.embed(model, device='cuda') # embed via GPU to speedup
left_da.embed(model, device='cuda') # embed via GPU to speed up
```

This step takes ~30 seconds on GPU. Beside PyTorch, you can also use Tensorflow, PaddlePaddle, ONNX models in `.embed(...)`.
This step takes ~30 seconds on GPU. Beside PyTorch, you can also use TensorFlow, PaddlePaddle, or ONNX models in `.embed(...)`.

### Visualize embeddings

Expand Down Expand Up @@ -216,7 +216,7 @@ Better see it.
<a href="https://docarray.jina.ai"><img src="https://github.com/jina-ai/docarray/blob/main/.github/README-img/9nn.png?raw=true" alt="Visualizing top-9 matches using DocArray API" height="250px"></a>
</p>

What we did here is reverting the preprocessing steps (i.e. switching axis and normalizing) on the copied matches, so that one can visualize them using image sprites.
What we did here is revert the preprocessing steps (i.e. switching axis and normalizing) on the copied matches, so that you can visualize them using image sprites.

### Quantitative evaluation

Expand Down Expand Up @@ -252,21 +252,23 @@ recall@5 0.0573470744680851

More metrics can be used such as `precision_at_k`, `ndcg_at_k`, `hit_at_k`.

If you think a pretrained ResNet50 is good enough, let me tell you with [Finetuner](https://github.com/jina-ai/finetuner) one could do much better in just 10 extra lines of code. [Here is how](https://finetuner.jina.ai/get-started/totally-looks-like/).
If you think a pretrained ResNet50 is good enough, let me tell you with [Finetuner](https://github.com/jina-ai/finetuner) you could do much better in just 10 extra lines of code. [Here is how](https://finetuner.jina.ai/get-started/totally-looks-like/).


### Save results

You can save a DocumentArray to binary, JSON, dict, dataframe, CSV or Protobuf message with/without compression. In its simplest form,
You can save a DocumentArray to binary, JSON, dict, DataFrame, CSV or Protobuf message with/without compression. In its simplest form,

```python
left_da.save('left_da.bin')
```

To reuse it, do `left_da = DocumentArray.load('left_da.bin')`.


If you want to transfer a DocumentArray from one machine to another or share it with your colleagues, you can do:


```python
left_da.push(token='my_shared_da')
```
Expand Down Expand Up @@ -295,6 +297,6 @@ Intrigued? That's only scratching the surface of what DocArray is capable of. [R

## Join Us

DocArray is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE). [We are actively hiring](https://jobs.jina.ai) AI engineers, solution engineers to build the next neural search ecosystem in opensource.
DocArray is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE). [We are actively hiring](https://jobs.jina.ai) AI engineers, solution engineers to build the next neural search ecosystem in open-source.

<!-- end support-pitch -->

0 comments on commit 995dbd3

Please sign in to comment.