docs: migrate from memmap to sqlite #4348

alaeddine-13 · 2022-02-15T16:43:58Z

docs: migrate from memmap to sqlite

github-actions · 2022-02-15T16:49:09Z

Latency summary

Current PR yields:

🐢🐢 index QPS at 860, delta to last 2 avg.: -28%
😶 query QPS at 36, delta to last 2 avg.: -1%
🐎🐎🐎🐎 avg flow time within 2.2014 seconds, delta to last 2 avg.: +46%
😶 import jina within 0.4382 seconds, delta to last 2 avg.: +5%

Breakdown

Version	Index QPS	Query QPS	Avg Flow Time (s)	Import Time (s)
current	860	36	2.2014	0.4382
`2.7.0`	1209	51	1.4446	0.3509
`2.6.4`	1181	21	1.5656	0.4773

Backed by latency-tracking. Further commits will update this comment.

JohannesMessner · 2022-02-15T17:07:15Z

docs/get-started/migrate.md

+
+### DocumentArray: new storage options
+Jina 2 used to offer persistence of DocumentArray through `DocumentArrayMemmap`. In Jina 3, this data structure is 
+deprecated and we introduce different [Document Stores](https://docarray.jina.ai/advanced/document-store/) within the 
+`DocumentArray` API. Thus, you can enjoy a consistent `DocumentArray` API across different storage backends and leverage
+ modern databases.
+
+For example, you can use [SQLite backend](https://docarray.jina.ai/advanced/document-store/sqlite/) as a replacement 
+for `DocumentArrayMemmap`:
+
+```python
+from docarray import Document, DocumentArray
+das = DocumentArray(storage='sqlite', config={'connection': 'my_connection', 'table_name': 'my_table_name'})
+das.extend([Document() for _ in range(10)])
+```
+
+This will persist the Documents into disk using SQLite and therefore, you should find the Documents within another 
+session:
+
+```python
+from docarray import DocumentArray
+das = DocumentArray(storage='sqlite', config={'connection': 'my_connection', 'table_name': 'my_table_name'})
+das.summary()
+```
+
+
+```text
+        Documents Summary         
+
+  Length                 10       
+  Homogenous Documents   True     
+  Common Attributes      ('id',)  
+
+                     Attributes Summary                     
+
+  Attribute   Data type   #Unique values   Has empty value  
+ ────────────────────────────────────────────────────────── 
+  id          ('str',)    10               False            
+
+                      Storage Summary                       
+
+  Backend                  SQLite (https://www.sqlite.org)  
+  Connection               my_connection                    
+  Table Name               my_table_name                    
+  Serialization Protocol                                    
+  Class                    DocumentArraySqlite
+```
+
+The API is **almost the same** as the deprecated `DocumentArrayMemmap` and is consistent across storage backends and 
+in-memory storage. Furthermore, some Document Stores offer fast Nearest Neighbor algorithms and are more convenient in 
+production.
+
+````{admonition} See Also
+:class: seealso
+Read more about [Document Stores](https://docarray.jina.ai/advanced/document-store/) in DocArray
+````


I am worried that this does not really fit the purpose of the migration guide, which really just tells people what they need to do to get their code working with Jina 3, as succinctly as possible. For further explanations it should refer to external resources. As you can see, document stores would be by far the longest item here.

So I suggest the following compromise:

Suggested change

### DocumentArray: new storage options

Jina 2 used to offer persistence of DocumentArray through `DocumentArrayMemmap`. In Jina 3, this data structure is

deprecated and we introduce different [Document Stores](https://docarray.jina.ai/advanced/document-store/) within the

`DocumentArray` API. Thus, you can enjoy a consistent `DocumentArray` API across different storage backends and leverage

modern databases.

For example, you can use [SQLite backend](https://docarray.jina.ai/advanced/document-store/sqlite/) as a replacement

for `DocumentArrayMemmap`:

```python

from docarray import Document, DocumentArray

das = DocumentArray(storage='sqlite', config={'connection': 'my_connection', 'table_name': 'my_table_name'})

das.extend([Document() for _ in range(10)])

```

This will persist the Documents into disk using SQLite and therefore, you should find the Documents within another

session:

```python

from docarray import DocumentArray

das = DocumentArray(storage='sqlite', config={'connection': 'my_connection', 'table_name': 'my_table_name'})

das.summary()

```

```text

Documents Summary

Length 10

Homogenous Documents True

Common Attributes ('id',)

Attributes Summary

Attribute Data type #Unique values Has empty value

──────────────────────────────────────────────────────────

id ('str',) 10 False

Storage Summary

Backend SQLite (https://www.sqlite.org)

Connection my_connection

Table Name my_table_name

Serialization Protocol

Class DocumentArraySqlite

```

The API is **almost the same** as the deprecated `DocumentArrayMemmap` and is consistent across storage backends and

in-memory storage. Furthermore, some Document Stores offer fast Nearest Neighbor algorithms and are more convenient in

production.

````{admonition} See Also

:class: seealso

Read more about [Document Stores](https://docarray.jina.ai/advanced/document-store/) in DocArray

````

**New storage options**:

Jina 2 used to offer persistence of DocumentArray through `DocumentArrayMemmap`. In Jina 3, this data structure is

deprecated and we introduce different [Document Stores](https://docarray.jina.ai/advanced/document-store/) within the

`DocumentArray` API. Thus, you can enjoy a consistent `DocumentArray` API across different storage backends and leverage

modern databases, such as [SQLite backend](https://docarray.jina.ai/advanced/document-store/sqlite/), while using an API that is **almost the same** as the deprecated `DocumentArrayMemmap`.

For example, you can use [SQLite backend](https://docarray.jina.ai/advanced/document-store/sqlite/) as a replacement

for `DocumentArrayMemmap`, which lets you persist Documents to disk and load them in another session:

````{tab} Storing to disk

```python

from docarray import Document, DocumentArray

docs = DocumentArray(storage='sqlite', config={'connection': 'my_connection', 'table_name': 'my_table_name'})

docs.extend([Document() for _ in range(10)])

````{tab} Loading from disk ```python from docarray import DocumentArray docs = DocumentArray(storage='sqlite', config={'connection': 'my_connection', 'table_name': 'my_table_name'}) ```

I removed the info box at the bottom because it just links to storage backends, but those are already linked at the top. I tried to cut as much as possible while still keeping the key information in place.

Ok so the code suggestion is messed (the box with the second tab should be inside of it) up but yeah

github-actions · 2022-02-16T07:44:00Z

📝 Docs are deployed on https://docs-sqlite-migration--jina-docs.netlify.app 🎉

JohannesMessner · 2022-02-16T07:49:19Z

concise enough for me now ;)

docs: migrate from memmap to sqlite

86c61d5

github-actions bot added size/S area/docs This issue/PR affects the docs labels Feb 15, 2022

JoanFM previously approved these changes Feb 15, 2022

View reviewed changes

JohannesMessner reviewed Feb 15, 2022

View reviewed changes

alaeddine-13 requested a review from alexcg1 February 15, 2022 17:19

docs: apply suggestions

9ddf121

alaeddine-13 dismissed JoanFM’s stale review via 9ddf121 February 16, 2022 07:39

JoanFM approved these changes Feb 16, 2022

View reviewed changes

JoanFM merged commit d566e5b into master Feb 16, 2022

JoanFM deleted the docs-sqlite-migration branch February 16, 2022 08:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: migrate from memmap to sqlite #4348

docs: migrate from memmap to sqlite #4348

alaeddine-13 commented Feb 15, 2022

github-actions bot commented Feb 15, 2022 •

edited

JohannesMessner Feb 15, 2022

JohannesMessner Feb 15, 2022 •

edited

github-actions bot commented Feb 16, 2022

JohannesMessner commented Feb 16, 2022

docs: migrate from memmap to sqlite #4348

docs: migrate from memmap to sqlite #4348

Conversation

alaeddine-13 commented Feb 15, 2022

github-actions bot commented Feb 15, 2022 • edited

Latency summary

Breakdown

JohannesMessner Feb 15, 2022

Choose a reason for hiding this comment

JohannesMessner Feb 15, 2022 • edited

Choose a reason for hiding this comment

github-actions bot commented Feb 16, 2022

JohannesMessner commented Feb 16, 2022

github-actions bot commented Feb 15, 2022 •

edited

JohannesMessner Feb 15, 2022 •

edited