Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: reload index from local disk #18

Merged
merged 26 commits into from
Nov 25, 2021
Merged

feat: reload index from local disk #18

merged 26 commits into from
Nov 25, 2021

Conversation

numb3r3
Copy link
Member

@numb3r3 numb3r3 commented Nov 22, 2021

The simple example to show how to use pqlite

import numpy as np
import random

from pqlite import PQLite
from jina import Document, DocumentArray

N = 1000  # number of data points
Nq = 5
Nt = 2000
D = 128  # dimensionality / number of features

# Step 1: create indexer
index = PQLite(dim=D, data_path='./test_data')

X = np.random.random((N, D)).astype(
    np.float32
)  # 10,000 128-dim vectors to be indexed

docs = DocumentArray(
    [
        Document(id=f'{i}', embedding=X[i], tags={'x': random.random()})
        for i in range(N)
    ]
)
index.index(docs)

# Step 2: reload pq index from data_path
pq = PQLite(dim=D, data_path='./test_data')


X = np.random.random((Nq, D)).astype(np.float32)  # a 128-dim query vector
query = DocumentArray([Document(embedding=X[i]) for i in range(5)])

pq.search(query)

print(f'{[m.scores["euclidean"].value for m in query[0].matches]}')
for i in range(len(query[0].matches) - 1):
    assert (
        query[0].matches[i].scores['euclidean'].value
        <= query[0].matches[i + 1].scores['euclidean'].value
    )

# clear the data folder
pq.clear()

@numb3r3 numb3r3 marked this pull request as draft November 22, 2021 02:41
@numb3r3 numb3r3 linked an issue Nov 22, 2021 that may be closed by this pull request
6 tasks
@numb3r3 numb3r3 self-assigned this Nov 22, 2021
@numb3r3 numb3r3 marked this pull request as ready for review November 23, 2021 02:51
@davidbp
Copy link
Contributor

davidbp commented Nov 23, 2021

After some iterations python examples/hnsw_benchmark.py included in the PR seems to fail can you reproduce the following?

Xtr: (124980, 128) vs Xte: (20, 128)
2021-11-23 11:42:03.020 | WARNING  | pqlite.index:train:131 - The pqlite has been trained or is not trainable. Please use ``force_retrain=True`` to retrain.
2021-11-23 11:42:46.358 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:43:30.497 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.95, 'recall': 0.95, 'train_time': 0.000240325927734375, 'index_time': 87.68162178993225, 'query_time': 0.1407299041748047, 'query_qps': 142.1162056300232, 'index_qps': 1425.384219049087, 'indexer_hyperparams': {'n_cells': 1, 'n_subvectors': 64}}
2021-11-23 11:43:30.908 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:43:31.179 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=8)
2021-11-23 11:43:33.298 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=8) with 20480 data...
2021-11-23 11:43:34.021 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:43:34.021 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/c905ae006031e55b1d8d51e87803d278
2021-11-23 11:44:19.429 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:23.197 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:24.466 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:28.833 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:30.179 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:33.951 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:35.024 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:38.036 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.99, 'recall': 0.99, 'train_time': 0.7391390800476074, 'index_time': 64.20390892028809, 'query_time': 0.2022690773010254, 'query_qps': 98.8781887319096, 'index_qps': 1946.6104494536003, 'indexer_hyperparams': {'n_cells': 8, 'n_subvectors': 64}}
2021-11-23 11:44:38.510 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:44:38.736 | INFO     | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:44:38.951 | INFO     | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:44:39.172 | INFO     | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:44:39.610 | INFO     | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:44:39.836 | INFO     | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:44:40.064 | INFO     | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:44:40.290 | INFO     | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:44:40.518 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=16)
2021-11-23 11:44:46.918 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=16) with 20480 data...
2021-11-23 11:44:47.653 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:44:47.653 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/75115be8393181300ec49112b88b2445
2021-11-23 11:45:30.760 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:45:46.906 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.9850000000000001, 'recall': 0.9850000000000001, 'train_time': 0.7362098693847656, 'index_time': 59.48536229133606, 'query_time': 0.28374195098876953, 'query_qps': 70.48658095958322, 'index_qps': 2101.021077889663, 'indexer_hyperparams': {'n_cells': 16, 'n_subvectors': 64}}
2021-11-23 11:45:47.490 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:45:47.970 | INFO     | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:45:48.488 | INFO     | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:45:48.952 | INFO     | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:45:49.400 | INFO     | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:45:49.650 | INFO     | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:45:50.114 | INFO     | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:45:50.552 | INFO     | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:45:50.987 | INFO     | pqlite.index:clear:259 - Clear the index of cell-8
2021-11-23 11:45:51.448 | INFO     | pqlite.index:clear:259 - Clear the index of cell-9
2021-11-23 11:45:51.888 | INFO     | pqlite.index:clear:259 - Clear the index of cell-10
2021-11-23 11:45:52.329 | INFO     | pqlite.index:clear:259 - Clear the index of cell-11
2021-11-23 11:45:52.773 | INFO     | pqlite.index:clear:259 - Clear the index of cell-12
2021-11-23 11:45:53.211 | INFO     | pqlite.index:clear:259 - Clear the index of cell-13
2021-11-23 11:45:53.662 | INFO     | pqlite.index:clear:259 - Clear the index of cell-14
2021-11-23 11:45:54.112 | INFO     | pqlite.index:clear:259 - Clear the index of cell-15
2021-11-23 11:45:54.555 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=32)
2021-11-23 11:46:10.758 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=32) with 20480 data...
2021-11-23 11:46:11.500 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:46:11.500 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/8f37c3b2ffd1c67e4c81e81f64db0eea
2021-11-23 11:47:01.267 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 1.0, 'recall': 1.0, 'train_time': 0.7422680854797363, 'index_time': 49.93944001197815, 'query_time': 0.30017614364624023, 'query_qps': 66.62754660333749, 'index_qps': 2502.6311862933007, 'indexer_hyperparams': {'n_cells': 32, 'n_subvectors': 64}}
2021-11-23 11:47:01.804 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:47:02.336 | INFO     | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:47:02.945 | INFO     | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:47:03.533 | INFO     | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:47:04.493 | INFO     | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:47:05.305 | INFO     | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:47:06.152 | INFO     | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:47:06.923 | INFO     | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:47:07.710 | INFO     | pqlite.index:clear:259 - Clear the index of cell-8
2021-11-23 11:47:08.489 | INFO     | pqlite.index:clear:259 - Clear the index of cell-9
2021-11-23 11:47:09.025 | INFO     | pqlite.index:clear:259 - Clear the index of cell-10
2021-11-23 11:47:09.471 | INFO     | pqlite.index:clear:259 - Clear the index of cell-11
2021-11-23 11:47:09.901 | INFO     | pqlite.index:clear:259 - Clear the index of cell-12
2021-11-23 11:47:10.344 | INFO     | pqlite.index:clear:259 - Clear the index of cell-13
2021-11-23 11:47:10.775 | INFO     | pqlite.index:clear:259 - Clear the index of cell-14
2021-11-23 11:47:11.211 | INFO     | pqlite.index:clear:259 - Clear the index of cell-15
2021-11-23 11:47:11.639 | INFO     | pqlite.index:clear:259 - Clear the index of cell-16
2021-11-23 11:47:12.076 | INFO     | pqlite.index:clear:259 - Clear the index of cell-17
2021-11-23 11:47:12.503 | INFO     | pqlite.index:clear:259 - Clear the index of cell-18
2021-11-23 11:47:12.932 | INFO     | pqlite.index:clear:259 - Clear the index of cell-19
2021-11-23 11:47:13.367 | INFO     | pqlite.index:clear:259 - Clear the index of cell-20
2021-11-23 11:47:13.815 | INFO     | pqlite.index:clear:259 - Clear the index of cell-21
2021-11-23 11:47:14.261 | INFO     | pqlite.index:clear:259 - Clear the index of cell-22
2021-11-23 11:47:14.700 | INFO     | pqlite.index:clear:259 - Clear the index of cell-23
2021-11-23 11:47:15.132 | INFO     | pqlite.index:clear:259 - Clear the index of cell-24
2021-11-23 11:47:15.587 | INFO     | pqlite.index:clear:259 - Clear the index of cell-25
2021-11-23 11:47:16.032 | INFO     | pqlite.index:clear:259 - Clear the index of cell-26
2021-11-23 11:47:16.472 | INFO     | pqlite.index:clear:259 - Clear the index of cell-27
2021-11-23 11:47:16.903 | INFO     | pqlite.index:clear:259 - Clear the index of cell-28
2021-11-23 11:47:17.350 | INFO     | pqlite.index:clear:259 - Clear the index of cell-29
2021-11-23 11:47:17.787 | INFO     | pqlite.index:clear:259 - Clear the index of cell-30
2021-11-23 11:47:18.227 | INFO     | pqlite.index:clear:259 - Clear the index of cell-31
2021-11-23 11:47:18.661 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=64)
2021-11-23 11:47:56.044 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=64) with 20480 data...
2021-11-23 11:47:57.003 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:47:57.004 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/e01ce8063d859fe594084b33a10515e8
2021-11-23 11:48:55.512 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
Traceback (most recent call last):
  File "examples/hnsw_benchmark.py", line 95, in <module>
    pq.search(docs, limit=top_k)
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/index.py", line 238, in search
    match_dists, match_docs = self.search_cells(
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/container.py", line 144, in search_cells
    dists, doc_ids, cells = self.ivf_search(
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/container.py", line 107, in ivf_search
    _dists, _doc_idx = self.vec_index(cell_id).search(
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/core/index/hnsw/index.py", line 77, in search
    ids, dists = self._index.knn_query(query, k=limit)
RuntimeError: Cannot return the results in a contigious 2D array. Probably ef or M is too small

@numb3r3
Copy link
Member Author

numb3r3 commented Nov 24, 2021

The total number of data in the index is less than limit. nmslib/hnswlib#244 I will fix it

pqlite/core/index/base.py Outdated Show resolved Hide resolved
@numb3r3 numb3r3 merged commit b3a3859 into main Nov 25, 2021
@numb3r3 numb3r3 deleted the feat-reload-index branch December 1, 2021 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PQlite: restore index from local storage
3 participants