Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase Lucene max dimension limit to 16,000 #1346

Merged
merged 2 commits into from
Dec 13, 2023

Conversation

junqiu-lei
Copy link
Member

@junqiu-lei junqiu-lei commented Dec 12, 2023

Description

Since Lucene moved the max dimension limit to codec, we are able to change it in k-NN Lucene codec. This PR updates the Lucene engine dimension max limit to 16k which is consistent with other k-NN engines.

Issues Resolved

Closes #925

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@junqiu-lei junqiu-lei added Enhancements Increases software capabilities beyond original client specifications v2.13.0 labels Dec 12, 2023
@junqiu-lei junqiu-lei self-assigned this Dec 12, 2023
@junqiu-lei junqiu-lei changed the title Increase Lucene max dimension limit Increase Lucene max dimension limit to 16,000 Dec 12, 2023
Signed-off-by: Junqiu Lei <junqiu@amazon.com>
Copy link

codecov bot commented Dec 12, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (2e3ab95) 85.15% compared to head (782d703) 85.02%.
Report is 1 commits behind head on main.

Files Patch % Lines
...opensearch/knn/index/mapper/LuceneFieldMapper.java 0.00% 1 Missing and 1 partial ⚠️
.../knn/index/codec/BasePerFieldKnnVectorsFormat.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1346      +/-   ##
============================================
- Coverage     85.15%   85.02%   -0.13%     
- Complexity     1216     1242      +26     
============================================
  Files           160      161       +1     
  Lines          4958     5069     +111     
  Branches        457      473      +16     
============================================
+ Hits           4222     4310      +88     
- Misses          538      554      +16     
- Partials        198      205       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ryanbogan
ryanbogan previously approved these changes Dec 12, 2023
Copy link
Member

@ryanbogan ryanbogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@martin-gaievski
Copy link
Member

Does it make sense to add a BWC for this change? Have you try to ingest data with this higher dimension value while doing rolling upgrade?

@junqiu-lei
Copy link
Member Author

junqiu-lei commented Dec 12, 2023

16k Dimension Sanity Test

Cluster Configuration

Used a single node cluster with OpenSearch 3.0 installed and plugin from https://github.com/junqiu-lei/k-NN/tree/higher_dimension installed. Nodes were ran on AWS EC2 instances

Parameter Value
node type r6gd.4xl
node count 1
node disk 500 GB
node JVM 32 GB
Index thread qty 1

Data set

Generated a random HDF5 data set with 100K train and 10K test vectors with dimension 16K. Used a random uniform distribution between -10K and 10K. Script can be found here: https://github.com/jmazanec15/k-NN-1/blob/dimension-test-16k/benchmarks/osb/scripts/generate-dataset.py.

Algorithm Configurations

Tested hnsw algorithm on lucene.

Algorithm parameters shard count replica count
lucene-hnsw ef_construction=64,m=16 1 1
nmslib-hnsw ef_search=64,ef_construction=64,m=8 1 1
faiss-ivf nlist=64,nprobes=4 1 1
faiss-ivfpq nlist=64,nprobes=4,code_size=8,m=8 1 1
faiss-hnsw ef_search=64,ef_construction=64,m=8 1 1

Test Tool

Result

  • Indexing
Exp No. Algorithm Min Throughput (docs/s) Mean Throughput (docs/s) Median Throughput (docs/s) Max Throughput (docs/s) 50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 100th percentile service time (ms) Error rate
1 lucene-hnsw 24.58 25.48 25.06 36.61 5611 5474 6642 7095 0.00%
2 nmslib-hnsw 37.38 37.64 37.59 38.08 2194 2906 3316 3458 0.00%
3 faiss-ivf 36.47 37.15 37.16 38.15 2189 3069 3767 4317 0.00%
4 faiss-ivfpq 21.38 159.09 170.92 182.40 8237 10440 12577 12582 0.00%
5 faiss-hnsw 12.25 27.22 36.29 37.80 2202 2291 2906 3013766 0.00%
  • Querying
Exp No. Algorithm Min t/p (docs/s) Mean t/p (docs/s) Median t/p (docs/s) Max t/p (docs/s) 50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 100th percentile service time (ms) Error rate
1 lucene-hnsw 69.87 80.21 80.59 80.75 201 235 270 326 0.00%
2 nmslib-hnsw 0.22 66.38 72.42 78.71 100 104 160 4586 0.00%
3 faiss-ivf 0.08 2.66 2.72 2.80 3624 4549 5011 12253 0.00%
4 faiss-ivfpq 0.09 29.21 31.99 34.42 252 256 268 10852 0.00%
5 faiss-hnsw 0.09 19.94 21.52 22.66 402 413 428 11485 0.00%

@navneet1v
Copy link
Collaborator

Sanity test

Cluster Configuration

Used a single node cluster with OpenSearch 3.0 installed and plugin from https://github.com/junqiu-lei/k-NN/tree/higher_dimension installed. Nodes were ran on AWS EC2 instances

Parameter Value
node type r5.16xlarge
node count 1
node disk 500 GB

Data set

Generated a random HDF5 data set with 100K train and 10K test vectors with dimension 16K. Used a random uniform distribution between -10K and 10K. Script can be found here: https://github.com/jmazanec15/k-NN-1/blob/dimension-test-16k/benchmarks/osb/scripts/generate-dataset.py.

Algorithm Configurations

Tested hnsw algorithm on lucene.

Algorithm parameters shard count replica count
hnsw (lucene) ef_construction=64,m=16 1 1

Test Tool

Result

  • Indexing

Min Throughput (docs/s) Mean Throughput (docs/s) Median Throughput (docs/s) Max Throughput (docs/s) 50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 100th percentile service time (ms) Error rate
32 33 32 45 3,938 5,424 6,048 6,575 0.00%

  • Search

50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 99.9th percentile service time (ms) 99.99th percentile service time (ms) 100th percentile service time (ms) Error rate
206 240 278 300 326 338 0.00%

can you post a comparison with other engines here too, which was done last time.

@junqiu-lei
Copy link
Member Author

junqiu-lei commented Dec 12, 2023

Does it make sense to add a BWC for this change? Have you try to ingest data with this higher dimension value while doing rolling upgrade?

I don't think we need BWC test for this, lower versions indices have <=1024 dimension settings. Will have a look on the rolling upgrade scenario.

@junqiu-lei
Copy link
Member Author

Sanity test

Cluster Configuration

Used a single node cluster with OpenSearch 3.0 installed and plugin from https://github.com/junqiu-lei/k-NN/tree/higher_dimension installed. Nodes were ran on AWS EC2 instances
Parameter Value
node type r5.16xlarge
node count 1
node disk 500 GB

Data set

Generated a random HDF5 data set with 100K train and 10K test vectors with dimension 16K. Used a random uniform distribution between -10K and 10K. Script can be found here: https://github.com/jmazanec15/k-NN-1/blob/dimension-test-16k/benchmarks/osb/scripts/generate-dataset.py.

Algorithm Configurations

Tested hnsw algorithm on lucene.
Algorithm parameters shard count replica count
hnsw (lucene) ef_construction=64,m=16 1 1

Test Tool

Result

  • Indexing

Min Throughput (docs/s) Mean Throughput (docs/s) Median Throughput (docs/s) Max Throughput (docs/s) 50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 100th percentile service time (ms) Error rate
32 33 32 45 3,938 5,424 6,048 6,575 0.00%

  • Search

50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 99.9th percentile service time (ms) 99.99th percentile service time (ms) 100th percentile service time (ms) Error rate
206 240 278 300 326 338 0.00%

can you post a comparison with other engines here too, which was done last time.

Sure, but last time was with 3 data nodes and some other differences.

Signed-off-by: Junqiu Lei <junqiu@amazon.com>
@heemin32
Copy link
Collaborator

Does it make sense to add a BWC for this change? Have you try to ingest data with this higher dimension value while doing rolling upgrade?

I don't think we need BWC test for this, lower versions indices have >=1024 dimension settings. Will have a look on the rolling upgrade scenario.

This is limit change during a new index creation. Existing bwc test should be enough to catch any issue with existing index.

@navneet1v
Copy link
Collaborator

Sanity test

Cluster Configuration

Used a single node cluster with OpenSearch 3.0 installed and plugin from https://github.com/junqiu-lei/k-NN/tree/higher_dimension installed. Nodes were ran on AWS EC2 instances
Parameter Value
node type r5.16xlarge
node count 1
node disk 500 GB

Data set

Generated a random HDF5 data set with 100K train and 10K test vectors with dimension 16K. Used a random uniform distribution between -10K and 10K. Script can be found here: https://github.com/jmazanec15/k-NN-1/blob/dimension-test-16k/benchmarks/osb/scripts/generate-dataset.py.

Algorithm Configurations

Tested hnsw algorithm on lucene.
Algorithm parameters shard count replica count
hnsw (lucene) ef_construction=64,m=16 1 1

Test Tool

Result

  • Indexing

Min Throughput (docs/s) Mean Throughput (docs/s) Median Throughput (docs/s) Max Throughput (docs/s) 50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 100th percentile service time (ms) Error rate
32 33 32 45 3,938 5,424 6,048 6,575 0.00%

  • Search

50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 99.9th percentile service time (ms) 99.99th percentile service time (ms) 100th percentile service time (ms) Error rate
206 240 278 300 326 338 0.00%

can you post a comparison with other engines here too, which was done last time.

Sure, but last time was with 3 data nodes and some other differences.

can we then rerun the tests for the nmslib and faiss with your setup?

@navneet1v
Copy link
Collaborator

Overall code looks good to me. Please just paste the comparison with nmslib and faiss in this PR and also on the github issue. Once that is done I can approve the PR.

@junqiu-lei
Copy link
Member Author

junqiu-lei commented Dec 12, 2023

Overall code looks good to me. Please just paste the comparison with nmslib and faiss in this PR and also on the github issue. Once that is done I can approve the PR.

Thanks, will re-run nmslib and faiss tests

@junqiu-lei
Copy link
Member Author

Overall code looks good to me. Please just paste the comparison with nmslib and faiss in this PR and also on the github issue. Once that is done I can approve the PR.

Thanks, will re-run nmslib and faiss tests

I have updated the table with new tests from different engines. #1346 (comment)

@junqiu-lei junqiu-lei merged commit 083ea2b into opensearch-project:main Dec 13, 2023
47 of 48 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 13, 2023
* Increase Lucene max dimension limit to 16,000

Signed-off-by: Junqiu Lei <junqiu@amazon.com>
(cherry picked from commit 083ea2b)
@vamshin
Copy link
Member

vamshin commented Dec 13, 2023

@junqiu-lei can we add Faiss(HNSW)? Also can we do it on a smaller instance (r6gd.4xl)

@heemin32
Copy link
Collaborator

@junqiu-lei We need a document update as well.

@junqiu-lei
Copy link
Member Author

@junqiu-lei can we add Faiss(HNSW)? Also can we do it on a smaller instance (r6gd.4xl)

Sure, will add.

junqiu-lei added a commit that referenced this pull request Dec 14, 2023
* Increase Lucene max dimension limit to 16,000

Signed-off-by: Junqiu Lei <junqiu@amazon.com>
(cherry picked from commit 083ea2b)

Co-authored-by: Junqiu Lei <junqiu@amazon.com>
@junqiu-lei
Copy link
Member Author

@junqiu-lei can we add Faiss(HNSW)? Also can we do it on a smaller instance (r6gd.4xl)

@vamshin Updated test results to have Faiss(HNSW) on r6gd.4xl instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Enhancements Increases software capabilities beyond original client specifications v2.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support higher vector dimension limit for lucene
6 participants