Increase Lucene max dimension limit to 16,000 #1346

junqiu-lei · 2023-12-12T17:14:52Z

Description

Since Lucene moved the max dimension limit to codec, we are able to change it in k-NN Lucene codec. This PR updates the Lucene engine dimension max limit to 16k which is consistent with other k-NN engines.

Issues Resolved

Closes #925

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Junqiu Lei <junqiu@amazon.com>

codecov · 2023-12-12T17:54:41Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (2e3ab95) 85.15% compared to head (782d703) 85.02%.
Report is 1 commits behind head on main.

Files	Patch %	Lines
...opensearch/knn/index/mapper/LuceneFieldMapper.java	0.00%	1 Missing and 1 partial ⚠️
.../knn/index/codec/BasePerFieldKnnVectorsFormat.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1346      +/-   ##
============================================
- Coverage     85.15%   85.02%   -0.13%     
- Complexity     1216     1242      +26     
============================================
  Files           160      161       +1     
  Lines          4958     5069     +111     
  Branches        457      473      +16     
============================================
+ Hits           4222     4310      +88     
- Misses          538      554      +16     
- Partials        198      205       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ryanbogan

LGTM!

martin-gaievski · 2023-12-12T18:26:31Z

Does it make sense to add a BWC for this change? Have you try to ingest data with this higher dimension value while doing rolling upgrade?

src/main/java/org/opensearch/knn/index/codec/KNN950Codec/KNN950PerFieldKnnVectorsFormat.java

junqiu-lei · 2023-12-12T18:36:42Z

16k Dimension Sanity Test

Cluster Configuration

Used a single node cluster with OpenSearch 3.0 installed and plugin from https://github.com/junqiu-lei/k-NN/tree/higher_dimension installed. Nodes were ran on AWS EC2 instances

Parameter	Value
node type	r6gd.4xl
node count	1
node disk	500 GB
node JVM	32 GB
Index thread qty	1

Data set

Generated a random HDF5 data set with 100K train and 10K test vectors with dimension 16K. Used a random uniform distribution between -10K and 10K. Script can be found here: https://github.com/jmazanec15/k-NN-1/blob/dimension-test-16k/benchmarks/osb/scripts/generate-dataset.py.

Algorithm Configurations

Tested hnsw algorithm on lucene.

Algorithm	parameters	shard count	replica count
lucene-hnsw	ef_construction=64,m=16	1	1
nmslib-hnsw	ef_search=64,ef_construction=64,m=8	1	1
faiss-ivf	nlist=64,nprobes=4	1	1
faiss-ivfpq	nlist=64,nprobes=4,code_size=8,m=8	1	1
faiss-hnsw	ef_search=64,ef_construction=64,m=8	1	1

Test Tool

opensearch-benchmark 1.1.0
Used default OSB benchmarks found here: https://github.com/opensearch-project/k-NN/tree/main/benchmarks/osb/procedures.
Queries k=10.
Ran benchmarks from same ec2 instance
Used 1 indexing and 10 querying clients

Result

Indexing

Exp No.	Algorithm	Min Throughput (docs/s)	Mean Throughput (docs/s)	Median Throughput (docs/s)	Max Throughput (docs/s)	50th percentile service time (ms)	90th percentile service time (ms)	99th percentile service time (ms)	100th percentile service time (ms)	Error rate
1	lucene-hnsw	24.58	25.48	25.06	36.61	5611	5474	6642	7095	0.00%
2	nmslib-hnsw	37.38	37.64	37.59	38.08	2194	2906	3316	3458	0.00%
3	faiss-ivf	36.47	37.15	37.16	38.15	2189	3069	3767	4317	0.00%
4	faiss-ivfpq	21.38	159.09	170.92	182.40	8237	10440	12577	12582	0.00%
5	faiss-hnsw	12.25	27.22	36.29	37.80	2202	2291	2906	3013766	0.00%

Querying

Exp No.	Algorithm	Min t/p (docs/s)	Mean t/p (docs/s)	Median t/p (docs/s)	Max t/p (docs/s)	50th percentile service time (ms)	90th percentile service time (ms)	99th percentile service time (ms)	100th percentile service time (ms)	Error rate
1	lucene-hnsw	69.87	80.21	80.59	80.75	201	235	270	326	0.00%
2	nmslib-hnsw	0.22	66.38	72.42	78.71	100	104	160	4586	0.00%
3	faiss-ivf	0.08	2.66	2.72	2.80	3624	4549	5011	12253	0.00%
4	faiss-ivfpq	0.09	29.21	31.99	34.42	252	256	268	10852	0.00%
5	faiss-hnsw	0.09	19.94	21.52	22.66	402	413	428	11485	0.00%

navneet1v · 2023-12-12T18:46:17Z

Sanity test

Cluster Configuration

Used a single node cluster with OpenSearch 3.0 installed and plugin from https://github.com/junqiu-lei/k-NN/tree/higher_dimension installed. Nodes were ran on AWS EC2 instances

Parameter Value
node type r5.16xlarge
node count 1
node disk 500 GB

Data set

Generated a random HDF5 data set with 100K train and 10K test vectors with dimension 16K. Used a random uniform distribution between -10K and 10K. Script can be found here: https://github.com/jmazanec15/k-NN-1/blob/dimension-test-16k/benchmarks/osb/scripts/generate-dataset.py.

Algorithm Configurations

Tested hnsw algorithm on lucene.

Algorithm parameters shard count replica count
hnsw (lucene) ef_construction=64,m=16 1 1

Test Tool

opensearch-benchmark 1.1.0

Used default OSB benchmarks found here: https://github.com/opensearch-project/k-NN/tree/main/benchmarks/osb/procedures.

Queries k=10.

Ran benchmarks from same ec2 instance in same region

Used 1 indexing and 10 querying clients

Result

Indexing

Min Throughput (docs/s) Mean Throughput (docs/s) Median Throughput (docs/s) Max Throughput (docs/s) 50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 100th percentile service time (ms) Error rate
32 33 32 45 3,938 5,424 6,048 6,575 0.00%

Search

50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 99.9th percentile service time (ms) 99.99th percentile service time (ms) 100th percentile service time (ms) Error rate
206 240 278 300 326 338 0.00%

can you post a comparison with other engines here too, which was done last time.

src/main/java/org/opensearch/knn/index/codec/BasePerFieldKnnVectorsFormat.java

junqiu-lei · 2023-12-12T19:13:59Z

Does it make sense to add a BWC for this change? Have you try to ingest data with this higher dimension value while doing rolling upgrade?

I don't think we need BWC test for this, lower versions indices have <=1024 dimension settings. Will have a look on the rolling upgrade scenario.

junqiu-lei · 2023-12-12T19:18:29Z

Sanity test

Cluster Configuration

Used a single node cluster with OpenSearch 3.0 installed and plugin from https://github.com/junqiu-lei/k-NN/tree/higher_dimension installed. Nodes were ran on AWS EC2 instances
Parameter Value
node type r5.16xlarge
node count 1
node disk 500 GB

Data set

Generated a random HDF5 data set with 100K train and 10K test vectors with dimension 16K. Used a random uniform distribution between -10K and 10K. Script can be found here: https://github.com/jmazanec15/k-NN-1/blob/dimension-test-16k/benchmarks/osb/scripts/generate-dataset.py.

Algorithm Configurations

Tested hnsw algorithm on lucene.
Algorithm parameters shard count replica count
hnsw (lucene) ef_construction=64,m=16 1 1

Test Tool

opensearch-benchmark 1.1.0

Used default OSB benchmarks found here: https://github.com/opensearch-project/k-NN/tree/main/benchmarks/osb/procedures.

Queries k=10.

Ran benchmarks from same ec2 instance in same region

Used 1 indexing and 10 querying clients

Result

Indexing

Min Throughput (docs/s) Mean Throughput (docs/s) Median Throughput (docs/s) Max Throughput (docs/s) 50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 100th percentile service time (ms) Error rate
32 33 32 45 3,938 5,424 6,048 6,575 0.00%

Search

50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 99.9th percentile service time (ms) 99.99th percentile service time (ms) 100th percentile service time (ms) Error rate
206 240 278 300 326 338 0.00%

can you post a comparison with other engines here too, which was done last time.

Sure, but last time was with 3 data nodes and some other differences.

Signed-off-by: Junqiu Lei <junqiu@amazon.com>

heemin32 · 2023-12-12T20:39:14Z

Does it make sense to add a BWC for this change? Have you try to ingest data with this higher dimension value while doing rolling upgrade?

I don't think we need BWC test for this, lower versions indices have >=1024 dimension settings. Will have a look on the rolling upgrade scenario.

This is limit change during a new index creation. Existing bwc test should be enough to catch any issue with existing index.

navneet1v · 2023-12-12T20:54:05Z

Sanity test

Cluster Configuration

Used a single node cluster with OpenSearch 3.0 installed and plugin from https://github.com/junqiu-lei/k-NN/tree/higher_dimension installed. Nodes were ran on AWS EC2 instances
Parameter Value
node type r5.16xlarge
node count 1
node disk 500 GB

Data set

Generated a random HDF5 data set with 100K train and 10K test vectors with dimension 16K. Used a random uniform distribution between -10K and 10K. Script can be found here: https://github.com/jmazanec15/k-NN-1/blob/dimension-test-16k/benchmarks/osb/scripts/generate-dataset.py.

Algorithm Configurations

Tested hnsw algorithm on lucene.
Algorithm parameters shard count replica count
hnsw (lucene) ef_construction=64,m=16 1 1

Test Tool

opensearch-benchmark 1.1.0

Used default OSB benchmarks found here: https://github.com/opensearch-project/k-NN/tree/main/benchmarks/osb/procedures.

Queries k=10.

Ran benchmarks from same ec2 instance in same region

Used 1 indexing and 10 querying clients

Result

Indexing

Min Throughput (docs/s) Mean Throughput (docs/s) Median Throughput (docs/s) Max Throughput (docs/s) 50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 100th percentile service time (ms) Error rate
32 33 32 45 3,938 5,424 6,048 6,575 0.00%

Search

50th percentile service time (ms) 90th percentile service time (ms) 99th percentile service time (ms) 99.9th percentile service time (ms) 99.99th percentile service time (ms) 100th percentile service time (ms) Error rate
206 240 278 300 326 338 0.00%

can you post a comparison with other engines here too, which was done last time.

Sure, but last time was with 3 data nodes and some other differences.

can we then rerun the tests for the nmslib and faiss with your setup?

navneet1v · 2023-12-12T20:56:19Z

Overall code looks good to me. Please just paste the comparison with nmslib and faiss in this PR and also on the github issue. Once that is done I can approve the PR.

junqiu-lei · 2023-12-12T21:02:13Z

Overall code looks good to me. Please just paste the comparison with nmslib and faiss in this PR and also on the github issue. Once that is done I can approve the PR.

Thanks, will re-run nmslib and faiss tests

junqiu-lei · 2023-12-13T07:09:33Z

Overall code looks good to me. Please just paste the comparison with nmslib and faiss in this PR and also on the github issue. Once that is done I can approve the PR.

Thanks, will re-run nmslib and faiss tests

I have updated the table with new tests from different engines. #1346 (comment)

* Increase Lucene max dimension limit to 16,000 Signed-off-by: Junqiu Lei <junqiu@amazon.com> (cherry picked from commit 083ea2b)

vamshin · 2023-12-13T19:30:13Z

@junqiu-lei can we add Faiss(HNSW)? Also can we do it on a smaller instance (r6gd.4xl)

heemin32 · 2023-12-13T19:34:42Z

@junqiu-lei We need a document update as well.

junqiu-lei · 2023-12-13T19:59:03Z

@junqiu-lei can we add Faiss(HNSW)? Also can we do it on a smaller instance (r6gd.4xl)

Sure, will add.

junqiu-lei · 2023-12-13T19:59:34Z

@junqiu-lei We need a document update as well.

Right, as I know we might need update here: https://opensearch.org/docs/2.11/search-plugins/knn/approximate-knn#:~:text=The%20knn_vector%20data%20type%20supports%20a%20vector%20of%20floats%20that%20can%20have%20a%20dimension%20count%20of%20up%20to%2016%2C000%20for%20the%20nmslib%20and%20faiss%20engines%2C%20as%20set%20by%20the%20dimension%20mapping%20parameter.

* Increase Lucene max dimension limit to 16,000 Signed-off-by: Junqiu Lei <junqiu@amazon.com> (cherry picked from commit 083ea2b) Co-authored-by: Junqiu Lei <junqiu@amazon.com>

junqiu-lei · 2023-12-25T21:02:57Z

@junqiu-lei can we add Faiss(HNSW)? Also can we do it on a smaller instance (r6gd.4xl)

@vamshin Updated test results to have Faiss(HNSW) on r6gd.4xl instance.

junqiu-lei added Enhancements Increases software capabilities beyond original client specifications v2.13.0 labels Dec 12, 2023

junqiu-lei self-assigned this Dec 12, 2023

junqiu-lei requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, martin-gaievski and ryanbogan as code owners December 12, 2023 17:14

junqiu-lei changed the title ~~Increase Lucene max dimension limit~~ Increase Lucene max dimension limit to 16,000 Dec 12, 2023

Increase Lucene max dimension limit

44febd3

Signed-off-by: Junqiu Lei <junqiu@amazon.com>

junqiu-lei force-pushed the higher_dimension branch from fb79843 to 44febd3 Compare December 12, 2023 17:39

junqiu-lei added v2.12.0 and removed v2.13.0 labels Dec 12, 2023

ryanbogan previously approved these changes Dec 12, 2023

View reviewed changes

martin-gaievski reviewed Dec 12, 2023

View reviewed changes

src/main/java/org/opensearch/knn/index/codec/KNN950Codec/KNN950PerFieldKnnVectorsFormat.java Show resolved Hide resolved

navneet1v reviewed Dec 12, 2023

View reviewed changes

src/main/java/org/opensearch/knn/index/codec/BasePerFieldKnnVectorsFormat.java Show resolved Hide resolved

junqiu-lei added the backport 2.x label Dec 12, 2023

Increase Lucene max dimension limit

782d703

Signed-off-by: Junqiu Lei <junqiu@amazon.com>

junqiu-lei dismissed ryanbogan’s stale review via 782d703 December 12, 2023 20:41

junqiu-lei mentioned this pull request Dec 13, 2023

[FEATURE] Make MAX_DIMENSION configurable #455

Closed

navneet1v approved these changes Dec 13, 2023

View reviewed changes

heemin32 approved these changes Dec 13, 2023

View reviewed changes

junqiu-lei merged commit 083ea2b into opensearch-project:main Dec 13, 2023
47 of 48 checks passed

opensearch-trigger-bot bot mentioned this pull request Dec 13, 2023

[Backport 2.x] Increase Lucene max dimension limit to 16,000 #1352

Merged

opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 13, 2023

Increase Lucene max dimension limit to 16,000 (#1346)

74b30a8

* Increase Lucene max dimension limit to 16,000 Signed-off-by: Junqiu Lei <junqiu@amazon.com> (cherry picked from commit 083ea2b)

junqiu-lei mentioned this pull request Dec 13, 2023

[DOC] Update k-NN Lucene Engine Dimension Limit to 16k opensearch-project/documentation-website#5874

Closed

4 tasks

Increase Lucene max dimension limit to 16,000 #1346

Increase Lucene max dimension limit to 16,000 #1346

Conversation

junqiu-lei commented Dec 12, 2023 • edited

Description

Issues Resolved

Check List

codecov bot commented Dec 12, 2023 • edited

Codecov Report

ryanbogan left a comment

Choose a reason for hiding this comment

martin-gaievski commented Dec 12, 2023

junqiu-lei commented Dec 12, 2023 • edited

16k Dimension Sanity Test

Cluster Configuration

Data set

Algorithm Configurations

Test Tool

Result

navneet1v commented Dec 12, 2023

Sanity test

Cluster Configuration

Data set

Algorithm Configurations

Test Tool

Result

junqiu-lei commented Dec 12, 2023 • edited

junqiu-lei commented Dec 12, 2023

Sanity test

Cluster Configuration

Data set

Algorithm Configurations

Test Tool

Result

heemin32 commented Dec 12, 2023

navneet1v commented Dec 12, 2023

Sanity test

Cluster Configuration

Data set

Algorithm Configurations

Test Tool

Result

navneet1v commented Dec 12, 2023

junqiu-lei commented Dec 12, 2023 • edited

junqiu-lei commented Dec 13, 2023

vamshin commented Dec 13, 2023

heemin32 commented Dec 13, 2023

junqiu-lei commented Dec 13, 2023

junqiu-lei commented Dec 13, 2023 • edited

junqiu-lei commented Dec 25, 2023

junqiu-lei commented Dec 12, 2023 •

edited

codecov bot commented Dec 12, 2023 •

edited

junqiu-lei commented Dec 12, 2023 •

edited

junqiu-lei commented Dec 12, 2023 •

edited

junqiu-lei commented Dec 12, 2023 •

edited

junqiu-lei commented Dec 13, 2023 •

edited