Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Getting error "index not supported" for GPU CAGRA index, please help #34220

Open
1 task done
kdabbir opened this issue Jun 26, 2024 · 12 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@kdabbir
Copy link

kdabbir commented Jun 26, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.5-gpu
- Deployment mode(standalone or cluster): Cluster
- MQ type(rocksmq, pulsar or kafka):   Kafka
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.4
- OS(Ubuntu or CentOS): Linux
- CPU/Memory: 
- GPU: 1 GPU each for query and index node
- Others:

Current Behavior

Explained the issue here in discord: https://discord.com/channels/1160323594396635310/1255592726355775518

Posting a bug here so that we can get quick help to resolve this. Would appreciate quick insights to fix this as we plan to complete GPU Cagra benchmarking and testing ASAP

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@kdabbir kdabbir added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 26, 2024
@yanliang567
Copy link
Contributor

/assign @Presburger
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 27, 2024
@kdabbir
Copy link
Author

kdabbir commented Jun 27, 2024

@yanliang567 @Presburger is there another milvus version/release that successfully works with GPU_CAGRA? I think this is a regression seeing another ticket for DiskANN: #34222, We are ok to test with any version as long as it works

@xiaocai2333
Copy link
Contributor

@kdabbir I think you didn't enable GPU when building the image; by default, it's a CPU image. Please check it.

@Presburger
Copy link
Contributor

@kdabbir Hi, could you provide more detailed log information?

@kdabbir
Copy link
Author

kdabbir commented Jun 27, 2024

Screenshot 2024-06-27 at 9 57 02 AM

Attaching screenshot of GPU image.
Tried to get server version using below code:

print(fmt.format("start connecting to Milvus"))
connections.connect(host='localhost', port='19530')

has = utility.has_collection("hello_milvus")
print(f"Does collection hello_milvus exist in Milvus: {has}")


print("Checking milvus gpu version")
print(utility.get_server_version())

@kdabbir
Copy link
Author

kdabbir commented Jun 27, 2024

@kdabbir I think you didn't enable GPU when building the image; by default, it's a CPU image. Please check it.

All I did is docker pull milvusdb/milvus:v2.4.5-gpu and used this image

Hi, could you provide more detailed log information?

Sure, let me take an export of logs and attach

@kdabbir
Copy link
Author

kdabbir commented Jun 27, 2024

milvus-log.tar.gz
Attached logs (You'll need to scroll up in each log to get milvus pod log file since we have multiple metric container logs as well)

@kdabbir
Copy link
Author

kdabbir commented Jun 27, 2024

@xiaocai2333 @Presburger

@kdabbir
Copy link
Author

kdabbir commented Jun 27, 2024

logs taken from above zip in proxy and data coord:

collection=hello_milvus]
[2024/06/27 04:43:53.409 +00:00] [INFO] [proxy/impl.go:1996] ["CreateIndex received"] [traceID=8573b31b96cc174537765c7c57fbc273] [role=proxy] [db=default] [collection=hello_milvus] [field=embeddings] [extra_params="[{\"key\":\"metric_type\",\"value\":\"L2\"},{\"key\":\"index_type\",\"value\":\"GPU_CAGRA\"},{\"key\":\"params\",\"value\":\"{\\\"intermediate_graph_degree\\\":64,\\\"graph_degree\\\":32}\"}]"]
[2024/06/27 04:43:53.409 +00:00] [INFO] [proxy/impl.go:2009] ["CreateIndex enqueued"] [traceID=8573b31b96cc174537765c7c57fbc273] [role=proxy] [db=default] [collection=hello_milvus] [field=embeddings] [extra_params="[{\"key\":\"metric_type\",\"value\":\"L2\"},{\"key\":\"index_type\",\"value\":\"GPU_CAGRA\"},{\"key\":\"params\",\"value\":\"{\\\"intermediate_graph_degree\\\":64,\\\"graph_degree\\\":32}\"}]"] [BeginTs=450747022282326017] [EndTs=450747022282326017]
[2024/06/27 04:43:53.409 +00:00] [INFO] [proxy/task_index.go:438] ["proxy create index"] [traceID=8573b31b96cc174537765c7c57fbc273] [collectionID=450733024940556810] [fieldID=102] [indexName=] [typeParams="[{\"key\":\"dim\",\"value\":\"8\"}]"] [indexParams="[{\"key\":\"metric_type\",\"value\":\"L2\"},{\"key\":\"index_type\",\"value\":\"GPU_CAGRA\"},{\"key\":\"params\",\"value\":\"{\\\"intermediate_graph_degree\\\":64,\\\"graph_degree\\\":32}\"}]"] [newExtraParams="[{\"key\":\"metric_type\",\"value\":\"L2\"},{\"key\":\"index_type\",\"value\":\"GPU_CAGRA\"},{\"key\":\"params\",\"value\":\"{\\\"intermediate_graph_degree\\\":64,\\\"graph_degree\\\":32}\"}]"]
[2024/06/27 04:43:53.467 +00:00] [INFO] [proxy/impl.go:2027] ["CreateIndex done"] [traceID=8573b31b96cc174537765c7c57fbc273] [role=proxy] [db=default] [collection=hello_milvus] [field=embeddings] [extra_params="[{\"key\":\"metric_type\",\"value\":\"L2\"},{\"key\":\"index_type\",\"value\":\"GPU_CAGRA\"},{\"key\":\"params\",\"value\":\"{\\\"intermediate_graph_degree\\\":64,\\\"graph_degree\\\":32}\"}]"] [BeginTs=450747022282326017] [EndTs=450747022282326017]
[2024/06/27 04:43:53.754 +00:00] [INFO] [proxy/impl.go:6009] ["AllocTimestamp request receive"]
[2024/06/27 04:43:53.754 +00:00] [INFO] [proxy/impl.go:6018] ["AllocTimestamp request success"] [timestamp=450747022374076418]
[2024/06/27 04:43:54.546 +00:00] [DEBUG] [proxy/impl.go:2140] ["DescribeIndex received"] [traceID=9f259891fd34c3579a26214baa1f488b] [role=proxy] [db=default] [collection=hello_milvus] [field=] ["index name"=]
[2024/06/27 04:43:54.547 +00:00] [DEBUG] [proxy/impl.go:2155] ["DescribeIndex enqueued"] [traceID=9f259891fd34c3579a26214baa1f488b] [role=proxy] [db=default] [collection=hello_milvus] [field=] ["index name"=] [BeginTs=450747022583791618] [EndTs=450747022583791618]
[2024/06/27 04:43:54.548 +00:00] [DEBUG] [proxy/impl.go:2175] ["DescribeIndex done"] [traceID=9f259891fd34c3579a26214baa1f488b] [role=proxy] [db=default] [collection=hello_milvus] [field=] ["index name"=] [BeginTs=450747022583791618] [EndTs=450747022583791618]


[2024/06/27 02:52:57.828 +00:00] [DEBUG] [datacoord/import_scheduler.go:167] ["peek slots done"] [nodeSlots="{\"27\":16}"]
[2024/06/27 02:52:58.091 +00:00] [INFO] [datacoord/index_service.go:682] ["receive DescribeIndex request"] [traceID=6637014b369f196b4c0a9fc7f7fae072] [collectionID=450731453867370391] [indexName=] [timestamp=0]
[2024/06/27 02:52:58.091 +00:00] [INFO] [datacoord/index_service.go:603] ["completeIndexInfo success"] [collectionID=450731453867370391] [indexID=450731453867370404] [totalRows=33396] [indexRows=0] [pendingIndexRows=33396] [state=Failed] [failReason="450731453867570628: failed to create index, C Runtime Exception: index not supported\n: segcore unsupported error[segcoreCode=2003];"]
[2024/06/27 02:52:58.091 +00:00] [INFO] [datacoord/index_service.go:730] ["DescribeIndex success"] [traceID=6637014b369f196b4c0a9fc7f7fae072] [collectionID=450731453867370391] [indexName=]
[2024/06/27 02:52:58.091 +00:00] [INFO] [datacoord/index_service.go:682] ["receive DescribeIndex request"] [traceID=6637014b369f196b4c0a9fc7f7fae072] [collectionID=450733024940556810] [indexName=] [timestamp=0]
[2024/06/27 02:52:58.091 +00:00] [INFO] [datacoord/index_service.go:603] ["completeIndexInfo success"] [collectionID=450733024940556810] [indexID=450733024940556861] [totalRows=9003] [indexRows=6002] [pendingIndexRows=3001] [state=Failed] [failReason="450733024940556832: failed to create index, C Runtime Exception: index not supported\n: segcore unsupported error[segcoreCode=2003];"]
[2024/06/27 02:52:58.091 +00:00] [INFO] [datacoord/index_service.go:730] ["DescribeIndex success"] [traceID=6637014b369f196b4c0a9fc7f7fae072] [collectionID=450733024940556810] [indexName=]
[2024/06/27 02:52:58.176 +00:00] [DEBUG] [datacoord/services.go:660] ["DataCoord current state"] [StateCode=Healthy]
[2024/06/27 02:52:59.827 +00:00] [DEBUG] [datacoord/import_scheduler.go:167] ["peek slots done"] [nodeSlots="{\"27\":16}"]
[2024/06/27 02:53:01.090 +00:00] [INFO] [datacoord/index_service.go:682] ["receive DescribeIndex request"] [traceID=3a19b0373547be5a6d5e420fcf173ddf] [collectionID=450733024940556810] [indexName=] [timestamp=0]
[2024/06/27 02:53:01.090 +00:00] [INFO] [datacoord/index_service.go:603] ["completeIndexInfo success"] [collectionID=450733024940556810] [indexID=450733024940556861] [totalRows=9003] [indexRows=6002] [pendingIndexRows=3001] [state=Failed] [failReason="450733024940556832: failed to create index, C Runtime Exception: index not supported\n: segcore unsupported error[segcoreCode=2003];"]
[2024/06/27 02:53:01.090 +00:00] [INFO] [datacoord/index_service.go:730] ["DescribeIndex success"] [traceID=3a19b0373547be5a6d5e420fcf173ddf] [collectionID=450733024940556810] [indexName=]
[2024/06/27 02:53:01.090 +00:00] [INFO] [datacoord/index_service.go:682] ["receive DescribeIndex request"] [traceID=3a19b0373547be5a6d5e420fcf173ddf] [collectionID=450731453867370391] [indexName=] [timestamp=0]
[2024/06/27 02:53:01.090 +00:00] [INFO] [datacoord/index_service.go:603] ["completeIndexInfo success"] [collectionID=450731453867370391] [indexID=450731453867370404] [totalRows=33396] [indexRows=0] [pendingIndexRows=33396] [state=Failed] [failReason="450731453867570628: failed to create index, C Runtime Exception: index not supported\n: segcore unsupported error[segcoreCode=2003];"]
[2024/06/27 02:53:01.090 +00:00] [INFO] [datacoord/index_service.go:730] ["DescribeIndex success"] [traceID=3a19b0373547be5a6d5e420fcf173ddf] [collectionID=450731453867370391] [indexName=]
[2024/06/27 02:53:01.827 +00:00] [DEBUG] [datacoord/import_scheduler.go:167] ["peek slots done"] [nodeSlots="{\"27\":16}"]
[2024/06/27 02:53:02.953 +00:00] [DEBUG] [config/etcd_source.go:154] ["etcd refreshConfigurations"] [prefix=by-dev/config] [endpoints="[etcd-cagra-milvuspoctest2:2379]"]
[2024/06/27 02:53:03.714 +00:00] [INFO] [datacoord/index_service.go:924] ["List index success"] [traceID=b0d71425290cf428df9436e227173707] [collectionID=450731453867370391]
[2024/06/27 02:53:03.715 +00:00] [INFO] [datacoord/index_service.go:924] ["List index success"] [traceID=9acf55ea97cb2c9824b85fa94a410748] [collectionID=450731453867370391]
[2024/06/27 02:53:03.715 +00:00] [DEBUG] [datacoord/index_service.go:892] ["GetIndexInfos successfully"] [traceID=58db72095e502624cf2c92a323c6fbbc] [collectionID=450731453867370391] [indexName=]
[2024/06/27 02:53:03.717 +00:00] [INFO] [datacoord/services.go:820] ["get recovery info request received"] [traceID=18d08747b4b7f0fcaa639de7b907b31b] [collectionID=450731453867370391] [partitionIDs="[]"]
[2024/06/27 02:53:03.717 +00:00] [INFO] [datacoord/handler.go:117] [GetQueryVChanPositions] [collectionID=450731453867370391] [channel=by-dev-cagra-milvuspoctest2-dml_0_450731453867370391v0] [numOfSegments=1] ["indexed segment"=0]
[2024/06/27 02:53:03.717 +00:00] [INFO] [datacoord/handler.go:302] ["channel seek position set from channel checkpoint meta"] [channel=by-dev-cagra-milvuspoctest2-dml_0_450731453867370391v0] [posTs=450745265209737217] [posTime=2024/06/27 02:52:10.689 +00:00]
[2024/06/27 02:53:03.717 +00:00] [INFO] [datacoord/services.go:835] ["datacoord append channelInfo in GetRecoveryInfo"] [traceID=18d08747b4b7f0fcaa639de7b907b31b] [collectionID=450731453867370391] [partitionIDs="[]"] [channel=by-dev-cagra-milvuspoctest2-

@kdabbir
Copy link
Author

kdabbir commented Jun 27, 2024

I got it working, the documentation needs to be updated, basically I had to add below config:

    gpu:
      enable: true
      cache.enable: true
      initMemSize: 0 # Gpu Memory Pool init size
      maxMemSize: 0 # Gpu Memory Pool Max size

gpu - enable:true is needed to enable GPU i believe, please update this documentation to add this configuration: https://milvus.io/docs/index-with-gpu.md#Configure-Milvus-settings-for-GPU-memory-control

@liliu-z
Copy link
Member

liliu-z commented Jul 1, 2024

I got it working, the documentation needs to be updated, basically I had to add below config:

    gpu:
      enable: true
      cache.enable: true
      initMemSize: 0 # Gpu Memory Pool init size
      maxMemSize: 0 # Gpu Memory Pool Max size

gpu - enable:true is needed to enable GPU i believe, please update this documentation to add this configuration: https://milvus.io/docs/index-with-gpu.md#Configure-Milvus-settings-for-GPU-memory-control

GPU&CPU are two different images since GPU will bring many extra dependencies.

@Presburger
Copy link
Contributor

@kdabbir Hello, under the gpu field, we do not have the enable and cache.enable fields. Please check all deployed nodes to ensure they are using the GPU version of the image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants