[Bug]: [benchmark][standalone][LRU] Build index raises error `failed to upload index` and milvus restarts multiple times #32487

wangting0128 · 2024-04-21T16:51:01Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: milvus-io-lru-dev-27cb486-20240419
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0rc66
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: lru-fouramf-nk4fk

server:

NAME                                                              READY   STATUS                            RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
lru-500g-etcd-0                                                   1/1     Running                           0               2d2h    10.104.20.193   4am-node22   <none>           <none>
lru-500g-milvus-standalone-844c594b58-fkpwr                       1/1     Running                           8 (17m ago)     2d2h    10.104.31.236   4am-node34   <none>           <none>
lru-500g-pulsar-bookie-0                                          1/1     Running                           0               2d2h    10.104.20.195   4am-node22   <none>           <none>
lru-500g-pulsar-bookie-1                                          1/1     Running                           0               2d2h    10.104.29.179   4am-node35   <none>           <none>
lru-500g-pulsar-bookie-2                                          1/1     Running                           0               2d2h    10.104.26.180   4am-node32   <none>           <none>
lru-500g-pulsar-bookie-init-h9xz8                                 0/1     Completed                         0               2d2h    10.104.1.4      4am-node10   <none>           <none>
lru-500g-pulsar-broker-0                                          1/1     Running                           1 (21h ago)     2d2h    10.104.1.8      4am-node10   <none>           <none>
lru-500g-pulsar-proxy-0                                           1/1     Running                           0               2d2h    10.104.6.221    4am-node13   <none>           <none>
lru-500g-pulsar-pulsar-init-drpwj                                 0/1     Completed                         0               2d2h    10.104.6.220    4am-node13   <none>           <none>
lru-500g-pulsar-recovery-0                                        1/1     Running                           0               2d2h    10.104.6.222    4am-node13   <none>           <none>
lru-500g-pulsar-zookeeper-0                                       1/1     Running                           0               2d2h    10.104.20.194   4am-node22   <none>           <none>
lru-500g-pulsar-zookeeper-1                                       1/1     Running                           0               2d2h    10.104.29.181   4am-node35   <none>           <none>
lru-500g-pulsar-zookeeper-2                                       1/1     Running                           0               2d2h    10.104.34.8     4am-node37   <none>           <none>

client pod name: lru-fouramf-nk4fk-1027207446
client log:

Expected Behavior

No response

Steps To Reproduce

1. create a collection, 2 fields: "id"(primary key), "float_vector"(768dim),
2. build HNSW index
3. insert 270m data
4. flush collection
5. build index again with the same params <- failed

Milvus Log

No response

Anything else?

client config:

{
     "dataset_params": {
          "dataset_name": "laion1b_nolang",
          "column_name": "float32_vector",
          "dim": 768,
          "dataset_size": "270m",
          "ni_per": 10000,
          "metric_type": "L2",
          "extra_partitions": {
               "partitions": 27,
               "data_repeated": false
          }
     },
     "collection_params": {
          "other_fields": [],
          "collection_name": "fouram_270m",
          "shards_num": 1
     },
     "index_params": {
          "index_type": "HNSW",
          "index_param": {
               "M": 30,
               "efConstruction": 360
          }
     },
     "concurrent_tasks": [
          {
               "type": "search",
               "weight": 1,
               "params": {
                    "top_k": 100,
                    "nq": 10,
                    "search_param": {
                         "ef": 64
                    },
                    "timeout": 3000,
                    "random_data": true
               }
          }
     ],
     "concurrent_params": {
          "interval": 20,
          "during_time": "1h",
          "concurrent_number": 1
     }
}

server config:

    standalone:
      resources:
        limits:
          cpu: '16'
          memory: 32Gi
          ephemeral-storage: 70Gi
        requests:
          cpu: '16'
          memory: 32Gi
      messageQueue: pulsar
      extraEnv:
        - name: LOCAL_STORAGE_SIZE
          value: '70'
      disk:
        size:
          enabled: true
    etcd:
      replicaCount: 1
      metrics:
        enabled: true
        podMonitor:
          enabled: true
    minio:
      enabled: false
    externalS3:
      enabled: true
      accessKey: miniolru500g
      secretKey: miniolru500g
      bucketName: lru-500g
      host: minio-1.minio
      port: 9000
      rootPath: lru500g
    pulsar:
      enabled: true
    metrics:
      serviceMonitor:
        enabled: true
    log:
      level: debug
    extraConfigFiles:
      user.yaml: |+
        queryNode:
          diskCacheCapacityLimit: 51539607552
          mmap:
            mmapEnabled: true
          lazyloadEnabled: true
          useStreamComputing: true
          cache:
            warmup: sync
          lazyloadWaitTimeout: 300000

The text was updated successfully, but these errors were encountered:

xiaofan-luan · 2024-04-22T01:30:41Z

I think you reach the limit of minio.
Let's focus this test on S3 since minio QPS is very limited

jaime0815 · 2024-04-22T03:08:29Z

Upload failure might lead to crash, it should be unexpected, could you help me to take a look at this? @xiaocai2333

jaime0815 · 2024-04-22T03:16:04Z

Put operation has high latency, shall we deploy minio with nvme disk?

wangting0128 · 2024-04-22T03:18:25Z

Put operation has high latency, shall we deploy minio with nvme disk?

minio:

NAME        READY   STATUS    RESTARTS        AGE   IP              NODE         NOMINATED NODE   READINESS GATES
minio-1-0   1/1     Running   2 (18d ago)     87d   10.104.25.186   4am-node30   <none>           <none>
minio-1-1   1/1     Running   1 (31d ago)     87d   10.104.28.126   4am-node33   <none>           <none>
minio-1-2   1/1     Running   1 (5d23h ago)   87d   10.104.26.226   4am-node32   <none>           <none>
minio-1-3   1/1     Running   1 (17d ago)     87d   10.104.27.213   4am-node31   <none>           <none>

It is already nvme disk

wangting0128 · 2024-04-22T03:48:02Z

Put operation has high latency, shall we deploy minio with nvme disk?

minio:

NAME        READY   STATUS    RESTARTS        AGE   IP              NODE         NOMINATED NODE   READINESS GATES
minio-1-0   1/1     Running   2 (18d ago)     87d   10.104.25.186   4am-node30   <none>           <none>
minio-1-1   1/1     Running   1 (31d ago)     87d   10.104.28.126   4am-node33   <none>           <none>
minio-1-2   1/1     Running   1 (5d23h ago)   87d   10.104.26.226   4am-node32   <none>           <none>
minio-1-3   1/1     Running   1 (17d ago)     87d   10.104.27.213   4am-node31   <none>           <none>

It is already nvme disk

Correction: it’s sata disk

xiaofan-luan · 2024-04-22T05:24:42Z

let's skip this issue. small minio cluster is not designed for frequent save/delete

xiaocai2333 · 2024-04-22T09:54:04Z

Potential destruction issues caused by failure to upload objects to minio.

xiaocai2333 · 2024-05-06T01:57:18Z

Problems caused by incorrect use of C++ thread pools. If one of the threads in a thread pool throws an exception(timeout...), the upper-layer code will immediately catch the exception and perform GC logic, which will cause the chunk manager in the thread pool to be destructed.
At the same time, other threads in the thread pool have not yet ended, and error occurs when using the chunk manager object.

xiaocai2333 · 2024-05-06T01:59:47Z

The fix is to catch exceptions in threads when using the thread pool, and wait for all threads to finish executing before throwing the exception that occurs.

jaime0815 · 2024-05-07T03:24:39Z

/unassign @xiaocai2333
/assign @wangting0128

wangting0128 · 2024-05-08T13:00:36Z

verification passed

image：milvus-io-lru-dev-2721816-20240507

…#32810) issue: #32487 Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

…milvus-io#32810) issue: milvus-io#32487 Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

…ished (#32810) (#33314) issue: #32487 master pr: #32810 Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Apr 21, 2024

wangting0128 assigned jaime0815, sunby and yanliang567 Apr 21, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 22, 2024

yanliang567 added this to the 2.4.lru milestone Apr 22, 2024

yanliang567 removed their assignment Apr 22, 2024

jaime0815 assigned xiaocai2333 and unassigned jaime0815 and sunby Apr 22, 2024

xiaocai2333 mentioned this issue Apr 22, 2024

Don't panic when create index failed #32504

Merged

sre-ci-robot assigned wangting0128 and unassigned xiaocai2333 May 7, 2024

xiaocai2333 mentioned this issue May 7, 2024

fix: Throw an exception after all the threads in thread pool finished #32810

Merged

wangting0128 closed this as completed May 8, 2024

sre-ci-robot pushed a commit that referenced this issue May 23, 2024

fix: Throw an exception after all the threads in thread pool finished (…

32d3e22

…#32810) issue: #32487 Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

xiaocai2333 added a commit to xiaocai2333/milvus that referenced this issue May 23, 2024

fix: Throw an exception after all the threads in thread pool finished (…

a01b4dd

…milvus-io#32810) issue: milvus-io#32487 Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

xiaocai2333 mentioned this issue May 23, 2024

fix: [2.4]Throw an exception after all the threads in thread pool finished (#32810) #33314

Merged

sre-ci-robot pushed a commit that referenced this issue May 23, 2024

fix: [2.4]Throw an exception after all the threads in thread pool fin…

ef10d15

…ished (#32810) (#33314) issue: #32487 master pr: #32810 Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [benchmark][standalone][LRU] Build index raises error `failed to upload index` and milvus restarts multiple times #32487

[Bug]: [benchmark][standalone][LRU] Build index raises error `failed to upload index` and milvus restarts multiple times #32487

wangting0128 commented Apr 21, 2024

xiaofan-luan commented Apr 22, 2024

jaime0815 commented Apr 22, 2024

jaime0815 commented Apr 22, 2024

wangting0128 commented Apr 22, 2024 •

edited

Loading

wangting0128 commented Apr 22, 2024

xiaofan-luan commented Apr 22, 2024

xiaocai2333 commented Apr 22, 2024

xiaocai2333 commented May 6, 2024

xiaocai2333 commented May 6, 2024

jaime0815 commented May 7, 2024

wangting0128 commented May 8, 2024

[Bug]: [benchmark][standalone][LRU] Build index raises error failed to upload index and milvus restarts multiple times #32487

[Bug]: [benchmark][standalone][LRU] Build index raises error failed to upload index and milvus restarts multiple times #32487

Comments

wangting0128 commented Apr 21, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

xiaofan-luan commented Apr 22, 2024

jaime0815 commented Apr 22, 2024

jaime0815 commented Apr 22, 2024

wangting0128 commented Apr 22, 2024 • edited Loading

wangting0128 commented Apr 22, 2024

xiaofan-luan commented Apr 22, 2024

xiaocai2333 commented Apr 22, 2024

xiaocai2333 commented May 6, 2024

xiaocai2333 commented May 6, 2024

jaime0815 commented May 7, 2024

wangting0128 commented May 8, 2024

verification passed

[Bug]: [benchmark][standalone][LRU] Build index raises error `failed to upload index` and milvus restarts multiple times #32487

[Bug]: [benchmark][standalone][LRU] Build index raises error `failed to upload index` and milvus restarts multiple times #32487

wangting0128 commented Apr 22, 2024 •

edited

Loading