Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][standalone][LRU] Build index raises error failed to upload index and milvus restarts multiple times #32487

Closed
1 task done
wangting0128 opened this issue Apr 21, 2024 · 11 comments
Assignees
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: milvus-io-lru-dev-27cb486-20240419
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0rc66
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: lru-fouramf-nk4fk

server:

NAME                                                              READY   STATUS                            RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
lru-500g-etcd-0                                                   1/1     Running                           0               2d2h    10.104.20.193   4am-node22   <none>           <none>
lru-500g-milvus-standalone-844c594b58-fkpwr                       1/1     Running                           8 (17m ago)     2d2h    10.104.31.236   4am-node34   <none>           <none>
lru-500g-pulsar-bookie-0                                          1/1     Running                           0               2d2h    10.104.20.195   4am-node22   <none>           <none>
lru-500g-pulsar-bookie-1                                          1/1     Running                           0               2d2h    10.104.29.179   4am-node35   <none>           <none>
lru-500g-pulsar-bookie-2                                          1/1     Running                           0               2d2h    10.104.26.180   4am-node32   <none>           <none>
lru-500g-pulsar-bookie-init-h9xz8                                 0/1     Completed                         0               2d2h    10.104.1.4      4am-node10   <none>           <none>
lru-500g-pulsar-broker-0                                          1/1     Running                           1 (21h ago)     2d2h    10.104.1.8      4am-node10   <none>           <none>
lru-500g-pulsar-proxy-0                                           1/1     Running                           0               2d2h    10.104.6.221    4am-node13   <none>           <none>
lru-500g-pulsar-pulsar-init-drpwj                                 0/1     Completed                         0               2d2h    10.104.6.220    4am-node13   <none>           <none>
lru-500g-pulsar-recovery-0                                        1/1     Running                           0               2d2h    10.104.6.222    4am-node13   <none>           <none>
lru-500g-pulsar-zookeeper-0                                       1/1     Running                           0               2d2h    10.104.20.194   4am-node22   <none>           <none>
lru-500g-pulsar-zookeeper-1                                       1/1     Running                           0               2d2h    10.104.29.181   4am-node35   <none>           <none>
lru-500g-pulsar-zookeeper-2                                       1/1     Running                           0               2d2h    10.104.34.8     4am-node37   <none>           <none>

image
截屏2024-04-22 00 49 20

截屏2024-04-22 00 50 39

client pod name: lru-fouramf-nk4fk-1027207446
client log:
截屏2024-04-22 00 45 31

Expected Behavior

No response

Steps To Reproduce

1. create a collection, 2 fields: "id"(primary key), "float_vector"(768dim),
2. build HNSW index
3. insert 270m data
4. flush collection
5. build index again with the same params <- failed

Milvus Log

No response

Anything else?

client config:

{
     "dataset_params": {
          "dataset_name": "laion1b_nolang",
          "column_name": "float32_vector",
          "dim": 768,
          "dataset_size": "270m",
          "ni_per": 10000,
          "metric_type": "L2",
          "extra_partitions": {
               "partitions": 27,
               "data_repeated": false
          }
     },
     "collection_params": {
          "other_fields": [],
          "collection_name": "fouram_270m",
          "shards_num": 1
     },
     "index_params": {
          "index_type": "HNSW",
          "index_param": {
               "M": 30,
               "efConstruction": 360
          }
     },
     "concurrent_tasks": [
          {
               "type": "search",
               "weight": 1,
               "params": {
                    "top_k": 100,
                    "nq": 10,
                    "search_param": {
                         "ef": 64
                    },
                    "timeout": 3000,
                    "random_data": true
               }
          }
     ],
     "concurrent_params": {
          "interval": 20,
          "during_time": "1h",
          "concurrent_number": 1
     }
}

server config:

    standalone:
      resources:
        limits:
          cpu: '16'
          memory: 32Gi
          ephemeral-storage: 70Gi
        requests:
          cpu: '16'
          memory: 32Gi
      messageQueue: pulsar
      extraEnv:
        - name: LOCAL_STORAGE_SIZE
          value: '70'
      disk:
        size:
          enabled: true
    etcd:
      replicaCount: 1
      metrics:
        enabled: true
        podMonitor:
          enabled: true
    minio:
      enabled: false
    externalS3:
      enabled: true
      accessKey: miniolru500g
      secretKey: miniolru500g
      bucketName: lru-500g
      host: minio-1.minio
      port: 9000
      rootPath: lru500g
    pulsar:
      enabled: true
    metrics:
      serviceMonitor:
        enabled: true
    log:
      level: debug
    extraConfigFiles:
      user.yaml: |+
        queryNode:
          diskCacheCapacityLimit: 51539607552
          mmap:
            mmapEnabled: true
          lazyloadEnabled: true
          useStreamComputing: true
          cache:
            warmup: sync
          lazyloadWaitTimeout: 300000
@wangting0128 wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Apr 21, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 22, 2024
@yanliang567 yanliang567 added this to the 2.4.lru milestone Apr 22, 2024
@yanliang567 yanliang567 removed their assignment Apr 22, 2024
@xiaofan-luan
Copy link
Collaborator

I think you reach the limit of minio.
Let's focus this test on S3 since minio QPS is very limited

@jaime0815
Copy link
Contributor

Upload failure might lead to crash, it should be unexpected, could you help me to take a look at this? @xiaocai2333

@jaime0815
Copy link
Contributor

Put operation has high latency, shall we deploy minio with nvme disk?
image

@wangting0128
Copy link
Contributor Author

wangting0128 commented Apr 22, 2024

Put operation has high latency, shall we deploy minio with nvme disk? image

minio:

NAME        READY   STATUS    RESTARTS        AGE   IP              NODE         NOMINATED NODE   READINESS GATES
minio-1-0   1/1     Running   2 (18d ago)     87d   10.104.25.186   4am-node30   <none>           <none>
minio-1-1   1/1     Running   1 (31d ago)     87d   10.104.28.126   4am-node33   <none>           <none>
minio-1-2   1/1     Running   1 (5d23h ago)   87d   10.104.26.226   4am-node32   <none>           <none>
minio-1-3   1/1     Running   1 (17d ago)     87d   10.104.27.213   4am-node31   <none>           <none>

It is already nvme disk

@wangting0128
Copy link
Contributor Author

Put operation has high latency, shall we deploy minio with nvme disk? image

minio:

NAME        READY   STATUS    RESTARTS        AGE   IP              NODE         NOMINATED NODE   READINESS GATES
minio-1-0   1/1     Running   2 (18d ago)     87d   10.104.25.186   4am-node30   <none>           <none>
minio-1-1   1/1     Running   1 (31d ago)     87d   10.104.28.126   4am-node33   <none>           <none>
minio-1-2   1/1     Running   1 (5d23h ago)   87d   10.104.26.226   4am-node32   <none>           <none>
minio-1-3   1/1     Running   1 (17d ago)     87d   10.104.27.213   4am-node31   <none>           <none>

It is already nvme disk

Correction: it’s sata disk

@xiaofan-luan
Copy link
Collaborator

let's skip this issue. small minio cluster is not designed for frequent save/delete

@xiaocai2333
Copy link
Contributor

Potential destruction issues caused by failure to upload objects to minio.

@xiaocai2333
Copy link
Contributor

Problems caused by incorrect use of C++ thread pools. If one of the threads in a thread pool throws an exception(timeout...), the upper-layer code will immediately catch the exception and perform GC logic, which will cause the chunk manager in the thread pool to be destructed.
At the same time, other threads in the thread pool have not yet ended, and error occurs when using the chunk manager object.

@xiaocai2333
Copy link
Contributor

The fix is to catch exceptions in threads when using the thread pool, and wait for all threads to finish executing before throwing the exception that occurs.

@jaime0815
Copy link
Contributor

/unassign @xiaocai2333
/assign @wangting0128

@wangting0128
Copy link
Contributor Author

verification passed

image:milvus-io-lru-dev-2721816-20240507

sre-ci-robot pushed a commit that referenced this issue May 23, 2024
xiaocai2333 added a commit to xiaocai2333/milvus that referenced this issue May 23, 2024
sre-ci-robot pushed a commit that referenced this issue May 23, 2024
…ished (#32810) (#33314)

issue: #32487
master pr: #32810

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants