use last instead of first bacalhau execution #913

thetechnocrat-dev · 2024-03-08T17:29:09Z

What type of PR is this?

🐛 Bug Fix

Description

I noticed

Error listing files in directory: ls: invalid path "": invalid ipfs pathunexpected error monitoring running jobs: ls: invalid path "": invalid ipfs path

In the logs and jobs not processing. I realized this was because a some jobs had multiple bacalhau executions, the first was always a bid rejected capacity error. For these cases we should always look at the most recent execution.

Example

State:
  CreateTime: "2024-03-08T15:46:41.301528202Z"
  Executions:
  - ComputeReference: e-46b17748-0ac3-4181-a252-1fe5e78bdc38
    CreateTime: "2024-03-08T15:46:41.308035083Z"
    DesiredState: 2
    JobID: 24df3836-3ad8-4402-91fa-b779595b7528
    NodeId: QmPui6hPRoktGhteDRUSNzrYceEc3R52nZqp82nbd4Kjiy
    PublishedResults: {}
    State: AskForBidRejected
    Status: 'this node does not have capacity to run the job ({CPU: 0.400000, Memory:
      2.5 GB, Disk: 1.7 TB, GPU: 0} requested but only {%!s(float64=3) %!s(uint64=12000000000)
      %!s(uint64=323702) %!s(uint64=1) []} is available). bid rejected'
    UpdateTime: "2024-03-08T15:46:41.430322833Z"
    Version: 3
  - ComputeReference: e-5cbdb18f-6390-49c6-8756-b7132759d9ba
    CreateTime: "2024-03-08T15:46:41.435599426Z"
    DesiredState: 2
    JobID: 24df3836-3ad8-4402-91fa-b779595b7528
    NodeId: QmVakTbjsKHKho6svUTw5Q5yqbojrhbrAAvcJyCscxyLwa
    PublishedResults:
      CID: QmUCD2RKAd8Q8CR8hmixFExGkshcFa6briXqveNeDw44Zu
      StorageSource: ipfs
    RunOutput:
      exitCode: 0

vercel · 2024-03-08T17:29:15Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Updated (UTC)
docs	⬜️ Ignored (Inspect)		Mar 8, 2024 5:29pm

acashmoney · 2024-03-11T23:16:25Z

Queued up 100 labsay jobs. 100/100 bacalhau jobs succeeded, however only 94/100 initially succeeded on the app frontend. The other 6 were perpetually in a state of Running.

Received the following error, similar to @thetechnocrat-dev for 2 of the 6 stalled jobs.

Error listing files in directory: ls: invalid path "": invalid ipfs pathunexpected error monitoring running jobs: ls: invalid path "": invalid ipfs path

Marking the 2 jobs as Failed allowed the other 4 jobs to process successfully, resulting in a final 98/100 success rate.

The 6 stalled jobs seem to have coincided with a scale up from 1 CPU node to 3. Unexpected behavior of the jobs' NodeIDs seem to contribute. See one of the 2 problematic "stalled" jobs despite a successful Bacalhau run:

 bacalhau describe 8e60e076-bc7a-4548-b4d5-93e943a171d7
Job:
  ...
State:
  CreateTime: "2024-03-11T21:55:57.362211708Z"
  Executions:
  - ComputeReference: e-be6e206b-9371-4fc0-833b-80183920a382
    CreateTime: "2024-03-11T21:55:57.368546836Z"
    DesiredState: 2
    JobID: 8e60e076-bc7a-4548-b4d5-93e943a171d7
    NodeId: QmQe4oJUqqCLfK2kbgT8omeufYcB837ryRTHcDpdtsDFrj
    PublishedResults:
      CID: QmWfw7axWtYSUk4XWBvFDvAG3fbcneFyYRWDavthkLWMz7
      StorageSource: ipfs
    RunOutput:
      exitCode: 0
      runnerError: ""
      stderr: ""
      stderrtruncated: false
      stdout: "Job Inputs: {'file_example': '/inputs/file_example/result.txt', 'number_example':
        54, 'speedup': True, 'string_example': '3hello world'}\n\n                                        @\n
        \                                @@@@@@@@@@@@@@@\n                               @@@@@@@@@@@@@@@@@@@\n
        \                             @@@@@@@@@@@@@@@@@@@@@\n             @@@@@@@@@@
        \     @@@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@\n           @@@@@@@@@@@@      @@@@@@@@@@@@@@@@@@@@@@@
        \     @@@@@@@@@@@@\n         @@@@@@@@@@@@@@      @@@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@@@@@\n
        \       *@@@@@@@@@@@@@      @@@@@@@@@@@@@@@@@@@@@@         @@@@@@@@@@@@@\n
        \        @@@@@@@@@@        @@@@@@@@@@@@@@@@@@@@@%            &@@@@@@@@@@\n
        \          @@@@           @@@@@@@@@@@@@@@@@@&                     @@@@\n                        @@@@@@@@\n
        \                  @@@@@@@@@\n      @@@@@@@@@@@@@@@@@@@@        ,@@@@@@@@@@@
        \                @@@@@@@@@@@@\n   @@@@@@@@@@@@@@@@@@@@@@       @@@@@@@@@@@@@@@@@
        \          @@@@@@@@@@@@@@@@@@\n  @@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@@@@@@@@@@@@
        \      @@@@@@@@@@@@@@@@@@@@@\n @@@@@@@@@@@@@@@@@@@@@@@     @@@@@@@@@@@@@@@@@@@@@@@
        \     @@@@@@@@@@@@@@@@@@@@@@\n@@@@@@@@@@@@@@@@@@@@@@@@     @@@@@@@@@@@@@@@@@@@@@@@
        \    @@@@@@@@@@@@@@@@@@@@@@@\n @@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@@@@@@@@@@@@@@
        \    @@@@@@@@@@@@@@@@@@@@@@@\n  @@@@@@@@@@@@@@@@@@@@@       @@@@@@@@@@@@@@@@@@@@@
        \     @@@@@@@@@@@@@@@@@@@@@@\n   @@@@@@@@@@@@@@@@@@           @@@@@@@@@@@@@@@@@
        \      @@@@@@@@@@@@@@@@@@@@@@\n      @@@@@@@@@@@@                 @@@@@@@@@@@
        \        @@@@@@@@@@@@@@@@@@@@\n                                                     @@@@@@@@@\n
        \                                                @@@@@@@@\n           @@@@
        \                    &@@@@@@@@@@@@@@@@@@           @@@@\n         @@@@@@@@@@
        \            @@@@@@@@@@@@@@@@@@@@@        &@@@@@@@@@@\n        *@@@@@@@@@@@@@
        \       @@@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@@@@\n         @@@@@@@@@@@@@@
        \     @@@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@@@@@\n           @@@@@@@@@@@@
        \     @@@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@@@\n             @@@@@@@@@@      "
      stdouttruncated: true
    State: Completed
    Status: . execution completed
    UpdateTime: "2024-03-11T21:56:10.569228513Z"
    Version: 6
  JobID: 8e60e076-bc7a-4548-b4d5-93e943a171d7
  State: Completed
  TimeoutAt: "2024-03-14T21:55:57.362211708Z"
  UpdateTime: "2024-03-11T21:56:10.576574822Z"
  Version: 3

The bacalhau describe shows a completed job, published results which can be inspected successfully on IPFS, however notes NodeId: QmQe4oJUqqCLfK2kbgT8omeufYcB837ryRTHcDpdtsDFrj. This NodeId does not appear as valid in the compute cluster:

bacalhau node describe QmQe4oJUqqCLfK2kbgT8omeufYcB837ryRTHcDpdtsDFrj
could not get node QmQe4oJUqqCLfK2kbgT8omeufYcB837ryRTHcDpdtsDFrj: Unexpected response code: 500 ({
  "error": "nodeInfo not found for nodeID: QmQe4oJUqqCLfK2kbgT8omeufYcB837ryRTHcDpdtsDFrj",
  "message": "Internal Server Error"
})

bacalhau node list
 ID        TYPE       LABELS                                              CPU     MEMORY      DISK         GPU
 QmPUc2aE  Requester  Architecture=amd64 Operating-System=linux
                      git-lfs=False owner=labdao
 QmQnWc21  Compute    Architecture=amd64 Operating-System=linux           3.2 /   12.3 GB /   768.6 GB /   0 /
                      git-lfs=False instance-id=i-00b2a63d65e16f212       3.2     12.3 GB     768.6 GB     0
                      instance-type=m5.xlarge node-type=cpu owner=labdao
 QmSMhNDD  Compute    Architecture=amd64 Operating-System=linux           3.2 /   12.1 GB /   769.2 GB /   0 /
                      git-lfs=False instance-id=i-0effc9f20d6d54602       3.2     12.1 GB     769.2 GB     0
                      instance-type=m5.xlarge node-type=cpu owner=labdao
 QmXYPp65  Compute    Architecture=amd64 Operating-System=linux           3.2 /   12.3 GB /   769.4 GB /   0 /
                      git-lfs=False instance-id=i-0fb5abee9bdd5d7fb       3.2     12.3 GB     769.4 GB     0
                      instance-type=m5.xlarge node-type=cpu owner=labdao
 QmcwRQbD  Compute    Architecture=amd64 Operating-System=linux           3.2 /   12.3 GB /   769.3 GB /   0 /
                      git-lfs=False instance-id=i-03031c023c7ec89c8       3.2     12.3 GB     769.3 GB     0
                      instance-type=m5.xlarge node-type=cpu owner=labdao
 QmdoAGf9  Compute    Architecture=amd64 Operating-System=linux           3.2 /   12.1 GB /   768.7 GB /   0 /
                      git-lfs=False instance-id=i-0b2fe586a323789b0       3.2     12.1 GB     768.7 GB     0
                      instance-type=m5.xlarge node-type=cpu owner=labdao
 QmeGfoaw  Compute    Architecture=amd64 Operating-System=linux           3.2 /   12.3 GB /   771.2 GB /   0 /
                      git-lfs=False instance-id=i-0cf245a18f4ffa88b       3.2     12.3 GB     771.2 GB     0
                      instance-type=m5.xlarge node-type=cpu owner=labdao

This seems to suggest that when autoscaling up, we sometimes run into a problem with the NodeId values changing causing stalls to the queue. Anecdotally, similar behavior seems to have occurred when previously scaling up.

supraja-968 · 2024-08-01T13:39:13Z

closing this PR as it is not relevant anymore

use last instead of first bacalhau execution

c9b619d

thetechnocrat-dev temporarily deployed to ci March 8, 2024 17:29 — with GitHub Actions Inactive

thetechnocrat-dev requested a review from acashmoney March 11, 2024 15:30

supraja-968 closed this Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use last instead of first bacalhau execution #913

use last instead of first bacalhau execution #913

thetechnocrat-dev commented Mar 8, 2024

vercel bot commented Mar 8, 2024

acashmoney commented Mar 11, 2024 •

edited

Loading

supraja-968 commented Aug 1, 2024

use last instead of first bacalhau execution #913

use last instead of first bacalhau execution #913

Conversation

thetechnocrat-dev commented Mar 8, 2024

What type of PR is this?

Description

vercel bot commented Mar 8, 2024

acashmoney commented Mar 11, 2024 • edited Loading

supraja-968 commented Aug 1, 2024

acashmoney commented Mar 11, 2024 •

edited

Loading