Conversation
|
Could you share a toy user submission as well using rocshmem. Just wanna get a sense of what things will look like e2e |
|
Also @saienduri to sanity check |
Vibe coded this but is gonna look similar to HIP kernels in python |
|
Looks good to me. Starting a test docker build here to check status: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17545534459. |
|
ooo! looks like there is some issue with UCX. I ll debug it today! |
|
@saienduri I made some changes but not sure if it works, is there a way to test the workflow without approval? I don't have MI300X to test 😅 |
|
Thanks, trying a build here now: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17701378282. You can locally try building the docker just to see if the build passes. |
|
Cool, the build passed and a sanity test passed here: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17702258708 |
|
@saienduri added one, lmk if it works! |
|
Hmm getting |
|
You want the example working with load_inline in PyTorch |
|
done but idk if it works 😬 |
|
@saienduri can we test the provided payload example on the server directly? If it's fine then we should be good to merge |
|
ok running the payload in github actions yielded the following (https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17790562194): I think it will be the same error on the server itself as well. |
|
Pushed a commit to fix the import issue. |
|
Ok, I ll test this on runpod and push a working version. Apologies for all the back and forth! |
|
@chivatam Hi, I have no permission to directly push commit to your repo, I corrected your payload, you can refer to that. Just use extra_ldflag instead |
|
@saienduri hi sai, could you pls replace the current one with mine above and trigger test again? Thanks |
|
@danielhua23 just gave you write access as well |
|
Latest log @danielhua23: |
|
You can always trigger a run as you make changes like this (make sure to select the same branch and runner name): After it runs, you can download the artifacts: Also, if you have access to a mi3x server, you can use this docker Just want to make sure I'm not slowing y'all down here :) |
|
Big thanks for your tutorials Sai, I will have a try! |
|
Currently the new payload with new docker works well on my local MI3x machines, but how to trigger a job with a new docker built by the new dockerfile? I already ping Sai, if you guys have solutions, you can also help! Thanks! |
|
@danielhua23 for the dockerfile you can publish a new one here https://github.com/gpu-mode/discord-cluster-manager/actions/workflows/publish_amd_docker.yml just link it to your branch and my understanding is @saienduri's infra should automatically pick it up |


Description
added rocshmem dependencies to the dockerfile
@msaroufim