Fixes for llama 3.1 training #823

sean-smith · 2025-08-24T23:25:43Z

Fixes #821, #820, #824

github-actions · 2025-08-24T23:25:54Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

sean-smith · 2025-08-26T00:01:19Z

recheck

ShriyaRishab · 2025-08-26T15:00:11Z

Added comments on the issues, can you please revert the git based changes - the other 2 look good.

sean-smith · 2025-08-26T16:41:44Z

Added comments on the issues, can you please revert the git based changes - the other 2 look good.

Done ✅

* Remove subpath from pretrain_llama.py * Install toml package * Adjust --gres=gpu:8 to number of user specified devices Signed-off-by: Sean Smith <seasmith@nvidia.com>

ShriyaRishab · 2025-08-26T17:27:40Z

large_language_model_pretraining/nemo/pretrain_llama31.py

        mem="0",
        exclusive=True,
-        gres="gpu:8",
-        packager=run.GitArchivePackager(subpath="large_language_model_pretraining/nemo", ref="HEAD"),


Can you please revert packager=run.GitArchivePackager(subpath="large_language_model_pretraining/nemo", ref="HEAD"), as well?

I can however I don't understand who is able to run this. If you follow the instructions in the readme this will fail since the path is wrong.

Maybe if you move the Dockerfile to the root directory and build there then this will work, i.e.

cp Dockerfile ../.. cd ../.. docker build -t nemo .

sean-smith requested a review from a team as a code owner August 24, 2025 23:25

sean-smith changed the title ~~Setup git dir in /workspace/llama31~~ Fixes for llama 3.1 training Aug 24, 2025

sean-smith force-pushed the master branch from ea013a7 to 72e75a3 Compare August 26, 2025 00:11

sean-smith force-pushed the master branch from 72e75a3 to c1ae48e Compare August 26, 2025 16:40

If merged this commit does the following:

f0aaecc

* Remove subpath from pretrain_llama.py * Install toml package * Adjust --gres=gpu:8 to number of user specified devices Signed-off-by: Sean Smith <seasmith@nvidia.com>

sean-smith force-pushed the master branch from c1ae48e to f0aaecc Compare August 26, 2025 16:47

ShriyaRishab reviewed Aug 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes for llama 3.1 training #823

Fixes for llama 3.1 training #823

Uh oh!

sean-smith commented Aug 24, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 24, 2025 •

edited

Loading

Uh oh!

sean-smith commented Aug 26, 2025

Uh oh!

ShriyaRishab commented Aug 26, 2025 •

edited

Loading

Uh oh!

sean-smith commented Aug 26, 2025

Uh oh!

ShriyaRishab Aug 26, 2025

Uh oh!

sean-smith Aug 26, 2025

Uh oh!

Uh oh!

Fixes for llama 3.1 training #823

Are you sure you want to change the base?

Fixes for llama 3.1 training #823

Uh oh!

Conversation

sean-smith commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sean-smith commented Aug 26, 2025

Uh oh!

ShriyaRishab commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sean-smith commented Aug 26, 2025

Uh oh!

ShriyaRishab Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

sean-smith Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sean-smith commented Aug 24, 2025 •

edited

Loading

github-actions bot commented Aug 24, 2025 •

edited

Loading

ShriyaRishab commented Aug 26, 2025 •

edited

Loading