[WIP] Initial DeepSeek reference implementation#861
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
7e03fe1 to
18410e3
Compare
44a6723 to
38e318c
Compare
| export GBS=1024 | ||
| # Dataloader: Micro batch size | ||
| export MBS=1 | ||
| export MAX_LR="2e-4" |
| grad_accumulation_steps = mini_batch_size // args.mbs | ||
|
|
||
| logging_configs = { | ||
| mllogger.constants.SEED: args.seed, |
There was a problem hiding this comment.
we need more mllog event like
[ self.mllogger.event(
key=constants.SUBMISSION_BENCHMARK,
value=self.submission_info["submission_benchmark"],
)], detailed list:(https://github.com/mlcommons/training/blob/master/llama2_70b_lora/scripts/mlperf_logging_utils.py)
There was a problem hiding this comment.
Added missing (I think all). SUBMISSION_BENCHMARK is logged above with mllogger.mlperf_submission_log(bmark)
| @@ -0,0 +1,16 @@ | |||
| git+https://github.com/denys-fridman/logging.git@dfridman/deepseek-v3 # TODO(dfridman): revert to main repo once merged | |||
There was a problem hiding this comment.
I think we need this reverted before merging. There is another TODO in the PR.
There was a problem hiding this comment.
Right. I'll update after mlcommons/logging#445 is merged
| pip install -e . | ||
|
|
||
| ## 2. Megatron-bridge and megatron-core | ||
| ARG MBRIDGE_REVISION=main |
There was a problem hiding this comment.
Can we pin this like NEMORUN_REVISION?
…onfig learning rates
|
|
||
| #### Run model conversion | ||
|
|
||
| Assuming that we have downloaded the HuggingFace checkpoint to a `<SRC_PATH>` directory, the checkpoint must be converted to Megatron-Bridge format before training. After conversion is done, set `MODEL_CKPT=<DST_PATH>` when launching the job. |
There was a problem hiding this comment.
Add a section on typical runtime expected and share reference hardware used along with # of nodes
…0-iteration warm-up
…version (TODO: script)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| @@ -0,0 +1,606 @@ | |||
| # Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Move everything to llm_moe_pretraining folder
Updated instructions for using the repository and downloading checkpoints.
Updated README to clarify GBS requirements and evaluation process.
No description provided.