Skip to content

[docs] update docs#89

Merged
Jintao-Huang merged 2 commits into
modelscope:mainfrom
Jintao-Huang:update_docs_0521
May 21, 2026
Merged

[docs] update docs#89
Jintao-Huang merged 2 commits into
modelscope:mainfrom
Jintao-Huang:update_docs_0521

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the English and Chinese README files by removing environment-specific comments and adding a new 'Minimal forward example' section. This section provides a Python script demonstrating how to perform a forward pass and compute loss using Mcore-Bridge. The review feedback correctly identifies a bug in the example code: the get_mcore_model function returns a list of models, so attempting to call .cuda() directly on the list or passing it to the forward function will result in an AttributeError. The reviewer suggests extracting the first model from the list to resolve this issue.

Comment thread README.md Outdated
Comment thread README_zh.md Outdated
@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a 'Minimal forward example' section to both the English and Chinese README files, providing a Python script that demonstrates how to perform a forward pass and compute loss using Mcore-Bridge. The review feedback identifies several necessary improvements to the example code: replacing torch.roll with a proper concatenation and padding approach for label shifting in causal language modeling, improving the robustness of the LOCAL_RANK environment variable retrieval, and correcting the parallel configuration parameters to ensure the script runs correctly on the specified two-GPU setup.

Comment thread README.md
mg_inputs.pop(key, None)
if template.padding_free:
mg_inputs['packed_seq_params'] = get_packed_seq_params(text_position_ids)
mg_inputs['labels'] = torch.roll(mg_inputs['labels'], -1, dims=-1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using torch.roll to shift labels for causal language modeling is problematic because it wraps the first token's label to the last position. This results in an incorrect loss calculation for the final token of the sequence. It is better to shift the labels and pad the last position with the ignore index (-100).

Suggested change
mg_inputs['labels'] = torch.roll(mg_inputs['labels'], -1, dims=-1)
mg_inputs['labels'] = torch.cat([mg_inputs['labels'][..., 1:], torch.full_like(mg_inputs['labels'][..., :1], -100)], dim=-1)

Comment thread README.md
return loss.sum() / loss_mask.sum()


torch.cuda.set_device(f"cuda:{os.getenv('LOCAL_RANK')}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

os.getenv('LOCAL_RANK') returns a string and can be None if the environment variable is missing. It is safer to use os.environ.get with a default value and cast it to an integer, which is the expected type for torch.cuda.set_device.

Suggested change
torch.cuda.set_device(f"cuda:{os.getenv('LOCAL_RANK')}")
torch.cuda.set_device(int(os.environ.get('LOCAL_RANK', 0)))

Comment thread README.md

torch.cuda.set_device(f"cuda:{os.getenv('LOCAL_RANK')}")
dist.init_process_group(backend='nccl')
TP, PP, EP, ETP = 2, 1, 2, 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The parallel configuration TP=2, PP=1, EP=2 requires at least 4 GPUs because the expert model parallel size (EP) must be less than or equal to the data parallel size (DP), and with 2 GPUs and TP=2, the DP size is 1. To match the execution command provided earlier (--nproc_per_node=2), EP should be set to 1.

Suggested change
TP, PP, EP, ETP = 2, 1, 2, 1
TP, PP, EP, ETP = 2, 1, 1, 1

Comment thread README_zh.md
mg_inputs.pop(key, None)
if template.padding_free:
mg_inputs['packed_seq_params'] = get_packed_seq_params(text_position_ids)
mg_inputs['labels'] = torch.roll(mg_inputs['labels'], -1, dims=-1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using torch.roll to shift labels for causal language modeling is problematic because it wraps the first token's label to the last position. This results in an incorrect loss calculation for the final token of the sequence. It is better to shift the labels and pad the last position with the ignore index (-100).

Suggested change
mg_inputs['labels'] = torch.roll(mg_inputs['labels'], -1, dims=-1)
mg_inputs['labels'] = torch.cat([mg_inputs['labels'][..., 1:], torch.full_like(mg_inputs['labels'][..., :1], -100)], dim=-1)

Comment thread README_zh.md
return loss.sum() / loss_mask.sum()


torch.cuda.set_device(f"cuda:{os.getenv('LOCAL_RANK')}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

os.getenv('LOCAL_RANK') returns a string and can be None if the environment variable is missing. It is safer to use os.environ.get with a default value and cast it to an integer, which is the expected type for torch.cuda.set_device.

Suggested change
torch.cuda.set_device(f"cuda:{os.getenv('LOCAL_RANK')}")
torch.cuda.set_device(int(os.environ.get('LOCAL_RANK', 0)))

Comment thread README_zh.md

torch.cuda.set_device(f"cuda:{os.getenv('LOCAL_RANK')}")
dist.init_process_group(backend='nccl')
TP, PP, EP, ETP = 2, 1, 2, 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The parallel configuration TP=2, PP=1, EP=2 requires at least 4 GPUs because the expert model parallel size (EP) must be less than or equal to the data parallel size (DP), and with 2 GPUs and TP=2, the DP size is 1. To match the execution command provided earlier (--nproc_per_node=2), EP should be set to 1.

Suggested change
TP, PP, EP, ETP = 2, 1, 2, 1
TP, PP, EP, ETP = 2, 1, 1, 1

@Jintao-Huang Jintao-Huang merged commit 2a96684 into modelscope:main May 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants