[docs] update docs#89
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the English and Chinese README files by removing environment-specific comments and adding a new 'Minimal forward example' section. This section provides a Python script demonstrating how to perform a forward pass and compute loss using Mcore-Bridge. The review feedback correctly identifies a bug in the example code: the get_mcore_model function returns a list of models, so attempting to call .cuda() directly on the list or passing it to the forward function will result in an AttributeError. The reviewer suggests extracting the first model from the list to resolve this issue.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds a 'Minimal forward example' section to both the English and Chinese README files, providing a Python script that demonstrates how to perform a forward pass and compute loss using Mcore-Bridge. The review feedback identifies several necessary improvements to the example code: replacing torch.roll with a proper concatenation and padding approach for label shifting in causal language modeling, improving the robustness of the LOCAL_RANK environment variable retrieval, and correcting the parallel configuration parameters to ensure the script runs correctly on the specified two-GPU setup.
| mg_inputs.pop(key, None) | ||
| if template.padding_free: | ||
| mg_inputs['packed_seq_params'] = get_packed_seq_params(text_position_ids) | ||
| mg_inputs['labels'] = torch.roll(mg_inputs['labels'], -1, dims=-1) |
There was a problem hiding this comment.
Using torch.roll to shift labels for causal language modeling is problematic because it wraps the first token's label to the last position. This results in an incorrect loss calculation for the final token of the sequence. It is better to shift the labels and pad the last position with the ignore index (-100).
| mg_inputs['labels'] = torch.roll(mg_inputs['labels'], -1, dims=-1) | |
| mg_inputs['labels'] = torch.cat([mg_inputs['labels'][..., 1:], torch.full_like(mg_inputs['labels'][..., :1], -100)], dim=-1) |
| return loss.sum() / loss_mask.sum() | ||
|
|
||
|
|
||
| torch.cuda.set_device(f"cuda:{os.getenv('LOCAL_RANK')}") |
There was a problem hiding this comment.
os.getenv('LOCAL_RANK') returns a string and can be None if the environment variable is missing. It is safer to use os.environ.get with a default value and cast it to an integer, which is the expected type for torch.cuda.set_device.
| torch.cuda.set_device(f"cuda:{os.getenv('LOCAL_RANK')}") | |
| torch.cuda.set_device(int(os.environ.get('LOCAL_RANK', 0))) |
|
|
||
| torch.cuda.set_device(f"cuda:{os.getenv('LOCAL_RANK')}") | ||
| dist.init_process_group(backend='nccl') | ||
| TP, PP, EP, ETP = 2, 1, 2, 1 |
There was a problem hiding this comment.
The parallel configuration TP=2, PP=1, EP=2 requires at least 4 GPUs because the expert model parallel size (EP) must be less than or equal to the data parallel size (DP), and with 2 GPUs and TP=2, the DP size is 1. To match the execution command provided earlier (--nproc_per_node=2), EP should be set to 1.
| TP, PP, EP, ETP = 2, 1, 2, 1 | |
| TP, PP, EP, ETP = 2, 1, 1, 1 |
| mg_inputs.pop(key, None) | ||
| if template.padding_free: | ||
| mg_inputs['packed_seq_params'] = get_packed_seq_params(text_position_ids) | ||
| mg_inputs['labels'] = torch.roll(mg_inputs['labels'], -1, dims=-1) |
There was a problem hiding this comment.
Using torch.roll to shift labels for causal language modeling is problematic because it wraps the first token's label to the last position. This results in an incorrect loss calculation for the final token of the sequence. It is better to shift the labels and pad the last position with the ignore index (-100).
| mg_inputs['labels'] = torch.roll(mg_inputs['labels'], -1, dims=-1) | |
| mg_inputs['labels'] = torch.cat([mg_inputs['labels'][..., 1:], torch.full_like(mg_inputs['labels'][..., :1], -100)], dim=-1) |
| return loss.sum() / loss_mask.sum() | ||
|
|
||
|
|
||
| torch.cuda.set_device(f"cuda:{os.getenv('LOCAL_RANK')}") |
There was a problem hiding this comment.
os.getenv('LOCAL_RANK') returns a string and can be None if the environment variable is missing. It is safer to use os.environ.get with a default value and cast it to an integer, which is the expected type for torch.cuda.set_device.
| torch.cuda.set_device(f"cuda:{os.getenv('LOCAL_RANK')}") | |
| torch.cuda.set_device(int(os.environ.get('LOCAL_RANK', 0))) |
|
|
||
| torch.cuda.set_device(f"cuda:{os.getenv('LOCAL_RANK')}") | ||
| dist.init_process_group(backend='nccl') | ||
| TP, PP, EP, ETP = 2, 1, 2, 1 |
There was a problem hiding this comment.
The parallel configuration TP=2, PP=1, EP=2 requires at least 4 GPUs because the expert model parallel size (EP) must be less than or equal to the data parallel size (DP), and with 2 GPUs and TP=2, the DP size is 1. To match the execution command provided earlier (--nproc_per_node=2), EP should be set to 1.
| TP, PP, EP, ETP = 2, 1, 2, 1 | |
| TP, PP, EP, ETP = 2, 1, 1, 1 |
No description provided.