-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fx/profiler] debug the fx.profiler / add an example test script for fx.profiler #1730
Conversation
colossalai/fx/profiler/memory.py
Outdated
return n.target in [torch.nn.functional.relu, torch.nn.functional.softmax] | ||
elif n.op == 'call_module': | ||
return type(n.graph.owning_module.get_submodule(n.target)) in [torch.nn.ReLU] | ||
return type(n.graph.owning_module.get_submodule(n.target)) in [torch.nn.ReLU, torch.nn.Softmax] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I mentioned, softmax is another relu_like node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is gonna confuse the future maintainer, what is a relu_like node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will think about this later. Unfortunately, I am also confused lol. Maybe I should add this to constants.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least you should define this category of ops so that its semantics is clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will make some comments there
return forward_mem, param_mem | ||
|
||
|
||
@pytest.mark.skip("Test for performance, no need for CI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark this with colossalai.testing.run_on_environment_flag
so that we can still run the test in the future without code change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be colossalai.testing.run_on_environment_flag('fx.profiler')
?
del model, gm | ||
|
||
|
||
@pytest.mark.skip("Test for performance, no need for CI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same.
colossalai/fx/profiler/profiler.py
Outdated
if target in RELU_LIKE_OPS: | ||
do_not_cache = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I also did some modification
"""Check if a node is a ReLU-like node. | ||
ReLU-like nodes have the following properties: | ||
- They are either `call_function` or `call_module` | ||
- Their output tensors are directly saved for backward | ||
- Their input tensors are not saved for backward | ||
|
||
An example is `torch.nn.functional.softmax` which has (forward + backward): | ||
def forward(self, input_2): | ||
_softmax_default = torch.ops.aten._softmax.default(input_2, None, None); input_2 = None | ||
zeros_like_default = torch.ops.aten.zeros_like.default(_softmax_default, dtype = None, layout = None, device = None, pin_memory = None) | ||
detach_default = torch.ops.aten.detach.default(_softmax_default); _softmax_default = None | ||
_softmax_backward_data_default = torch.ops.aten._softmax_backward_data.default(zeros_like_default, detach_default, None, None); zeros_like_default = detach_default = None | ||
detach_default_1 = torch.ops.aten.detach.default(_softmax_backward_data_default); _softmax_backward_data_default = None | ||
detach_default_2 = torch.ops.aten.detach.default(detach_default_1); detach_default_1 = None | ||
|
||
Args: | ||
n (Node): A node from the graph | ||
|
||
Returns: | ||
bool: Whether the node is a ReLU-like node | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A docstring is added
colossalai/fx/profiler/constants.py
Outdated
RELU_LIKE_OPS = [ | ||
torch.nn.functional.relu, | ||
torch.nn.functional.softmax, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still you should define a term for this kind of ops, relu_like
does not suggest any useful info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or call them something like OUTPUT_SAVED_OPS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put a docstring in the function which used this constant to illustrate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup I see it. Just think it could be clearer. I can accept it at the moment but hope it can be improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will come back to refactoring the profiler after @Cypher30 finishes the backward estimation. I already have some ideas to remove some of these constants without affecting the performance.
@@ -71,14 +74,35 @@ def calculate_fwd_tmp(n: Node) -> int: | |||
fwd_tmp (int): the result of `fwd_tmp` | |||
""" | |||
|
|||
def is_relu_node(n: Node) -> bool: | |||
def is_relu_like_node(n: Node) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it's here
What's new?
I provide an example script for the
fx.profiler
performance test. It is marked as skip.Remarks
After I upgraded transformers to the newest version, GPT-2 of hugging face no longer uses
torch.nn.MultiheadAttention
for transformer blocks. So thetorch.nn.functional.softmax
caused some unexpecting problems. They also usedtorch.finfo
object, which is weird. I managed to fix them.Tests