Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fx/profiler] debug the fx.profiler / add an example test script for fx.profiler #1730

Merged
merged 11 commits into from
Oct 19, 2022

Conversation

super-dainiu
Copy link
Contributor

What's new?

I provide an example script for the fx.profiler performance test. It is marked as skip.

Remarks

After I upgraded transformers to the newest version, GPT-2 of hugging face no longer uses torch.nn.MultiheadAttention for transformer blocks. So the torch.nn.functional.softmax caused some unexpecting problems. They also used torch.finfo object, which is weird. I managed to fix them.

Tests

image
8

Comment on lines 76 to 78
return n.target in [torch.nn.functional.relu, torch.nn.functional.softmax]
elif n.op == 'call_module':
return type(n.graph.owning_module.get_submodule(n.target)) in [torch.nn.ReLU]
return type(n.graph.owning_module.get_submodule(n.target)) in [torch.nn.ReLU, torch.nn.Softmax]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I mentioned, softmax is another relu_like node

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is gonna confuse the future maintainer, what is a relu_like node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will think about this later. Unfortunately, I am also confused lol. Maybe I should add this to constants.py?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least you should define this category of ops so that its semantics is clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will make some comments there

return forward_mem, param_mem


@pytest.mark.skip("Test for performance, no need for CI")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark this with colossalai.testing.run_on_environment_flag so that we can still run the test in the future without code change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be colossalai.testing.run_on_environment_flag('fx.profiler')?

del model, gm


@pytest.mark.skip("Test for performance, no need for CI")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same.

Comment on lines 320 to 321
if target in RELU_LIKE_OPS:
do_not_cache = True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I also did some modification

Comment on lines +76 to +96
"""Check if a node is a ReLU-like node.
ReLU-like nodes have the following properties:
- They are either `call_function` or `call_module`
- Their output tensors are directly saved for backward
- Their input tensors are not saved for backward

An example is `torch.nn.functional.softmax` which has (forward + backward):
def forward(self, input_2):
_softmax_default = torch.ops.aten._softmax.default(input_2, None, None); input_2 = None
zeros_like_default = torch.ops.aten.zeros_like.default(_softmax_default, dtype = None, layout = None, device = None, pin_memory = None)
detach_default = torch.ops.aten.detach.default(_softmax_default); _softmax_default = None
_softmax_backward_data_default = torch.ops.aten._softmax_backward_data.default(zeros_like_default, detach_default, None, None); zeros_like_default = detach_default = None
detach_default_1 = torch.ops.aten.detach.default(_softmax_backward_data_default); _softmax_backward_data_default = None
detach_default_2 = torch.ops.aten.detach.default(detach_default_1); detach_default_1 = None

Args:
n (Node): A node from the graph

Returns:
bool: Whether the node is a ReLU-like node
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A docstring is added

Comment on lines 34 to 37
RELU_LIKE_OPS = [
torch.nn.functional.relu,
torch.nn.functional.softmax,
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still you should define a term for this kind of ops, relu_like does not suggest any useful info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or call them something like OUTPUT_SAVED_OPS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a docstring in the function which used this constant to illustrate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I see it. Just think it could be clearer. I can accept it at the moment but hope it can be improved.

Copy link
Contributor Author

@super-dainiu super-dainiu Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will come back to refactoring the profiler after @Cypher30 finishes the backward estimation. I already have some ideas to remove some of these constants without affecting the performance.

@@ -71,14 +74,35 @@ def calculate_fwd_tmp(n: Node) -> int:
fwd_tmp (int): the result of `fwd_tmp`
"""

def is_relu_node(n: Node) -> bool:
def is_relu_like_node(n: Node) -> bool:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's here

@super-dainiu super-dainiu merged commit 30874f1 into hpcaitech:main Oct 19, 2022
@super-dainiu super-dainiu deleted the debug/profiler branch October 19, 2022 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants