[fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 #1679

super-dainiu · 2022-10-05T17:26:44Z

What's new?

Provide a function to calculate fwd_in, fwd_tmp and fwd_out. Hopefully, this will avoid a lot of misunderstandings.
Assigned UUID to each unrecorded tensor.
Improved performance on GPT-2

Bugs in existing code

C version of the rotor solver is not correct. cc @Cypher30

C version results

Model	mem_limit	real_consumption	train step time	solver time
<function densenet121 at 0x7feff4622430>	mem_limit: None	real memory consumption: 15996.476 MB	train step time: 171.204 MS
<function densenet121 at 0x7feff4622430>	mem_limit: 5300.0 MB	real memory consumption: 5300.000 MB	train step time: inf MS	solver time: 0.000 MS
<function densenet121 at 0x7feff4622430>	mem_limit: 6680.0 MB	real memory consumption: 6680.000 MB	train step time: inf MS	solver time: 0.000 MS
<function densenet121 at 0x7feff4622430>	mem_limit: 8060.0 MB	real memory consumption: 8060.000 MB	train step time: inf MS	solver time: 0.000 MS
<function densenet121 at 0x7feff4622430>	mem_limit: 9440.0 MB	real memory consumption: 9440.000 MB	train step time: inf MS	solver time: 0.000 MS
<function densenet121 at 0x7feff4622430>	mem_limit: 10820.0 MB	real memory consumption: 10820.000 MB	train step time: inf MS	solver time: 0.000 MS
<function densenet121 at 0x7feff4622430>	mem_limit: 12200.0 MB	real memory consumption: 12200.000 MB	train step time: inf MS	solver time: 0.000 MS
<function densenet121 at 0x7feff4622430>	mem_limit: 13580.0 MB	real memory consumption: 13580.000 MB	train step time: inf MS	solver time: 0.000 MS
<function densenet121 at 0x7feff4622430>	mem_limit: 14960.0 MB	real memory consumption: 14960.000 MB	train step time: inf MS	solver time: 0.000 MS
<function densenet121 at 0x7feff4622430>	mem_limit: 16340.0 MB	real memory consumption: 15800.507 MB	train step time: 171.508 MS	solver time: 2038.141 MS
<function densenet121 at 0x7feff4622430>	mem_limit: 17720.0 MB	real memory consumption: 15800.476 MB	train step time: 171.513 MS	solver time: 2205.862 MS
<function densenet121 at 0x7feff4622430>	mem_limit: 19100.0 MB	real memory consumption: 15800.476 MB	train step time: 171.499 MS	solver time: 2239.843 MS

Force Python results

Model	mem_limit	real_consumption	train step time	solver time
<function densenet121 at 0x7f2bc32e79d0>	mem_limit: None	real memory consumption: 16027.334 MB	train step time: 171.072 MS
<function densenet121 at 0x7f2bc32e79d0>	mem_limit: 5300.0 MB	real memory consumption: 5300.000 MB	train step time: inf MS	solver time: 0.000 MS
<function densenet121 at 0x7f2bc32e79d0>	mem_limit: 6690.0 MB	real memory consumption: 6313.646 MB	train step time: 208.779 MS	solver time: 1582.991 MS
<function densenet121 at 0x7f2bc32e79d0>	mem_limit: 8080.0 MB	real memory consumption: 7292.647 MB	train step time: 204.590 MS	solver time: 1605.730 MS
<function densenet121 at 0x7f2bc32e79d0>	mem_limit: 9470.0 MB	real memory consumption: 9500.286 MB	train step time: 195.120 MS	solver time: 1625.514 MS
<function densenet121 at 0x7f2bc32e79d0>	mem_limit: 10860.0 MB	real memory consumption: 10728.410 MB	train step time: 190.821 MS	solver time: 1571.043 MS
<function densenet121 at 0x7f2bc32e79d0>	mem_limit: 12250.0 MB	real memory consumption: 11897.869 MB	train step time: 186.108 MS	solver time: 1640.221 MS
<function densenet121 at 0x7f2bc32e79d0>	mem_limit: 13640.0 MB	real memory consumption: 12093.869 MB	train step time: 185.596 MS	solver time: 1590.731 MS
<function densenet121 at 0x7f2bc32e79d0>	mem_limit: 15030.0 MB	real memory consumption: 14851.113 MB	train step time: 175.736 MS	solver time: 1633.711 MS
<function densenet121 at 0x7f2bc32e79d0>	mem_limit: 16420.0 MB	real memory consumption: 16027.334 MB	train step time: 171.635 MS	solver time: 1585.685 MS
<function densenet121 at 0x7f2bc32e79d0>	mem_limit: 17810.0 MB	real memory consumption: 16027.334 MB	train step time: 171.443 MS	solver time: 1645.066 MS

What is wrong with this? ✔
Logger always requires a launch, some modifications should be made either on the logger or on test_linearize.py. cc @Cypher30

Tests

Passed all the tests/test_fx tests locally.

…into speedup

super-dainiu · 2022-10-05T17:31:58Z

colossalai/fx/profiler/dataflow.py

+    # TODO(super-dainiu): removed redundant items, currently all of them are necessary for development
+
    fwd_flop: int = 0
    fwd_time: float = 0.0
    bwd_flop: int = 0
    bwd_time: float = 0.0
    save_fwd_in: bool = False
+    fwd_in: List = field(default_factory=list)
+    fwd_tmp: List = field(default_factory=list)
+    fwd_out: List = field(default_factory=list)
    fwd_mem_tmp: int = 0
    fwd_mem_out: int = 0
    bwd_mem_tmp: int = 0


Just for compatibility of different InfoProp

Now all fwd of MetaInfoProp are saved in List

super-dainiu · 2022-10-05T17:34:19Z

colossalai/fx/profiler/profiler.py

        # If there is an argument that this `call_function` is inplace, we should
        # still run the profiling but discard some results regarding `target`
+        global do_not_cache


This happens if we replace inplace=True with inplace=False, because some tensors might be saved for backward during inplace=False, and it will be cached.

super-dainiu · 2022-10-05T17:35:08Z

colossalai/fx/profiler/tensor.py

+def set_uuid(x):
+    if isinstance(x, torch.Tensor):
+        if not hasattr(x, 'uuid'):
+            setattr(x, 'uuid', uuid.uuid4())


uuid is set if and only if the tensor has no uuid

super-dainiu · 2022-10-05T17:35:22Z

tests/test_fx/test_ckpt_solvers/test_linearize.py

@@ -15,7 +19,7 @@
    with_codegen = False


-@pytest.mark.skip(reason='TODO: modify calculations in rotor')
+@pytest.mark.skip(reason='TODO: modify the logger')


See error 3

super-dainiu · 2022-10-05T17:35:30Z

tests/test_fx/test_ckpt_solvers/test_ckpt_torchvision.py

            codegen = ActivationCheckpointCodeGen()
            gm.graph.set_codegen(codegen)
            if solver == solver_rotor:
-                gm = solver(gm, data, mem_limit=500 * 1024 * 1024, mem_slots=500)
+                gm = solver(gm, data, mem_limit=500 * 1024 * 1024, mem_slots=500, force_python=True)


See error 1

Cypher30

I will check the logger problem, and the build warning in code factor might because of shell=True, I will dive into this problem.

tests/test_fx/test_ckpt_solvers/test_linearize.py

Cypher30 · 2022-10-06T03:16:43Z

tests/test_fx/test_ckpt_solvers/test_ckpt_torchvision.py

            codegen = ActivationCheckpointCodeGen()
            gm.graph.set_codegen(codegen)
            if solver == solver_rotor:
-                gm = solver(gm, data, mem_limit=500 * 1024 * 1024, mem_slots=500)
+                gm = solver(gm, data, mem_limit=500 * 1024 * 1024, mem_slots=500, force_python=True)


Currently I cannot reproduce this problem.

…into speedup

super-dainiu · 2022-10-06T07:32:13Z

colossalai/fx/profiler/experimental/memory.py

+# for PyTorch 1.11 compatibility uses
+import torch
+from torch.fx import Node, GraphModule
+from typing import Union, Dict, List, Tuple
+
+__all__ = ["calculate_fwd_in", "calculate_fwd_tmp", "calculate_fwd_out"]
+
+
+def calculate_fwd_in(n: Node) -> bool:
+    """A helper function to calculate `fwd_in`
+
+    Args:
+        n (Node): a node from the graph
+
+    Returns:
+        save_fwd_in (bool): the result of `save_fwd_in`
+    """
+    return n.meta['save_fwd_in']
+
+
+def calculate_fwd_tmp(n: Node) -> int:
+    """A helper function to calculate `fwd_tmp`
+
+    Args:
+        n (Node): a node from the graph
+
+    Returns:
+        fwd_tmp (int): the result of `fwd_tmp`
+    """
+    return n.meta["fwd_mem_tmp"]
+
+
+def calculate_fwd_out(n: Node) -> int:
+    """A helper function to calculate `fwd_out`
+
+    Args:
+        n (Node): a node from the graph
+
+    Returns:
+        fwd_out (int): the result of `fwd_out`
+    """
+    return n.meta['fwd_mem_out']


this is for compatibility uses because metainfoprop for torch 1.11 don't save tensors with uuid.

super-dainiu · 2022-10-06T07:33:49Z

colossalai/fx/passes/algorithms/ckpt_solver_rotor.py

+def _get_fwd_mem_tmp(node: List[Node]) -> int:
+    """Get the forward temp memory of a node
+    This could be done by subtracting the saved activation from all output of a node
+
+    Args:
+        node (List[Node]): List of torch.fx Node,
+        indicates a node in linearized graph
+
+    Returns:
+        int: forward temp memory, unit Byte
+    """
+    n = node[-1]
+    return activation_size(n.meta['fwd_out']) - calculate_fwd_out(n)


Now fwd_out that are not saved for next node may be regarded as fwd_tmp

Cypher30

Great Work! I will approve this PR first and check the C version issue.

super-dainiu and others added 27 commits September 23, 2022 15:07

[fx/profiling] provide summary for MetaInfoProp.

8a42a8c

[fx/profiler] provide a table of summary.

905eece

[fx/profiler] provide a table of summary.

95fcbcc

[fx/profiler] provide a table of summary.

4b326df

[fx/profiler] provide a table of summary.

6056be1

[fx] optimize table repr.

58cd60b

[fx] optimize table repr.

bdec890

[fx] refactor code for profiler.

2f24d4b

Merge branch 'hpcaitech:main' into speedup

754f64a

[fx] add docstring.

4d13492

Merge branch 'speedup' of https://github.com/super-dainiu/ColossalAI …

ff93220

…into speedup

[fx] add docstring.

71b93e4

[fx] skip test.

2b1bd26

[fx] redo.

e57fbc4

[fx] redo.

a6210d2

[fx] fix import error for torch11.

66e8546

[fx] fix import error for torch11.

0e37fc5

[hotfix] fix singledispatchmethod incompatibility.

3e25419

Merge branch 'hpcaitech:main' into main

7993aff

[hotfix] fix singledispatchmethod incompatibility.

d358df5

[fx/profiler] modify data_ptr into uuid for all tensors.

3ec8061

[fx] modify uuid.

ff8e832

Merge branch 'hpcaitech:main' into speedup

db0c211

Merge branch 'speedup' of https://github.com/super-dainiu/ColossalAI …

9f206d7

…into speedup

Merge branch 'hpcaitech:main' into speedup

ab96506

[fx/profiler] tune performance on GPT-2.

fab23e3

[fx] merge upstreams.

49defef

super-dainiu requested a review from Cypher30 October 5, 2022 17:26

super-dainiu changed the title ~~Speedup~~ [fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 Oct 5, 2022

super-dainiu added the Run Build and Test label Oct 5, 2022

[fx] updates.

771c02e

super-dainiu commented Oct 5, 2022

View reviewed changes

Cypher30 reviewed Oct 6, 2022

View reviewed changes

super-dainiu and others added 4 commits October 6, 2022 13:55

[fx] debug.

016a94b

[fx] debug.

abe66ef

Merge branch 'hpcaitech:main' into speedup

283d121

Merge branch 'speedup' of https://github.com/super-dainiu/ColossalAI …

47c8381

…into speedup

super-dainiu commented Oct 6, 2022

View reviewed changes

super-dainiu requested review from Cypher30 and FrankLeeeee October 6, 2022 08:10

[fx] cuda.

a405f60

Cypher30 approved these changes Oct 11, 2022

View reviewed changes

super-dainiu merged commit 3dd6994 into hpcaitech:main Oct 11, 2022

super-dainiu deleted the speedup branch October 11, 2022 06:41

Cypher30 mentioned this pull request Oct 12, 2022

[autoparallel] fix C version rotor inconsistency #1691

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 #1679

[fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 #1679

super-dainiu commented Oct 5, 2022 •

edited

super-dainiu Oct 5, 2022

super-dainiu Oct 5, 2022

super-dainiu Oct 5, 2022

super-dainiu Oct 5, 2022

super-dainiu Oct 5, 2022

super-dainiu Oct 5, 2022

Cypher30 left a comment

Cypher30 Oct 6, 2022

super-dainiu Oct 6, 2022

super-dainiu Oct 6, 2022

Cypher30 left a comment

[fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 #1679

[fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 #1679

Conversation

super-dainiu commented Oct 5, 2022 • edited

What's new?

Bugs in existing code

Tests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Cypher30 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Cypher30 left a comment

Choose a reason for hiding this comment

super-dainiu commented Oct 5, 2022 •

edited