[fx] Add offload codegen #1598

Cypher30 · 2022-09-14T06:35:55Z

What's New

Previously we have offload option in activation checkpoint region to support the offload input process, however, in the upcoming activation solver, we might offload the input of a node that is not inside any checkpoint region. Therefore, I use saved_tensors_hooks for this kind of offload manner.

As we haven't implemented the torch11 version of ColoGraphModule, I skip the unit test for this part and attach the results on torch12 below

Merge ColossalAI

Daily merge

Merge

Daily Merge

FrankLeeeee · 2022-09-14T06:38:38Z

Paste an example of the generated code here.

FrankLeeeee · 2022-09-14T06:39:42Z

tests/test_fx/test_codegen/test_offload_codegen.py

+        if node.name == "linear4":
+            setattr(node, "activation_checkpoint", [0])
+
+    gm = ColoGraphModule(copy.deepcopy(model), graph)


I think copy is not needed.

Copy is needed because we want to test backward gradients.

Cypher30 · 2022-09-14T06:41:00Z

Paste an example of the generated code here.

Here is an example for the generated code:

def pack_hook(self, x):
    if getattr(x, "offload", None):
        return (x.device, x.cpu())
    else:
        return x

def unpack_hook(self, packed):
    if isinstance(packed, tuple):
        device, tensor = packed
        return tensor.to(device)
    else:
        return packed

def checkpoint_0(self, linear2):
    linear3 = self.linear3(linear2);  linear2 = None
    linear4 = self.linear4(linear3);  linear3 = None
    return linear4

def forward(self, x):
    linear1 = self.linear1(x);  x = None
    setattr(linear1, 'offload', True)
    with torch.autograd.graph.saved_tensors_hooks(self.pack_hook, self.unpack_hook):
        linear2 = self.linear2(linear1);  linear1 = None
    linear4 = colossalai.utils.activation_checkpoint.checkpoint(self.checkpoint_0, True, linear2, use_reentrant=False)
    linear5 = self.linear5(linear4);  linear4 = None
    return linear5

FrankLeeeee · 2022-09-14T06:41:07Z

tests/test_fx/test_codegen/test_offload_codegen.py

+@pytest.mark.skipif(with_codegen, reason='torch version is equal to or higher than 1.12.0')
+@pytest.mark.skip(reason="currently torch11 ColoGraphModule is not implemented")


why need two skips?

Sure I could remove the first one~

…ossalAI into feature/add_offload_codegen

colossalai/fx/codegen/activation_checkpoint_codegen.py

Cypher30 and others added 30 commits July 14, 2022 16:07

Merge pull request #1 from hpcaitech/main

04e5272

Merge ColossalAI

Merge pull request #2 from hpcaitech/main

75618b3

Daily merge

Merge pull request #3 from hpcaitech/main

3e4620c

Merge

Merge remote-tracking branch 'upstream/main' into main

cf24049

Merge

Merge remote-tracking branch 'upstream/main' into main

3d223b6

Daily Merge

Merge branch 'hpcaitech:main' into main

644115c

Merge branch 'hpcaitech:main' into main

d995ade

Merge branch 'hpcaitech:main' into main

bba2dbe

Merge branch 'hpcaitech:main' into main

05ca628

Merge branch 'hpcaitech:main' into main

0a967da

Merge branch 'hpcaitech:main' into main

0637c0d

Merge branch 'hpcaitech:main' into main

74a6227

Merge branch 'hpcaitech:main' into main

e550490

Merge branch 'hpcaitech:main' into main

2d7f5d9

Merge branch 'hpcaitech:main' into main

b62e870

Merge branch 'hpcaitech:main' into main

b4b0974

Merge branch 'hpcaitech:main' into main

65c20de

Merge branch 'hpcaitech:main' into main

1660bfc

Merge branch 'hpcaitech:main' into main

6eb0ad0

Merge branch 'hpcaitech:main' into main

56df059

Merge branch 'hpcaitech:main' into main

480e932

Merge branch 'hpcaitech:main' into main

0fa66ee

Merge branch 'hpcaitech:main' into main

1d013b0

Merge branch 'hpcaitech:main' into main

5774db2

Merge branch 'hpcaitech:main' into main

e8ff699

Merge branch 'hpcaitech:main' into main

855c728

Merge branch 'main' of github.com:Cypher30/ColossalAI into main

2c113ea

Merge branch 'hpcaitech:main' into main

838ba70

Merge branch 'main' of github.com:Cypher30/ColossalAI into main

cacec2b

Merge branch 'hpcaitech:main' into main

5ed6ef0

Cypher30 and others added 3 commits September 14, 2022 14:26

[fx] add input activation offload to codegen

6dbac6a

[fx] modify unit test

9a7dcc6

Merge branch 'hpcaitech:main' into feature/add_offload_codegen

b0afb21

Cypher30 requested review from FrankLeeeee and super-dainiu September 14, 2022 06:36

Cypher30 added the Run Build and Test label Sep 14, 2022

FrankLeeeee reviewed Sep 14, 2022

View reviewed changes

Cypher30 added 2 commits September 14, 2022 14:42

[fx] remove two skips in torch11

18b0e4f

Merge branch 'feature/add_offload_codegen' of github.com:Cypher30/Col…

dd8cf0e

…ossalAI into feature/add_offload_codegen

FrankLeeeee approved these changes Sep 14, 2022

View reviewed changes

super-dainiu reviewed Sep 14, 2022

View reviewed changes

colossalai/fx/codegen/activation_checkpoint_codegen.py Outdated Show resolved Hide resolved

[fx] use all_input_nodes instead of _input_nodes

3d9d63a

Cypher30 merged commit a7cda6f into hpcaitech:main Sep 14, 2022

Cypher30 mentioned this pull request Sep 20, 2022

[fx] Modify offload codegen #1618

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fx] Add offload codegen #1598

[fx] Add offload codegen #1598

Cypher30 commented Sep 14, 2022

FrankLeeeee commented Sep 14, 2022

FrankLeeeee Sep 14, 2022

Cypher30 Sep 14, 2022

FrankLeeeee Sep 14, 2022

Cypher30 commented Sep 14, 2022

FrankLeeeee Sep 14, 2022

Cypher30 Sep 14, 2022

		@pytest.mark.skipif(with_codegen, reason='torch version is equal to or higher than 1.12.0')
		@pytest.mark.skip(reason="currently torch11 ColoGraphModule is not implemented")

[fx] Add offload codegen #1598

[fx] Add offload codegen #1598

Conversation

Cypher30 commented Sep 14, 2022

What's New

FrankLeeeee commented Sep 14, 2022

FrankLeeeee Sep 14, 2022

Choose a reason for hiding this comment

Cypher30 Sep 14, 2022

Choose a reason for hiding this comment

FrankLeeeee Sep 14, 2022

Choose a reason for hiding this comment

Cypher30 commented Sep 14, 2022

FrankLeeeee Sep 14, 2022

Choose a reason for hiding this comment

Cypher30 Sep 14, 2022

Choose a reason for hiding this comment