[fx] Add nested checkpoint in activation checkpoint codegen #1585

Cypher30 · 2022-09-12T05:51:35Z

What's New?

As we need to make use of activation checkpoint solver with the setting in rotor and pofo, we might encounter the situation that we need to employ nested checkpoint, i.e. we have something in the following forms

def checkpoint_0(self, x):
    linear3 = colossalai.utils.activation_checkpoint.checkpoint(self.checkpoint_0_0, False, x, use_reentrant=False)
    linear4 = colossalai.utils.activation_checkpoint.checkpoint(self.checkpoint_0_1, False, linear3, use_reentrant=False)
    return linear4
def checkpoint_0_0(self, x):
    linear1 = colossalai.utils.activation_checkpoint.checkpoint(self.checkpoint_0_0_0, False, x, use_reentrant=False)
    linear2 = self.linear2(linear1);  linear1 = None
    linear3 = colossalai.utils.activation_checkpoint.checkpoint(self.checkpoint_0_0_1, False, linear2, use_reentrant=False)
    return linear3
def checkpoint_0_0_0(self, x):
    linear1 = self.linear1(x);  x = None
    return linear1
def checkpoint_0_0_1(self, linear2):
    linear3 = self.linear3(linear2);  linear2 = None
    return linear3
def checkpoint_0_1(self, linear3):
    linear4 = self.linear4(linear3);  linear3 = None
    return linear4
def checkpoint_1(self, linear4):
    linear5 = self.linear5(linear4);  linear4 = None
def forward(self, x):
    linear4 = colossalai.utils.activation_checkpoint.checkpoint(self.checkpoint_0, False, x, use_reentrant=False)
    linear5 = colossalai.utils.activation_checkpoint.checkpoint(self.checkpoint_1, False, linear4, use_reentrant=False)
    linear6 = self.linear6(linear5);  linear5 = None
    return linear6

in the upcoming solver update, the annotation process will be able to detect those structures, and each node.activation_checkpoint(if annotated) will be a list indicates the checkpoint label in each level, for example, the node with [0, 1, 1] means that it belongs to checkpoint_0_1_1, this function will be called by checkpoint_0_1 and checkpoint_0_1 will be called by checkpoint_0, finally the checkpoint_0 will be called by forward.

Old version of activation checkpoint codegen is also preserved as we have the following mechanism to choose which activation checkpoint to use

if all(not isinstance(getattr(node, "activation_checkpoint", None), list) for node in nodes):
       emit_code_with_activation_checkpoint(body, ckpt_func, nodes, emit_node, delete_unused_values)
else:
       emit_code_with_nested_activation_checkpoint(body, ckpt_func, nodes, emit_node, delete_unused_values)

As currently we haven't implemented the ColoGraphModule for torch11, I just simply skip the test for it, the following is the test result on torch12

Merge ColossalAI

Daily merge

Merge

Daily Merge

Cypher30 and others added 30 commits July 14, 2022 16:07

Merge pull request #1 from hpcaitech/main

04e5272

Merge ColossalAI

Merge pull request #2 from hpcaitech/main

75618b3

Daily merge

Merge pull request #3 from hpcaitech/main

3e4620c

Merge

Merge remote-tracking branch 'upstream/main' into main

cf24049

Merge

Merge remote-tracking branch 'upstream/main' into main

3d223b6

Daily Merge

Merge branch 'hpcaitech:main' into main

644115c

Merge branch 'hpcaitech:main' into main

d995ade

Merge branch 'hpcaitech:main' into main

bba2dbe

Merge branch 'hpcaitech:main' into main

05ca628

Merge branch 'hpcaitech:main' into main

0a967da

Merge branch 'hpcaitech:main' into main

0637c0d

Merge branch 'hpcaitech:main' into main

74a6227

Merge branch 'hpcaitech:main' into main

e550490

Merge branch 'hpcaitech:main' into main

2d7f5d9

Merge branch 'hpcaitech:main' into main

b62e870

Merge branch 'hpcaitech:main' into main

b4b0974

Merge branch 'hpcaitech:main' into main

65c20de

Merge branch 'hpcaitech:main' into main

1660bfc

Merge branch 'hpcaitech:main' into main

6eb0ad0

Merge branch 'hpcaitech:main' into main

56df059

Merge branch 'hpcaitech:main' into main

480e932

Merge branch 'hpcaitech:main' into main

0fa66ee

Merge branch 'hpcaitech:main' into main

1d013b0

Merge branch 'hpcaitech:main' into main

5774db2

Merge branch 'hpcaitech:main' into main

e8ff699

Merge branch 'hpcaitech:main' into main

855c728

Merge branch 'main' of github.com:Cypher30/ColossalAI into main

2c113ea

[fx] add nested activation_checkpoint codegen

a987de0

Merge branch 'hpcaitech:main' into feature/add_nested_checkpoint_codegen

06da433

undo algorithms commits

bfcb7cd

Cypher30 added 2 commits September 12, 2022 13:36

solver

eef18c8

undo some commits

59de066

Cypher30 requested a review from super-dainiu September 12, 2022 05:51

Cypher30 added the Run Build and Test label Sep 12, 2022

Cypher30 added 4 commits September 12, 2022 13:54

[fx] torch11 add nested activation checkpoint codegen

456d844

remove some imports

f1f356b

[fx] add some comments in activation codegen

0b10758

[fx] codegen instance error fix

931e4e8

super-dainiu approved these changes Sep 12, 2022

View reviewed changes

Cypher30 merged commit f3687e4 into hpcaitech:main Sep 12, 2022

Cypher30 mentioned this pull request Sep 12, 2022

[fx] Improve linearize and rotor solver #1586

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fx] Add nested checkpoint in activation checkpoint codegen #1585

[fx] Add nested checkpoint in activation checkpoint codegen #1585

Uh oh!

Cypher30 commented Sep 12, 2022

Uh oh!

Uh oh!

[fx] Add nested checkpoint in activation checkpoint codegen #1585

[fx] Add nested checkpoint in activation checkpoint codegen #1585

Uh oh!

Conversation

Cypher30 commented Sep 12, 2022

What's New?

Uh oh!

Uh oh!