[fx/profiler] tuned the calculation of memory estimation #1619

super-dainiu · 2022-09-21T08:21:43Z

What's modified?

To do MetaInfoProp on arbitrary physical devices, we need to wrap a torch.Tensor with MetaTensor, which has a property of fake_device. fake_device is used by __torch_dispatch__ to find an torch.ops.aten for execution, while in real execution, the tensor is moved back to device='meta'. So now you should do MetaInfoProp on a tm.resnet18 on CPU with the following script.

model = tm.resnet18()
input = MetaTensor(torch.rand(10000, 3, 224, 224, device='meta'), fake_device='cpu')
MetaInfoProp(gm).run(input)

And on GPU with the following script.

model = tm.resnet18().cuda()    # don't forget to move your model to cuda as well!
input = MetaTensor(torch.rand(10000, 3, 224, 224, device='meta'), fake_device='cuda')
MetaInfoProp(gm).run(input)

Now you might observe some different patterns in estimated memory, because torch.autograd behaves differently on different devices.

Improvements

The computation graph is completely different when executing on 'cpu' and 'cuda'.
Tensor dtype may change during execution.
Not every tensor that passes through torch.autograd.graph.saved_tensors_hooks is saved to device. tensor.data_ptr() marks its physical address. No duplicated tensor.data_ptr() should be saved.
1. One stupid thing is that we always have tensor.data_ptr()=0 on device='meta'
2. Luckily in Python frontend, we can still track the same tensor with the unique identifier of tensor.data_ptr as a torch._C function.
See the graph below for illustrations.
1. Whether or not fwd_out should be saved is determined by the node's users. If and only if any of the node's users have saved fwd_in can this node have fwd_out.
nn.MultiheadAttention has (x, x, x) as input, but only one x should be saved for backward.

Concerns

Backward memory is indeed too hard to estimate.

super-dainiu · 2022-09-21T08:22:52Z

colossalai/_meta_registrations.py

+# https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cudnn/BatchNorm.cpp
+@register_meta(aten.cudnn_batch_norm.default)
+def meta_cudnn_bn(input: torch.Tensor, weight, bias, running_mean, running_var, training, momentum, eps):
    n_input = input.size(1)

    output = torch.empty_like(input)
    running_mean = torch.empty((n_input), device='meta')
    running_var = torch.empty((n_input), device='meta')
+    reserve = torch.empty((0), dtype=torch.uint8, device='meta')
+    return output, running_mean, running_var, reserve
+
+
+# https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cudnn/BatchNorm.cpp
+# NB: CuDNN only implements the backward algorithm for batchnorm
+# in training mode (evaluation mode batchnorm has a different algorithm),
+# which is why this doesn't accept a 'training' parameter.
+@register_meta(aten.cudnn_batch_norm_backward.default)
+def meta_cudnn_bn_backward(dY: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, running_mean, running_var,
+                           save_mean, save_invstd, eps, reserve):
+    dX = torch.empty_like(input)
+    dgamma = torch.empty_like(weight)
+    dbeta = torch.empty_like(weight)
+    return dX, dgamma, dbeta
+
+
+@register_meta(aten.native_layer_norm.default)
+def meta_ln(input: torch.Tensor, normalized_shape, weight, bias, eps):
+    bs = input.size(0)
+    n_input = input.size(1)
+
+    output = torch.empty_like(input)
+    running_mean = torch.empty((bs, n_input, 1), device='meta')
+    running_var = torch.empty((bs, n_input, 1), device='meta')
    return output, running_mean, running_var


A sad truth is autograd on cuda use cudnn_batch_norm

super-dainiu · 2022-09-21T08:23:23Z

colossalai/_meta_registrations.py

+def meta_native_dropout_default(input: torch.Tensor, p: float, train: bool = False):
+    # notice that mask is bool
+    output = torch.empty_like(input)
+    mask = torch.empty_like(input, dtype=torch.bool)
+    return output, mask
+
+
+# https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Dropout.cpp
+@register_meta(aten.native_dropout_backward.default)
+def meta_native_dropout_backward_default(grad: torch.Tensor, mask: torch.Tensor, scale: float):
+    return torch.empty_like(grad)


Another sad truth is autograd on cuda use native_dropout

super-dainiu · 2022-09-21T08:23:55Z

colossalai/fx/passes/algorithms/ckpt_solver_rotor.py

+        xbar += n.meta['fwd_mem_tmp']
+        if any(map(lambda x: x.meta['save_fwd_in'], n.users)):
+            xbar += n.meta['fwd_mem_out']


Not every node has fwd_mem_out.

super-dainiu · 2022-09-21T08:24:42Z

colossalai/fx/passes/meta_info_prop.py

@@ -224,7 +222,7 @@ def output(self, target: 'Target', args: Tuple[Argument, ...], kwargs: Dict[str,
            result (Any): The argument value that was retrieved
            meta_info (MetaInfo): The memory cost and FLOPs estimated with `MetaTensor`.
        """
-        return args[0], GraphInfo(fwd_mem_in=activation_size(args[0]))
+        return args[0], GraphInfo(save_fwd_in=True)


output node saves the last fwd_mem_out for sure

super-dainiu · 2022-09-21T08:26:11Z

colossalai/fx/profiler/dataflow.py

        elif is_phase(n, Phase.BACKWARD):
            if len(n.users):
                graph_info.bwd_mem_tmp = max(graph_info.bwd_mem_tmp, _peak_memory(deps))
            else:
                # TODO: some of the bwd_mem_out might be model parameters.
                # basically a backward node without user is a `grad_out` node
-                graph_info.bwd_mem_out += activation_size(n.meta['out'])
+                graph_info.bwd_mem_out += activation_size(n.meta['saved_tensor'])
        for input_n in n.all_input_nodes:
            if input_n in deps:
                deps[input_n] -= 1


these analysis are still naive, so sad

super-dainiu · 2022-09-21T08:26:45Z

colossalai/fx/profiler/memory.py

-if META_COMPATIBILITY:
-    aten = torch.ops.aten
-
-    WEIRD_OPS = [
-        torch.where,
-    ]
-
-    INPLACE_ATEN = [
-        aten.add_.Tensor,
-        aten.sub_.Tensor,
-        aten.div_.Tensor,
-        aten.div_.Scalar,
-        aten.mul_.Tensor,
-        aten.bernoulli_.float,
-
-    # inplace reshaping
-        aten.copy_.default,
-        aten.detach.default,
-        aten.t.default,
-        aten.transpose.int,
-        aten.view.default,
-        aten._unsafe_view.default,
-    ]
-
-    NORMALIZATION_ATEN = [
-        aten.native_batch_norm.default,
-        aten.native_layer_norm.default,
-    # aten.max_pool2d_with_indices.default,
-    ]
-
-    CLONE_ATEN = [
-        aten.clone.default,
-    ]
-
-    __all__ += ['INPLACE_ATEN', 'WEIRD_OPS', 'NORMALIZATION_ATEN', 'CLONE_ATEN']
-
-else:
-    # TODO fill out the inplace ops
-    INPLACE_OPS = [
-        add,
-        sub,
-        mul,
-        floordiv,
-        neg,
-        pos,
-        getitem,
-        setitem,
-        getattr,
-        torch.Tensor.cpu,
-    ]
-
-    # TODO: list all call_methods that are inplace here
-    INPLACE_METHOD = [
-        'transpose',
-        'permute',
-    # TODO: reshape may return a copy of the data if the data is not contiguous
-        'reshape',
-        'dim',
-        'flatten',
-        'size',
-        'view',
-        'unsqueeze',
-        'to',
-        'type',
-        'flatten',
-    ]
-
-    # TODO: list all call_methods that are not inplace here
-    NON_INPLACE_METHOD = [
-        'chunk',
-        'contiguous',
-        'expand',
-        'mean',
-        'split',
-    ]
-    __all__ += ['INPLACE_OPS', 'INPLACE_METHOD', 'NON_INPLACE_METHOD']


move them to constant.py

super-dainiu · 2022-09-21T08:27:39Z

colossalai/fx/profiler/opcount.py

@@ -201,6 +202,8 @@ def zero_flop_jit(*args):
    # normalization
    aten.native_batch_norm.default: batchnorm_flop_jit,
    aten.native_batch_norm_backward.default: batchnorm_flop_jit,
+    aten.cudnn_batch_norm.default: batchnorm_flop_jit,
+    aten.cudnn_batch_norm_backward.default: partial(batchnorm_flop_jit, training=True),


notice that cudnn_batch_norm_backward is different because it is only used when training=True

super-dainiu · 2022-09-21T08:28:21Z

colossalai/fx/profiler/profiler.py

+# super-dainiu:
+# x.detach() will change the unique identifier of data_ptr
+# we need to handle this in a stupid way


x.detach() will change the unique identifier of data_ptr. we need to handle this in a stupid way

super-dainiu · 2022-09-21T08:28:38Z

colossalai/fx/profiler/profiler.py

 from .tensor import MetaTensor
 from .opcount import flop_mapping

 __all__ = ['profile_function', 'profile_module', 'profile_method']

+# super-dainiu: this cache should be global, otherwise it cannot
+# track duplicated tensors between nodes
+cache = set()


this cache should be global, otherwise it cannot track duplicated tensors between nodes

super-dainiu · 2022-09-21T08:30:12Z

colossalai/fx/profiler/profiler.py

+        # still run the profiling but discard some results regarding `module`.
+        inplace = getattr(module, 'inplace', False)
+        if inplace:
+            module.inplace = False


no longer skip profiling for inplace=True

super-dainiu · 2022-09-21T09:10:31Z

super-dainiu added 6 commits September 15, 2022 12:07

[fx] tuned the meta info and rotor solver.

f3312c9

[fx] remove import.

a30ac70

[fx] remove import.

02b741c

[fx] remove import.

1e2b32f

[fx] tune the meta calculations.

23995d4

[fx] tune the meta calculations.

d941839

super-dainiu requested a review from Cypher30 September 21, 2022 08:21

super-dainiu commented Sep 21, 2022

View reviewed changes

[fx] polish comments.

8c00134

super-dainiu commented Sep 21, 2022

View reviewed changes

[fx] remove assertions.

9737e90

super-dainiu requested a review from FrankLeeeee September 21, 2022 08:39

super-dainiu added 2 commits September 21, 2022 16:48

[fx] modify test cases.

8c71283

[fx] modify test cases.

f73804b

super-dainiu added the Run Build and Test label Sep 21, 2022

super-dainiu and others added 2 commits September 21, 2022 17:25

[fx] optimize import.

932dda5

[fx

3a2d756

FrankLeeeee approved these changes Sep 23, 2022

View reviewed changes

FrankLeeeee merged commit d967779 into hpcaitech:main Sep 23, 2022

super-dainiu deleted the tuning/meta_backward branch September 23, 2022 06:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fx/profiler] tuned the calculation of memory estimation #1619

[fx/profiler] tuned the calculation of memory estimation #1619

super-dainiu commented Sep 21, 2022

super-dainiu Sep 21, 2022

super-dainiu Sep 21, 2022

super-dainiu Sep 21, 2022

super-dainiu Sep 21, 2022

super-dainiu Sep 21, 2022

super-dainiu Sep 21, 2022

super-dainiu Sep 21, 2022 •

edited

super-dainiu Sep 21, 2022

super-dainiu Sep 21, 2022

super-dainiu Sep 21, 2022

super-dainiu commented Sep 21, 2022

[fx/profiler] tuned the calculation of memory estimation #1619

[fx/profiler] tuned the calculation of memory estimation #1619

Conversation

super-dainiu commented Sep 21, 2022

What's modified?

Improvements

Concerns

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

super-dainiu Sep 21, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

super-dainiu commented Sep 21, 2022

super-dainiu Sep 21, 2022 •

edited