[tensor] refactor colo-tensor #992

ver217 · 2022-05-17T10:21:06Z

ColoTensor inherits torch.Tensor
Test graph is broken and must be fixed in the next PR.
Lazy Init of ColoTensor is not supported any more.
Remove torch.sum and torch.mean from element wise ops. I don't think they are element wise.

colossalai/tensor/colo_tensor.py

feifeibear · 2022-05-17T10:33:49Z

colossalai/tensor/colo_tensor.py

-
-        def replace_tensor_with_colo(func):
+            func = _COLOSSAL_OPS[func]
+        return super().__torch_function__(func, types, args, kwargs)


confusing about this line? why not return _COLOSSAL_OPS[func](types, args, kwargs, None) as before?

super().__torch_function__ handles type conversion automatically and disable torch function in this context. See https://github.com/pytorch/pytorch/blob/0e351c7df9c41367bca94b824e20f68ea8f165b4/torch/_tensor.py#L1112

As ColoTensor inherits torch.Tensor, we must disable torch_function in our implementations of torch functions. This avoids infinite recursion.

I see. You are smart.

…colo-tensor

Wesley-Jzy · 2022-05-18T12:30:08Z

colossalai/tensor/_ops/_utils.py

+from typing import Union, Optional
+from colossalai.tensor import ColoTensor
+
+GeneralTensor = Union[ColoTensor, torch.Tensor]


Why we still need this Union?

I think it's a subset of torch.Tensor now.

Because input tensor can be torch.Tensor.

Yes, but their attributes and methods are slightly different. This helps auto completion when using IDE.

I agree with Wesley. ColoTensor is a subclass of Tensor. Then a Union of them is unnecessary. As for the auto-completion issue, it may not be a good idea to confuse users for this reason.

I still think we should distinguish them. ColoTensor is torch.Tensor, but torch.Tensor is not ColoTensor.

colossalai/tensor/_ops/linear.py

colossalai/tensor/colo_parameter.py

Wesley-Jzy · 2022-05-18T12:49:50Z

colossalai/tensor/colo_parameter.py

+                          spec: TensorSpec = TensorSpec(dist_spec.replicate())) -> 'ColoParameter':
+        tensor = tensor.as_subclass(ColoParameter)
+        tensor.__init__(tensor, requires_grad=requires_grad, spec=spec)
+        return tensor


What about using classmethod and as_subclass(cls) in ColoTensor?

ColoTensor.from_torch_tensor and ColoParameter.from_torch_tensor take different arguments. Although we can make from_torch_tensor class method which takes *args and **kwargs, in order to make the function clear and friendly for IDE auto completion, I override it.

feifeibear · 2022-05-19T02:00:54Z

colossalai/tensor/_ops/addmm.py

    parallel_action = mat2.spec.get_action_by_compute_pattern(ComputePattern.TP1D)
    # mat1:S[1] x mat2:S[0] = Output:P
    # beta * input + alpha * All-Reduce(Output) = res

-    mat1.to_dist_spec(dist_spec.shard(mat2.spec.get_process_group(), [-1], [mat2.spec.get_process_group().size()]))
+    mat1 = mat1.convert_to_dist_spec(
+        dist_spec.shard(mat2.spec.get_process_group(), [-1], [mat2.spec.get_process_group().size()]))


why not add a new API get_process_group_size().

colossalai/tensor/_ops/addmm.py

feifeibear · 2022-05-19T02:07:10Z

colossalai/tensor/_ops/addmm.py

@@ -1,64 +1,60 @@
 import torch
-from typing import Union
 from colossalai.tensor.op_wrapper import colo_op_impl
 from colossalai.nn.layer.parallel_1d._utils import reduce_input, reduce_grad
 from colossalai.tensor import ComputePattern, TensorSpec, ComputePattern, ParallelAction, ColoTensor
 from colossalai.tensor import dist_spec


I consistently confused the package name with a variable name. Because you tend to use the short names and _ for local variables.
How about name it as distspec?

As PEP 8 clearly describes it:

Package and Module Names

Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.

So do you mean file names should have no underscore? I think there are many cases like this in colo tensor files.

feifeibear · 2022-05-19T02:12:06Z

colossalai/tensor/_ops/embedding.py

    elif weight.spec.has_compute_pattern(ComputePattern.TP1D):    # Single Model Parallel Applied
        if weight.spec.is_1D_row():
-            return colo_embedding_1Drow(input_tensor, weight, args, kwargs)
+            return colo_embedding_1Drow(input_tensor,


The following code snippet is duplicated. Typically, it is better to use a general API as colo_tensor_1D(XXXX, type = 'col'|'row').
But I know you would like to make the style consist with the other Ops.

colossalai/tensor/colo_tensor.py

feifeibear · 2022-05-19T02:18:59Z

tests/test_tensor/test_gpt.py

+ModelOutput.__post_init__ = _post_init_colotensor
+
+
+class GPTLMModel(nn.Module):


move model def it the commons.

test_tensor folder have some redundancy codes which can be moved to utils. I will refactor them and move GPT model to tests.components_to_test in the next PR.

feifeibear · 2022-05-19T02:21:17Z

colossalai/tensor/colo_tensor.py


-        attr = getattr(self._torch_tensor, name)
+    def convert_to_dist_spec_(self, dist_spec: _DistSpec) -> None:
+        with DistSpecManager.no_grad():


I think the user can not understand why we need call no_grad() here.
The common case of no_grad usage is fwd without grad by torch.no_grad(), which is completely different what you did here.,
hide the no_grad() for user? Just with DistSpecManager()?

It also means "fwd with out grad". In this context, autograd Function is not applied.

OK, as a developer, I know you do some communication inside trans_spec, and you do not want to trigger grad_fn, or something related to grads like that.
But as a user, I would assume handle_trans_spec will lead to grad-related ops?

Wesley-Jzy · 2022-05-19T03:56:45Z

colossalai/tensor/_ops/addmm.py

@@ -1,64 +1,67 @@
+from sqlalchemy import func


It seems like a auto completion error...

https://segmentfault.com/a/1190000041428523 for reference

feifeibear · 2022-05-19T02:26:24Z

tests/test_tensor/test_gpt.py

+    BATCH_SIZE = 4
+    SEQ_LEN = 1024
+    VOCAB_SIZE = 50304
+    NUM_STEPS = 1


upper_case means global vars. not local ones

I will refactor this file in the next PR.

feifeibear · 2022-05-19T04:27:24Z

colossalai/tensor/colo_tensor.py


-        attr = getattr(self._torch_tensor, name)
+    def convert_to_dist_spec_(self, dist_spec: _DistSpec) -> None:
+        with DistSpecManager.no_grad():


OK, as a developer, I know you do some communication inside trans_spec, and you do not want to trigger grad_fn, or something related to grads like that.
But as a user, I would assume handle_trans_spec will lead to grad-related ops?

refactor colo-tensor and update linear op

7d08677

feifeibear reviewed May 17, 2022

View reviewed changes

colossalai/tensor/colo_tensor.py Outdated Show resolved Hide resolved

colossalai/tensor/colo_tensor.py Outdated Show resolved Hide resolved

polish code

927577f

feifeibear reviewed May 17, 2022

View reviewed changes

ver217 added 4 commits May 17, 2022 18:44

polish code

df908a0

update ops and unit tests

eeb303f

update unit tests

c0cf43e

Merge branch 'main' of github.com:hpcaitech/ColossalAI into refactor/…

016747e

…colo-tensor

ver217 marked this pull request as ready for review May 18, 2022 08:51

polish code

a84b511

Wesley-Jzy reviewed May 18, 2022

View reviewed changes

colossalai/tensor/_ops/linear.py Show resolved Hide resolved

Wesley-Jzy reviewed May 18, 2022

View reviewed changes

colossalai/tensor/colo_parameter.py Show resolved Hide resolved

Wesley-Jzy reviewed May 18, 2022

View reviewed changes

feifeibear reviewed May 19, 2022

View reviewed changes

colossalai/tensor/colo_tensor.py Show resolved Hide resolved

colossalai/tensor/colo_tensor.py Show resolved Hide resolved

rename dist_spec module

ae079b7

feifeibear reviewed May 19, 2022

View reviewed changes

ver217 added 2 commits May 19, 2022 10:32

polish code

31e7c43

polish code

ea95254

Wesley-Jzy reviewed May 19, 2022

View reviewed changes

remove unneeded import

376cdcd

ver217 added the Run Build and Test label May 19, 2022

fix pipelinable

09799b2

feifeibear reviewed May 19, 2022

View reviewed changes

feifeibear approved these changes May 19, 2022

View reviewed changes

Wesley-Jzy approved these changes May 19, 2022

View reviewed changes

ver217 merged commit ad536e3 into main May 19, 2022

ver217 deleted the refactor/colo-tensor branch May 19, 2022 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tensor] refactor colo-tensor #992

[tensor] refactor colo-tensor #992

ver217 commented May 17, 2022 •

edited

Loading

feifeibear May 17, 2022

ver217 May 17, 2022

ver217 May 17, 2022

feifeibear May 17, 2022

Wesley-Jzy May 18, 2022

Wesley-Jzy May 18, 2022

ver217 May 18, 2022

ver217 May 18, 2022

feifeibear May 19, 2022

ver217 May 19, 2022

Wesley-Jzy May 18, 2022 •

edited

Loading

ver217 May 18, 2022

feifeibear May 19, 2022

ver217 May 19, 2022

feifeibear May 19, 2022

ver217 May 19, 2022

Wesley-Jzy May 19, 2022

feifeibear May 19, 2022

ver217 May 19, 2022

feifeibear May 19, 2022

ver217 May 19, 2022

feifeibear May 19, 2022

ver217 May 19, 2022

feifeibear May 19, 2022

Wesley-Jzy May 19, 2022

ver217 May 19, 2022

Wesley-Jzy May 19, 2022

feifeibear May 19, 2022

ver217 May 19, 2022

feifeibear May 19, 2022

		ModelOutput.__post_init__ = _post_init_colotensor


		class GPTLMModel(nn.Module):

[tensor] refactor colo-tensor #992

[tensor] refactor colo-tensor #992

Conversation

ver217 commented May 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wesley-Jzy May 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ver217 commented May 17, 2022 •

edited

Loading

Wesley-Jzy May 18, 2022 •

edited

Loading