[Distributed] all_reduce op and distributed info in graphs #284

soodoshll · 2023-06-19T22:31:11Z

add the all_reduce op
add nccl-related headers and libs when building tasks (as a new pass include_nccl_pass)
support grouping in distributed (by using ncclCommSplit)
We now have a example of all_reduce(relu(x * w)) in ./examples/distributed/test.py

soodoshll · 2023-06-22T03:53:06Z

@yaoyaoding this pr is ready for review :)

Merely assigning environment variables is insufficient for setting up dev environment now. We need to run pip to install hidet package in develop mode. Users still need to build source files written in C++ manually. Consider integrating that into `setup.py` in the future?

yaoyaoding

Thanks @soodoshll !

I left some suggestions on the data organization and implementation.

yaoyaoding · 2023-06-23T00:31:24Z

python/hidet/cuda/nccl/comm.py

 def init_unique_id(unqie_id: NcclUniqueId) -> None:
+    if not nccl_available():
+        raise RuntimeError("NCCL is not available")
    nccl_runtime_api.get_unique_id(unqie_id)


Can we define init_unique_id(...) as

def create_unique_id() -> NcclUniqueId: ...

I feel the current API is not very intuitive.

The point here is now we need the NcclUniqueId to be shared by all processes. And the current solution is

Create a shared NcclUniqueId object;

Launch multiple processes with the shared uniqueid object as one argument;

Init the shared uniqueid object in process 0, which need the reference to the shared object
If we create the NcclUniqueId in process 0 after processes have been launched, it's not so easy to do the broadcast (if there's an elegant way of broadcasting, please let me know).

A workaround is to 1) create the shared object; 2) launch processes; 3) create a unique id object; 4) copy its value back to the shared object.

yaoyaoding · 2023-06-23T00:35:55Z

python/hidet/graph/flow_graph.py

+        # For distributed graphs
+        self.nrank = nrank
+        self.rank = rank
+        self.groups = groups
+


Let's define a new class called FlowGraphAttrs and define these attributes in that class. Then add a field in FlowGraph with FlowGraphAttrs type.

something like

class FlowGraph: def __init__(..., attrs=None): ... self.attrs: FlowGraphAttrs = attrs if attrs else FlowGraphAttrs()

yaoyaoding · 2023-06-23T00:38:10Z

python/hidet/graph/flow_graph.py

+    def is_distributed(self):
+        return self.nrank is not None or self.rank is not None
+
+    def set_dist_attrs(self, nrank: int, rank: int, groups: Optional[List[List[int]]] = None):
+        self.nrank = nrank
+        self.rank = rank
+        self.groups = groups
+


Let's define thses functions at the module that will use these functionality, instead of defining them as FlowGraph methods.

I have replaced them with set_attrs

yaoyaoding · 2023-06-23T00:44:53Z

python/hidet/graph/ops/distributed.py

+        self.comm_id = comm_id
+        self.op = op
+
+        super().__init__('all_reduce', inputs=[x], outputs=[y], attributes={})


Better also add comm_id and op to attributes, so that the user can see the comm_id and op when compiling the task.

yaoyaoding · 2023-06-23T00:46:18Z

python/hidet/graph/ops/distributed.py

+        return f"all_reduce_{self.op}_{self.comm_id}"
+
+    def implement(self, target: Union[Target, str], working_dir: str) -> List[IRModule]:
+        # we may need current rank here to avoid duplicated working_dirs


Could you clarify the problem here? Thanks.

If we add the comm_id to attributes, then the op hash would be different.

if we run the compilation concurrently in multiple processes, for the same op, there might be race conditions in the local filesystem.

yaoyaoding · 2023-06-23T00:56:25Z

python/hidet/runtime/compiled_graph.py

+            comms_array = comms_to_array(self.nccl_comms)
+            runtime_api.set_nccl_comms(comms_array)


Let's create this when initialize the dist-related info, to avoid repeating creating the comm Array.

yaoyaoding · 2023-06-23T00:57:18Z

python/hidet/runtime/compiled_graph.py

@@ -105,6 +114,10 @@ def __init__(
        self.cuda_workspace: Optional[Storage] = None
        self.cpu_workspace: Optional[Storage] = None

+        # distributed properties
+        self.dist_info: Optional[GraphDistributedInfo] = dist_info


Better to put this in GraphMetaData.

I think a better idea is to put the FlowGraphAttr in the GraphMetaData as a whole, instead of reiterating all attributes. But then where should we put FlowGraphAttr? Putting it in flow_graph.py will cause circular import.

yaoyaoding · 2023-06-23T00:58:06Z

python/hidet/runtime/compiled_graph.py

@@ -105,6 +114,10 @@ def __init__(
        self.cuda_workspace: Optional[Storage] = None
        self.cpu_workspace: Optional[Storage] = None

+        # distributed properties
+        self.dist_info: Optional[GraphDistributedInfo] = dist_info
+        self.nccl_comms: List[NcclCommunicator] = []


store it as Array of NcclCommunicator directly, to avoid repeating creating the Array in run_async.

Array of NcclCommunicator cannot be directly passed into C++. C++ needs an array of ncclComm_t, which is basically the handle of NcclCommunicator. And to avoid NcclCommunicators being released by GC, we need to maintain the list of NcclCommunicator. If we also maintain the ncclComm_t array, we will have two redundant arrays which almost save the same value

yaoyaoding · 2023-06-23T01:00:21Z

python/hidet/transforms/include_nccl.py

+    def _recursive_find(root: Stmt):
+        if isinstance(root, BlackBoxStmt):
+            if root.template_string.startswith('nccl'):
+                return True
+        for child in dir(root):
+            if isinstance(child, Stmt):
+                if _recursive_find(child):
+                    return True
+        return False
+
+    ret = _recursive_find(func.body)


Use hidet.ir.tools.collect to collect all BlackStmt.

yaoyaoding · 2023-06-23T01:01:46Z

python/hidet/transforms/__init__.py

@@ -80,5 +81,6 @@ def lower(ir_module: IRModule) -> IRModule:
        rule_based_simplify_pass(),
        inline_let_stmt_pass(),
        simplify_stmt_pass(),
+        include_nccl_pass(),


Later, we will use a pass to add the header information. Let's make this pass a general one and give a name like "annotate_headers" or "annotate_include_headers". Or "annotate_header_and_libs".

) Previously, if a primitive function calls a primitive function, the `instantiate_symbols` pass will update the corresponding `hidet.ir.primitives.func.PrimitiveFunctionRegistry.function` in-place (I am not sure exactly how it's done, but this is what I observed), adding symbol variables to its parameters. The primitive function pool is a global variable, therefore this effect is cumulative across tuning candidates. So while candidate 0 will have no problem, candidate 1 will have two extra copies of symbol params, and so on, leading to compile errors. Since primitive functions do not need symbol vars, a quick fix is just to not instantiate any symbols for them.

yaoyaoding

Thanks @soodoshll !

I left some comments.

python/hidet/distributed/distributed.py

python/hidet/distributed/group.py

yaoyaoding · 2023-06-28T20:51:55Z

python/hidet/distributed/group.py

+NCCL_COMMS = []
+_NCCL_ARRAY = None


Suggested change

NCCL_COMMS = []

_NCCL_ARRAY = None

NCCL_COMMS: List[NcclCommunicator] = []

_NCCL_ARRAY: 'Array' = None