initiate zero-copy output. #62

ceciliapeng2011 · 2023-06-08T09:25:26Z

Details:

item1
...

Tickets:

ticket-id

…dynamic

maxnick · 2023-06-12T06:47:26Z

src/plugins/intel_cpu/src/output_mem_mgr.h

+using OutputAllocatorCPtr = std::shared_ptr<const OutputAllocator>;
+
+
+class OutputMemoryMngr : public IMemoryMngr {


OutputMemoryMngr should inherit IMemoryMngrObserver to be used polymorphically with other IMemoryMngrObserver implementations.

maxnick · 2023-06-12T06:53:45Z

src/plugins/intel_cpu/src/graph.cpp

+                for (auto& edge : edge_clusters[box.id]) {
+                    if (edge->getChild()->getType() == Type::Output) {
+                        isOutGrp++;
+                        break;
+                    }


Simply use std::any_of for checking if the cluster contains an output edge.

maxnick · 2023-06-12T06:55:05Z

src/plugins/intel_cpu/src/infer_request.cpp

        if (output != outputNodesMap.end()) {
            auto parentEdge = output->second->getParentEdgeAt(0);
+
+            if (graph->hasDynamicInput()) { // TODO: internal dynamism


To check is the graph dynamic, one may use its status.

maxnick · 2023-06-12T06:57:24Z

src/plugins/intel_cpu/src/graph.cpp

+            if (isOutGrp) {
+                IE_ASSERT(isOutGrp==1);  // reuse_io_tensors false
+                grpMemMngr =
+                    std::make_shared<OutputMemoryMngr>(std::unique_ptr<MemoryMngrWithReuse>(new MemoryMngrWithReuse()));


This memory manager should be stored in the graph in a container of objects of the specific OutputMemoryMngr type. So that, the infer requests can be able to get access to those mem managers.

maxnick · 2023-06-12T07:01:49Z

src/plugins/intel_cpu/src/cpu_memory.cpp

+    if (outMemMngr != nullptr) {
+        outMemMngr->setMemDesc(desc);
+    }


As we discussed, this should not work that way. The memory descriptor may not reflect the memory size to be allocated. Please consider inPlace concat chains. As it was agreed, the reallocation should happen using the size requested from the memory manager, via OutputMemoryMngr::resize wrapped call.

maxnick · 2023-06-12T07:02:57Z

src/plugins/intel_cpu/src/infer_request.cpp

+        auto outblob = InferRequest::GetBlob(it.first);
+
+        outputAllocators[it.first] = std::make_shared<OutputAllocator>(outblob);


The dependency must be inverted. The blob should be initialized with the allocator, not vice versa.

maxnick · 2023-06-12T07:06:34Z

src/plugins/intel_cpu/src/output_mem_mgr.cpp

+
+void OutputMemoryMngr::setExtBuff(void* ptr, size_t size) {
+    if (m_allocator) {
+        return;


Please avoid silent action skipping. Either replace the memory pointer in the allocator, or throw.

maxnick · 2023-06-12T07:21:04Z

src/plugins/intel_cpu/src/output_mem_mgr.h

+namespace ov {
+namespace intel_cpu {
+
+class OutputAllocator : public Allocator {


Since we are going to use OutputMemoryMngr over Allocator we do not need this class at all.

maxnick · 2023-06-13T12:54:45Z

src/plugins/intel_cpu/src/graph.h

    std::map<std::string, NodePtr> inputNodesMap;
    std::map<std::string, NodePtr> outputNodesMap;

+    std::map<std::string, MemoryMngrPtr> outputNodesMemMngrMap;


It is better to use unordered_map. Also it does make more sense to store the specific OutputMemoryMngr type to avoid type convertions.

maxnick · 2023-06-13T12:55:13Z

src/plugins/intel_cpu/src/graph.h

        dynBatch = newDynBatch;
    }

+    Status getDynStatus() const {return status;}


Please rename to simply getStatus

maxnick · 2023-06-13T13:51:59Z

src/plugins/intel_cpu/src/graph.cpp

+            MemoryMngrPtr grpMemMngr;
+            grpMemMngr =
                std::make_shared<DnnlMemoryMngr>(std::unique_ptr<MemoryMngrWithReuse>(new MemoryMngrWithReuse()));
+
+            // deternmine a group with outputs.
+            size_t isOutGrp = 0;
+            int64_t outBoxId = -1;
+            for (auto& box : group) {
+                if (std::any_of(
+                    edge_clusters[box.id].begin(),
+                    edge_clusters[box.id].end(),
+                    [box](const ov::intel_cpu::EdgePtr edge) {
+                        return edge->getChild()->getType() == Type::Output;
+                    })) {
+                        isOutGrp++;
+                        outBoxId = box.id;
+                }
+            }
+            if (isOutGrp) {
+                IE_ASSERT(isOutGrp==1);  // reuse_io_tensors false
+                grpMemMngr =
+                    std::make_shared<OutputMemoryMngr>(grpMemMngr);
+                DEBUG_LOG(grpMemMngr);
+
+                // Store the output memory managers.
+                // So that, the infer requests can be able to get access to them.
+                for (auto& edge : edge_clusters[outBoxId]) {
+                    const auto child = edge->getChild();
+                    if (child->getType() == Type::Output) {
+                        for (auto &output : outputNodesMap) {
+                            if (output.second == child) outputNodesMemMngrMap[output.first] = grpMemMngr;
+                        }
+                    }
+                }
+            }


This part must be handled in a different way.
Even before the static memory initialization on L814 we have to do the following:

Search for all the Output edges.

Allocate all the found edges with the status Edge::Status::NeedAllocation using the OutputMemoryMngr

Store output memory managers used for initialization of the output edges to provide access from infer requests.
The rest of the code frame may remains the same, as those output edges now have Allocated Status and will be skipped during static and dynamic memory initialization routines.

@maxnick There is one problem wrt your suggestion.
An output edge might share the memory from a base edge, in which case the output edge is marked "Status::NotAllocated". Then this output edge will slip and won't use OutputMemoryMngr.

maxnick · 2023-06-13T14:03:29Z

src/plugins/intel_cpu/src/node.cpp


        const auto &currDesc = edges[0]->getMemory().getDesc();
-        if (currDesc.getShape().isStatic() && currDesc.getShape().getStaticDims() == newOutputShape)
+        if (currDesc.getShape().isStatic() && currDesc.getShape().getStaticDims() == newOutputShape && !forceUpdateShape)


The corner stone of the software development is the isolation of complexity. Using this flag we exposing complexity of the memory relationship to the node. Let's try to avoid such design flow.
It seems that we can handle the memory resize on the OutputMemoryDesc level. We can register the fact that the allocator was reset and call allocate either on demand (when get ptr is called) or when resize is called. That means we will not have to handle this nuance outside the memory manager implementation.

@maxnick yes, the existing design is simply a quick try and needs optimizing.

There is a problem here if we allocate when get ptr is called -
There is no info on how much memory should be allocated in the manager, as there is no memdesc, no size of previous allocator, etc.. unless we invent a new interface in IMemoryMngr to get this information.

Another problem is getRawPtr() is const, which shouldn't change the object itself.

handle the memory resize on the OutputMemoryDesc level

I don't quite understand this well. The output edge may share memory with other edges, and each edge has its memory object and memory desc. How to let all of them know the updates from manager? Or we could assume there is no need to update memdesc?

maxnick and others added 20 commits May 31, 2023 17:06

InPlace memory direction resolution pass

b45875d

Partitioned mem mngr

c20329b

Concat reshape pattern has been enabled

329b592

Enhanced in place conflicts detection

a1b6315

Refactor Concat

8f9cadb

Fix Reshape isExecutable call

8529c71

Split node refactoring

1c2b083

Gather node inPlace special case

e46bdfc

Temporal WA to enable zero copy on Split input

97d813d

Process inPlace edges in order

21a5226

Fixes

4999dd3

Remove implicit initialization from Edge::getMemory

d3c71e7

extract IMemory interface and implements Memory class.

4a87f2e

Allow NotAllocated edges call allocate

b33eb55

Merge remote-tracking branch 'cecilia/cecilia/IMemory' into in_place_…

eb938c7

…dynamic

IMemory fixes after merge

0cea5f8

Prevent input memory modification

2bf5cee

Minor build fixes

c00d8a0

Fix unittest build

01c24cd

Fix for variadic concat

619045a

github-actions bot added the category: CPU label Jun 8, 2023

ceciliapeng2011 added 4 commits June 9, 2023 00:34

initiate zero-copy outputy.

887df92

fix

05e33e0

fix

8965eb9

fix

6ea25c0

maxnick requested changes Jun 12, 2023

View reviewed changes

refactor

8ab3f78

ceciliapeng2011 requested a review from maxnick June 13, 2023 05:21

ceciliapeng2011 added 2 commits June 12, 2023 22:32

refactor

4af3b26

fix

8c8bffa

ceciliapeng2011 added 2 commits June 13, 2023 00:45

fix

f02c26e

fix with a custom IAllocator.

1696a1d

ceciliapeng2011 marked this pull request as draft June 15, 2023 01:19

maxnick reviewed Jun 15, 2023

View reviewed changes

maxnick force-pushed the in_place_dynamic branch 3 times, most recently from cd818cd to bfaef7c Compare June 22, 2023 17:05

ceciliapeng2011 mentioned this pull request Jun 28, 2023

Cecilia/itensor stage2 #64

Open

maxnick force-pushed the in_place_dynamic branch 2 times, most recently from 97b8873 to 8c87601 Compare July 10, 2023 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

initiate zero-copy output. #62

initiate zero-copy output. #62

Uh oh!

ceciliapeng2011 commented Jun 8, 2023

Uh oh!

maxnick Jun 12, 2023

Uh oh!

maxnick Jun 12, 2023

Uh oh!

maxnick Jun 12, 2023

Uh oh!

maxnick Jun 12, 2023

Uh oh!

maxnick Jun 12, 2023

Uh oh!

maxnick Jun 12, 2023

Uh oh!

maxnick Jun 12, 2023

Uh oh!

maxnick Jun 12, 2023

Uh oh!

maxnick Jun 13, 2023

Uh oh!

maxnick Jun 13, 2023

Uh oh!

maxnick Jun 13, 2023

Uh oh!

ceciliapeng2011 Jun 27, 2023 •

edited

Loading

Uh oh!

maxnick Jun 13, 2023

Uh oh!

ceciliapeng2011 Jun 27, 2023

Uh oh!

ceciliapeng2011 Jun 27, 2023 •

edited

Loading

Uh oh!

Uh oh!

		using OutputAllocatorCPtr = std::shared_ptr<const OutputAllocator>;


		class OutputMemoryMngr : public IMemoryMngr {

		auto outblob = InferRequest::GetBlob(it.first);

		outputAllocators[it.first] = std::make_shared<OutputAllocator>(outblob);

initiate zero-copy output. #62

Are you sure you want to change the base?

initiate zero-copy output. #62

Uh oh!

Conversation

ceciliapeng2011 commented Jun 8, 2023

Details:

Tickets:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ceciliapeng2011 Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ceciliapeng2011 Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ceciliapeng2011 Jun 27, 2023 •

edited

Loading

ceciliapeng2011 Jun 27, 2023 •

edited

Loading