New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better memory management #22
Comments
In thinking on this specific issue... Is this related to luxonis/depthai-core#253? When NO...I see no harm in adding a custom allocator for For an image frame...the current depthai+xlink codebase sends a single big chuck of data from device to host. There are probably smaller USB packets and definitely smaller IP packets for PoE yet eventually those are transmitted and collected into a big chunk. This is my understanding of the code so far. This chuck contains two blocks within it. For example...an RGB camera capture...
This whole chuck (image frame+metadata) is one allocation of memory. The image frame block and the metadata block are both inconsistent sizes. Therefore the whole chuck is an inconsistent size... meaning... the memory allocations by And...the allocations are not just the image data. The chucks are image data+metadata. And the layout is "depthai-specific". This means that whatever digests this single chuck of allocated memory has to support "depthai-specific" memory layout. The only code that supports "depthai-specific" layout is depthai itself. Read on for how this might relate to OpenCV... When YES...OpenCV does not support the storage of image data (e.g. cv::UMat) and custom metadata in the same object. Discussed here opencv/opencv#18792
There is no It is possible for an app to create its own class (I made my own NotesIf the depthai "data" and "metadata" are in the same memory allocation...in the same big chuck collected together after transmission... then I doubt the benefit of a custom allocator given the above. If they are separated... then a custom allocator is more interesting. The metadata is random so keep using the internal XLink allocation system since the metadata and its layout is "depthai-specific". No need to expose that to the app code. But the data as a separate isolated block of allocated memory... that is useful. Even if the size isn't the same. Or maybe there is an additional neural net on the host GPU and the app is waiting for a block of data from the device. The app's NN wants the device data to be written into a preallocated block of memory and passes that pointer to depthai+xlink. And the USB packets are then written into it. Given those two examples (UMat, NN)... what if they are the same app. What if the app runs a NN and the app also uses OpenCV image processing. In that case, there are two custom allocators. I can easily extend this chain of thought to have more custom allocators in the same app. |
It is also related to the mentioned issue in core - but its main intention is to not require copying the memory. How this strings together with both issues is twofold:
The second one is partial in this case as as you've mentioned, you don't get to control the overall size precisely...
As much as size of data is "guranteed" (depending on your pipeline on device side), the size of metadata is also capped. Currently XLink is at 50kiB. (to make this even more useful, this should be exposed via So using this to be able to have your own pool allocators might be a bit harder, if the allocator was xlink global. Might make more sense it being local to stream. So if its stream local + if we expose the maximum metadata size, one could still do pools more easily. But main reason I image is being able to allocate memory where needed at specified alignments - so if we want the incoming 300x300 BGR image to be allocated 64B aligned in certain memory region, then that is possible also. Regarding the metadata that is at the end of data, that can be ignored* - Example:
So at the end of the day, you are left with a ImgFrame that has 300x300 BGR frame amount of data which is lets say aligned, or in some specific memory area, etc.. So to circle back, main reason IMO is mostly zero copy, then alignment and last capability of doing pools and allocating in special memory areas, etc... if that helps the consumer of the data more. *ignored as in not needing to access/use that serialized metadata, which is an implementation detail, but using ImgFrame object to access metadata. |
The following is more discussion on challenges... I get your intention. This is clear and easy for host-only CPU memory. Imagine a depthai custom allocator. It creates a OpenCL zero-copy buffer. Usually integrated GPUs support zero-copy while discrete support low-copy (still have to copy over PCIe). Most hardware requires the start address to be 64-byte aligned, and have a total allocated size a multiple of 64 bytes. Here is Intel's requirements https://www.intel.com/content/www/us/en/developer/articles/training/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics.html#inpage-nav-2 A 300x300 bgr will NOT satisfy the requirements for an OpenCL zero-copy buffer. OpenCV does not adjust striding https://github.com/opencv/opencv/blob/d24befa0bc7ef5e73bf8b1402fa1facbdbf9febb/modules/core/src/ocl.cpp#L5426. Therefore, a 300x300 bgr will result in many copies occurring within the OpenCL runtime. For this example, use instead a 320x320 bgr image. Perhaps this is a traditional allocator+context situation. Pseudocode... // standard depthai interface for allocator results
struct Depthai_Alloc_Return {
void* const data; // xlink understands where to start writing
const size_t size; // xlink knows bounds
void* const context; // xlink has no knowledge, always passes to next step
}
// custom struct defined and used only within my app
struct My_App_Context {
UMat gpuUmat;
Mat hostMat;
};
// this function is declared/defined in my app, and a pointer to it is passed to a depthai API call that cascades into XLink
Depthai_Alloc_Return customAllocator(size_t xlinkNeedByteSize) {
// enforce size requirement
assert(xlinkNeedByteSize % 64 == 0);
// ulgy raw new; requires xlink and later functions never fault without deleting this mem
auto myContext = new My_App_Context();
// must live beyond this custom allocator
myContext->gpuUmat = UMat(1, xlinkNeedByteSize, CV_8UC1, OpenCL);
// also must live beyond this allocator because it is the
// thing that maps gpu memory to the host. On destruct, it unmaps and the host loses access
myContext->hostMat = myBGRimage.getMat(ACCESS_WRITE);
// where do we put the metadata?
// all OpenCV apis (and devs) expect gpuImage to work like this
// cv::UMat nv12gpuImage;
// cv::cvtColor(gpuImage, nv12gpuImage, COLOR_YUV2BGR_NV12)
//
// submatrix on cpu memory works well, submatrix on gpu memory is fragile -- have to test
return {
myContext.hostMat.data,
myContext.hostMat.size,
myContext
};
} First, the issue of lifetimes of Second the issue of size, striding, and the metadata. If XLink calls But XLink has no knowledge of rows and columns. Who does this needed calculation to Pretend depthai parses the metadata and now passes ?something? to something in the main app UMat something(somethingParam param) {
// get result block of memory that contains the image data, metadata, and context
auto block = dai::getImage();
My_App_Context myContext;
try {
myContext = *static_cast<My_App_Context*>(block.context);
}
catch {
delete block.context;
return {};
}
delete block.context;
// somehow separate the image data from the metadata
// because the custom allocator allocated all the memory together in one block
auto submatrix = cv::UMat(myContext.gpuUmat, Range(0, block.rows * block.stride));
// reshape UMat based on the depthai metadata
return submatrix.reshape(block.channels, block.rows);
// destructor for Mat and UMat within myContext should run as scope ends here
} |
I'm not familiar as to how UMat/OpenCL/CUDA functions in terms of memory management and available APIs, so read the following with that in mind.
Correct, but, this is data issue - to resolve it we must present means to set a stride on device (or one strides the image on host, but this is a "copy operation", and its "easy" again - as we can "copy" to a needed destination memory location), to be able to eg. resize an image to 300x300 and set stride to 64. So resulting image would be: 320 * 300 * 3 (strideheightplanes) = 288000B. This will arrive to XLink, into a specified buffer. After everything is processed as I've previously mentioned, you are given a zero copy, data ptr, that points to the beginning of this image. If we allocate it 64B aligned, it can also start at a 64B aligned address. Afterwards, one (UMat handwaving ahead), creates UMat from existing memory, and use that to compute upon, send to a GPU, etc... Same as in As to metadata, no need to worry about it here yet - it gets transferred alongside the data, but, afterwards the only thing relevant is the final Message. In this case, that'll be eg ImgFrame with its data ptr and the rest of information. Above is possible IFF the recipient API supports viewing/taking ownership of an existing memory (with given limitations, it being aligned, etc...). If this isn't possible, then things get a bit harder, likely requiring a copy, to not go through all the pain otherwise. (quick example would be std::vector hah). In UMat/OpenCL/CUDA cases is such approach possible or do we always get memory from their API and its what is available to work with? Or is the whole above premise false, as we've always just given memory by their API and have to work with that? |
All you wrote is cpu centric (which you caveat'd). The overall thinking doesn't apply in OpenCL, CUDA, and (from what I understand) Vulkan. You describe something that can already be done today with depthai-core. There is no need for a custom allocator...because depthai-core doesn't need such a feature. depthai-core can use the aligned boost vector to make the "existing memory". Perhaps you do this in stages. Start with adding alignment and UMat support.
Then next year, consider stage 2. From my perspective, a custom allocator is when my app provides the allocator. And my allocator provides depthai-core a pointer. And, depthai-core must write the frame data to that address. Not the opposite. No "from existing". The app owns the allocator, the app's allocator provides the pointer, therefore the app owns the memory. That's luxonis/depthai-core#253 More on OpenCL and OpenCVI have OpenCL experience (OpenCV's default gpu/device api) so I'll write from that perspective. I will "personify" OpenCL for brevity. And ignore the less-common scenarios. OpenCL has a device (gpu, vpu, etc.) view of the world. By default...it is always about the device. The host (traditional cpu) only exists to service the device. When OpenCL speaks about memory, it is, by default, speaking about its own (device) memory. It is in control of memory and only deals with host-memory with explicit apis. A block of memory...means a block of device memory. This memory is accessible only to the device. The host (cpu) has no access. Naturally, there is a method for data to be exchanged between device<->host. The device is in control of that. The following process is similar in CUDA and Vulkan. And roughly follows the pseudocode I wrote above. OpenCL default process -> Device allocates+owns memory
That is the default process that works across the widest set of Device+Host+OS. It has most opportunities for that ecosystem to self-optimize, zero-copies, etc. And it is often the easiest for an app to use -when compared- to the code and testing needed to reach the same level of optimization for alternate data exchange methods. [Sidebar: OpenCL has no idea of "striding". It is just a block of memory. It is the responsibility of the host code -and- the device code/kernels to manage any memory layout concerns like striding] Alternate 1 -> Device still allocates+owns memory yet copies Host memory to DeviceOne alternate data exchange process is one you allude to in your post immediately above. "...creates UMat from existing memory". The OpenCL runtime will do what is requested and will make as any copies of the whole 1MB of data as needed to make it happen. Optimization is no longer a priority. To aid in maybe optimizing (not promised) there are strict requirements that are different for each manufacturer, and even different models of the same manufacturer. Intel's requirements I linked above. Requirements like
Notice step 2. It is the same as the default process. Meaning... the "source" Host pointer given to OpenCL in step 1 is not the same pointer given back to the Host in step 2. It might be, it might not be. OpenCL runtime will copy data to any location it chooses when it chooses. There are many caveats and options and conditions. Host can only read/write to the pointer location the Device gives it when "map" in step 2. This is why it is best to use the default process because that has the most opportunities for optimization and no copies. Alternate 2 -> Device still allocates+owns memory and Host wants same Host memory addressSecond alternate data exchange process for "...creates UMat from existing memory". The OpenCL runtime will consider what is requested and may deny it. When the Host requests the use of the same Host memory address there is more possibility for out-of-memory scenarios. This is a special and limited kind of memory; often called "pinned memory". If too much "pinned memory" is being used (by the entire OS), then OpenCL will deny the request and the codepath can not proceed. The amount of "pinned memory" available differs based on device models, device manufacturers, device drivers, OS setting, and host hardware platform. For example, I suspect a raspberry pi has less possible pinned memory compared to a giant gaming Nvidia desktop. One shouldn't fear using pinned memory, but instead be responsible with its use. Unfortunately, OpenCV is a carnival of code with respect to this Alternate 2 with pinned memory. It is fragile, often does needless copies of the buffer...it is a mess. I've fixed all the issues with my private OpenCV -- running great for months. I gave up working with the OpenCV maintainer (paired with me) for reasons not relevant to this discussion so the public OpenCV releases are problematic. So this leads again to my recommendation to use the default approach at the top. OpenCV's OpenCL support still has race conditions with the mapping process, not thread-safe, etc... but with this alternative it also adds needless copies... in addition to any copies OpenCL might do itself. [Same] To aid in maybe optimizing (not promised) there are strict requirements that are different for each manufacturer, and even different models of the same manufacturer. Intel's requirements I linked above. Requirements like
|
Thanks for the insight As you've mentioned, I agree is best split into 2 stages. Afterwards to improve the OpenCL, CUDA and other usecases, we can move into making it possible to do a zero copy into a provided memory buffer by one of the mentioned APIs (as long as the unnecessary already consumed serialized metadata at the end of the buffer won't cause issues). When stage 1 is complete, we immediately get 1 copy reduction even in OpenCL/CUDA usecases, as we can then do 1 remaining copy into the needed buffer. Also in cases where good optimizations from these APIs aren't available, and memory transfer to device is made anyway, this would yield same "magnitude of performance". So, I think a great step forward in both cases, having stage 1 done. Although for complete proper integration later in stage 2, think might have to be redone a bit - we'll see about that afterwards. I'll try to get a PoC of zero copy done by the end of the week. |
Zero Copy definition?I think we are both saying "zero copy" but meaning slightly different things. I'd like to check and confirm... "zero copy" to me means there are no bulk full image frame copies on the host or the OpenCL device.
Is that what you mean when you write "zero copy"? I'm ok with stage 1. Meaning depthai+xlink does the single host cpu memory allocation with the 64/64 restrictions. My code for threads, move, and alignmentI'll push branches (not PRs). I can't do PRs because both of these need the Xlink move semantics from PR #29. When/if that is merged in, then I can open PRs.
|
I do - but, was using it wrongly hah. Did some additional digging around yesterday and it turns out, we could be doing zero copy from USB DMAd -> XLink as well :D Which in turn could be USB DMAd -> "GPU memory". But turns out its kinda out of scope, as there are too many restrictions to go along. (memory management would get hard fast) Anyway, the "zerocopy" in mind here is the XLink->core and core parsing scope. So as you mentioned above. For stage 1, I intend to move the "std::vector" out of RawBuffer - and put a And as part of that StreamPacketParser doesn't have to copy the data, it just deserializes the last bits of the buffer into aan arrived type Message. Regarding alignment and data size, XLink by default already does 64B address and 64B size aligned, so that will be handled as well. Thanks a lot for the commits - I've already integrated the #29 into a temporary |
got it. Since there are STL "container" standards/rules, perhaps take a 2nd look at Loop me in if/when you want my look at something. 👍 |
To provide better memory management the following changes are proposed:
XLinkSetAllocator(alloc, dealloc)
, ...)XLinkPlatformAllocateData
&XLinkPlatformDeallocateData
to call the callback insteadXLinkReleaseData
andXLinkReleaseSpecificData
toXLinkFreeData
andXLinkFreeSpecificData
respectivelyReleaseData
equivalent, which performs everything the same with exception of actually freeing the dataThe text was updated successfully, but these errors were encountered: