New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Multi-GPU Memory Allocation #1093
Comments
So this looks like two separate problems. (1) You are trying to reserve 8000 MiB of framebuffer memory, but your device only has ~7850 MiB available. Try reducing to (2) The
|
(1) Giving less RAM fixed the first thing and it just gives the same error as using the legate interactive console. (2)
Conda list: |
That is surprising; I would expect to see something like
Can you also try For reference, here is what I see on my machine:
Just to confirm, did you build in an environment created using |
Well with that second command I get: If I run objdump without the grep from inside the lib folder I get the file below. With the pipe to grep nothing pops up. Yes I used an environment file, here's the actual file and I'm pretty sure the command was: |
OK, I suspect what happened is that you have cusolver somewhere on your system, and nvcc was able to find it and link to it at build time, but no link to I think the best thing to do is just add cusolver to your conda environment and rebuild. I think a top-of-tree pull of legate.core will give you a |
Ok I'll nuke everything and try that. Edit: Things compiled, While I'm here is there a way to make a cunumeric program an executable or linked library so I can call cunumeric routines from other code bases? My long term use case will require many many calls to the same function (literally just one numpy func) and I'd like to avoid all the JIT or whatever is happening inside cunumeric every time I call the function. |
We haven't looked at ways to "ahead-of-time compile" a cuNumeric program. You could try an accelerated Python interpreter like Depending on your application, there may be ways to "keep alive" a Python interpreter, to send cuNumeric commands to. The easiest being to just drive the whole application through Python. There is some work going on in the direction of reducing cuNumeric's overheads, e.g. reusing Python data structures between operations instead of allocating them fresh, caching the dependence analysis for pieces of the code that get executed repeatedly, and moving more functionality to C++. However, these are still concerned with speeding up the invocation of cuNumeric as part of a Python program. |
Ok good to know. I am using Julia which has the ability to start a Python interpreter alongside the Julia instance and pass Julia objects to that Python instance. I am unsure how that would play with legate though. Is the This part is mostly me dreaming, but it would be cool to have bindings to cunumeric functions in Julia. I have literally no clue how legion/cunumeric works under the hood but its extremely easy to wrap c/c++/fortran/python libraries in Julia. |
Yes, you can run cuNumeric programs using a vanilla Do note that distributed launching becomes harder in that case (simply passing
Non-python bindings has been discussed, but is not near the top of our priority list at the moment. |
I too think doing a Legate Julia would be interesting and it's something we've discussed in the past, but we have limited resources at the moment. I know that folks at LANL have considered this as well. @pmccormick for visibility. |
Software versions
Command is not on path (not sure why). Here's output of
legate --info
Jupyter notebook / Jupyter Lab version
N/A
Expected behavior
I am trying to use cunumeric to split up a large tensor contraction across multiple GPU's as it is too big to fit in the memory of one GPU. This is exactly what cunumeric is built for as far as I can tell, but I am having issues getting it to work correctly. I have built legate-core and cunumeric from source using the
install.py
scripts. I expected it to kind of just work, but something must have gone wrong subtly in the build.Observed behavior
When I try to run my cunumeric program from the command line with
legate --gpus 2 --fbmem 8000 --sysmem 40000
I get this error:If I type just
legate
and activate the interactive prompt and try toimport curnumeric
I get this error which seems to suggest that cunumeric cannot find one of the CUDA libraries. Clearly something is very broken but I am unsure how to troubleshoot this as the build of both legate and cunumeric looked successful.Example code or instructions
Stack traceback or browser console output
No response
The text was updated successfully, but these errors were encountered: