Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Multi-GPU Memory Allocation #1093

Open
ejmeitz opened this issue Nov 28, 2023 · 10 comments
Open

[BUG] Multi-GPU Memory Allocation #1093

ejmeitz opened this issue Nov 28, 2023 · 10 comments

Comments

@ejmeitz
Copy link

ejmeitz commented Nov 28, 2023

Software versions

Command is not on path (not sure why). Here's output of legate --info

Legate build configuration:
  build_type : Release
  use_openmp : False
  use_cuda   : True
  networks   : ucx
  conduit    :

Jupyter notebook / Jupyter Lab version

N/A

Expected behavior

I am trying to use cunumeric to split up a large tensor contraction across multiple GPU's as it is too big to fit in the memory of one GPU. This is exactly what cunumeric is built for as far as I can tell, but I am having issues getting it to work correctly. I have built legate-core and cunumeric from source using the install.py scripts. I expected it to kind of just work, but something must have gone wrong subtly in the build.

Observed behavior

When I try to run my cunumeric program from the command line with legate --gpus 2 --fbmem 8000 --sysmem 40000 I get this error:
image
If I type just legate and activate the interactive prompt and try to import curnumeric I get this error which seems to suggest that cunumeric cannot find one of the CUDA libraries. Clearly something is very broken but I am unsure how to troubleshoot this as the build of both legate and cunumeric looked successful.
image

Example code or instructions

Not this kind of bug.

Stack traceback or browser console output

No response

@manopapad
Copy link
Contributor

So this looks like two separate problems.

(1) You are trying to reserve 8000 MiB of framebuffer memory, but your device only has ~7850 MiB available. Try reducing to --fbmem 7500.

(2) The cuSolver dependency is missing. This should have automatically been included in the environment that you built using generate-conda-envs.py, but possibly something is wrong with that. Can you provide the output from these commands, to help debug the issue?

objdump -p /home/emeitz/.conda/envs/legate/lib/libcunumeric.so | grep PATH
ldd /home/emeitz/.conda/envs/legate/lib/libcunumeric.so | grep solv
conda list

@ejmeitz
Copy link
Author

ejmeitz commented Nov 28, 2023

(1) Giving less RAM fixed the first thing and it just gives the same error as using the legate interactive console.

(2)
objdump -p /home/emeitz/.conda/envs/legate/lib/libcunumeric.so | grep PATH:

  • RPATH /home/emeitz/.conda/envs/legate/lib:$ORIGIN

ldd /home/emeitz/.conda/envs/legate/lib/libcunumeric.so | grep solv

  • Gives nothing, so yeah cuSolver likely missing. I have it installed with CUDA outside of the conda env maybe something got confused there.

Conda list:
cl.txt

@manopapad
Copy link
Contributor

ldd [...] gives nothing

That is surprising; I would expect to see something like

libcusolver.so.11 => not found

Can you also try objdump -p libcunumeric.so | grep solv?

For reference, here is what I see on my machine:

(noneditable) iblis:~/noneditable/env> ldd lib/libcunumeric.so  | grep solv
	libcusolver.so.11 => /home/mpapadakis/noneditable/env/lib/libcusolver.so.11 (0x00007f163f000000)
(noneditable) iblis:~/noneditable/env> objdump -p lib/libcunumeric.so  | grep solv
  NEEDED               libcusolver.so.11
  required from libcusolver.so.11:
    0x09a2e521 0x00 11 libcusolver.so.11

Just to confirm, did you build in an environment created using generate-conda-envs.py --ctk 12.0? That script should have included a bunch of packages that I don't see in your conda list https://github.com/nv-legate/legate.core/blob/branch-24.01/scripts/generate-conda-envs.py#L80-L83.

@ejmeitz
Copy link
Author

ejmeitz commented Nov 29, 2023

Well with that second command I get:
objdump: 'libcunumeric.so': No such file
This file does exist at /home/emeitz/software/cunumeric/build/lib but this isn't my LD_LIBRARY_PATH and I guess the anaconda env isn't picking it up either even though cunumeric is in conda list.

If I run objdump without the grep from inside the lib folder I get the file below. With the pipe to grep nothing pops up.
objdump.txt

Yes I used an environment file, here's the actual file and I'm pretty sure the command was:
./scripts/generate-conda-envs.py --python 3.11 --ctk 12.0 --os linux --no-compilers --no-openmpi --ucx
environment-test-linux-py3.11-cuda12.0-ucx.zip

@manopapad
Copy link
Contributor

OK, I suspect what happened is that you have cusolver somewhere on your system, and nvcc was able to find it and link to it at build time, but no link to libcusolver.so was even recorded.

I think the best thing to do is just add cusolver to your conda environment and rebuild. I think a top-of-tree pull of legate.core will give you a scripts/generate-conda-envs.py that works correctly under 12.0, and should include the package libcusolver-dev.

@ejmeitz
Copy link
Author

ejmeitz commented Dec 1, 2023

Ok I'll nuke everything and try that. Edit: Things compiled, legate-issue runs and so does my program.

While I'm here is there a way to make a cunumeric program an executable or linked library so I can call cunumeric routines from other code bases? My long term use case will require many many calls to the same function (literally just one numpy func) and I'd like to avoid all the JIT or whatever is happening inside cunumeric every time I call the function.

@manopapad
Copy link
Contributor

We haven't looked at ways to "ahead-of-time compile" a cuNumeric program. You could try an accelerated Python interpreter like pypy, but we haven't tried that, so no guarantees that it will work. :-)

Depending on your application, there may be ways to "keep alive" a Python interpreter, to send cuNumeric commands to. The easiest being to just drive the whole application through Python.

There is some work going on in the direction of reducing cuNumeric's overheads, e.g. reusing Python data structures between operations instead of allocating them fresh, caching the dependence analysis for pieces of the code that get executed repeatedly, and moving more functionality to C++. However, these are still concerned with speeding up the invocation of cuNumeric as part of a Python program.

@ejmeitz
Copy link
Author

ejmeitz commented Dec 1, 2023

Ok good to know.

I am using Julia which has the ability to start a Python interpreter alongside the Julia instance and pass Julia objects to that Python instance. I am unsure how that would play with legate though. Is the legate command just a custom python interpreter? If so it might just work.

This part is mostly me dreaming, but it would be cool to have bindings to cunumeric functions in Julia. I have literally no clue how legion/cunumeric works under the hood but its extremely easy to wrap c/c++/fortran/python libraries in Julia.

@manopapad
Copy link
Contributor

Is the legate command just a custom python interpreter?

Yes, you can run cuNumeric programs using a vanilla python interpreter, and (fundamentally) any compatible interpreter. Most options that you pass through the legate wrapper (e.g. --gpus and --fbmem) need to be passed through the LEGATE_CONFIG environment variable (e.g. LEGATE_CONFIG='--gpus 2 --fbmem 4000' python prog.py).

Do note that distributed launching becomes harder in that case (simply passing --nodes 2 and --launcher mpirun to LEGATE_CONFIG won't work), and we will likely need more details on your workflow to assist with that.

it would be cool to have bindings to cunumeric functions in Julia

Non-python bindings has been discussed, but is not near the top of our priority list at the moment.

@lightsighter
Copy link
Contributor

This part is mostly me dreaming, but it would be cool to have bindings to cunumeric functions in Julia.

I too think doing a Legate Julia would be interesting and it's something we've discussed in the past, but we have limited resources at the moment. I know that folks at LANL have considered this as well. @pmccormick for visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants