Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated vis_gpu API to match vis_cpu #50

Merged
merged 91 commits into from
Aug 11, 2022
Merged

Updated vis_gpu API to match vis_cpu #50

merged 91 commits into from
Aug 11, 2022

Conversation

steven-murray
Copy link
Contributor

@steven-murray steven-murray commented Jul 26, 2022

This PR is quite large, and should generate a new major version.

Fixes #42
Fixes #14

Changes

It removes the following:

  • Ability to use an (l,m) grid for beam interpolation, in both CPU and GPU versions
  • Ability to pass anything except UVBeam or AnalyticBeam objects in the beam list.

It adds the following:

  • GPU version now performs beam interpolation on a regular theta/phi grid (but can also use direct function evaluation if it's an AnalyticBeam)
  • Beam polarization now fully supported in GPU.
  • A new CLI script viscpu profile that can run a simulation with arbitrary numbers of freqs, times, sources, antennas and beams and outputs line profiling information
  • Updated tests for both CPU and GPU, comparing both against each other and pyuvsim.

It improves the following:

  • CPU version made ~10x faster by switching to a matrix-product instead of an einsum.

Performance Measurements

GPU vs. original (einsum) CPU

image

CPU: einsum vs matprod (new)

image

CPU: performance vs. Nthreads

image

Caveats / Things to do

  • GPU version has not yet been tested vs nthreads/maximum memory etc.
  • CPU version has no "chunking" which would allow it to run on smaller-memory allocations. Since it seems that increasing Ncores doesn't increase performance by much, it would seem that the best way to break up the simulation would be by time and frequency, running a single time and frequency per-core. However, since that memory wouldn't be shared, we'd need a way to be able to run with a low per-core memory.
  • Dominant component of the CPU version is the element-wise product and exponentiation of the fringe. There may be further speedups possible here to do with memory layout. Possibly also we could get speedup by threading if we tried hard (which would reduce the need for memory chunking).
  • The viscpu profile script works really well for line profiling of the CPU, but for GPU the whole thing needs to be run inside of nvprof which makes it slightly non-uniform. This could be improved.
  • viscpu profile always runs with a particular coarse beam, which is not really representative of real problems. We should check the effect of this.

We may not need to do all of the above in this PR, but I thought I'd lay it out for discussion.
*

@steven-murray
Copy link
Contributor Author

This is currently failing on the single-precision test on GPU. For some reason it's saying

pycuda._driver.LaunchError: cuLaunchKernel failed: too many resources requested for launch

This would seem to indicate that it requires more threads/shared memory/registers than available. But it works perfectly fine for double precision, which makes no sense to me. Anyone with CUDA knowledge, please step in! Maybe @AaronParsons could be of help here...

Other than that, everything seems to be working fine, and is tested at 100% coverage. To get coverage on GPU I set up a self-hosted runner on my own laptop (clearly not sustainable, but I'll transfer this over to our ASU cluster when I can).

Copy link
Collaborator

@philbull philbull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty major changes here! Looks like a lot of improvements have been made, which look good to me overall. There are some API breakages that should be documented in the release notes, and I can't really give useful comments on the CUDA code. Only a few minor comments to address, mostly about documentation, plus getting the tests to pass.

src/vis_cpu/cli.py Show resolved Hide resolved
src/vis_cpu/conversions.py Outdated Show resolved Hide resolved
src/vis_cpu/gpu_src/beam_interpolation.cu Show resolved Hide resolved
src/vis_cpu/gpu_src/beam_interpolation.cu Show resolved Hide resolved
src/vis_cpu/vis_cpu.py Outdated Show resolved Hide resolved
src/vis_cpu/vis_cpu.py Outdated Show resolved Hide resolved
src/vis_cpu/vis_cpu.py Outdated Show resolved Hide resolved
src/vis_cpu/vis_gpu.py Outdated Show resolved Hide resolved
src/vis_cpu/vis_gpu.py Outdated Show resolved Hide resolved
src/vis_cpu/wrapper.py Show resolved Hide resolved
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Collaborator

@philbull philbull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think this can be merged (with a version bump) once the tests are passing again.

@steven-murray steven-murray merged commit c741675 into main Aug 11, 2022
@steven-murray steven-murray deleted the clean-gpu branch August 11, 2022 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Req] Get vis_gpu up to API parity with vis_cpu vis_gpu() arguments don't match how it is called
2 participants