Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Vulkan queue allocator #685

Closed
ehntoo opened this issue Sep 10, 2022 · 13 comments
Closed

Improve Vulkan queue allocator #685

ehntoo opened this issue Sep 10, 2022 · 13 comments
Assignees

Comments

@ehntoo
Copy link
Contributor

ehntoo commented Sep 10, 2022

I started experimenting with compiling the ngscopeclient app on my MacBook. I almost have things building, and I ran into an architectural limitation that may have a larger impact on the scopehal codebase for eventual macOS / Apple Silicon support.

Under MoltenVK, each hardware queue type only has a single queue associated with it: https://github.com/KhronosGroup/MoltenVK/blob/master/MoltenVK/MoltenVK/API/vk_mvk_moltenvk.h#L378-L384

Metal does not distinguish functionality between queues, which would normally mean only a single general-purpose queue family with multiple queues is needed. However, Vulkan associates command buffers with a queue family, whereas Metal associates command buffers with a specific Metal queue. In order to allow a Metal command buffer to be prefilled before is is formally submitted to a Vulkan queue, each Vulkan queue family can support only a single Metal queue. As a result, in order to provide parallel queue operations, MoltenVK provides multiple queue families, each with a single queue.

On my M1 Max MacBook Pro, that results in this output from VulkanInit():

% /Users/ehntoo/projects/scopehal-apps/build/tests/Filters/Filters "Filter_FFT"
Initializing Vulkan
    VK_KHR_get_physical_device_properties2: supported
    VK_EXT_debug_utils: supported
    Loader/API support available for Vulkan 1.3
    Vulkan 1.2 support available, requesting it
    Initializing glfw 3.3.8 Cocoa NSGL EGL OSMesa dynamic
    GLFW required extensions:
        VK_KHR_surface
        VK_EXT_metal_surface
    Physical devices:
        Device 0: Apple M1 Max
            API version:            0x004010e0 (0.1.1.224)
            Driver version:         0x0000277f (0.0.2.1919)
            Vendor ID:              106b
            Device ID:              c0503ef
            Device type:            Integrated GPU
            int64:                  yes
            int16:                  yes (allowed in SSBOs)
            int8:                   yes (allowed in SSBOs)
            Max image dim 2D:       16384
            Max storage buf range:  4095 MB
            Max mem alloc:          1024 MB
            Max compute shared mem: 32 KB
            Max compute grp count:  1073741824 x 1073741824 x 1073741824
            Max compute invocs:     1024
            Max compute grp size:   1024 x 1024 x 1024
            Memory types:
                Type 0
                    Heap index: 0
                    Device local
                Type 1
                    Heap index: 0
                    Device local
                    Host visible
                    Host coherent
                    Host cached
                Type 2
                    Heap index: 0
                    Device local
                    Host visible
                    Host cached
                Type 3
                    Heap index: 0
                    Device local
                    Lazily allocated
            Memory heaps:
                Heap 0
                    Size: 64 GB
                    Device local
        Selected device 0
            Queue families (4 total)
                Queue type 0
                    Queue count:          1
                    Timestamp valid bits: 64
                    Graphics
                    Compute
                    Transfer
                    Using this queue type for compute
                    Using this queue type for rendering
                Queue type 1
                    Queue count:          1
                    Timestamp valid bits: 64
                    Graphics
                    Compute
                    Transfer
                Queue type 2
                    Queue count:          1
                    Timestamp valid bits: 64
                    Graphics
                    Compute
                    Transfer
                Queue type 3
                    Queue count:          1
                    Timestamp valid bits: 64
                    Graphics
                    Compute
                    Transfer
            Enabling 64-bit integer support
            Enabling 16-bit integer support
            Enabling 16-bit integer support for SSBOs
            Enabling 8-bit integer support
            Enabling 8-bit integer support for SSBOs
            Using type 1 for pinned host memory
            Using type 0 for card-local memory
zsh: segmentation fault  /Users/ehntoo/projects/scopehal-apps/build/tests/Filters/Filters "Filter_FFT"

The segmentation fault occurs at the end of VulkanInit() while attempting to allocate a "Compute" queue for g_vkFFTQueue and accessing a nonexistent queueIndex of 1 for queue type 0.

I know "proper" macOS/Apple Silicon support is still a future item, but thought this might be a useful thing to be aware of in advance.

@azonenberg
Copy link
Collaborator

Good to know, this will definitely hit us soon. It shouldn't be too hard to ifdef some stuff and act differently with MoltenVK.

  • @lainy since she's working on the Apple side of things.

@azonenberg
Copy link
Collaborator

azonenberg commented Sep 10, 2022

@ehntoo I think this should be as simple as making the queue allocators always return 0 (i.e. every VkQueue points to the same "hardware queue") on MoltenVK. Can you try this and see if the unit tests work then?

@azonenberg
Copy link
Collaborator

A more optimal long term fix would divide the four Metal queues up among compute worker threads etc. But the workaround i mentioned above should suffice to at least make things run.

@ehntoo
Copy link
Contributor Author

ehntoo commented Sep 10, 2022

Overriding the queue allocators to always return 0 definitely gets closer, Acceleration, Filter_FrequencyMeasurement and Primitive_SampleOnRisingEdges pass with that change. Looks like I'll need to properly figure out shader compilation for several others, as they're failing with things like ReadDataFile: Could not open file "shaders/Convert16BitSamples.spv".

I'll try and get my changes cleaned up enough to push to a fork before end-of-day in case they're of any interest.

@ehntoo
Copy link
Contributor Author

ehntoo commented Sep 10, 2022

Made some changes to replace references to /proc/self/exe and things are looking much better. I suspect the remaining two failures that are bailing with SIGILL will probably take some more investigation.

ctest -j12 -C Debug -T test --output-on-failure
   Site: feirefiz
   Build name: Darwin-clang++
Test project /Users/ehntoo/projects/scopehal-apps/build
      Start  3: Filter_DeEmbed
      Start  5: Filter_FFT
      Start  4: Filter_FIR
      Start  7: Filter_Upsample
      Start  6: Filter_Subtract
      Start  8: Filter_FrequencyMeasurement
      Start 10: Primitive_Convert16BitSamples
      Start  9: Primitive_Convert8BitSamples
      Start 11: Primitive_SampleOnRisingEdges
      Start  2: Buffers_CpuGpu
      Start  1: Buffers_CpuOnly
 1/11 Test  #1: Buffers_CpuOnly ..................   Passed    0.19 sec
 2/11 Test  #2: Buffers_CpuGpu ...................   Passed    0.21 sec
 3/11 Test  #5: Filter_FFT .......................***Exception: Illegal  0.25 sec
Filters: Filter_FFT
Randomness seeded to: 4219064763
Iteration 0

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Filters is a Catch2 v3.1.0 host application.
Run with -? for options

-------------------------------------------------------------------------------
Filter_FFT
  Iteration 0
-------------------------------------------------------------------------------
/Users/ehntoo/projects/scopehal-apps/tests/Filters/Filter_FFT.cpp:77
...............................................................................

/Users/ehntoo/projects/scopehal-apps/tests/Filters/Filter_FFT.cpp:77: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGILL - Illegal instruction signal

===============================================================================
test cases: 1 | 1 failed
assertions: 2 | 1 passed | 1 failed


 4/11 Test  #3: Filter_DeEmbed ...................***Exception: Illegal  0.40 sec
Filters: Filter_DeEmbed
Randomness seeded to: 1552133182
Iteration 0

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Filters is a Catch2 v3.1.0 host application.
Run with -? for options

-------------------------------------------------------------------------------
Filter_DeEmbed
  Iteration 0
-------------------------------------------------------------------------------
/Users/ehntoo/projects/scopehal-apps/tests/Filters/Filter_DeEmbed.cpp:89
...............................................................................

/Users/ehntoo/projects/scopehal-apps/tests/Filters/Filter_DeEmbed.cpp:89: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGILL - Illegal instruction signal

===============================================================================
test cases: 1 | 1 failed
assertions: 2 | 1 passed | 1 failed


 5/11 Test #11: Primitive_SampleOnRisingEdges ....   Passed    0.46 sec
 6/11 Test  #9: Primitive_Convert8BitSamples .....   Passed    1.40 sec
 7/11 Test #10: Primitive_Convert16BitSamples ....   Passed    1.48 sec
 8/11 Test  #8: Filter_FrequencyMeasurement ......   Passed    2.97 sec
 9/11 Test  #6: Filter_Subtract ..................   Passed    7.52 sec
10/11 Test  #7: Filter_Upsample ..................   Passed   25.58 sec
11/11 Test  #4: Filter_FIR .......................   Passed   66.18 sec

82% tests passed, 2 tests failed out of 11

Total Test time (real) =  66.19 sec

The following tests FAILED:
	  3 - Filter_DeEmbed (ILLEGAL)
	  5 - Filter_FFT (ILLEGAL)
Errors while running CTest

ehntoo added a commit to ehntoo/scopehal that referenced this issue Sep 11, 2022
@ehntoo
Copy link
Contributor Author

ehntoo commented Sep 11, 2022

@azonenberg
Copy link
Collaborator

Yeah we use procfs by default on POSIX platforms to find data files.

If I had to guess, the SIGILL is in JIT code generated by FFTS since both of the tests involving FFTs - and only those - are failing. What happens if you comment out the code in the test case that sets g_gpuFilterEnabled false?

Now you'll be using vkFFT only. This means you no longer have a second FFT implementation to cross-check your results against, but if it doesn't crash then that's a good sign. The spectrogram filter (which does not yet have a unit test) currently uses FFTS and does not have a vkFFT backend, so that is likely to crash too. But fixing that is on my to-do list.

@azonenberg
Copy link
Collaborator

You can also ignore the unit test failures for the purposes of ngscopeclient, because while ngscopeclient can connect to an instrument it doesn't yet display waveform data or let you configure filter blocks. So there's no way to actually reach the crashing code path.

What I'm more interested in at this point is making ngscopeclient run, render a GUI, and allow basic interaction. If that works, at this stage, I'm happy.

ehntoo added a commit to ehntoo/scopehal that referenced this issue Sep 11, 2022
@ehntoo
Copy link
Contributor Author

ehntoo commented Sep 11, 2022

with ehntoo/scopehal-apps@e70659b, I've made... progress?
Screen Shot 2022-09-11 at 5 19 46 PM

Three issues are immediately clear:

  1. Input handling is not correct - mouse movements only seem to be tracked when the window is not focused
  2. Whatever's going on with the font texture rendering on top of the left side panes (expected)
  3. No app entry showing up in the system tray/task switcher? (Related to 1?)

Should I move this thread of discussion to another issue for macOS bring-up?

@ehntoo
Copy link
Contributor Author

ehntoo commented Sep 11, 2022

Remaining basic issues running ngscopeclient on macOS fixed in ehntoo@d19c507. I'll tidy up a bit and submit some PRs.

ehntoo added a commit to ehntoo/scopehal that referenced this issue Sep 11, 2022
@azonenberg azonenberg changed the title Vulkan Queue Management on MoltenVK Improve Vulkan queue allocator Sep 12, 2022
@azonenberg
Copy link
Collaborator

azonenberg commented Sep 12, 2022

Per discussion in dev chat, this issue is now specifically for the problem of "we try to allocate a lot of queues of the first type that fits, and run out". Please file new issues for any other OSX or MoltenVK portability issues.

@lainy
Copy link
Collaborator

lainy commented Sep 27, 2022

Currently working on this.

Also, for reference the ngscopeclient output from my M1 MacBook Pro is below.
A quick diff shows this is identical to the M1 Max information in the original post, other than a difference in total heap size and device IDs.

Initializing Vulkan
    VK_KHR_get_physical_device_properties2: supported
    VK_EXT_debug_utils: supported
    Loader/API support available for Vulkan 1.3
    Vulkan 1.2 support available, requesting it
    Initializing glfw 3.3.8 Cocoa NSGL EGL OSMesa dynamic
    GLFW required extensions:
        VK_KHR_surface
        VK_EXT_metal_surface
    Physical devices:
        Device 0: Apple M1
            API version:            0x004010e0 (0.1.1.224)
            Driver version:         0x0000277f (0.0.2.1919)
            Vendor ID:              106b
            Device ID:              c0603ef
            Device type:            Integrated GPU
            int64:                  yes
            int16:                  yes (allowed in SSBOs)
            int8:                   yes (allowed in SSBOs)
            Max image dim 2D:       16384
            Max storage buf range:  4095 MB
            Max mem alloc:          1024 MB
            Max compute shared mem: 32 KB
            Max compute grp count:  1073741824 x 1073741824 x 1073741824
            Max compute invocs:     1024
            Max compute grp size:   1024 x 1024 x 1024
            Memory types:
                Type 0
                    Heap index: 0
                    Device local
                Type 1
                    Heap index: 0
                    Device local
                    Host visible
                    Host coherent
                    Host cached
                Type 2
                    Heap index: 0
                    Device local
                    Host visible
                    Host cached
                Type 3
                    Heap index: 0
                    Device local
                    Lazily allocated
            Memory heaps:
                Heap 0
                    Size: 16 GB
                    Device local
        Selected device 0
            Queue families (4 total)
                Queue type 0
                    Queue count:          1
                    Timestamp valid bits: 64
                    Graphics
                    Compute
                    Transfer
                    Using this queue type for compute
                    Using this queue type for rendering
                Queue type 1
                    Queue count:          1
                    Timestamp valid bits: 64
                    Graphics
                    Compute
                    Transfer
                Queue type 2
                    Queue count:          1
                    Timestamp valid bits: 64
                    Graphics
                    Compute
                    Transfer
                Queue type 3
                    Queue count:          1
                    Timestamp valid bits: 64
                    Graphics
                    Compute
                    Transfer
            Enabling 64-bit integer support
            Enabling 16-bit integer support
            Enabling 16-bit integer support for SSBOs
            Enabling 8-bit integer support
            Enabling 8-bit integer support for SSBOs
            Using type 1 for pinned host memory
            Using type 0 for card-local memory

@azonenberg
Copy link
Collaborator

Fixed ages ago and we forgot to close the ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants