Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low frame rates on Starcraft 2 while CPU and GPU utilization is low #333

Open
Venemo opened this issue Jan 13, 2019 · 63 comments
Open

Low frame rates on Starcraft 2 while CPU and GPU utilization is low #333

Venemo opened this issue Jan 13, 2019 · 63 comments

Comments

@Venemo
Copy link

Venemo commented Jan 13, 2019

Long story short, I observed a scenario where I get low frame rates even though the CPU and GPU utilization was low. This is a continuation of this issue where we started discussing a performance problem with Starcraft 2 when running on Gallium Nine standalone.

Hardware details: the system is a Dell XPS 13 9370 with a 4-core / 8-thread Intel 8550U, a Sapphire AMD Radeon RX 570 ITX card is connected through Thunderbolt 3 (which is roughly equivalent to a x4 PCI-E connection), and a 4K Dell U2718Q screen is connected to the RX 570's DisplayPort output.

I didn't measure the frame rates in a scientific way, but the difference is very noticable:

  • On the same setup I easily get 60-80 fps on all high settings on Windows 10
  • Running Gallium Nine standalone on Fedora 29's Wine 3.21 on medium settings results in about 30-40 fps ― the curlpit seems to be the shaders, if I turn it down to low shaders then it gets above 60 fps.
  • If I run Wine 3.13 with the PBA patch from the espionage724 repo, then I get about 50-60 fps (again on medium settings)

Some observations from the other thread that are worth to mention here:

  • The frame rate doesn't seem to be related to resolution. Setting the system to use 1080p (using Gnome settings) results in similar numbers when using Nine.
  • According to GALLIUM_HUD CPU utilization is around 30% and GPU utilization is around 50%
  • According to perf top the majority of time is spent in functions like si_set_constant_buffer amdgpu_mm_rreg and NineDevice9_SetIndices from which I got the impression that some buffers are copied to/from the GPU and that may be a problem.

@axeldavy

On the first campain, after the videos, I get about 140 fps on full hd, everything maxed out on my radeon rx 480. The GPU load is about 90%. My 4 cpu threads are all around 60-70%..
I used GALLIUM_HUD=GPU-load,fps,cpu0+cpu1+cpu2+cpu3,shader-clock+memory-clock+temperature tearfree_discard=true WINEDEBUG=-all

That does sound awesome, and makes me think that the problem may be with my setup, or that maybe my setup triggers some sort of corner case within Nine. After thinking about it more, I've got three ideas:

Idea 1. Maybe there's something wrong with my kernel command line? Bascially what I do is disable all the spectre / meltdown mitigations, I enable the PowerPlay features in amdgpu and blacklist the i915 module.

resume=UUID=0dc28d3d-cf9a-4a1d-b980-e7f78ad7aaee rd.luks.uuid=luks-98f2c2d3-77e1-444a-b3f2-d3396b53e16e rhgb quiet mem_sleep_default=deep pti=off spectre_v2=off l1tf=off nospec_store_bypass_disable no_stf_barrier amdgpu.ppfeaturemask=0xffffffff i915.enable_guc=0 module_blacklist=i915 3

Idea 2. Maybe somehow Nine performs more IO through PCI-E than wined3d and the Thunderbolt 3 port is the bottleneck. Is it a possibility that there are some operations within Nine that are not noticable when using a PCI-E x16 connection but become problematic when running over PCI-E x4? I don't know how to verify this theory, but maybe there is a way to check the PCI-E port usage.

Idea 3. It occours to me that I installed DirectX 9 from winetricks in this Wine prefix before I started using Nine. Is it possible that this interferes with Nine somehow?

@iiv3

do you compile your own kernel? If so, you might have some additional tracing enabled.

No, I've got 4.19 from Fedora: 4.19.13-300.fc29.x86_64

If not...
Could you try to disable "iommu=off", in case you are hitting some "soft" emulation that involves page faults.

Good idea, I will try that.

If this doesn't help either, try entering the "amdgpu_mm_rreg" in perf top and see which instructions are getting most heat.

Sure, I'll take a look, though I'm afraid I'm not an expert in perf.

Sorry for the long post!

@iiv3
Copy link

iiv3 commented Jan 15, 2019

Actually most motherboards have very few PCIE x16 slots, usually one or two. While there are might be more full sized PCIE slots, they are actually working as x4 ones.
So maybe @axeldavy could try and test his video card on x4 PCIE slot.

@Venemo,
You could try radeon top program, that shows utilization of specific GPU subsystems. It doesn't sound like the bottleneck is there, but it might give some useful hint.

You could also take a look what options are available for you in R600_DEBUG, despite its name it works on latest radeons too. Do an R600_DEBUG=help glxgears and you will get all available options (some drivers have more). Try everything that is not about dumping stuff to the console. (Try it one by one.)

Thunderbold3 definitely complicates things.
I would add testing all working iommu types, in case you do need a hardware support for it.

I had a similar slowdown caused by the fact that my distribution had i586 optimized glibc, that used a memcpy() with hack for original Pentium. It would read from the destination memory, before overwriting it with real data. When used over my PCIE x16 it would get something like 25% CPU in perf top and 50 fps instead of 70fps (benchmark avg).

This reminds me...
You had two separate perf top logs that showed si_set_constant_buffer and amdgpu_mm_rreg as top usage functions... but I could not quite understand what have you done differently to get one or the other result.
It's not compilation or inline-ing, as one is from mesa3d and the other is from the kernel.

If you are compiling your own mesa3d try to use -O1 or -Og, it might give some hints. Actually compiling with debug support would be very good idea.
When you run perf top you can select function with arrow keys. Press enter once, then press it second time to enter into Annotate. In that mode it will show you how hot each instructions from that functions are. If you have debug support it might show you even the source.

Also, I've tried to make a line that would show all graphs on the screen, try it in case it provides some hint.

export GALLIUM_HUD=\
".dfps+cpu+GPU-load+temperature,.dGTT-usage+VRAM-usage,num-compilations+num-shaders-created;"\
"primitives-generated+draw-calls,samples-passed+ps-invocations+vs-invocations,buffer-wait-time;"\
"CS-thread-busy+gallium-thread-busy,dma-calls+cp-dma-calls,num-bytes-moved;"\
"num-vs-flushes+num-ps-flushes+num-cs-flushes+num-CB-cache-flushes+num-DB-cache-flushes"

You can check if you have an extra graphs with GALLIUM_HUD=help glxgears and add them too.

@Venemo
Copy link
Author

Venemo commented Jan 15, 2019

@iiv3 Thanks for the suggestions! I'll try them.

I had a similar slowdown caused by the fact that my distribution had i586 optimized glibc, that used a memcpy() with hack for original Pentium. It would read from the destination memory, before overwriting it with real data.

What was the solution to this one? How do I check if this is the case?

You had two separate perf top logs that showed si_set_constant_buffer and amdgpu_mm_rreg as top usage functions... but I could not quite understand what have you done differently to get one or the other result.

I honestly don't know, it is possible that I have messed up something in that WINEPREFIX. I will re-install the game in a clean prefix and use that for further testing.

@axeldavy
Copy link

Every frame, vertices must be sent to the graphic card for the draw calls. There's also the draw commands sent by the driver. Good games don't upload anything else.

As mentionned already, the vertices for SC2 are stored in a permanently mapped buffer. The allocation is done in GTT, ie graphic memory in cpu ram. At draw time, the graphic card fetches the data there.
The thing is if that was the limiting factor, the GPU-load would be higher: indeed the GPU is running shaders when fetching these vertice data, and it should count as GPU runnning. Still, you can try use VRAM for these allocations (the size of the mappable VRAM is restricted to 64MB, but it should be enough there). Replace all PIPE_USAGE_STREAM in buffer9.c and nine_buffer_upload.c to PIPE_USAGE_DEFAULT. Possibly in your case it should be faster as the data would make it to the GPU sooner.

As for the draw commands, there's not much to do.

Maybe as iive suggested you have wrong flags somewhere which leads to unoptimal functions somewhere.

@Venemo
Copy link
Author

Venemo commented Jan 15, 2019

Okay, so I upgraded to Fedora 29, kernel 5.0-rc1 and wine staging 4.0-rc1. Then I deleted the wine prefix and reinstalled the game in a clean prefix.

If this doesn't help either, try entering the "amdgpu_mm_rreg" in perf top and see which instructions are getting most heat.

There is a mov instruction which takes ~96% of the time spent in that function.

You could try radeon top program, that shows utilization of specific GPU subsystems. It doesn't sound like the bottleneck is there, but it might give some useful hint.

Here is the output. I couldn't copy-paste the bars from the terminal, but the percentages are clearly visible.

                         Graphics pipe  39,17% │
───────────────────────────────────────────────┼─
                 Event          Engine   0,00% │
                                               │
           Vertex Grouper + Tesselator  10,00% │
                                               │
                     Texture Addresser  24,17% │
                                               │
                         Shader Export  31,67% │
           Sequencer Instruction Cache   8,33% │
                   Shader Interpolator  33,33% │
                                               │
                        Scan Converter  34,17% │
                    Primitive Assembly  10,83% │
                                               │
                           Depth Block  30,83% │
                           Color Block  30,00% │
                                               │
                    2815M / 3995M VRAM  70,44% │
                       73M / 4093M GTT   1,79% │
            1,75G / 1,75G Memory Clock 100,00% │
            1,24G / 1,24G Shader Clock 100,00% │
                                

You could also take a look what options are available for you in R600_DEBUG

I took a look, and tried a few options but some of them caused a crash, while others caused noticable performance decrease, so I didn't get too far with it. Is there a specific option you want me to try?

I would add testing all working iommu types, in case you do need a hardware support for it.

With iommu=off my wifi card is not recognized, which is a problem. With iommu=on intel_iommu=on I get roughly +5 fps, and the "gallium thread busy" line from GALLIUM_HUD goes to zero. So I guess that is a good thing at least.

If you are compiling your own mesa3d try to use -O1 or -Og, it might give some hints. Actually compiling with debug support would be very good idea.

I just use the mesa that comes with Fedora. I'll also try the version from the "che" repo which is supposed to be a newer version with more optimizations enabled, will report back if it helps.

Also, I've tried to make a line that would show all graphs on the screen, try it in case it provides some hint.

Here is a screenshot from the graphs with intel_iommu=on:

Here is a graph

@Venemo Venemo closed this as completed Jan 15, 2019
@Venemo Venemo reopened this Jan 15, 2019
@Venemo
Copy link
Author

Venemo commented Jan 15, 2019

Oops, I accidentally hit "close and comment", sorry. Reopening it now.

@axeldavy
Copy link

performance

@Venemo
Copy link
Author

Venemo commented Jan 15, 2019

@axeldavy Thanks! How come that the samples passed and the ps/vs invocations are so low on your machine? Also the primitives generated seem much less than mine.

@axeldavy
Copy link

It depends on the scene and resolution.

@Venemo
Copy link
Author

Venemo commented Jan 16, 2019

Okay, I tried the patch suggested by @axeldavy and replaced PIPE_USAGE_STREAM to PIPE_USAGE_DEFAULT in the specified files. It definitely helps, but does not solve the problem.

Here are the graphs from the patched nine:

Graphs from patched nine

Hotspots from perf top:

  13,66%  d3dadapter9.so.1.0.0      [.] si_set_constant_buffer
   4,88%  anonmap.WTz65d (deleted)  [.] 0x000000000023f415
   2,69%  d3dadapter9.so.1.0.0      [.] NineDevice9_SetIndices
   2,57%  d3dadapter9.so.1.0.0      [.] amdgpu_do_add_real_buffer
   2,19%  d3dadapter9.so.1.0.0      [.] amdgpu_cs_add_buffer
   1,30%  dsound.dll.so             [.] 0x000000000002997a
   1,07%  anonmap.WTz65d (deleted)  [.] 0x000000000023f403
   0,63%  libc-2.28.so              [.] __memmove_avx_unaligned_erms
   0,59%  anonmap.WTz65d (deleted)  [.] 0x0000000001272ed6
   0,58%  dsound.dll.so             [.] 0x00000000000298e1

Here is the hot spot from the annotated si_set_constant_buffer:

       │     si_descriptors.c:1212                                                                                                                                                                                  ▒
  0,01 │       mov    (%rsi),%rax                                                                                                                                                                                   ▒
  0,00 │       lea    (%rax,%r14,8),%rdx                                                                                                                                                                            ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:139                                                                                                                                                 ▒
  0,07 │       mov    (%rdx),%rsi                                                                                                                                                                                   ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:79                                                                                                                                                  ▒
  0,01 │       cmp    %rsi,%r8                                                                                                                                                                                      ▒
       │     ↓ je     8c                                                                                                                                                                                            ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:87                                                                                                                                                  ▒
  0,00 │       test   %rsi,%rsi                                                                                                                                                                                     ▒
       │     ↓ je     8c                                                                                                                                                                                            ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:89                                                                                                                                                  ▒
 98,08 │       lock   subl   $0x1,(%rsi)                                                                                                                                                                            ▒
  1,16 │     ↓ jne    8c                                                                                                                                                                                            ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:146                                                                                                                                                 ▒
  0,00 │ 4d:   mov    0x20(%rsi),%rax                                                                                                                                                                               ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:148                                                                                                                                                 ▒
       │       mov    0x28(%rsi),%rcx                                                                                                                                                                               ▒
       │       mov    %rdx,0x8(%rsp)   

I also examined the same graph while running wined3d with the PBA patch:

Graphs from wined3d PBA

Hotspots according to perf top:

  16,67%  radeonsi_dri.so           [.] u_upload_alloc                                                                                                                                                              ◆
   3,12%  radeonsi_dri.so           [.] cso_set_vertex_buffers                                                                                                                                                      ▒
   2,27%  radeonsi_dri.so           [.] tc_call_draw_vbo                                                                                                                                                            ▒
   1,84%  radeonsi_dri.so           [.] amdgpu_cs_add_buffer                                                                                                                                                        ▒
   1,68%  libc-2.28.so              [.] __memmove_avx_unaligned_erms                                                                                                                                                ▒
   1,53%  radeonsi_dri.so           [.] amdgpu_do_add_real_buffer                                                                                                                                                   ▒
   1,25%  wined3d.dll.so            [.] 0x000000000005a390                                                                                                                                                          ▒
   1,20%  wined3d.dll.so            [.] 0x000000000005b5e4                                                                                                                                                          ▒
   1,17%  wined3d.dll.so            [.] 0x000000000005b870                                                                                                                                                          ▒
   0,90%  libpthread-2.28.so        [.] __pthread_mutex_lock                                                                                                                                                        ▒
   0,79%  ntdll.dll.so              [.] RtlEnterCriticalSection                                                                                                                                                     ▒
   0,73%  wined3d.dll.so            [.] 0x000000000005a3af                                                                                                                                                          ▒
   0,66%  wined3d.dll.so            [.] 0x000000000005a39a                                                                                                                                                          ▒
   0,64%  wined3d.dll.so            [.] 0x000000000005b52d                                                                                                                                                          ▒
   0,63%  wined3d.dll.so            [.] 0x000000000005b5d0                                                                                                                                                          ▒
   0,63%  wined3d.dll.so            [.] 0x000000000005b670                                                                                                                                                          ▒
   0,61%  wined3d.dll.so            [.] 0x000000000005b4a6                                                                                                                                                          ▒
   0,61%  ntdll.dll.so              [.] RtlLeaveCriticalSection                                                                                                                                                     ▒
   0,59%  [amdgpu]                  [k] amdgpu_mm_rreg                                                                                                                                                              ▒
   0,57%  wined3d.dll.so            [.] 0x000000000005aad0                                                                                                                                                          ▒
   0,57%  libpthread-2.28.so        [.] __pthread_mutex_unlock_usercnt                                                                                                                                              ▒
   0,53%  wined3d.dll.so            [.] 0x000000000005b5ed                                                                                                                                                          ▒
   0,51%  radeonsi_dri.so           [.] _mesa_unmarshal_dispatch_cmd                                                                                                                                                ▒
   0,50%  wined3d.dll.so            [.] 0x000000000005aabd    

Which means that wined3d+PBA still beats nine by a dozen or so FPS. What stands out to me is:

  • wined3d-pba uses 10 times as much GTT, but only half the VRAM
  • nine uses significantly less CPU, but wined3d-pba manages to load the GPU about 10% more than nine
  • the ps/vs invocations are in the same ballpark, though wined3d-pba has somewhat less samples passed

Further notes:

  • Both screenshots were taken from approx. the same point in the same replay on the same graphical settings.
  • For the sake of this experiment I downgraded to Wine 3.13, because that is the version I could easily get for Fedora with the PBA patch included. I recompiled Nine standalone for this Wine version.
  • mesa 18.2.8 was patched, the patch was added to mesa.spec and the whole x86_64 mesa package were rebuilt, but I only installed the mesa-d3d packages from the build output. (I assume the change doesn't affect the rest of the packages.)

Where do we go from here?

@Venemo
Copy link
Author

Venemo commented Jan 17, 2019

Here is what lspci -vvv has to say about the GPU:

08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef) (prog-if 00 [VGA controller])
	Subsystem: Sapphire Technology Limited Device e343
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 128 bytes
	Interrupt: pin A routed to IRQ 156
	Region 0: Memory at 60000000 (64-bit, prefetchable) [size=256M]
	Region 2: Memory at 70000000 (64-bit, prefetchable) [size=2M]
	Region 4: I/O ports at 2000 [size=256]
	Region 5: Memory at ac000000 (32-bit, non-prefetchable) [size=256K]
	Expansion ROM at ac040000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x4 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee006b8  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [200 v1] Resizable BAR <?>
	Capabilities: [270 v1] Secondary PCI Express <?>
	Capabilities: [2b0 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable-, Smallest Translation Unit: 00
	Capabilities: [2c0 v1] Page Request Interface (PRI)
		PRICtl: Enable- Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000020, Page Request Allocation: 00000000
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [320 v1] Latency Tolerance Reporting
		Max snoop latency: 71680ns
		Max no snoop latency: 71680ns
	Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 1
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [370 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=0us PortTPowerOnTime=170us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

Here is what amdgpu's sysfs has to say:

[root@timur-xps ~]# cat /sys/class/drm/card1/device/pp_dpm_pcie 
0: 2.5GT/s, x8 
1: 8.0GT/s, x16 *

Reducing the bandwidth using echo 0 > /sys/class/drm/card1/device/pp_dpm_pcie did not seem to have any effect on the frame rate, but I'm honestly not sure how reliable pp_dpm_pcie is. Or maybe the TB3 bottleneck is more severe than the restriction that it can impose.

Trying to determine the available bandwidth using clpeak is not possible due to the missing PCI-E atomics support:

kfd kfd: skipped device 1002:67df, PCI rejects atomics

However this egpu.io page states the following: Intel throttles 32Gbps-TB3 to 22Gbps which benchmarks as 20.40Gbps (AMD) or 18.91Gbps (Nvidia) in the important H2D direction.

@Venemo
Copy link
Author

Venemo commented Jan 17, 2019

I took a look at the hot spots for Nine in perf top and investigated a bit:

si_set_constant_buffer -> pipe_resource_reference -> pipe_reference_described -> p_atomic_dec_zero which is a call to either __atomic_sub_fetch or __sync_sub_and_fetch (preferring the __atomic version if available)

Looking at the other items like NineDevice9_SetIndices and amdgpu_do_add_real_buffer, all of them wait for a similar synchronization primitive.

@dhewg
Copy link

dhewg commented Jan 18, 2019

Cannot reproduce here either

  • Xeon E3-1231 v3 @ 3.40GHz
  • R9 380X
  • Linux 4.19.13
  • mesa master from ~2 weeks ago
  • wine 4.0~rc6
  • sc2 set to 1080p with ultra gfx settings

without vsync:
screenshot_2019-01-18_13-10-30
and with vsync:
screenshot_2019-01-18_13-56-08

@Venemo
Copy link
Author

Venemo commented Jan 18, 2019

@dhewg @axeldavy Which distro do you guys use? Just asking because I might just try running that distro as an experiment. Maybe there is some package that is better optimized there than here, and maybe that makes the difference. Or maybe mesa itself is compiled with different flags?

@dhewg
Copy link

dhewg commented Jan 18, 2019

I'm on Debian, @axeldavy on Arch iirc.
I built mesa myself though, using CFLAGS="-march=native" CXXFLAGS="-march=native" meson ~/src/mesa --prefix /opt/andre/mesa -Ddri-drivers= -Dgallium-drivers=radeonsi,swrast -Dvulkan-drivers=amd -Dgallium-nine=true -Dosmesa=gallium

@Venemo
Copy link
Author

Venemo commented Jan 18, 2019

Okay, so @axeldavy Suggested an approach that helps find whether the constant data is the bottleneck or not:

08:20 mannerov: in nine_state.c
08:20 mannerov: prepare_vs_constants_userbuf
08:20 mannerov: when cb.buffer_size is set
08:21 mannerov: set it to the maximum value instead, that is sizeof(float[4]) * NINE_MAX_CONST_ALL
08:21 mannerov: if it doesn't decrease fps, it is not limiting

Here is how the patch file looks:

From debd5413b1d14a28f26ae40c1d907df621044c8a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Timur=20Krist=C3=B3f?= <venemo@msn.com>
Date: Fri, 18 Jan 2019 14:38:49 +0100
Subject: [PATCH] Change prepare_vs_constants_userbuf to always use the maximum
 size.

---
 src/gallium/state_trackers/nine/nine_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/gallium/state_trackers/nine/nine_state.c b/src/gallium/state_trackers/nine/nine_state.c
index bf39ebc9b4..209518ac2e 100644
--- a/src/gallium/state_trackers/nine/nine_state.c
+++ b/src/gallium/state_trackers/nine/nine_state.c
@@ -431,7 +431,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
     struct pipe_constant_buffer cb;
     cb.buffer = NULL;
     cb.buffer_offset = 0;
-    cb.buffer_size = context->vs->const_used_size;
+    cb.buffer_size = sizeof(float[4]) * NINE_MAX_CONST_ALL;
     cb.user_buffer = context->vs_const_f;
 
     if (context->swvp) {
-- 
2.20.1

When I run the game with this patch:

  • I get worse frame rate than before, it's about 10 fps lower on the scene where I took the two screenshots above.
  • VRAM usage is hugely worse than before, nearing 3 GB.
  • There is a new hot spot in perf top.

Now the perf top looks like this:

  17,21%  d3dadapter9.so.1.0.0      [.] si_decompress_textures
  12,59%  d3dadapter9.so.1.0.0      [.] si_set_constant_buffer
   4,72%  d3dadapter9.so.1.0.0      [.] nine_context_draw_indexed_primitive_priv
   1,58%  d3dadapter9.so.1.0.0      [.] amdgpu_do_add_real_buffer
   1,31%  dsound.dll.so             [.] 0x000000000002997a
   1,06%  d3dadapter9.so.1.0.0      [.] amdgpu_cs_add_buffer
   0,74%  anonmap.qaDqCx (deleted)  [.] 0x000000000023f415
   0,65%  d3dadapter9.so.1.0.0      [.] si_set_active_descriptors_for_shader
   0,55%  dsound.dll.so             [.] 0x00000000000298e1
   0,55%  libc-2.28.so              [.] __memmove_avx_unaligned_erms
   0,54%  d3dadapter9.so.1.0.0      [.] amdgpu_add_fence_dependencies_bo_list
   0,49%  d3dadapter9.so.1.0.0      [.] si_bind_vs_shader
   0,49%  [unknown]                 [.] 0000000000000000
   0,45%  d3dadapter9.so.1.0.0      [.] nine_update_state
   0,44%  d3dadapter9.so.1.0.0      [.] amdgpu_lookup_buffer
   0,42%  dsound.dll.so             [.] 0x0000000000029971

Interesting how si_decompress_textures now appears in there. I also annotated it:

       │     si_blit.c:780                                                                                                                                                                                          ▒
  1,71 │ 18:   push   %r15                                                                                                                                                                                          ▒
 44,26 │       push   %r14                                                                                                                                                                                          ▒
  7,00 │       push   %r13                                                                                                                                                                                          ▒
 30,18 │       push   %r12                                                                                                                                                                                          ▒
 16,11 │       push   %rbp                                                                                                                                                                                          ▒
  0,02 │       push   %rbx                                                                                                                                                                                          ▒
       │       mov    %rdi,%rbx                                                                                                                                                                                     ▒
  0,01 │       sub    $0x28,%rsp   

@Venemo
Copy link
Author

Venemo commented Jan 18, 2019

@dhewg Is there anything in here that stands out to you? https://src.fedoraproject.org/rpms/mesa/blob/f29/f/mesa.spec#_365

@Venemo
Copy link
Author

Venemo commented Jan 18, 2019

I also tried the the other suggestion by @axeldavy - here are the results.

23:43 mannerov: Venemo_j: another thing you could try is in si_pipe.c in the const_uploader = u_upload_create call to replace DEFAULT by STREAM

Here is what I experience when I run the game with this:

  • In the same scene, the fps is increased to ~80
  • GTT usage increased to ~400 MB
  • VRAM usage is about ~1.5 GB
  • GPU usage is reported between 85% - 99%
  • I haven't noticed any visual glitch or rendering error thus far.
  • Mesa has disappeared from the hot spot in perf top. Now it's just the amdgpu_mm_rreg that is still there.
  • Now if I run the game on 1080p with this patch, then I get ~110 fps on the same scene with the same settings. However if I set everything to Ultra then it's hardly 60 fps.

This is the patch:

From 7025c93aca84713c25b3a73ff9e2db91493f217c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Timur=20Krist=C3=B3f?= <venemo@msn.com>
Date: Fri, 18 Jan 2019 14:38:49 +0100
Subject: [PATCH] Change si_create_context to use STREAM for const_uploader.

---
 src/gallium/drivers/radeonsi/si_pipe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 6b36893698..ce77a4fd4c 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -409,7 +409,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
 		goto fail;
 
 	sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
-						   0, PIPE_USAGE_DEFAULT,
+						   0, PIPE_USAGE_STREAM,
 						   SI_RESOURCE_FLAG_32BIT |
 						   (sscreen->cpdma_prefetch_writes_memory ?
 							    0 : SI_RESOURCE_FLAG_READ_ONLY));
-- 
2.20.1

perf top result:

  12,26%  [amdgpu]                  [k] amdgpu_mm_rreg
   1,98%  dsound.dll.so             [.] 0x000000000002997a
   0,92%  d3dadapter9.so.1.0.0      [.] amdgpu_cs_add_buffer
   0,81%  dsound.dll.so             [.] 0x00000000000298e1
   0,78%  libc-2.28.so              [.] __memmove_avx_unaligned_erms
   0,65%  [kernel]                  [k] update_blocked_averages
   0,64%  anonmap.8QZ4Mi (deleted)  [.] 0x0000000001c16777
   0,63%  dsound.dll.so             [.] 0x0000000000029971
   0,60%  dsound.dll.so             [.] 0x00000000000298c5
   0,60%  d3dadapter9.so.1.0.0      [.] amdgpu_lookup_buffer
   0,57%  d3dadapter9.so.1.0.0      [.] nine_context_set_texture
   0,54%  [kernel]                  [k] psi_task_change
   0,50%  d3dadapter9.so.1.0.0      [.] NineDevice9_SetTexture

@Venemo
Copy link
Author

Venemo commented Jan 18, 2019

OOOPS, looks like I forgot to revert the other patch before applying this one.

Will do another test with just the si_pipe patch.

@iiv3
Copy link

iiv3 commented Jan 18, 2019

Cannot reproduce here either
* Xeon E3-1231 v3 @ 3.40GHz
* R9 380X
* Linux 4.19.13
* mesa master from 2 weeks ago
* wine 4.0
rc6
* sc2 set to 1080p with ultra gfx settings

The @Venemo, setup involves thunderbolt to run the card outside the laptop. This limits it to pcie x4 speeds.

To recreate his setup you should at least move your video card to pcie x4 slot. Most MB have only 1-2 PCIE x16 slots, and the rest are x4, even though they use the full x16 connector.

You may try it if this is not going to void your PC warranty.

@Venemo
Copy link
Author

Venemo commented Jan 18, 2019

Rebuilt mesa with the correct patch and updated my last comment.

@dhewg
Copy link

dhewg commented Jan 19, 2019

@Venemo: No idea about rpm spec stuff, but nothing stands out I guess. I just tried against mesa packages from debian, and it looks like the fps is roughly the same on sc2

@Venemo
Copy link
Author

Venemo commented Jan 19, 2019

While the si_pipe patch does help a lot, it's not perfect. Even though Gallium says that the GPU utilization is 99%, radeontop disagrees, and says the utilization is ~50% (up from the previous ~30%).

@iiv3
Copy link

iiv3 commented Jan 19, 2019

@dhewg , to reproduce the issue you need to move the video card to PCIE x4 slot.

As I said in my previous comment to you, a lot of MB have full sized slots that are x4.

@dhewg
Copy link

dhewg commented Jan 19, 2019

@iiv3 yeah, I get that it's a bottleneck that I don't run into. Just confirming that it's not a general issue

@iiv3
Copy link

iiv3 commented Jan 19, 2019

I ported the change to r600_pipe_common.c::706 and run a benchmark.
Without the change l4d2 got 83,55 and 83,72fps.
After the change it got 84,27 and 83,92fps.

I'd say that it is a consistent improvement, even thought 0.5% is around the margin of error.

@axeldavy
Copy link

newscreenshot
I get better performance with the constant compacting code, even though it's still not 100% usage.

@dhewg
Copy link

dhewg commented Jan 24, 2019

With vsync disabled in-game and thread_submit=true as well as tearfree_discard=true my gpu load is maxed out. That is with upstream mesa, so none of those new patches.

@Venemo
Copy link
Author

Venemo commented Jan 30, 2019

@marekolsak

I've run a series of tests for you. I'm posting the patches here too. All of these were applied on top of the ixit/master branch. In every case the same SC2 replay was used and I looked at the FPS counter at the same moment during the replay.

It's possible that const_uploader needs to use a bigger size.

The maximum size of the buffer that comes from prepare_vs_constants_userbuf is 4096 (the maximum number of shader constants in nine is 256 and each of them have the maximum size of 4 * sizeof(float)). To confirm, I added an assert(cb.buffer_size <= 4096); below the u_upload_alloc call, and it is never bigger than that.

I'd like to identify the reason why stream_uploader is faster.

@axeldavy has a better understanding than me, maybe he can chime in with a better explanation, but I think it isn't really faster, this is just a symptom of an inefficiency somewhere in the code. For example, if there are a lot of unused shader constants, then the const_uploader would needlessly upload all of them, but the stream_uploader would not (as far as I understand).

In fact, changing everything to use STREAM is slower than changing just the VS constants to use STREAM.

Increasing the default size of the const uploader

The following patch gives ~50 fps, so no noticable improvement.

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..8c610152942 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -434,7 +434,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
        if (!sctx->b.stream_uploader)
                goto fail;
 
-       sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
+       sctx->b.const_uploader = u_upload_create(&sctx->b, 1024 * 1024,
                                                   0, PIPE_USAGE_DEFAULT,
                                                   SI_RESOURCE_FLAG_32BIT |
                                                   (sscreen->cpdma_prefetch_writes_memory ?

Changing the const uploader to STREAM

Gives ~75-80 fps.

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..939bf31e46f 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -435,7 +435,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
                goto fail;
 
        sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
-                                                  0, PIPE_USAGE_DEFAULT,
+                                                  0, PIPE_USAGE_STREAM,
                                                   SI_RESOURCE_FLAG_32BIT |
                                                   (sscreen->cpdma_prefetch_writes_memory ?
                                                            0 : SI_RESOURCE_FLAG_READ_ONLY));

Changing the const uploader to STREAM while also increasing its default size

Gives ~75-80 fps, so the same as before.

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..1ede2de6580 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -434,8 +434,8 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
        if (!sctx->b.stream_uploader)
                goto fail;
 
-       sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
-                                                  0, PIPE_USAGE_DEFAULT,
+       sctx->b.const_uploader = u_upload_create(&sctx->b, 1024 * 1024,
+                                                  0, PIPE_USAGE_STREAM,
                                                   SI_RESOURCE_FLAG_32BIT |
                                                   (sscreen->cpdma_prefetch_writes_memory ?
                                                            0 : SI_RESOURCE_FLAG_READ_ONLY));

Hack that makes the VS constants use the steam uploader

This one yields ~90 fps.

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..ac001ddd7b7 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -430,7 +430,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
 
        sctx->b.stream_uploader = u_upload_create(&sctx->b, 1024 * 1024,
                                                    0, PIPE_USAGE_STREAM,
-                                                   SI_RESOURCE_FLAG_READ_ONLY);
+                                                   SI_RESOURCE_FLAG_32BIT | SI_RESOURCE_FLAG_READ_ONLY);
        if (!sctx->b.stream_uploader)
                goto fail;
 
diff --git a/src/gallium/state_trackers/nine/nine_state.c b/src/gallium/state_trackers/nine/nine_state.c
index 40c6ed10446..34d7720f4cb 100644
--- a/src/gallium/state_trackers/nine/nine_state.c
+++ b/src/gallium/state_trackers/nine/nine_state.c
@@ -485,7 +485,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
      * to have efficient WC. Thus for this case we really want
      * that intermediate buffer. */
 
-    u_upload_alloc(context->pipe->const_uploader,
+    u_upload_alloc(context->pipe->stream_uploader,
                   0,
                   cb.buffer_size,
                   256, /* Be conservative about alignment */
@@ -509,7 +509,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
         }
     }
 
-    u_upload_unmap(context->pipe->const_uploader);
+    u_upload_unmap(context->pipe->stream_uploader);
     cb.user_buffer = NULL;
 
     /* Free previous resource */

Hacking VS and PS contants to use the stream uploader

This one yields ~80-85 fps, so worse than using just the VS constants with the stream uploader.

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..9056268cd53 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -430,11 +430,11 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
 
        sctx->b.stream_uploader = u_upload_create(&sctx->b, 1024 * 1024,
                                                    0, PIPE_USAGE_STREAM,
-                                                   SI_RESOURCE_FLAG_READ_ONLY);
+                                                   SI_RESOURCE_FLAG_32BIT | SI_RESOURCE_FLAG_READ_ONLY);
        if (!sctx->b.stream_uploader)
                goto fail;
 
-       sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
+       sctx->b.const_uploader = u_upload_create(&sctx->b, 4 * 1024,
                                                   0, PIPE_USAGE_DEFAULT,
                                                   SI_RESOURCE_FLAG_32BIT |
                                                   (sscreen->cpdma_prefetch_writes_memory ?
diff --git a/src/gallium/state_trackers/nine/nine_state.c b/src/gallium/state_trackers/nine/nine_state.c
index 40c6ed10446..1bebdf440a5 100644
--- a/src/gallium/state_trackers/nine/nine_state.c
+++ b/src/gallium/state_trackers/nine/nine_state.c
@@ -485,7 +485,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
      * to have efficient WC. Thus for this case we really want
      * that intermediate buffer. */
 
-    u_upload_alloc(context->pipe->const_uploader,
+    u_upload_alloc(context->pipe->stream_uploader,
                   0,
                   cb.buffer_size,
                   256, /* Be conservative about alignment */
@@ -494,6 +494,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
                   (void**)&upload_ptr);
 
     assert(cb.buffer && upload_ptr);
+    assert(cb.buffer_size <= 4096);
 
     if (!context->cso_shader.vs_const_ranges) {
         memcpy(upload_ptr, cb.user_buffer, cb.buffer_size);
@@ -509,7 +510,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
         }
     }
 
-    u_upload_unmap(context->pipe->const_uploader);
+    u_upload_unmap(context->pipe->stream_uploader);
     cb.user_buffer = NULL;
 
     /* Free previous resource */
@@ -573,7 +574,7 @@ prepare_ps_constants_userbuf(struct NineDevice9 *device)
     if (!cb.buffer_size)
         return;
 
-    u_upload_alloc(context->pipe->const_uploader,
+    u_upload_alloc(context->pipe->stream_uploader,
                   0,
                   cb.buffer_size,
                   256, /* Be conservative about alignment */
@@ -597,7 +598,7 @@ prepare_ps_constants_userbuf(struct NineDevice9 *device)
         }
     }
 
-    u_upload_unmap(context->pipe->const_uploader);
+    u_upload_unmap(context->pipe->stream_uploader);
     cb.user_buffer = NULL;
 
     /* Free previous resource */

@axeldavy
Copy link

Just a few information
. If shader is using indirect addressing, loops and/or booleans, you can have slightly more than 256 constants. I guess that doesn't happen for this game.
. With this git tree, there are patches to compress the constant ranges, thus nine only uploads what is going to be used by the shader, except for the case of constant indirect addressing (can only happen with vs), where we have to upload the 256 float slots. It's likely not all of them end up used, which may be why STREAM would perform better (because the transfert via the pci bus would only be for the used constants).
That doesn't explain why the stream_uploader would be faster.

@Venemo
Copy link
Author

Venemo commented Jan 30, 2019

@axeldavy Is it possible to somehow check if the game uses indirect addressing? Just to confirm or rule out that possibility.

@axeldavy
Copy link

The game uses indirect addressing.
I don't know the ratio 'constants sent because shader uses indirect addressing' over 'all constants sent', but given usually a shader needs less that 25 float[4] slots, even if a few shaders use indirect addressing, the ratio can be high.

@Venemo
Copy link
Author

Venemo commented Jan 30, 2019

Can we somehow eliminate the indirect addressing?

@marekolsak
Copy link

Thanks. I think the problem with DEFAULT is that more buffers are competing for the visible VRAM window, which makes the memory manager move buffers more often.

If you use STREAM too much, you might decrease GPU performance, because PCIe is slower.

@axeldavy
Copy link

The same game has good performance on pcie x16 but not on thunderbolt. That seems surpring to me that the culprit for the performance manager would move the buffers in VRAM, as else pcie x16 would be affected as well. Except maybe if the buffers get moved to GTT ? That would explain a lot.

@marekolsak
Copy link

Can you test this branch?
https://cgit.freedesktop.org/~mareko/mesa/log/?h=const-uploads-sdma

@Venemo
Copy link
Author

Venemo commented Jan 31, 2019

@marekolsak Wow, thank you. I merged your branch with the ixit master, compiled a release build and ran the game with SDMA=true. Now in the same scene I'm getting ~85-90 fps. GPU usage is 98% according to GALLIUM_HUD and 95% according to radeontop.

perf top result:

   9,56%  [amdgpu]                  [k] amdgpu_mm_rreg
   1,12%  libc-2.28.so              [.] __memmove_avx_unaligned_erms
   0,99%  d3dadapter9.so.1.0.0      [.] nine_update_state
   0,86%  d3dadapter9.so.1.0.0      [.] amdgpu_lookup_or_add_slab_buffer
   0,84%  d3dadapter9.so.1.0.0      [.] amdgpu_cs_add_buffer
   0,68%  [kernel]                  [k] update_blocked_averages
   0,68%  d3dadapter9.so.1.0.0      [.] amdgpu_add_fence_dependencies_bo_list
   0,61%  libc-2.28.so              [.] __memcmp_avx2_movbe
   0,45%  d3dadapter9.so.1.0.0      [.] nine_context_set_texture
   0,40%  [kernel]                  [k] psi_task_change
   0,37%  d3dadapter9.so.1.0.0      [.] si_emit_spi_map
   0,36%  d3dadapter9.so.1.0.0      [.] NineDevice9_SetTexture

Some more experiments that I tried:

  • Increasing or decreasing the default size of the const_uploader did not affect the FPS in a substantial way.
  • Changing the VS constants to use the stream uploader no longer seems to yield better performance.
  • Combining your branch with the suggestion from this comment gives a consistent improvement of ~5-10 FPS.
  • Previously, the frame rate was not that different between 1080p and 4K. Now I actually get more FPS on 1080p. Note that other people got ~130 fps on 1080p when every setting was set to the highest. If I do that I'm getting a modest ~70 fps. So there is still a bottleneck somewhere.

@marekolsak
Copy link

Thanks that you added SDMA=true. I forgot to tell you about it.

I've updated the branch with a performance fix. It should now reach 100% GPU load:
https://cgit.freedesktop.org/~mareko/mesa/log/?h=const-uploads-sdma

@iiv3
Copy link

iiv3 commented Feb 1, 2019

@marekolsak, would it be possible to implement the same changes for the r600 driver?

@Venemo
Copy link
Author

Venemo commented Feb 1, 2019

@marekolsak I've ran another series of tests for you.

Branch FPS GTT VRAM GPU use hotspot
ixit/master 55 60 MB 1.7 GB 45-50% 25% nine_update_state
20% amdgpu_mm_reg
mareko/
const-uploads-sdma
105 70 MB 1.7 GB 100% 10% amdgpu_mm_reg
merged 105-110 70 MB 1.7 GB 100% 10% amdgpu_mm_reg
merged
+ this patch
110 45 MB 1.7 GB 100% 10% amdgpu_mm_reg

Conclusions:

  • Using the SDMA for const uploads definitely has a huge impact on this system. Would be interesting if somebody could test if it has a similar impact on a non-Thunderbolt setup as well.
  • All other patches have a small, but measurable impact.
  • amdgpu_mm_reg is now the hot spot according to perf top. Would be interesting to find out why.

@marekolsak
Copy link

@marekolsak, would it be possible to implement the same changes for the r600 driver?

It would be possible, but the performance improvement wouldn't be as high, because the DMA engine is not asynchronous on radeon. For r600, I recommend changing const_uploader from DEFAULT to STREAM.

@Venemo My series decreases performance for trivial benchmarks by 20%, so I need some amdgpu kernel changes to get the performance back.

@axeldavy
Copy link

axeldavy commented Feb 1, 2019

Surprising. Any idea what bad kernel behaviour is causing the performance drop ?

@marekolsak
Copy link

Surprising. Any idea what bad kernel behaviour is causing the performance drop ?

We submit 2 CS ioctls. We need to merge them into 1.

@Venemo
Copy link
Author

Venemo commented Feb 1, 2019

@marekolsak Can you shed some light on what amdgpu_mm_reg is any why is it the hot spot? As far as I understand it has to do with memory-mapped registers but couldn't quite figure out why it is eating 10% CPU time in the game.

@marekolsak
Copy link

amdgpu_mm_reg is used for getting the GPU load.

@iiv3
Copy link

iiv3 commented Feb 4, 2019

@marekolsak, would it be possible to implement the same changes for the r600 driver?

It would be possible, but the performance improvement wouldn't be as high, because the DMA engine is not asynchronous on radeon. For r600, I recommend changing const_uploader from DEFAULT to STREAM.

Is the limitation in hardware or in the radeon kernel driver?

@marekolsak
Copy link

Is the limitation in hardware or in the radeon kernel driver?

It's a limitation of the radeon kernel driver.

@montvid
Copy link

montvid commented Feb 10, 2019

Can someone update this thread when the patch lands upstream? I have a r3g v1.1 pcie 3.0 x16 to m.2 x4 adapter + rx580 so I could test but I don't know how to build mesa or these out of tree patches.

@Venemo
Copy link
Author

Venemo commented Feb 11, 2019

@montvid Sure thing! You'll hear about it when it gets upstream. I'm curious though, how does it perform on your setup?

@montvid
Copy link

montvid commented Feb 12, 2019

Native Direct3D 9 v0.3.0.184-devel is active.

It seems SC2 is free to play so I got it from battle.net. Interesting during the first campaign the cpu (i3-7100U) and gpu (RX580 8GB) usage is always ~40%, fps is around ~60 (with esync enabled a bit more), GTT is ~100 mb, vram ~1200 mb (1900mb with all settings on ultra-extreme) buffer wait time max 20us (Assassins Creed 1 can get up to 20ms! on ultra settings so runs worse). I am using tkg 4.1 wine (lutris) a pcie 3.0 x4 adapter so it is not a thunderbolt issue.

@Venemo I was hoping for better game performance from my egpu setup but thunderbolt is even worse according to egpu.io at least i have 8gb of vram to store all the ultra textures. :P

@Venemo
Copy link
Author

Venemo commented Feb 19, 2019

@montvid is that with or without the SDMA patch? if without then it is roughly in line with what I experience here.

@montvid
Copy link

montvid commented Feb 19, 2019

Without the patch.

@dhewg
Copy link

dhewg commented Feb 21, 2019

Looks like the sdma patches landed on mesa master. I guess this can be closed then?

@Venemo
Copy link
Author

Venemo commented Feb 21, 2019

Even though GPU utilization is near 100% now the game is still slower on my setup than yours. I'd prefer to keep this issue open until I manage to do a proper comparison between nine and windows on this hardware. If there is no significant difference then let's close it, otherwise we should try to find the next bottleneck.

@Venemo
Copy link
Author

Venemo commented May 31, 2019

Here is an interesting tidbit. In kernel 5.1 amdgpu adds a way to examine the PCI Express bandwidth usage: /sys/class/drm/card1/device/pcie_bw (you can substitute card1 with your card). This gives you the number of PCIe packets received and sent by the GPU as well as the maximum size of a PCIe packet.

So I ran watch -n 1 "cat /sys/class/drm/card1/device/pcie_bw" and looked at the output. When I run SC2 on mesa 19.1 I get the following numbers (roughly):

Version Received by GPU Sent by GPU Packet size Estimated max bandwidth use
Mesa 19.0 around 2500000 around 50000 128 2.5 Gbit / sec
Mesa master between 10000 - 25000 between 10000 - 50000 128 75 Mbit / sec

EDIT: fixed mistake in calculating the Gbit/sec.

There are two interesting conclusions here:

  • In mesa master, the bandwidth use went down by a whopping two orders of magnitude. This is a bit too good to be true.
  • Even in mesa 19.0, the bandwidth use was well within the TB3 spec, so it should NOT have caused a problem. I suspect that there is an issue somewhere with TB3 bandwidth on Linux.

@marekolsak What do you think about my numbers, do they make sense? Am I interpreting them correctly?

@marekolsak
Copy link

Yes, the numbers look correct.

@marekolsak
Copy link

It's not just the bandwidth that matters. It's also the latency that PCIe incurs.

@Venemo
Copy link
Author

Venemo commented May 31, 2019

@marekolsak True. But 2.5 Gbit / sec is still less than 10% of what it should be capable of. There is definitely a problem in there somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants