Netbox query performance degradation #22383
Replies: 4 comments 11 replies
-
|
Thanks for the detailed report and benchmark data, and for tracking down the earlier discussion threads! One thing to flag: the issue was reported against v4.5.8, but the current stable release is v4.6.2 (a full minor version ahead). It would be helpful to know whether the performance degradation under concurrent load is still present on v4.6.2, since a number of performance-related changes may have landed in the interim. If you are able to reproduce the issue on the latest release, please update the issue with that confirmation — it will help maintainers prioritize. I am an automated triage assistant. A human maintainer will follow up. |
Beta Was this translation helpful? Give feedback.
-
|
There is already a discussion thread here: #22023 I simply noticed #21256 and understood it (perhaps wrongly) that you want to track performance problems as issues of this type, that's why I raised #22382 as such. |
Beta Was this translation helpful? Give feedback.
-
|
Just to confirm: after upgrade to 4.6.2 the results are still quite similar. Attaching the redacted test script |
Beta Was this translation helpful? Give feedback.
-
|
Hi @weakcamel, I had a look at your Your "failures" are almost certainly client timeouts, not server errors. The script's So the real question is why requests queue up, and your numbers answer that too. Going from 10 to 100 threads, throughput stays basically flat (roughly 18 to 30 rps) while latency grows almost linearly with thread count. If the database or the application code were the bottleneck you'd expect throughput to collapse. Flat rps plus linear queueing latency is the textbook signature of a fixed-size worker pool that is fully saturated, with extra requests waiting in line. You can even estimate the pool size from your own data: throughput times latency, ~20 rps x ~0.5s = about 10 concurrent workers. Which matches exactly where your problems start. Things I'd check, in order: 1. How many WSGI processes is the pod actually running? The netbox-docker image serves NetBox through nginx-unit, and the number of application processes is a fixed config value, it does not scale with your 3 CPUs. Exec into the pod and count the unit application processes ( If you still have manifests from the old deployment, compare this value between the two. Chart and image upgrades can silently change process defaults, and that alone would fully explain "this never happened before on the same hardware". Given your numbers I'd put my money here rather than on a code regression. 2. CPU throttling, not CPU usage "Monitoring never shows even half usage" is averaged usage, which hides CFS throttling. With a CPU limit set, short bursts get throttled even when the average looks low, and that shows up as exactly this kind of latency tail. Check 3. Confirm the database is bored While the 50 thread test runs, watch The fix is then straightforward: raise the unit process count per pod (with CPU to match) and/or add replicas, so total processes across the deployment comfortably exceed your expected parallel client count. As a rough rule, to survive 50 parallel Ansible lookups at ~0.5s each you want 25 to 30 processes total. That also matches your Ansible experience: lookups "started failing" once forks exceeded the worker pool because the netbox.netbox collection has its own request timeouts. And once confirmed, that gives you the "specific proposed change" the maintainers asked for: a chart/docs change for process count defaults, which is a much easier ask than a general performance report. Would be curious what the unit config comparison between your old and new deployment shows. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
NetBox Version
v4.5.8
Python Version
3.12
Area(s) of Concern
Observations
Observations
Note: originally spotted on v4.5.4, still persists on 4.5.8. This issue was originally reported to #21352 (reply in thread) then moved away to a discussion #22023 and confirmed by another user ( @Markethh ).
We have indeed upgraded recently to netbox 4.5.4 (unfortunately I'm not sure what was the previous version, likely an early 4.x), our deployment is running on Kubernetes (Azure cluster). Unfortunately, we're noticing a performance degradation rather than improvement. Since the upgrade, our lookups (from Ansible, using netbox.netbox collection 3.22.0) have started failing when the number of parallel queries reaches ~10; in the past this just didn't happen at all, no matter the parallelism.
Details of the deployment:
Observations from running a simple AI-generated script which just runs some queries in parallel:
10 parallel queries for a specific IP address in IPAM (Edit: device ) in Netbox are doing fine. 20 of them hare a 95% success rate and if gets worse and worse with the number of parallel queries.
Proposed Changes
Not sure, sorry but I'm not familiar with Netbox code base :-(
#21352 didn't gain much traction.
Beta Was this translation helpful? Give feedback.
All reactions