Netbox query performance degradation #22383

weakcamel · 2026-06-04T10:55:21Z

weakcamel
Jun 4, 2026

NetBox Version

v4.5.8

Python Version

3.12

Area(s) of Concern

Observations

Note: originally spotted on v4.5.4, still persists on 4.5.8. This issue was originally reported to #21352 (reply in thread) then moved away to a discussion #22023 and confirmed by another user ( @Markethh ).

We have indeed upgraded recently to netbox 4.5.4 (unfortunately I'm not sure what was the previous version, likely an early 4.x), our deployment is running on Kubernetes (Azure cluster). Unfortunately, we're noticing a performance degradation rather than improvement. Since the upgrade, our lookups (from Ansible, using netbox.netbox collection 3.22.0) have started failing when the number of parallel queries reaches ~10; in the past this just didn't happen at all, no matter the parallelism.

Details of the deployment:

netbox and pod: 3 CPU, 6 GB RAM (but monitoring doesn't ever show getting even to half of that)
netbox-worker pod: same as netbox

Observations from running a simple AI-generated script which just runs some queries in parallel:

$ ./netbox_perf.py --api-key "$API_KEY" --threads 10 --duration 10
Running test against https://netbox.example.com
Target hostname: gpu002.mgmt.example.com
Duration: 10s, Threads: 10

=== NetBox Performance Summary ===
req=201 ok=201 (100.0%) rps=19.3
latency[s]: avg=0.5070 min=0.2845 max=1.4392 (p50=0.4366 p95=0.9352 p99=1.2315)

$ ./netbox_perf.py --api-key "$API_KEY" --threads 20 --duration 10
Running test against https://netbox.example.com
Target hostname: gpu002.mgmt.example.com
Duration: 10s, Threads: 20

=== NetBox Performance Summary ===
req=250 ok=239 (95.6%) rps=18.5
latency[s]: avg=0.8636 min=0.2835 max=5.0042 (p50=0.5764 p95=1.9896 p99=5.0039)

$ ./netbox_perf.py --api-key "$API_KEY" --threads 50 --duration 10
Running test against https://netbox.example.com
Target hostname: gpu002.mgmt.example.com
Duration: 10s, Threads: 50

=== NetBox Performance Summary ===
req=301 ok=234 (77.7%) rps=21.9
latency[s]: avg=1.7822 min=0.2799 max=5.1338 (p50=0.8995 p95=5.1039 p99=5.1293)

$ ./netbox_perf.py --api-key "$API_KEY" --threads 100 --duration 10
Running test against https://netbox.example.com
Target hostname: gpu002.mgmt.example.com
Duration: 10s, Threads: 100

=== NetBox Performance Summary ===
req=429 ok=259 (60.4%) rps=29.8
latency[s]: avg=2.5274 min=0.2826 max=5.2704 (p50=1.1716 p95=5.2287 p99=5.2611)

10 parallel queries for a specific IP address in IPAM (Edit: device ) in Netbox are doing fine. 20 of them hare a 95% success rate and if gets worse and worse with the number of parallel queries.

Proposed Changes

Not sure, sorry but I'm not familiar with Netbox code base :-(

#21352 didn't gain much traction.

2026-06-04T10:56:30Z

github-actions[bot]
Bot Jun 4, 2026

Thanks for the detailed report and benchmark data, and for tracking down the earlier discussion threads!

One thing to flag: the issue was reported against v4.5.8, but the current stable release is v4.6.2 (a full minor version ahead). It would be helpful to know whether the performance degradation under concurrent load is still present on v4.6.2, since a number of performance-related changes may have landed in the interim. If you are able to reproduce the issue on the latest release, please update the issue with that confirmation — it will help maintainers prioritize.

I am an automated triage assistant. A human maintainer will follow up.

0 replies

weakcamel · 2026-06-04T12:31:28Z

weakcamel
Jun 4, 2026
Author

Hi @jeremystretch

There is already a discussion thread here: #22023

I simply noticed #21256 and understood it (perhaps wrongly) that you want to track performance problems as issues of this type, that's why I raised #22382 as such.

4 replies

jeremystretch Jun 4, 2026
Maintainer

This issue was converted to a discussion because it lacks sufficient detail to be actionable. Please work with members of the community to determine your issue, and if appropriate, you can then submit a detailed performance issue proposing specific improvements to NetBox that may alleviate the identified issue.

weakcamel Jun 4, 2026
Author

This issue was converted to a discussion because it lacks sufficient detail to be actionable.

I see. I'm a bit confused to be honest. What you're describing matches the New Feature flow as outlined here: https://github.com/netbox-community/netbox/blob/main/CONTRIBUTING.md#bug-reporting-bugs

Performance degradation is however not a new feature but rather a bug, regression in software, Since my observations (and not just mine) are showing that the expected 4.5.4 performance improvements are in fact (at least in some aspects) giving the opposite results, I thought you might want to know about that. Unfortunately I'm unable to dedicate much more time to fixing the performance problems other than help reproducing them.

jeremystretch Jun 4, 2026
Maintainer

Issues are not for troubleshooting performance problems, which is why the template specifically asks you to detail the changes you're proposing. If you can't do that, you'll need to work with others in the community to determine the root problem.

weakcamel Jun 4, 2026
Author

Personally I don't understand the requests for feedback on performance if later on they're being ignored (unless you come with a plan and design to fix them) but fair enough, noted.

weakcamel · 2026-06-04T15:20:15Z

weakcamel
Jun 4, 2026
Author

Just to confirm: after upgrade to 4.6.2 the results are still quite similar.

$ ./netbox_perf.py --api-key "$KEY" --threads 10 --duration 10
Running test against https://netbox.example.com
Target hostname: gpu002.mgmt.example.com
Duration: 10s, Threads: 10

=== NetBox Performance Summary ===
req=187 ok=187 (100.0%) rps=17.9
latency[s]: avg=0.5424 min=0.3014 max=1.2137 (p50=0.4868 p95=0.9075 p99=1.0991)

$ ./netbox_perf.py --api-key "$KEY" --threads 20 --duration 10
Running test against https://netbox.example.com
Target hostname: gpu002.mgmt.example.com
Duration: 10s, Threads: 20

=== NetBox Performance Summary ===
req=244 ok=236 (96.7%) rps=16.8
latency[s]: avg=0.9065 min=0.3870 max=5.0071 (p50=0.6696 p95=1.4856 p99=5.0066)

$ ./netbox_perf.py --api-key "$KEY" --threads 50 --duration 10
Running test against https://netbox.example.com
Target hostname: gpu002.mgmt.example.com
Duration: 10s, Threads: 50

=== NetBox Performance Summary ===
req=291 ok=233 (80.1%) rps=20.1
latency[s]: avg=1.9451 min=0.4035 max=6.5909 (p50=1.1597 p95=5.9592 p99=6.5039)

$ ./netbox_perf.py --api-key "$KEY" --threads 100 --duration 10
Running test against https://netbox.example.com
Target hostname: gpu002.mgmt.example.com
Duration: 10s, Threads: 100

=== NetBox Performance Summary ===
req=361 ok=210 (58.2%) rps=22.8
latency[s]: avg=3.5946 min=0.3982 max=8.3361 (p50=2.9968 p95=7.5023 p99=8.0412)

Attaching the redacted test script
netbox_perf_redacted.py in case someone wants to try for themselves.

0 replies

DanjalZockt · 2026-06-10T19:43:13Z

DanjalZockt
Jun 10, 2026

Hi @weakcamel,

I had a look at your netbox_perf_redacted.py and I think the numbers are telling a different story than "requests are failing".

Your "failures" are almost certainly client timeouts, not server errors. The script's worker() defaults to timeout=5, and none of your published runs pass --timeout. That's exactly why your first batch of results has max=5.0042 and p99=5.0039, the failures are pinned to the 5 second mark because that's where requests gives up and raises, and the script counts the exception as a failure. NetBox most likely never returned an error at all. You can confirm this in one run: add --verbose (the script already supports it) and you should see ReadTimeout exceptions rather than HTTP 5xx, or just rerun with --timeout 30 and watch the success rate go back to ~100% while latency keeps climbing.

So the real question is why requests queue up, and your numbers answer that too. Going from 10 to 100 threads, throughput stays basically flat (roughly 18 to 30 rps) while latency grows almost linearly with thread count. If the database or the application code were the bottleneck you'd expect throughput to collapse. Flat rps plus linear queueing latency is the textbook signature of a fixed-size worker pool that is fully saturated, with extra requests waiting in line.

You can even estimate the pool size from your own data: throughput times latency, ~20 rps x ~0.5s = about 10 concurrent workers. Which matches exactly where your problems start.

Things I'd check, in order:

1. How many WSGI processes is the pod actually running?

The netbox-docker image serves NetBox through nginx-unit, and the number of application processes is a fixed config value, it does not scale with your 3 CPUs. Exec into the pod and count the unit application processes (ps aux inside the container), or dump the live unit config and look at the processes setting. Each sync Django process handles exactly one request at a time, so 10 processes means 10 requests in flight, full stop.

If you still have manifests from the old deployment, compare this value between the two. Chart and image upgrades can silently change process defaults, and that alone would fully explain "this never happened before on the same hardware". Given your numbers I'd put my money here rather than on a code regression.

2. CPU throttling, not CPU usage

"Monitoring never shows even half usage" is averaged usage, which hides CFS throttling. With a CPU limit set, short bursts get throttled even when the average looks low, and that shows up as exactly this kind of latency tail. Check container_cpu_cfs_throttled_periods_total for the netbox pod during the test. If it climbs, raise or remove the CPU limit and retest.

3. Confirm the database is bored

While the 50 thread test runs, watch pg_stat_activity on the Postgres side. My bet is you'll see only ~10 active queries, each finishing fast, which confirms the bottleneck sits in front of the database, not in it.

The fix is then straightforward: raise the unit process count per pod (with CPU to match) and/or add replicas, so total processes across the deployment comfortably exceed your expected parallel client count. As a rough rule, to survive 50 parallel Ansible lookups at ~0.5s each you want 25 to 30 processes total. That also matches your Ansible experience: lookups "started failing" once forks exceeded the worker pool because the netbox.netbox collection has its own request timeouts.

And once confirmed, that gives you the "specific proposed change" the maintainers asked for: a chart/docs change for process count defaults, which is a much easier ask than a general performance report.

Would be curious what the unit config comparison between your old and new deployment shows.

7 replies

Markethh Jun 12, 2026

Thanks @DanjalZockt! I believe I tried with various different values of GRANIAN_WORKERS during my testing on 4.5 (I found various different "recommendations" online), all of which did not resolve the problem - I still observed timeouts on 10+ parallel requests.

From memory, I tried 32 workers, 11 workers (going off of a 'CPU * 2 + 1' recommendation), and also matched my CPU count with 5 workers. With each of these, I still observed the timeouts. It will be interesting if @weakcamel can either verify the same behaviour I saw, or see better results.

Markethh Jun 15, 2026

Update from my side: I set the env vars GRANIAN_WORKERS & GRANIAN_BACKPRESSURE to values of "CPU * 2 + 1" and I am seeing a significant improvement on Netbox 4.5.9. Running 10 parallel queries is now comparable in performance to the same queries running against Netbox version 4.4.10. I think last time I tried I was unknowingly still running on the default 4 workers.

There seems to be differing opinions on what these values should be for optimal performance/balance. As Netbox has very recently moved to Granian, it would be good to have some kind of clarification on recommended values for both these, and any other potential new configuration.

weakcamel Jun 15, 2026
Author

Many thanks for the test plan! :-)

I finally got some time to give it a spin. As you predicted, there were 4 worker processes before the change and 8 after.

Baseline (serial run) is almost identical but with increased parallelism, results are significantly better above 10 threads and slightly worse higher up.

Serial run (1 thread)

before

$ ./netbox_perf.py --threads 1 --api-key "$API_KEY" --url https://netbox.lab.example.com --threads 1
Duration: 30s, Threads: 1

=== NetBox Performance Summary ===
req=101 ok=101 (100.0%) rps=3.4
latency[s]: avg=0.2975 min=0.2669 max=0.5693 (p50=0.2814 p95=0.3793 p99=0.5373)

after

$ ./netbox_perf.py --threads 1 --api-key "$API_KEY"   --url https://netbox.lab.example.com --threads 1
Duration: 30s, Threads: 1

=== NetBox Performance Summary ===
req=103 ok=103 (100.0%) rps=3.4
latency[s]: avg=0.2917 min=0.2672 max=0.4833 (p50=0.2792 p95=0.3619 p99=0.4531)

10 threads

before

$ ./netbox_perf.py --threads 1 --api-key "$API_KEY" --url https://netbox.lab.example.com --threads 10
Duration: 30s, Threads: 10

=== NetBox Performance Summary ===
req=655 ok=655 (100.0%) rps=21.6
latency[s]: avg=0.4609 min=0.2792 max=1.0772 (p50=0.4122 p95=0.7683 p99=0.8676)

after

$ ./netbox_perf.py --threads 1 --api-key "$API_KEY" --url https://netbox.lab.example.com --threads 10
Duration: 30s, Threads: 10

=== NetBox Performance Summary ===
req=602 ok=599 (99.5%) rps=19.9
latency[s]: avg=0.5009 min=0.2858 max=5.0067 (p50=0.4219 p95=0.8180 p99=1.1086)

50 threads

before

$ ./netbox_perf.py --threads 1 --api-key "$API_KEY" --url https://netbox.lab.example.com --threads 50
Duration: 30s, Threads: 50

=== NetBox Performance Summary ===
req=883 ok=722 (81.8%) rps=25.3
latency[s]: avg=1.7682 min=0.4144 max=6.4183 (p50=1.0508 p95=5.1049 p99=6.1644)

after

$ ./netbox_perf.py --threads 1 --api-key "$API_KEY" --url https://netbox.lab.example.com --threads 50
Duration: 30s, Threads: 50

=== NetBox Performance Summary ===
req=588 ok=548 (93.2%) rps=18.3
latency[s]: avg=2.6396 min=0.6234 max=6.6728 (p50=2.4180 p95=5.0065 p99=6.4204)

100 threads

before

$ ./netbox_perf.py --threads 1 --api-key "$API_KEY" --url https://netbox.lab.speechmatics.io --threads 100
Duration: 30s, Threads: 100

=== NetBox Performance Summary ===
req=960 ok=562 (58.5%) rps=27.5
latency[s]: avg=3.2559 min=0.5185 max=8.3738 (p50=1.9276 p95=7.2271 p99=8.0267)

after

$ ./netbox_perf.py --threads 1 --api-key "$API_KEY" --url https://netbox.lab.speechmatics.io --threads 100
Duration: 30s, Threads: 100

=== NetBox Performance Summary ===
req=733 ok=414 (56.5%) rps=21.1
latency[s]: avg=4.3030 min=0.2565 max=8.4188 (p50=4.9886 p95=6.2657 p99=8.0672)

(Edited: removed redundant info, added 100 threads case)

weakcamel Jun 15, 2026
Author

Due to surprising results for the 100 threads case I re-ran it again without and with extra threads and this time success rate was lowered even more

before

$ ./netbox_perf.py --threads 1 --api-key "$API_KEY" --url https://netbox.lab.example.com --threads 100
Duration: 30s, Threads: 100

=== NetBox Performance Summary ===
req=960 ok=562 (58.5%) rps=27.5
latency[s]: avg=3.2559 min=0.5185 max=8.3738 (p50=1.9276 p95=7.2271 p99=8.0267)

after

$ ./netbox_perf.py --threads 1 --api-key "$API_KEY" --url https://netbox.lab.example.com --threads 100
Duration: 30s, Threads: 100

=== NetBox Performance Summary ===
req=682 ok=278 (40.8%) rps=19.5
latency[s]: avg=4.5770 min=1.0157 max=7.2689 (p50=5.0073 p95=5.8816 p99=7.1144)

weakcamel Jun 22, 2026
Author

In the interim, I've added the env vars to our production deployment and it seems to be performing much better indeed!

Uh oh!

Netbox query performance degradation #22383

Uh oh!

Uh oh!

weakcamel Jun 4, 2026

NetBox Version

Python Version

Area(s) of Concern

Observations

Observations

Proposed Changes

Replies: 4 comments · 11 replies

Uh oh!

github-actions[bot] Bot Jun 4, 2026

Uh oh!

weakcamel Jun 4, 2026 Author

Uh oh!

jeremystretch Jun 4, 2026 Maintainer

Uh oh!

weakcamel Jun 4, 2026 Author

Uh oh!

jeremystretch Jun 4, 2026 Maintainer

Uh oh!

weakcamel Jun 4, 2026 Author

Uh oh!

weakcamel Jun 4, 2026 Author

Uh oh!

DanjalZockt Jun 10, 2026

Uh oh!

Uh oh!

Markethh Jun 12, 2026

Uh oh!

Uh oh!

Markethh Jun 15, 2026

Uh oh!

Uh oh!

weakcamel Jun 15, 2026 Author

Serial run (1 thread)

10 threads

50 threads

100 threads

Uh oh!

weakcamel Jun 15, 2026 Author

Uh oh!

weakcamel Jun 22, 2026 Author

weakcamel
Jun 4, 2026

Replies: 4 comments 11 replies

github-actions[bot]
Bot Jun 4, 2026

weakcamel
Jun 4, 2026
Author

jeremystretch Jun 4, 2026
Maintainer

weakcamel Jun 4, 2026
Author

jeremystretch Jun 4, 2026
Maintainer

weakcamel Jun 4, 2026
Author

weakcamel
Jun 4, 2026
Author

DanjalZockt
Jun 10, 2026

weakcamel Jun 15, 2026
Author

weakcamel Jun 15, 2026
Author

weakcamel Jun 22, 2026
Author