GoldenConfigListView does not scale well with N GoldenConfigSettings objects #270

whitej6 · 2022-06-14T22:46:53Z

Environment

Python version: 3.7.12
Nautobot version: 1.2.0 (should apply to all)
nautobot-golden-config version: 1.0.2

When navigating to /plugins/golden-config/golden/ the load time can exceed default timeouts for NGINX proxy of 60 seconds causing a 504 gateway timeout.

Steps to Reproduce

Configure N GoldenConfigSettings with unique scopes
Large subset of devices in said scopes
Attempt to load /plugins/golden-config/golden/

Expected Behavior

Table view loads within a reasonable amount of time

Observed Behavior

504 gateway timeout.

Observed Issue

Two issues have been discovered.

For each scope the initial query creates an OR in the SQL query caused by joining a queryset in a for loop. (querysets are lazy and are only executed when they are needed/evaluated). Caused by. This scale exponentially
For each row in the rendered table an almost identical query is done which compounds the above issue. Caused by

Proposed Change

Move the view to be based directly on the GoldenConfig model and not filtered Devices via the GoldenConfigSettings. This would also limit this table view to ONLY devices that a job has been run against them in the plugin and exclude devices that MAY be in scope but have not had a job run against them.

The text was updated successfully, but these errors were encountered:

whitej6 · 2022-06-14T22:51:05Z

Query for first objserved issue

SELECT "dcim_device"."id",
       "dcim_device"."created",
       "dcim_device"."last_updated",
       "dcim_device"."_custom_field_data",
       "dcim_device"."status_id",
       "dcim_device"."local_context_data",
       "dcim_device"."local_context_schema_id",
       "dcim_device"."local_context_data_owner_content_type_id",
       "dcim_device"."local_context_data_owner_object_id",
       "dcim_device"."device_type_id",
       "dcim_device"."device_role_id",
       "dcim_device"."tenant_id",
       "dcim_device"."platform_id",
       "dcim_device"."name",
       "dcim_device"."_name",
       "dcim_device"."serial",
       "dcim_device"."asset_tag",
       "dcim_device"."site_id",
       "dcim_device"."rack_id",
       "dcim_device"."position",
       "dcim_device"."face",
       "dcim_device"."primary_ip4_id",
       "dcim_device"."primary_ip6_id",
       "dcim_device"."cluster_id",
       "dcim_device"."virtual_chassis_id",
       "dcim_device"."vc_position",
       "dcim_device"."vc_priority",
       "dcim_device"."comments",
       "dcim_device"."secrets_group_id"
  FROM "dcim_device"
 WHERE "dcim_device"."id" IN (
        SELECT DISTINCT U0."id"
          FROM "dcim_device" U0
         INNER JOIN "dcim_platform" U1
            ON (U0."platform_id" = U1."id")
         WHERE (U1."slug" = 'arista_eos' OR U1."slug" = 'cisco_ios') ## This OR operator is costly 
       )
 ORDER BY "dcim_device"."_name" ASC
 LIMIT 100

whitej6 · 2022-06-14T22:51:59Z

Query for second observed issue

SELECT "nautobot_golden_config_goldenconfig"."id",
       "nautobot_golden_config_goldenconfig"."created",
       "nautobot_golden_config_goldenconfig"."last_updated",
       "nautobot_golden_config_goldenconfig"."_custom_field_data",
       "nautobot_golden_config_goldenconfig"."device_id",
       "nautobot_golden_config_goldenconfig"."backup_config",
       "nautobot_golden_config_goldenconfig"."backup_last_attempt_date",
       "nautobot_golden_config_goldenconfig"."backup_last_success_date",
       "nautobot_golden_config_goldenconfig"."intended_config",
       "nautobot_golden_config_goldenconfig"."intended_last_attempt_date",
       "nautobot_golden_config_goldenconfig"."intended_last_success_date",
       "nautobot_golden_config_goldenconfig"."compliance_config",
       "nautobot_golden_config_goldenconfig"."compliance_last_attempt_date",
       "nautobot_golden_config_goldenconfig"."compliance_last_success_date"
  FROM "nautobot_golden_config_goldenconfig"
 INNER JOIN "dcim_device"
    ON ("nautobot_golden_config_goldenconfig"."device_id" = "dcim_device"."id")
 WHERE "nautobot_golden_config_goldenconfig"."device_id" = '2020d6bf-a59f-4276-a03c-9ed9903c109b'::uuid ## This is the only line that changes and the reset is repeated for each instance displayed.
 ORDER BY "dcim_device"."_name" ASC
 LIMIT 1

itdependsnetworks · 2022-06-15T23:23:58Z

The reason that the device model view approach was used was since it was confusing to users when they were changing the "in scope" to not be reflected. It would seem that feature plus the multiple repo feature combined caused this. That being said, will have to investigate options that will be least impactful. Not only for performance but for changing functionality.

vladanmilosevic73 · 2022-06-23T11:23:14Z

Would it be possible to do a staggered refresh of repos in controlled manner and do a single device refresh when done on-demand? I doubt users back up their devices every hour... Might be that I do not have enough exposure to say.

jathanism · 2022-06-24T16:50:25Z

Query for first objserved issue

SELECT "dcim_device"."id",
       "dcim_device"."created",
       "dcim_device"."last_updated",
       "dcim_device"."_custom_field_data",
       "dcim_device"."status_id",
       "dcim_device"."local_context_data",
       "dcim_device"."local_context_schema_id",
       "dcim_device"."local_context_data_owner_content_type_id",
       "dcim_device"."local_context_data_owner_object_id",
       "dcim_device"."device_type_id",
       "dcim_device"."device_role_id",
       "dcim_device"."tenant_id",
       "dcim_device"."platform_id",
       "dcim_device"."name",
       "dcim_device"."_name",
       "dcim_device"."serial",
       "dcim_device"."asset_tag",
       "dcim_device"."site_id",
       "dcim_device"."rack_id",
       "dcim_device"."position",
       "dcim_device"."face",
       "dcim_device"."primary_ip4_id",
       "dcim_device"."primary_ip6_id",
       "dcim_device"."cluster_id",
       "dcim_device"."virtual_chassis_id",
       "dcim_device"."vc_position",
       "dcim_device"."vc_priority",
       "dcim_device"."comments",
       "dcim_device"."secrets_group_id"
  FROM "dcim_device"
 WHERE "dcim_device"."id" IN (
        SELECT DISTINCT U0."id"
          FROM "dcim_device" U0
         INNER JOIN "dcim_platform" U1
            ON (U0."platform_id" = U1."id")
         WHERE (U1."slug" = 'arista_eos' OR U1."slug" = 'cisco_ios') ## This OR operator is costly 
       )
 ORDER BY "dcim_device"."_name" ASC
 LIMIT 100

Can you please run this and the other query from a dbshell prefixed with EXPLAIN ANALYZE? And then paste the SQL query and the output from EXPLAIN ANALYZE SELECT into https://explain.depesz.com and provide the output of those reports?

This seems like an indexing problem to me so we should rule that out.

itdependsnetworks · 2022-06-26T15:29:54Z

Thanks @jathanism!! I suspect there is a solution here, I just don't know all of the alternates. I have asked @mzbroch to take a look as well.

whitej6 · 2022-06-27T19:16:40Z

nautobot-# ;
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=45.24..45.49 rows=100 width=853) (actual time=0.972..0.980 rows=100 loops=1)
   ->  Sort  (cost=45.24..45.52 rows=111 width=853) (actual time=0.971..0.975 rows=100 loops=1)
         Sort Key: dcim_device._name
         Sort Method: top-N heapsort  Memory: 74kB
         ->  Hash Join  (cost=23.55..41.47 rows=111 width=853) (actual time=0.302..0.477 rows=382 loops=1)
               Hash Cond: (dcim_device.id = u0.id)
               ->  Seq Scan on dcim_device  (cost=0.00..16.89 rows=389 width=853) (actual time=0.007..0.029 rows=389 loops=1)
               ->  Hash  (cost=22.16..22.16 rows=111 width=16) (actual time=0.288..0.288 rows=382 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 26kB
                     ->  HashAggregate  (cost=19.94..21.05 rows=111 width=16) (actual time=0.223..0.251 rows=382 loops=1)
                           Group Key: u0.id
                           Batches: 1  Memory Usage: 77kB
                           ->  Hash Join  (cost=1.13..19.66 rows=111 width=16) (actual time=0.018..0.147 rows=382 loops=1)
                                 Hash Cond: (u0.platform_id = u1.id)
                                 ->  Seq Scan on dcim_device u0  (cost=0.00..16.89 rows=389 width=32) (actual time=0.002..0.060 rows=389 loops=1)
                                 ->  Hash  (cost=1.10..1.10 rows=2 width=16) (actual time=0.009..0.010 rows=2 loops=1)
                                       Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                       ->  Seq Scan on dcim_platform u1  (cost=0.00..1.10 rows=2 width=16) (actual time=0.006..0.007 rows=2 loops=1)
                                             Filter: (((slug)::text = 'arista_eos'::text) OR ((slug)::text = 'cisco_ios'::text))
                                             Rows Removed by Filter: 5
 Planning Time: 1.060 ms
 Execution Time: 1.042 ms
(22 rows)

jathanism · 2022-06-28T20:52:49Z

I think we arrived at a pretty highly optimized implementation here by constructing the Q object by merging all scope fields together, and then annotating the resultant queryset with all of the related objects that are needed in a single query, instead of multiple per-row queries.

…eryset annotation. (#277) * Optimize query generation using dictionary merges and queryset annotation. (#277)

…s and queryset annotation. (nautobot#277) * Optimize query generation using dictionary merges and queryset annotation. (nautobot#277)

whitej6 added a commit to whitej6/nautobot-plugin-golden-config that referenced this issue Jun 27, 2022

Resolves nautobot#270

4a5a567

whitej6 added a commit to whitej6/nautobot-plugin-golden-config that referenced this issue Jun 28, 2022

Resolves nautobot#270

5fd9626

whitej6 added a commit to whitej6/nautobot-plugin-golden-config that referenced this issue Jun 28, 2022

Resolved nautobot#270

29053fd

jathanism linked a pull request Jun 28, 2022 that will close this issue

Fixes #270 - Optimize query generation using dictionary merges and queryset annotation. #277

Merged

itdependsnetworks closed this as completed in #277 Jun 29, 2022

itdependsnetworks pushed a commit that referenced this issue Jun 29, 2022

Fixes #270 - Optimize query generation using dictionary merges and qu…

ba6ab15

…eryset annotation. (#277) * Optimize query generation using dictionary merges and queryset annotation. (#277)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GoldenConfigListView does not scale well with N GoldenConfigSettings objects #270

GoldenConfigListView does not scale well with N GoldenConfigSettings objects #270

whitej6 commented Jun 14, 2022

whitej6 commented Jun 14, 2022

whitej6 commented Jun 14, 2022

itdependsnetworks commented Jun 15, 2022

vladanmilosevic73 commented Jun 23, 2022

jathanism commented Jun 24, 2022

itdependsnetworks commented Jun 26, 2022

whitej6 commented Jun 27, 2022

jathanism commented Jun 28, 2022

GoldenConfigListView does not scale well with N GoldenConfigSettings objects #270

GoldenConfigListView does not scale well with N GoldenConfigSettings objects #270

Comments

whitej6 commented Jun 14, 2022

Environment

Steps to Reproduce

Expected Behavior

Observed Behavior

Observed Issue

Proposed Change

whitej6 commented Jun 14, 2022

whitej6 commented Jun 14, 2022

itdependsnetworks commented Jun 15, 2022

vladanmilosevic73 commented Jun 23, 2022

jathanism commented Jun 24, 2022

itdependsnetworks commented Jun 26, 2022

whitej6 commented Jun 27, 2022

jathanism commented Jun 28, 2022