Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GoldenConfigListView does not scale well with N GoldenConfigSettings objects #270

Closed
whitej6 opened this issue Jun 14, 2022 · 8 comments · Fixed by #277
Closed

GoldenConfigListView does not scale well with N GoldenConfigSettings objects #270

whitej6 opened this issue Jun 14, 2022 · 8 comments · Fixed by #277

Comments

@whitej6
Copy link
Contributor

whitej6 commented Jun 14, 2022

Environment

  • Python version: 3.7.12
  • Nautobot version: 1.2.0 (should apply to all)
  • nautobot-golden-config version: 1.0.2

When navigating to /plugins/golden-config/golden/ the load time can exceed default timeouts for NGINX proxy of 60 seconds causing a 504 gateway timeout.

Steps to Reproduce

  1. Configure N GoldenConfigSettings with unique scopes
  2. Large subset of devices in said scopes
  3. Attempt to load /plugins/golden-config/golden/

Expected Behavior

Table view loads within a reasonable amount of time

Observed Behavior

504 gateway timeout.

Observed Issue

Two issues have been discovered.

  1. For each scope the initial query creates an OR in the SQL query caused by joining a queryset in a for loop. (querysets are lazy and are only executed when they are needed/evaluated). Caused by. This scale exponentially
  2. For each row in the rendered table an almost identical query is done which compounds the above issue. Caused by

Proposed Change

Move the view to be based directly on the GoldenConfig model and not filtered Devices via the GoldenConfigSettings. This would also limit this table view to ONLY devices that a job has been run against them in the plugin and exclude devices that MAY be in scope but have not had a job run against them.

@whitej6
Copy link
Contributor Author

whitej6 commented Jun 14, 2022

Query for first objserved issue

SELECT "dcim_device"."id",
       "dcim_device"."created",
       "dcim_device"."last_updated",
       "dcim_device"."_custom_field_data",
       "dcim_device"."status_id",
       "dcim_device"."local_context_data",
       "dcim_device"."local_context_schema_id",
       "dcim_device"."local_context_data_owner_content_type_id",
       "dcim_device"."local_context_data_owner_object_id",
       "dcim_device"."device_type_id",
       "dcim_device"."device_role_id",
       "dcim_device"."tenant_id",
       "dcim_device"."platform_id",
       "dcim_device"."name",
       "dcim_device"."_name",
       "dcim_device"."serial",
       "dcim_device"."asset_tag",
       "dcim_device"."site_id",
       "dcim_device"."rack_id",
       "dcim_device"."position",
       "dcim_device"."face",
       "dcim_device"."primary_ip4_id",
       "dcim_device"."primary_ip6_id",
       "dcim_device"."cluster_id",
       "dcim_device"."virtual_chassis_id",
       "dcim_device"."vc_position",
       "dcim_device"."vc_priority",
       "dcim_device"."comments",
       "dcim_device"."secrets_group_id"
  FROM "dcim_device"
 WHERE "dcim_device"."id" IN (
        SELECT DISTINCT U0."id"
          FROM "dcim_device" U0
         INNER JOIN "dcim_platform" U1
            ON (U0."platform_id" = U1."id")
         WHERE (U1."slug" = 'arista_eos' OR U1."slug" = 'cisco_ios') ## This OR operator is costly 
       )
 ORDER BY "dcim_device"."_name" ASC
 LIMIT 100

@whitej6
Copy link
Contributor Author

whitej6 commented Jun 14, 2022

Query for second observed issue

SELECT "nautobot_golden_config_goldenconfig"."id",
       "nautobot_golden_config_goldenconfig"."created",
       "nautobot_golden_config_goldenconfig"."last_updated",
       "nautobot_golden_config_goldenconfig"."_custom_field_data",
       "nautobot_golden_config_goldenconfig"."device_id",
       "nautobot_golden_config_goldenconfig"."backup_config",
       "nautobot_golden_config_goldenconfig"."backup_last_attempt_date",
       "nautobot_golden_config_goldenconfig"."backup_last_success_date",
       "nautobot_golden_config_goldenconfig"."intended_config",
       "nautobot_golden_config_goldenconfig"."intended_last_attempt_date",
       "nautobot_golden_config_goldenconfig"."intended_last_success_date",
       "nautobot_golden_config_goldenconfig"."compliance_config",
       "nautobot_golden_config_goldenconfig"."compliance_last_attempt_date",
       "nautobot_golden_config_goldenconfig"."compliance_last_success_date"
  FROM "nautobot_golden_config_goldenconfig"
 INNER JOIN "dcim_device"
    ON ("nautobot_golden_config_goldenconfig"."device_id" = "dcim_device"."id")
 WHERE "nautobot_golden_config_goldenconfig"."device_id" = '2020d6bf-a59f-4276-a03c-9ed9903c109b'::uuid ## This is the only line that changes and the reset is repeated for each instance displayed.
 ORDER BY "dcim_device"."_name" ASC
 LIMIT 1

@itdependsnetworks
Copy link
Contributor

The reason that the device model view approach was used was since it was confusing to users when they were changing the "in scope" to not be reflected. It would seem that feature plus the multiple repo feature combined caused this. That being said, will have to investigate options that will be least impactful. Not only for performance but for changing functionality.

@vladanmilosevic73
Copy link

Would it be possible to do a staggered refresh of repos in controlled manner and do a single device refresh when done on-demand? I doubt users back up their devices every hour... Might be that I do not have enough exposure to say.

@jathanism
Copy link

Query for first objserved issue

SELECT "dcim_device"."id",
       "dcim_device"."created",
       "dcim_device"."last_updated",
       "dcim_device"."_custom_field_data",
       "dcim_device"."status_id",
       "dcim_device"."local_context_data",
       "dcim_device"."local_context_schema_id",
       "dcim_device"."local_context_data_owner_content_type_id",
       "dcim_device"."local_context_data_owner_object_id",
       "dcim_device"."device_type_id",
       "dcim_device"."device_role_id",
       "dcim_device"."tenant_id",
       "dcim_device"."platform_id",
       "dcim_device"."name",
       "dcim_device"."_name",
       "dcim_device"."serial",
       "dcim_device"."asset_tag",
       "dcim_device"."site_id",
       "dcim_device"."rack_id",
       "dcim_device"."position",
       "dcim_device"."face",
       "dcim_device"."primary_ip4_id",
       "dcim_device"."primary_ip6_id",
       "dcim_device"."cluster_id",
       "dcim_device"."virtual_chassis_id",
       "dcim_device"."vc_position",
       "dcim_device"."vc_priority",
       "dcim_device"."comments",
       "dcim_device"."secrets_group_id"
  FROM "dcim_device"
 WHERE "dcim_device"."id" IN (
        SELECT DISTINCT U0."id"
          FROM "dcim_device" U0
         INNER JOIN "dcim_platform" U1
            ON (U0."platform_id" = U1."id")
         WHERE (U1."slug" = 'arista_eos' OR U1."slug" = 'cisco_ios') ## This OR operator is costly 
       )
 ORDER BY "dcim_device"."_name" ASC
 LIMIT 100

Can you please run this and the other query from a dbshell prefixed with EXPLAIN ANALYZE? And then paste the SQL query and the output from EXPLAIN ANALYZE SELECT into https://explain.depesz.com and provide the output of those reports?

This seems like an indexing problem to me so we should rule that out.

@itdependsnetworks
Copy link
Contributor

Thanks @jathanism!! I suspect there is a solution here, I just don't know all of the alternates. I have asked @mzbroch to take a look as well.

whitej6 added a commit to whitej6/nautobot-plugin-golden-config that referenced this issue Jun 27, 2022
@whitej6
Copy link
Contributor Author

whitej6 commented Jun 27, 2022

nautobot-# ;
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=45.24..45.49 rows=100 width=853) (actual time=0.972..0.980 rows=100 loops=1)
   ->  Sort  (cost=45.24..45.52 rows=111 width=853) (actual time=0.971..0.975 rows=100 loops=1)
         Sort Key: dcim_device._name
         Sort Method: top-N heapsort  Memory: 74kB
         ->  Hash Join  (cost=23.55..41.47 rows=111 width=853) (actual time=0.302..0.477 rows=382 loops=1)
               Hash Cond: (dcim_device.id = u0.id)
               ->  Seq Scan on dcim_device  (cost=0.00..16.89 rows=389 width=853) (actual time=0.007..0.029 rows=389 loops=1)
               ->  Hash  (cost=22.16..22.16 rows=111 width=16) (actual time=0.288..0.288 rows=382 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 26kB
                     ->  HashAggregate  (cost=19.94..21.05 rows=111 width=16) (actual time=0.223..0.251 rows=382 loops=1)
                           Group Key: u0.id
                           Batches: 1  Memory Usage: 77kB
                           ->  Hash Join  (cost=1.13..19.66 rows=111 width=16) (actual time=0.018..0.147 rows=382 loops=1)
                                 Hash Cond: (u0.platform_id = u1.id)
                                 ->  Seq Scan on dcim_device u0  (cost=0.00..16.89 rows=389 width=32) (actual time=0.002..0.060 rows=389 loops=1)
                                 ->  Hash  (cost=1.10..1.10 rows=2 width=16) (actual time=0.009..0.010 rows=2 loops=1)
                                       Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                       ->  Seq Scan on dcim_platform u1  (cost=0.00..1.10 rows=2 width=16) (actual time=0.006..0.007 rows=2 loops=1)
                                             Filter: (((slug)::text = 'arista_eos'::text) OR ((slug)::text = 'cisco_ios'::text))
                                             Rows Removed by Filter: 5
 Planning Time: 1.060 ms
 Execution Time: 1.042 ms
(22 rows)

whitej6 added a commit to whitej6/nautobot-plugin-golden-config that referenced this issue Jun 28, 2022
whitej6 added a commit to whitej6/nautobot-plugin-golden-config that referenced this issue Jun 28, 2022
@jathanism
Copy link

I think we arrived at a pretty highly optimized implementation here by constructing the Q object by merging all scope fields together, and then annotating the resultant queryset with all of the related objects that are needed in a single query, instead of multiple per-row queries.

itdependsnetworks pushed a commit that referenced this issue Jun 29, 2022
…eryset annotation. (#277)

* Optimize query generation using dictionary merges and queryset annotation. (#277)
jmpettit pushed a commit to jmpettit/nautobot-app-golden-config that referenced this issue Jan 30, 2024
…s and queryset annotation. (nautobot#277)

* Optimize query generation using dictionary merges and queryset annotation. (nautobot#277)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants