Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kong 1.0 first time invoking any unique route path slow loading reported in access_phase #4164

Closed
jeremyjpj0916 opened this issue Jan 7, 2019 · 24 comments

Comments

Projects
None yet
5 participants
@jeremyjpj0916
Copy link
Contributor

commented Jan 7, 2019

Summary

Have to run in 5 min, so posting this quick with more followup later tonight.

Basically my last vetting of 1.0 has one last outstanding issue. On 1.0 against my existing C* database nodes I have used with 0.14.1 without major issue it seems Kong on any given unique exposed route that has not been called yet takes a really long time to respond. Opentracing reveals it to be in the access_phase, StatsD shows that it takes that long, and even running a curl client locally from the kong node itself reveals it taking that long. After the first call everything is super fast as expected out of this awesome application.

Steps To Reproduce

  1. Create a few routes and services on Kong 1.0, I have like 2 or so consumers with creds oauth2/jwt too.
  2. Try calling a given route for first time(note this proxy has no custom plugins or anything on it, just a bare route and service! I even tried disabling the global zipkin/statsd plugins too as a test but that didn't help either), should see a major delay the first call and every other call fast. I see it taking anywhere around 11-14 seconds on the first call. After we go to 48 milliseconds or less end to end.

Picture for some ammo/proof:
kongfirstclientrequesttoagivenroute

Edit - And if it helps, and I call a route protected by oauth/jwt the slowdown occurs too before I get back an "unauthorized". Makes me still think something about the service/router building/local caching??? Note I also never made any change to the config after kong started up, no edits on the fly or anything.

Additional Details & Logs

  • Kong version 1.0
  • Kong debug-level Startup/debug logs reveal nothing
  • Kong error logs Nothing in the error logs
  • Kong configuration Will post tonight, its same as every other github issue I post though
  • Operating system: Alpine 3.6

@jeremyjpj0916 jeremyjpj0916 changed the title Kong 1.0 first time invoking any unique route slow loading in access_phase Kong 1.0 first time invoking any unique route path slow loading in access_phase Jan 7, 2019

@jeremyjpj0916 jeremyjpj0916 changed the title Kong 1.0 first time invoking any unique route path slow loading in access_phase Kong 1.0 first time invoking any unique route path slow loading reported in access_phase Jan 7, 2019

@p0pr0ck5

This comment has been minimized.

Copy link
Contributor

commented Jan 7, 2019

This is pretty well-known and expected behavior. When kong is first started, most of the db entities (consumers, plugin configs) are not cached in memory, and so require database access. With 1.0 C* read consistency set to a weaker level can reduce database access times, but unless data is ore fetched from the database on node initialization, the behavior noted above is fundamental to Kong’s architecture.

@jeremyjpj0916

This comment has been minimized.

Copy link
Contributor Author

commented Jan 8, 2019

I do understand the general process of caching Kong accomplishes upon that first tx (since service/route building is global). I have not checked if you or Bungles proposed PRs address this further yet but I kinda feel they may(will give each a go again probably against my 1.0 sandbox tomorrow), but I can say on Kong 0.14.1 with 100's of service/route/consumer/plugin resources that first call the additional millisecond latency on the first invoke was around 200-300 additional milliseconds, but now to jump to 13-14 full seconds(on 1.0 in a sandbox with 8 or less service/route pairs and 1-2 consumers, 6 disabled plugins, 1 upstream) seems a bit odd. Since my Kong C* config is directed to use local cluster and its only 3 node local consensus a query generally takes anywhere from 15-50 milliseconds from Kong <-> Cassandra node, my Kong nodes run in the same DC as my C* nodes. The other unexplained part is how each individual route on Kong 1.0 upon first invoke reveals that latency, if the service/router building is indeed a global and not granular, then if I call /api/myroute/serviceone , Kong under the hood builds out all service/router pairs right and has worker events going to ensure other ngx worker processes also pick up the values. Then after the router build is complete I should be able to call /api/myroute/servicetwo 20-30 seconds after and it load faster since the other route was already called and the router would be built across workers. I am finding each individual route has that massive lag for the first call, that behavior should not be the case though right? Honestly if the PR's out there fix the bigger delay will just close this issue, but on pure 1.0 w/out the pr's the additional latency (12700 extra milliseconds) on first call on a Kong 1.0 that has essentially no resources vs a 0.14.1 that has tons strikes me as odd.

Edit- So reducing to 4 service/route pairs, deleting 1 of the 2 consumers and both their creds/acls, and removing all but 2 total plugins, keeping 1 upstream has gotten the initial call to a single route down to about 2 seconds now with 12 total resources on the gateway(still 1600-1700 ms more than 0.14.1's behavior with 1000's resources). I believe what would be best is if I had a file I could drop into the core libs that traced where what I perceive as a bottleneck is occurring and logging around querying and cache building probs. But I think giving the pr's pending a go may be more worthwhile to know if what I'm seeing is fixed from those changes.

Attaching a Konga(next branch: https://github.com/pantsel/konga/tree/next) snapshot of the bare bones Kong 1.0 one if you want to import it to your own fresh Kong node:
snapshot_4.txt , may have to rename the file as json, idk if Konga needs a .json file but github won't let me upload a json file.

@bungle

This comment has been minimized.

Copy link
Member

commented Jan 8, 2019

I think we can improve two things:

  1. startup time (as reported #4138)
  2. initial performance (as reported here)

One thing we could do is to cache all the core entities, or at least the most important ones (services, routes, plugins, consumers, upstreams, targets, snis, certificates — or just some of those).

The problem is that caching and warming up the cache may increase the startup time and memory pressure. So we will need SHM for each, with easily tunable sizes (and a bit more intelligent warmup to stop if shm is fully used already). One Kong user may have millions of consumers, while the other may not have any. Still we should have some "sane" defaults (if possible). I guess this is something we want to do soon because of talks about Kong without active db connection in service mesh deployment scenarios.

I have seen that around 15 secs jumps with Cassandra, and perhaps DNS, and I think there could be something related to how Cassandra calls the next_coordinator, that could be further investigated. @jeremyjpj0916 does this occur only on Cassandra or do you see same behavior with Postgres? Also is there a difference between single Cassandra node vs. Cluster?

The PR I did send (#4102) has cache warmup for services, so that could speed up initial request a bit (and also that config parameter to make router rebuild only happen on background), with adding a little bit to the startup time. We could perhaps use a timer or other way to warm-up without adding to startup time. But I think it is about finding a good balance.

@jeremyjpj0916

This comment has been minimized.

Copy link
Contributor Author

commented Jan 8, 2019

@bungle I will investigate different C* configurations in sandbox as well as testing Postgres as well. My current C* config in sandbox was as follows:

            - name: KONG_CASSANDRA_PORT
              value: '9042'
            - name: KONG_CASSANDRA_KEYSPACE
              value: kong_dev
            - name: KONG_CASSANDRA_TIMEOUT
              value: '5000'
            - name: KONG_CASSANDRA_SSL
              value: 'on'
            - name: KONG_CASSANDRA_SSL_VERIFY
              value: 'on'
            - name: KONG_CASSANDRA_USERNAME
              value: ****
            - name: KONG_CASSANDRA_PASSWORD
              value: ****
            - name: KONG_CASSANDRA_CONSISTENCY
              value: LOCAL_QUORUM
            - name: KONG_CASSANDRA_LB_POLICY
              value: RequestDCAwareRoundRobin
            - name: KONG_CASSANDRA_LOCAL_DATACENTER
              value: DC2
            - name: KONG_DB_UPDATE_FREQUENCY
              value: '5'
            - name: KONG_DB_UPDATE_PROPAGATION
              value: '5'
            - name: KONG_DB_CACHE_TTL
              value: '0'
            - name: KONG_DNS_HOSTSFILE
              value: /etc/hosts
            - name: KONG_DNS_STALE_TTL
              value: '4'
            - name: KONG_DNS_NOT_FOUND_TTL
              value: '30'
            - name: KONG_DNS_ERROR_TTL
              value: '1'
            - name: KONG_DNS_NO_SYNC
              value: 'off'
            - name: KONG_LUA_SSL_TRUSTED_CERTIFICATE
              value: /usr/local/kong/ssl/kongcert.pem
            - name: KONG_LUA_SSL_VERIFY_DEPTH
              value: '3'
            - name: KONG_CASSANDRA_CONTACT_POINTS
              value: >-
                server00466,server00467,server00468,server00469,server00470,server00471 (So 3 of these would get leveraged and 3 are in the other dc).

Let me try tweaking these and see what happens too:

            - name: KONG_CASSANDRA_CONSISTENCY
              value: LOCAL_QUORUM
            - name: KONG_CASSANDRA_LB_POLICY
              value: RequestDCAwareRoundRobin

As well as trying against a Postgres single node or C* single node. Overall I think to achieve best perf across all db implementations would be 1000 record queries at a time on various tables and building relations locally in memory. Any path that does a lookup 1 record, which triggers a lookup on another single record, and then another record on another table to fully configure a single kong node built resource is pretty inefficient. I would think a method of getting all data, then local code to map it all together in lua land based on keys and such is the best way.

But right now with this specific issue I think there is actually some sort of problem. Makes no sense how a sandbox with no resources to even require heavy querying with the db takes so much longer than a 0.14.1 node that is much faster with x1000 resources in my environments against the same underlying Cassandra cluster, makes me think some underlying change on 1.0 with a specific C* based setup may have some sort of bug causing the delay but yet to be proven.

@bungle

This comment has been minimized.

Copy link
Member

commented Jan 8, 2019

@jeremyjpj0916, if you got time, it would be interesting to see if this has any effect:
#4171

@jeremyjpj0916

This comment has been minimized.

Copy link
Contributor Author

commented Jan 8, 2019

@bungle let me study the files and what to drop in where and give it a go. I am going to rebuild out my sandbox back to more resources than the barebones I have it now(which is like 4 services and routes and 2 plugins and a consumer)- which got down that first proxy call to 2-3 seconds generally on every single individual route first call of each 4 individual routes tied to each service). When it was at like 8-9 services and routes and 10 or so plugins and 2 consumers with creds/acls it was at the 13-14 second range on first call to get a sense of how it scaled. Then I went back to my Stage env on 0.14.1 which now has 1000+ of service/route and consumers and plugins and I called a poc endpoint I know that had not been consumed today, took 200 milliseconds flat xD . So yeah some strangeness indeed.

@jeremyjpj0916

This comment has been minimized.

Copy link
Contributor Author

commented Jan 8, 2019

@bungle Testing on 1.0 GA throws this for some reason with your patch files dropped in.

2019/01/08 22:25:24 [error] 22#0: init_by_lua error: /usr/local/share/lua/5.1/kong/db/init.lua:125: attempt to concatenate local 'err' (a nil value)
stack traceback:
	/usr/local/share/lua/5.1/kong/db/init.lua:125: in function 'prefix_err'
	/usr/local/share/lua/5.1/kong/db/init.lua:190: in function 'close'
	/usr/local/share/lua/5.1/kong/init.lua:360: in function 'init'
	init_by_lua:3: in main chunk
nginx: [error] init_by_lua error: /usr/local/share/lua/5.1/kong/db/init.lua:125: attempt to concatenate local 'err' (a nil value)
stack traceback:
	/usr/local/share/lua/5.1/kong/db/init.lua:125: in function 'prefix_err'
	/usr/local/share/lua/5.1/kong/db/init.lua:190: in function 'close'
	/usr/local/share/lua/5.1/kong/init.lua:360: in function 'init'
	init_by_lua:3: in main chunk

image

@p0pr0ck5

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2019

My first thought here is the plugins iterator- migration of the plugins schema to the new DAO could explain a change in latencies.

To avoid polluting this thread I am sending jeremy a patch that will generate debug tracing data that should shed light on the jump in latencies in cache-miss situations. Will report the results here once they’re in.

@p0pr0ck5

This comment has been minimized.

Copy link
Contributor

commented Jan 9, 2019

Did a small bit of digging and it seems that, with a fresh 1.0 cluster, this call never returns true: https://github.com/Kong/kong/blob/master/kong/db/dao/plugins.lua#L317

So we end up having to make an expensive lookup of the schema state with every plugins iteration, which is causing the latency noted here as its quite chatty. From what I can find, the schema state created in https://github.com/Kong/kong/blob/master/kong/db/migrations/state.lua#L283 doesn't align with that's expected in the state's is_migration_executed, throughout the request Kong never sees that the migrations are up to date. @kikito would you be able to spare a minute to see if something's off here?

@kikito

This comment has been minimized.

Copy link
Member

commented Jan 9, 2019

Hi @jeremyjpj0916, I have implemented a fix for the problem described by @p0pr0ck5. It's in #4179. If you can, please give it a try.

@jeremyjpj0916

This comment has been minimized.

Copy link
Contributor Author

commented Jan 9, 2019

@kikito you got it, going to drop that in and give it a go on my 1.0 sandbox. Will report back in a few mins to the perf impacts.

@jeremyjpj0916

This comment has been minimized.

Copy link
Contributor Author

commented Jan 9, 2019

Tried it, did not have any improvement dropping that file in on my kong startup. 8572ms taken on first call.

@p0pr0ck5

This comment has been minimized.

Copy link
Contributor

commented Jan 9, 2019

Gave this a try, and I don't notice an change in behavior either. I do see that the function https://github.com/Kong/kong/blob/master/kong/db/dao/plugins.lua#L317 now returns true, but the adjustment to the translator function doesn't have any effect (https://github.com/Kong/kong/blob/master/kong/db/dao/plugins.lua#L318) - we still end up in the path where is_migration_executed is called every time a plugin config is attempted to be loaded via select_by_cache_key (https://github.com/Kong/kong/blob/master/kong/db/dao/plugins.lua#L306).

@p0pr0ck5

This comment has been minimized.

Copy link
Contributor

commented Jan 9, 2019

BTW, this completely-unacceptable-and-very-dirty-hack does solve the underlying problem of the cost of always checking the schema state:

diff --git a/kong/db/dao/plugins.lua b/kong/db/dao/plugins.lua
index 3b17c860..d2b36ba7 100644
--- a/kong/db/dao/plugins.lua
+++ b/kong/db/dao/plugins.lua
@@ -308,6 +308,7 @@ function Plugins:load_plugin_schemas(plugin_set)
 end
 
 
+local s
 function Plugins:select_by_cache_key(key)
 
   -- first try new way
@@ -316,7 +317,13 @@ function Plugins:select_by_cache_key(key)
   if not new_err then -- the step above didn't fail
     -- we still need to check whether the migration is done,
     -- because the new table may be only partially full
-    local schema_state = assert(self.db:schema_state())
+    local schema_state
+    if not s then
+      schema_state = assert(self.db:schema_state())
+      s = schema_state
+    else
+      schema_state = s
+    end
 
     -- if migration is complete, disable this translator function and return
     if schema_state:is_migration_executed("core", "001_14_to_15") then

In my local environment (kong-vagrant with local C*), this reduces subsequent calls to load_plugin_config from around 35 ms to under 1 ms (averaging 500 or so us, roughly just the cost of the CQL query from itself). Obviously this isn't an appropriate solution, but maybe we can considering checking this migration state on node boot, and in the optimistic case where migrations are up to date, we can avoid this check on the request path?

hishamhm added a commit that referenced this issue Jan 9, 2019

fix(db) disable migrating version of plugins:select_by_cache_key on b…
…oth object and class

We need to patch both the class and the current instance of the
Plugins object in order to prevent the "migrating" version of
`select_by_cache_key` to be used again once we determine that it
is not necessary.

See #4164.

kikito added a commit that referenced this issue Jan 9, 2019

fix(db) disable migrating version of plugins:select_by_cache_key on b…
…oth object and class

We need to patch both the class and the current instance of the
Plugins object in order to prevent the "migrating" version of
`select_by_cache_key` to be used again once we determine that it
is not necessary.

See #4164.
@kikito

This comment has been minimized.

Copy link
Member

commented Jan 9, 2019

@jeremyjpj0916 my coworkers pointed out that there were in fact two changes needed. Please give #4179 another look when you can

kikito added a commit that referenced this issue Jan 9, 2019

fix(db) disable migrating version of plugins:select_by_cache_key on b…
…oth object and class

We need to patch both the class and the current instance of the
Plugins object in order to prevent the "migrating" version of
`select_by_cache_key` to be used again once we determine that it
is not necessary.

See #4164.

kikito added a commit that referenced this issue Jan 9, 2019

fix(db) disable migrating version of plugins:select_by_cache_key on b…
…oth object and class

We need to patch both the class and the current instance of the
Plugins object in order to prevent the "migrating" version of
`select_by_cache_key` to be used again once we determine that it
is not necessary.

See #4164.
@jeremyjpj0916

This comment has been minimized.

Copy link
Contributor Author

commented Jan 9, 2019

@kikito already ahead of yah, saw the latest PR changes flood in 😄 . Happy to report the massive latency has been dramatically reduced now. Currently taking 784ms proxy_latency on the first call(as opposed to 8572ms) with a node that has this many Kong resources:
Resources on this gateway:

10 global plugins, only 1 enabled currently
Service/Route pairs: 5
Consumers: 2
Upstreams: 1 with 2 targets

Still see perf hit on 1.0 compared to 0.14.1 (it takes around 200-300ms proxy_latency to load my environment with 1000+ service/route/consumers/plugins/acls, whereas my sandbox has minimal resources and still 2-3x times slower on that first load). And the fact that that first call proxy_latency seems to occur on every unique route on 1.0 whereas on 0.14.1 it seemed only one route gets hit with the first call slowdown and then other unique routes to different services don't get impacted by the router build since its a global.

I did get @p0pr0ck5 tracing logic, I will incorporate a trace with your patches applied too that may point out to other areas of optimization and send it his way for further review by the Kong squad.

I fully intend to also add 1000s of mock resources to 1.0 as well and report back findings on perf at scale, it may also help to reveal hot paths that are inefficient or lacking at the moment.

hishamhm added a commit that referenced this issue Jan 9, 2019

fix(db) disable migrating version of plugins:select_by_cache_key on b…
…oth object and class

We need to patch both the class and the current instance of the
Plugins object in order to prevent the "migrating" version of
`select_by_cache_key` to be used again once we determine that it
is not necessary.

See #4164.
@hishamhm

This comment has been minimized.

Copy link
Member

commented Jan 9, 2019

I just merged #4179 and it looks like there's a bit more to do before we close this one!

@jeremyjpj0916, just to confirm I assume the numbers you got (784ms) are with that latest code (basically equivalent to what was merged).

maybe we can considering checking this migration state on node boot, and in the optimistic case where migrations are up to date, we can avoid this check on the request path?

@p0pr0ck5 indeed, we do check the migration state on boot! Here's an attempt at doing what you suggested:

diff --git a/kong/db/dao/plugins.lua b/kong/db/dao/plugins.lua
index c38c900e..1ddc828e 100644
--- a/kong/db/dao/plugins.lua
+++ b/kong/db/dao/plugins.lua
@@ -304,6 +304,15 @@ end
 
 
 function Plugins:select_by_cache_key(key)
+  local schema_state = assert(self.db:last_schema_state())
+
+  -- if migration is complete, disable this translator function
+  -- and use the regular function
+  if schema_state:is_migration_executed("core", "001_14_to_15") then
+    self.select_by_cache_key = self.super.select_by_cache_key
+    Plugins.select_by_cache_key = nil
+    return self.super.select_by_cache_key(self, key)
+  end
 
   -- first try new way
   local entity, new_err = self.super.select_by_cache_key(self, key)
diff --git a/kong/db/init.lua b/kong/db/init.lua
index 4b350003..128b8a53 100644
--- a/kong/db/init.lua
+++ b/kong/db/init.lua
@@ -411,8 +411,18 @@ do
   local MigrationHelpers = require "kong.db.migrations.helpers"
   local MigrationsState = require "kong.db.migrations.state"
 
+
+  local last_schema_state
+
+
   function DB:schema_state()
-    return MigrationsState.load(self)
+    last_schema_state = MigrationsState.load(self)
+    return last_schema_state
+  end
+
+
+  function DB:last_schema_state()
+    return last_schema_state or self:schema_state()
   end
 
 

If this further improves the initial request I'll cut a PR with this one.

@jeremyjpj0916

This comment has been minimized.

Copy link
Contributor Author

commented Jan 9, 2019

@hishamhm yes. I am going to add the 2 smaller bungle pr's as well around db and init stuff that just got merged that should probably help 1.0 when I test it at scale too (query 1000 vs 100). But even in my small sandbox I imagine the diff between 0.14.1 taking 200-300ms with tons of resources vs 784ms 1.0 with a few resources, hopefully a trace reveals something else making up the difference.

@p0pr0ck5

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2019

I spent a bit of time digging into some granular data generated by @jeremyjpj0916 from a request that reported 726ms of latency. This was a cold-cache request- the cache entries needed by the plugins iterator were all misses. A few high-level data points:

  • DNS resolution called by balancer.toip: 10 ms
  • Shortest-running CQL query: 5.5ms
  • Longest-running CQL query: 121ms
  • Sum of CQL query timings: 386 ms
  • Average CQL timing: ~ 12.9ms, but factor out the 121ms outlier (the second longest-running query was 22 ms), the average CQL is 8.9 ms
  • Time to validate schema state: 315 ms

The sum of schema state validation, CQL queries, and DNS lookup (which constitutes all of the network I/O in this request) was 711 of the 726 ms of proxy latency noted.

Removing the schema state validation as introduced by @hishamhm in #4182 would cut this almost in half, bringing things pretty close to the cold-cache latencies noted by jeremy in 0.14.1.

@hishamhm

This comment has been minimized.

Copy link
Member

commented Jan 10, 2019

Great to hear! Thanks @p0pr0ck5!

@jeremyjpj0916

This comment has been minimized.

Copy link
Contributor Author

commented Jan 10, 2019

@p0pr0ck5 thanks for taking some time to investigate all this with me! Glad to see it yielded core improvements. Onward and upward towards a bulletproof Kong 🐒 . Haven't tried @hishamhm latest PR change yet so I will give this a go when I get some time and report back and we can hopefully close up this issue! Still hoping something makes it into the router perf/stability on 1.1 from past conversations as its the other piece to my struggles, or maybe ill just run the open pr's on the latest minor version and I can help yall feel good about them after its been running in our stage/prod for a month lol.

@jeremyjpj0916

This comment has been minimized.

Copy link
Contributor Author

commented Jan 12, 2019

Tested with @hishamhm's latest pending PR included, the before and after are a nice cold cache shave indeed in the wild on 1.0:

Before that PR on 1.0(with the other recent prs still there):

X-Upstream-Latency: 10
X-Proxy-Latency: 757

After that PR on 1.0:

X-Upstream-Latency: 5
X-Proxy-Latency: 379

I am comfortable with that number, seems a bit more inline with what 0.14.1 dished out, still slightly higher than 0.14.1 at a similar resource count but I did just give our dev dmz env running 0.14.1 a quick spin(it has 1,500 service/route/acl/consumeers on it right now granted, so much larger than my 1.0 sandbox and it took this long running cold:

X-Upstream-Latency: 10
X-Proxy-Latency: 377

So as I go back to adding resources to sandbox 1.0 and scaling it up lets see what happens.

Edit-

Added 1k full "proxy" resources(service/routes/consumsers/creds/acls/oauth2) to 1.0 and it say X-Proxy-Latency @ 379 still on the first call and then drops to about 0-1-2 ms each subsequent call with a few times it spiking to 10-14ms so that is pretty great in my book.

I think the only last question that remains unanswered is since the router rebuild is a global is why each individual route has that "cold cache" latency on the first call(with each individual routes subsequent calls then being fast).

@p0pr0ck5

This comment has been minimized.

Copy link
Contributor

commented Jan 13, 2019

I think the only last question that remains unanswered is since the router rebuild is a global is why each individual route has that "cold cache" latency on the first call(with each individual routes subsequent calls then being fast).

As noted before, this is a function of the plugins iterator, not the db calls needed to rebuild the router. The plugins iterator loads plugins configs based on the tuple of route/service/consumer present in the request. So for requests to a route that previously hasn’t been seen by Kong, a separate set of queries is issued to attempt to load plugin configs that may be applicable to the request. This is expected behavior.

@jeremyjpj0916

This comment has been minimized.

Copy link
Contributor Author

commented Jan 13, 2019

Sounds good to me then. Closing up issue as PRs have been merged and prove effective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.