Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oathkeeper (v0.14.2_oryOS.10) returning empty reply on slow/long distance database calls #178

Closed
dietb opened this issue Apr 25, 2019 · 8 comments

Comments

Projects
None yet
3 participants
@dietb
Copy link

commented Apr 25, 2019

Describe the bug
I have Oathkeeper running in an East US kubernetes cluster, and the DB is currently living in West US. There’s about a 70ms lag between the two sites.

When running the following curl in the same k8s cluster as oathkeeper, I’m seeing the following

curl:
root@shellpod-bd-55c5b5cc74-nlmrh:/# curl “http://oathkeeper-api.hydra.svc.cluster.local:4456/rules?limit=50000&offset=0”
curl: (52) Empty reply from server

[oathkeeper-api-7f4cc5cb9f-dbmr6] time=“2019-04-24T18:50:58Z” level=info msg=“started handling request” method=GET remote=“10.39.0.47:60820” request="/rules?limit=50000&offset=0"
[oathkeeper-api-7f4cc5cb9f-dbmr6] time=“2019-04-24T18:51:24Z” level=info msg=“completed handling request” measure#oathkeeper-api.latency=26073727174 method=GET remote=“10.39.0.47:60820” request="/rules?limit=50000&offset=0" status=200 text_status=OK took=26.073727174s

Running what I believe to be the same query from the postgres logs and the code (https://github.com/ory/oathkeeper/blob/master/rule/manager_sql.go#L79) from the same pod issuing the curl requests returns within a few ms.

There are currently around 50 roles configured. When I poll with numbers less than 20 (specifically, 19) it works. Once I try 20, it times out just over 10 seconds.

To Reproduce
Steps to reproduce the behavior:

  1. Have an oathkeeper instance with a 70ms lag between the instance and the database
  2. Either manually poll oathkeeper rules or set up oathkeeper-proxy

Expected behavior
For things to not time out :-)

Version:

  • Environment: Kubernetes 1.12.7,
  • Version oryd/oathkeeper:v0.14.2_oryOS.10

Additional context
I can reproduce this pretty easily if needed.

@aeneasr

This comment has been minimized.

Copy link
Member

commented Apr 27, 2019

Could you show the (truncated) db connection string?

@aeneasr

This comment has been minimized.

Copy link
Member

commented Apr 27, 2019

Also, we might not fix this because there are some changes incoming with the next release: #177

@dietb

This comment has been minimized.

Copy link
Author

commented Apr 27, 2019

Nothing fancy.

postgres://:@haproxy.default.svc.cluster.local:5432/auth?sslmode=verify-ca

I tried to up the max_conns but didn't see any positive results. I dug through the code and it doesn't look like that'd help a ton anyway.

Once work starts on the issue referenced, I may be able to dedicate some time to helping out with it.

@aeneasr

This comment has been minimized.

Copy link
Member

commented Apr 27, 2019

Ok, damn - that was what I thought could work. Maybe there is a timeout based on the transaction size? I don't really know...

What do you think of the proposal laid out in #177 ? Would it suit your use case?

@dietb

This comment has been minimized.

Copy link
Author

commented Apr 27, 2019

I can keep poking around and add some logging when I get a chance. I assume that part of it is that its making multiple queries, and 70ms+N adds up. Writing those to be nested with one query would probably be faster overall, but.....

#177 is definitely a big step in the right direction. Localized reads stored in memory will be a lot faster than constant db reads. Things would get a little trickier with hot reloads because you'd have to read repeatedly, but the odds of a KV store being faster than DB lookups are pretty strong.

I have some other thoughts about #177 that probably aren't appropriate for here. If I get time, I'll hop on discord to rant.

@aeneasr

This comment has been minimized.

Copy link
Member

commented Apr 27, 2019

Please do! Also feel free to post them in #177 !

@ridhamtarpara

This comment has been minimized.

Copy link

commented Apr 30, 2019

I was having the same problem with 15 rules due to the same cross region issue with europe-east and asia-south and trace the reason. In my case, it was happening due to sequential queries in ListRules logic

func (s *SQLManager) ListRules(limit, offset int) ([]Rule, error) {
	var ids []string
	if err := s.db.Select(&ids, s.db.Rebind("SELECT surrogate_id FROM oathkeeper_rule ORDER BY id LIMIT ? OFFSET ?"), limit, offset); err == sql.ErrNoRows {
		return []Rule{}, nil
	} else if err != nil {
		return nil, errors.WithStack(err)
	}

	rules := make([]Rule, len(ids))
	for k, id := range ids {
		d, err := s.GetRule(id)
		if err != nil {
			return nil, errors.WithStack(err)
		}
		rules[k] = *d
	}

	return rules, nil
}

func (s *SQLManager) GetRule(id string) (*Rule, error) {
	d := sqlRule{
		Authenticators:     []sqlRuleHandler{},
		Authorizers:        []sqlRuleHandler{},
		CredentialsIssuers: []sqlRuleHandler{},
	}
	if err := s.db.Get(&d, s.db.Rebind("SELECT * FROM oathkeeper_rule WHERE surrogate_id=?"), id); err == sql.ErrNoRows {
		return nil, errors.WithStack(helper.ErrResourceNotFound)
	} else if err != nil {
		return nil, errors.WithStack(err)
	}

	if err := s.getHandler("oathkeeper_rule_authorizer", id, &d.Authorizers); err != nil {
		return nil, errors.WithStack(err)
	}
	if err := s.getHandler("oathkeeper_rule_authenticator", id, &d.Authenticators); err != nil {
		return nil, errors.WithStack(err)
	}
	if err := s.getHandler("oathkeeper_rule_credentials_issuer", id, &d.CredentialsIssuers); err != nil {
		return nil, errors.WithStack(err)
	}

	return d.toRule()
}

func (s *SQLManager) getHandler(table, id string, rh *sqlRuleHandlers) (err error) {
	if err := s.db.Select(rh, s.db.Rebind(fmt.Sprintf("SELECT * FROM %s WHERE rule_id=?", table)), id); err != nil {
		return errors.WithStack(err)
	}
	return
}

So I changed the query and used SQL joins that worked for me and it was necessary as we were in production with oathkeeper so it may be a workaround for you as after #177 it won't be a case

@dietb

This comment has been minimized.

Copy link
Author

commented Apr 30, 2019

@ridhamtarpara Would you mind posting the query? I'd like to take a look at it.

@aeneasr aeneasr closed this in 6b509ad May 17, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.