Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hashed settings to cache keys #59

Merged
merged 6 commits into from
Jul 14, 2022
Merged

Conversation

sarahzinger
Copy link
Member

@sarahzinger sarahzinger commented Jul 7, 2022

relates to grafana/athena-datasource#144

The issue:

  • a user goes to create an AWS grafana plugin instance using secrets/keys as auth
  • user mistypes those secret/keys
  • user tries to fetch catalogs with those credentials
  • we create a session with those credentials on their behalf
  • we store that session with a cache key that does not update if secrets/keys updates
  • we create an api instance using that session
  • we store that api instance using a "connection key" that also does not update if secrets/keys updates
  • aws returns an error saying the credentials are invalid
  • the user then updates their credentials
  • we successfully save those new credentials to the datasource
  • but we do not expire the cacheKey or connectionKey, so we will continue to return a broken session and api until we reset the server, or update a field that does update the cache/connectionKeys

Proposed solution:

  • we take the entire settings config, hash it, and make that the cache/connection keys to access this data, rather than a non hashed selection of less sensitive settings.

Open questions/things I'd love feedback on:

  • does this feel like a secure solution to you? I think it is, but I'd love someone more familiar with security/cryptography/etc to double check my thinking
  • does anyone know if it really makes sense to have these 2 caches in the first place? I don't recall why they are necessary exactly, but it seems reasonable I suppose, tho I'm not sure if this additional complication provides us many wins, I'm sort of tempted to say we should not cache the api and should only cache sessions, but I suspect there were reasons why this made sense in the first place and I'm just jumping in late to the party.
  • how do we feel about the fact that we are storing stale sessions/api clients on people's servers and never remove them? I assume it's probably "ok" since that's what we're currently doing as best as I can tell, but it does feel like it could potentially be pretty inefficient? I wonder if instead we should destroy stale session/apis if the settings hash changes?

@CLAassistant
Copy link

CLAassistant commented Jul 7, 2022

CLA assistant check
All committers have signed the CLA.

@sarahzinger sarahzinger requested review from a team, iwysiu and kevinwcyu and removed request for a team July 7, 2022 21:05
@@ -46,13 +46,13 @@ func (s *AWSDatasource) createDB(id int64, args sqlds.Options, settings models.S
return db, nil
}

func (d *AWSDatasource) storeAPI(id int64, args sqlds.Options, dsAPI api.AWSAPI) {
key := connectionKey(id, args)
func (d *AWSDatasource) storeAPI(id int64, args sqlds.Options, settings models.Settings, dsAPI api.AWSAPI) {
Copy link
Contributor

@andresmgot andresmgot Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the changes in this package. If the settings change, the session will change so this will be a new connection. But I may be wrong, it's worth to double check it, you can do this changing the access key and trying to reload databases, for example. The second connection, using the wrong key, should fail (and not reuse the previous session connection).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @andresmgot I'm not sure I follow what you mean when you say you don't think I need the changes in this package? Do you mean you don't think the changes are necessary in datasource.go but they are necessary in sessions.go ?

I believe we need the changes here in addition to session, because when we first create an api instance we load it with a particular session and store it. While we no longer use that session directly, the stored api instance holds onto the out of date session. When I remove my changes from datasource.go I see the bug reappear.

Let me know if I'm misunderstanding tho!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean you don't think the changes are necessary in datasource.go but they are necessary in sessions.go ?

yes, that was what I meant.

the stored api instance holds onto the out of date session.

That is what I don't fully understand why 🤔 What I thought should be happening, whenever you click "Save" in the data source configuration is:

  1. Call sqlds NewDatasource function
  2. That calls Connect in the data source code.
  3. That calls GetDB from this package
  4. That calls createAPI which finally calls the New method of the data source API package. Which creates a AWS session and stores the result, overwritting whatever API instance was there already.

If that is not happening, I may have gotten that flow wrong but it feels weird that this package cares about AWS settings, since that should have been solved in a lower level. Can you maybe debug what's happening here? This code is a bit complicated but it's important to understand it well since it can cause many errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm a really good question, that I'm not totally sure I know the answer just yet but I did try looking yesterday. I'm going to keep digging today if I can since it feels like a worthy thing to understand.

Here's what I've found so far, just in case it's sort of helpful to read my debugging process

  • I can tell you that without this code the error still persists. But to your point maybe that means we aren't overwriting the original value properly or something, and there's a better way to solve this problem.
  • I can tell you that we do call Connect/GetDb/createAPI in an effort to overwrite the api with the right settings, and that we do this before calling getAPI, so it's not an issue of a race condition I don't think.
  • I can tell you that we use this sync map store method: https://cs.opensource.google/go/go/+/refs/tags/go1.18.3:src/sync/map.go;l=137 to store the new api
  • I find this line in particular very interesting in that source code:
// tryStore stores a value if the entry has not been expunged.
//
// If the entry is expunged, tryStore returns false and leaves the entry
// unchanged.

I don't really know what makes something expunged yet, but I do sort of wonder if its possible that we are calling store but not actually storing anything and leaving the original entry somehow? This feels like a wild theory but I suppose it's possible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"expunge" is a new word for me but it seems that only applies when you remove stuff from the map (which we don't) but in the debugger you should be able to see if the entry changes in memory. It's more likely I did something wrong there or it's not calling what we think it's calling.

If you don't spot the issue feel free to shoot me a meeting for tomorrow and we can take a look at this together, I am curious what's going on 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks andrés, I think I'm ok, I'm just writing out thoughts for myself at this point sorry lol, but if you have opinions or corrections feel free to chime in.

Learned a few more things today, logging it so I don't forget for tomorrow.

  • definitely was on a totally wrong track with that expunge stuff lol 😅
  • our cacheKey was hashing a settings map that had hidden fields, so we weren't hashing secret key for example. I have fixed that in a local pr.
  • but interesting this still doesn't work without the hashed connectionkey (when it totally should), this is because when we store the api when we update the datasource, we store it with a connectionKey that follows this format:

id-map-of-args-that-could-affect-the-api-like-region

and when we update the datasource and call Connect we do not send along a region as a query arg (which is fine and fair imo), and so when we create the api we store that with a connectionKey of

id-emptymap[]

and then when we query for catalogs right after we try to find the api we just saved but we make this call with a region we search with the default region so we use a connection key like this:

id-map-with-region

which now contains out of date information

So I think the thing we need to do is either update Connect (in athena and other datasources) to somehow get the default region and use that to pass along to getDB, or maybe alter create/storeAPI (in aws-sdk) to ensure that if we're trying to store with an empty region we use the default.

Going to give that a try tomorrow. Sorry I'm very slow at debugging this but it feels worth struggling through hahaha

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, that makes more sense! It works when you include the settings in the key because getApi was failing to get one instance from the cache and created a new instance. The default thing has given me soo many headaches since the beginning 😓

So I think the thing we need to do is either update Connect (in athena and other datasources) to somehow get the default region and use that to pass along to getDB, or maybe alter create/storeAPI (in aws-sdk) to ensure that if we're trying to store with an empty region we use the default.

mmh, it seems that there are two issues here. In the Connect method we are using args = nil so, as you say, the map is empty but for all the resources, we send as "args" elements that may not be relevant to the connection. For example, for retrieving Columns we are using args = {"region": "default", "catalog": "default", "database": "default", "table": "default"}, which are all considered "connection args" since we don't make a distinction in the backend. This is probably causing that we have many copies of the same connection.

Since for Athena, the only argument that really affects the connection is the region, I would modify the Connect method to fill the args with {"region": "default"} if it's empty and in the rest of methods (catalogs, databases, tables...) strip the args to only setting the region.

In the case of Redshift, we are not even allowing to change the region so I would just use nil (or empty) when getting the API instances in all the methods.

BTW, note that we are literally using default as the value (and that will be resolved afterwards), you don't need to worry about what's actually the default value. It's done here.

In any case, those changes are for the data sources code, not something that you need to change here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andresmgot good to know that this is probably not an issue with say redshift and is just isolated to athena. I won't put the fix in this repo then! I did try your method of passing region: default in Connect and got back this error from our sdk. We could try to support it here, but as you say it sounds like this is for now isolated to athena?

So I kept the fix in Connect, but manually grabbed the default region from the config (which hasn't been set yet) and involved unmarshalling twice which is odd, but I suppose ok? You can see it here grafana/athena-datasource#168

Let me know if this seems like a good path forward to you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replied in your PR 👍

that this is probably not an issue with say redshift

Sorry no, this is probably an issue in Redshift too. My only point was that in Redshift we don't allow to change the region from the QueryEditor so we don't need to use different args to keep track of different API connections. Right now, it is wrongly using other parameters (like the db name) to store different copies of the same (or the wrong) thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh ok, I think that does makes sense, thanks for explaining! Let me know if you think this seems good to go now!

@andresmgot
Copy link
Contributor

  • does this feel like a secure solution to you? I think it is, but I'd love someone more familiar with security/cryptography/etc to double check my thinking

Yes, this should be fine. For example, see this discussion.

  • does anyone know if it really makes sense to have these 2 caches in the first place? I don't recall why they are necessary exactly, but it seems reasonable I suppose, tho I'm not sure if this additional complication provides us many wins, I'm sort of tempted to say we should not cache the api and should only cache sessions, but I suspect there were reasons why this made sense in the first place and I'm just jumping in late to the party.

So we are caching two different things here. The first one is caching the AWS API session, the second is caching the connectioln to the resource, for example, the connection to a Redshift database. We should not open new connections for every request since it's less performant and we can reach simultaneous connection limits.

  • how do we feel about the fact that we are storing stale sessions/api clients on people's servers and never remove them? I assume it's probably "ok" since that's what we're currently doing as best as I can tell, but it does feel like it could potentially be pretty inefficient? I wonder if instead we should destroy stale session/apis if the settings hash changes?

We are actually expiring session clients. See:

if env.expiration.After(time.Now().UTC()) {

Since we are expiring the session, the resource connection should change as well.

It's true though that if a session is never used again, there will be some garbage left there in memory but I do not think it's something that we need to worry about.

@fridgepoet
Copy link
Member

Tested it manually in Athena (after using aws-sdk-react 0.0.37) and it is working as expected

Copy link
Contributor

@andresmgot andresmgot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for your patience in this PR :)

@@ -185,7 +188,7 @@ func (sc *SessionCache) GetSession(c SessionConfig) (*session.Session, error) {

var regionCfg *aws.Config
if c.Settings.Region == defaultRegion {
backend.Logger.Warn("Region is set to \"default\", which is unsupported")
backend.Logger.Warn("Region is set to \"default\", which is unsupported, please use sqlModels.DefaultKey")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a different thing (and not specific to sql) so I would leave the message as it was.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yeah fair point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants