Add hashed settings to cache keys #59

sarahzinger · 2022-07-07T20:59:41Z

relates to grafana/athena-datasource#144

The issue:

a user goes to create an AWS grafana plugin instance using secrets/keys as auth
user mistypes those secret/keys
user tries to fetch catalogs with those credentials
we create a session with those credentials on their behalf
we store that session with a cache key that does not update if secrets/keys updates
we create an api instance using that session
we store that api instance using a "connection key" that also does not update if secrets/keys updates
aws returns an error saying the credentials are invalid
the user then updates their credentials
we successfully save those new credentials to the datasource
but we do not expire the cacheKey or connectionKey, so we will continue to return a broken session and api until we reset the server, or update a field that does update the cache/connectionKeys

Proposed solution:

we take the entire settings config, hash it, and make that the cache/connection keys to access this data, rather than a non hashed selection of less sensitive settings.

Open questions/things I'd love feedback on:

does this feel like a secure solution to you? I think it is, but I'd love someone more familiar with security/cryptography/etc to double check my thinking
does anyone know if it really makes sense to have these 2 caches in the first place? I don't recall why they are necessary exactly, but it seems reasonable I suppose, tho I'm not sure if this additional complication provides us many wins, I'm sort of tempted to say we should not cache the api and should only cache sessions, but I suspect there were reasons why this made sense in the first place and I'm just jumping in late to the party.
how do we feel about the fact that we are storing stale sessions/api clients on people's servers and never remove them? I assume it's probably "ok" since that's what we're currently doing as best as I can tell, but it does feel like it could potentially be pretty inefficient? I wonder if instead we should destroy stale session/apis if the settings hash changes?

CLAassistant · 2022-07-07T20:59:44Z

All committers have signed the CLA.

andresmgot · 2022-07-08T08:01:38Z

pkg/sql/datasource/datasource.go

@@ -46,13 +46,13 @@ func (s *AWSDatasource) createDB(id int64, args sqlds.Options, settings models.S
 	return db, nil
 }

-func (d *AWSDatasource) storeAPI(id int64, args sqlds.Options, dsAPI api.AWSAPI) {
-	key := connectionKey(id, args)
+func (d *AWSDatasource) storeAPI(id int64, args sqlds.Options, settings models.Settings, dsAPI api.AWSAPI) {


I don't think you need the changes in this package. If the settings change, the session will change so this will be a new connection. But I may be wrong, it's worth to double check it, you can do this changing the access key and trying to reload databases, for example. The second connection, using the wrong key, should fail (and not reuse the previous session connection).

Hey @andresmgot I'm not sure I follow what you mean when you say you don't think I need the changes in this package? Do you mean you don't think the changes are necessary in datasource.go but they are necessary in sessions.go ?

I believe we need the changes here in addition to session, because when we first create an api instance we load it with a particular session and store it. While we no longer use that session directly, the stored api instance holds onto the out of date session. When I remove my changes from datasource.go I see the bug reappear.

Let me know if I'm misunderstanding tho!

Do you mean you don't think the changes are necessary in datasource.go but they are necessary in sessions.go ?

yes, that was what I meant.

the stored api instance holds onto the out of date session.

That is what I don't fully understand why 🤔 What I thought should be happening, whenever you click "Save" in the data source configuration is:

Call sqlds NewDatasource function

That calls Connect in the data source code.

That calls GetDB from this package

That calls createAPI which finally calls the New method of the data source API package. Which creates a AWS session and stores the result, overwritting whatever API instance was there already.

If that is not happening, I may have gotten that flow wrong but it feels weird that this package cares about AWS settings, since that should have been solved in a lower level. Can you maybe debug what's happening here? This code is a bit complicated but it's important to understand it well since it can cause many errors.

Hmm a really good question, that I'm not totally sure I know the answer just yet but I did try looking yesterday. I'm going to keep digging today if I can since it feels like a worthy thing to understand.

Here's what I've found so far, just in case it's sort of helpful to read my debugging process

I can tell you that without this code the error still persists. But to your point maybe that means we aren't overwriting the original value properly or something, and there's a better way to solve this problem.

I can tell you that we do call Connect/GetDb/createAPI in an effort to overwrite the api with the right settings, and that we do this before calling getAPI, so it's not an issue of a race condition I don't think.

I can tell you that we use this sync map store method: https://cs.opensource.google/go/go/+/refs/tags/go1.18.3:src/sync/map.go;l=137 to store the new api

I find this line in particular very interesting in that source code:

// tryStore stores a value if the entry has not been expunged. // // If the entry is expunged, tryStore returns false and leaves the entry // unchanged.

I don't really know what makes something expunged yet, but I do sort of wonder if its possible that we are calling store but not actually storing anything and leaving the original entry somehow? This feels like a wild theory but I suppose it's possible?

"expunge" is a new word for me but it seems that only applies when you remove stuff from the map (which we don't) but in the debugger you should be able to see if the entry changes in memory. It's more likely I did something wrong there or it's not calling what we think it's calling.

If you don't spot the issue feel free to shoot me a meeting for tomorrow and we can take a look at this together, I am curious what's going on 🤔

Thanks andrés, I think I'm ok, I'm just writing out thoughts for myself at this point sorry lol, but if you have opinions or corrections feel free to chime in.

Learned a few more things today, logging it so I don't forget for tomorrow.

definitely was on a totally wrong track with that expunge stuff lol 😅

our cacheKey was hashing a settings map that had hidden fields, so we weren't hashing secret key for example. I have fixed that in a local pr.

but interesting this still doesn't work without the hashed connectionkey (when it totally should), this is because when we store the api when we update the datasource, we store it with a connectionKey that follows this format:

id-map-of-args-that-could-affect-the-api-like-region

and when we update the datasource and call Connect we do not send along a region as a query arg (which is fine and fair imo), and so when we create the api we store that with a connectionKey of

id-emptymap[]

and then when we query for catalogs right after we try to find the api we just saved but we make this call with a region we search with the default region so we use a connection key like this:

id-map-with-region

which now contains out of date information

So I think the thing we need to do is either update Connect (in athena and other datasources) to somehow get the default region and use that to pass along to getDB, or maybe alter create/storeAPI (in aws-sdk) to ensure that if we're trying to store with an empty region we use the default.

Going to give that a try tomorrow. Sorry I'm very slow at debugging this but it feels worth struggling through hahaha

gotcha, that makes more sense! It works when you include the settings in the key because getApi was failing to get one instance from the cache and created a new instance. The default thing has given me soo many headaches since the beginning 😓

So I think the thing we need to do is either update Connect (in athena and other datasources) to somehow get the default region and use that to pass along to getDB, or maybe alter create/storeAPI (in aws-sdk) to ensure that if we're trying to store with an empty region we use the default.

mmh, it seems that there are two issues here. In the Connect method we are using args = nil so, as you say, the map is empty but for all the resources, we send as "args" elements that may not be relevant to the connection. For example, for retrieving Columns we are using args = {"region": "default", "catalog": "default", "database": "default", "table": "default"}, which are all considered "connection args" since we don't make a distinction in the backend. This is probably causing that we have many copies of the same connection.

Since for Athena, the only argument that really affects the connection is the region, I would modify the Connect method to fill the args with {"region": "default"} if it's empty and in the rest of methods (catalogs, databases, tables...) strip the args to only setting the region.

In the case of Redshift, we are not even allowing to change the region so I would just use nil (or empty) when getting the API instances in all the methods.

BTW, note that we are literally using default as the value (and that will be resolved afterwards), you don't need to worry about what's actually the default value. It's done here.

In any case, those changes are for the data sources code, not something that you need to change here.

@andresmgot good to know that this is probably not an issue with say redshift and is just isolated to athena. I won't put the fix in this repo then! I did try your method of passing region: default in Connect and got back this error from our sdk. We could try to support it here, but as you say it sounds like this is for now isolated to athena?

So I kept the fix in Connect, but manually grabbed the default region from the config (which hasn't been set yet) and involved unmarshalling twice which is odd, but I suppose ok? You can see it here grafana/athena-datasource#168

Let me know if this seems like a good path forward to you!

I have replied in your PR 👍

that this is probably not an issue with say redshift

Sorry no, this is probably an issue in Redshift too. My only point was that in Redshift we don't allow to change the region from the QueryEditor so we don't need to use different args to keep track of different API connections. Right now, it is wrongly using other parameters (like the db name) to store different copies of the same (or the wrong) thing.

ahh ok, I think that does makes sense, thanks for explaining! Let me know if you think this seems good to go now!

andresmgot · 2022-07-08T08:20:25Z

does this feel like a secure solution to you? I think it is, but I'd love someone more familiar with security/cryptography/etc to double check my thinking

Yes, this should be fine. For example, see this discussion.

does anyone know if it really makes sense to have these 2 caches in the first place? I don't recall why they are necessary exactly, but it seems reasonable I suppose, tho I'm not sure if this additional complication provides us many wins, I'm sort of tempted to say we should not cache the api and should only cache sessions, but I suspect there were reasons why this made sense in the first place and I'm just jumping in late to the party.

So we are caching two different things here. The first one is caching the AWS API session, the second is caching the connectioln to the resource, for example, the connection to a Redshift database. We should not open new connections for every request since it's less performant and we can reach simultaneous connection limits.

how do we feel about the fact that we are storing stale sessions/api clients on people's servers and never remove them? I assume it's probably "ok" since that's what we're currently doing as best as I can tell, but it does feel like it could potentially be pretty inefficient? I wonder if instead we should destroy stale session/apis if the settings hash changes?

We are actually expiring session clients. See:

grafana-aws-sdk/pkg/awsds/sessions.go

Line 172 in df7d5c7

if env.expiration.After(time.Now().UTC()) {

Since we are expiring the session, the resource connection should change as well.

It's true though that if a session is never used again, there will be some garbage left there in memory but I do not think it's something that we need to worry about.

fridgepoet · 2022-07-08T14:14:46Z

Tested it manually in Athena (after using aws-sdk-react 0.0.37) and it is working as expected

andresmgot

LGTM! Thank you for your patience in this PR :)

andresmgot · 2022-07-14T14:40:30Z

pkg/awsds/sessions.go

@@ -185,7 +188,7 @@ func (sc *SessionCache) GetSession(c SessionConfig) (*session.Session, error) {

 	var regionCfg *aws.Config
 	if c.Settings.Region == defaultRegion {
-		backend.Logger.Warn("Region is set to \"default\", which is unsupported")
+		backend.Logger.Warn("Region is set to \"default\", which is unsupported, please use sqlModels.DefaultKey")


I think this is a different thing (and not specific to sql) so I would leave the message as it was.

ah, yeah fair point

Add hashed settings to cache keys

c48f87e

sarahzinger requested review from a team, iwysiu and kevinwcyu and removed request for a team July 7, 2022 21:05

andresmgot reviewed Jul 8, 2022

View reviewed changes

sarahzinger added 3 commits July 13, 2022 21:03

Revert previous changes to connectionKey

ebabcc7

Build out cache key manually to include hidden fields

e2917df

Add back log line that was accidentally removed

0feb594

sarahzinger mentioned this pull request Jul 14, 2022

Use default region in Connect grafana/athena-datasource#168

Merged

Improve log line slightly

1a40055

andresmgot approved these changes Jul 14, 2022

View reviewed changes

Put log line as it was since awssdk can be used for non sql based ds

9fce8ea

sarahzinger merged commit 8235fb7 into main Jul 14, 2022

sarahzinger deleted the sarahzinger/athena-144 branch July 14, 2022 16:01

sarahzinger mentioned this pull request Jul 18, 2022

Cloudwatch: Upgrade grafana-aws-sdk to fix auth issue with secret keys. grafana/grafana#52420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hashed settings to cache keys #59

Add hashed settings to cache keys #59

sarahzinger commented Jul 7, 2022 •

edited

CLAassistant commented Jul 7, 2022 •

edited

andresmgot Jul 8, 2022 •

edited

sarahzinger Jul 8, 2022

andresmgot Jul 11, 2022

sarahzinger Jul 12, 2022

andresmgot Jul 12, 2022

sarahzinger Jul 12, 2022 •

edited

andresmgot Jul 13, 2022 •

edited

sarahzinger Jul 14, 2022

andresmgot Jul 14, 2022

sarahzinger Jul 14, 2022

andresmgot commented Jul 8, 2022

fridgepoet commented Jul 8, 2022

andresmgot left a comment

andresmgot Jul 14, 2022

sarahzinger Jul 14, 2022

Add hashed settings to cache keys #59

Add hashed settings to cache keys #59

Conversation

sarahzinger commented Jul 7, 2022 • edited

CLAassistant commented Jul 7, 2022 • edited

andresmgot Jul 8, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarahzinger Jul 12, 2022 • edited

Choose a reason for hiding this comment

andresmgot Jul 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andresmgot commented Jul 8, 2022

fridgepoet commented Jul 8, 2022

andresmgot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarahzinger commented Jul 7, 2022 •

edited

CLAassistant commented Jul 7, 2022 •

edited

andresmgot Jul 8, 2022 •

edited

sarahzinger Jul 12, 2022 •

edited

andresmgot Jul 13, 2022 •

edited