Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: x/crypto/argon2: add API variants to support a buffer pool #36634

Open
networkimprov opened this issue Jan 18, 2020 · 31 comments
Open

proposal: x/crypto/argon2: add API variants to support a buffer pool #36634

networkimprov opened this issue Jan 18, 2020 · 31 comments

Comments

@networkimprov
Copy link

@networkimprov networkimprov commented Jan 18, 2020

argon2.Key() & .IDKey() appear to thrash memory by allocating a (typically) large buffer and using it once.
https://github.com/golang/crypto/blob/master/argon2/argon2.go#L157

This might not scale well where every login on a high-traffic site calls this API to check a password.

Could API variants be added which take a config struct that optionally contains a buffer, so we can use a buffer pool?

@aead @mark-rushakoff @bradfitz

@gopherbot gopherbot added this to the Unreleased milestone Jan 18, 2020
@aead
Copy link
Contributor

@aead aead commented Jan 19, 2020

In general, that sounds like a legit request to me. The argon2 doc contains a recommendation for such scenarios - however, it is not always possible to follow this advice.

Unfortunately, we cannot just use an argon2-internal global buffer pool - similar to
ioutil.Discard - since we have to allocate a memory-cost dependent buffer. One option may be a function that returns a "derive-function":

New(time, memory uint32, threads uint8) func(password []byte, salt []byte, keyLen uint32) []byte

that returns a closure with e.g. a sync.Pool. The returned closure may be typed
(e.g. type DeriveFunc func([]byte, []byte, uint32) []byte).

Calling code would use this as following:

f := argon2.New(1, 64 * 1024, 4)
key1 := f(password, salt, 32)
key2 := f(password2, salt2, 32)

/cc @FiloSottile

@networkimprov
Copy link
Author

@networkimprov networkimprov commented Jan 19, 2020

I'd want to manage the buffer pool in my code. Roughly:

config := argon2.New(time, memory, threads)
mypool.Put(config.MakeBlockset())
...
go func() { // login handler
   buf := mypool.Get().(argon2.Blockset) // type Blockset []block
   key := config.Key(password, salt, keylen, buf)  
}()
@aead
Copy link
Contributor

@aead aead commented Jan 19, 2020

I'd want to manage the buffer pool in my code

@networkimprov Any specific reason for that - since that sounds like a significantly more complex API?!

@networkimprov
Copy link
Author

@networkimprov networkimprov commented Jan 19, 2020

Well your argon2.New() is essentially 2 methods; I've added one more, plus a named type. And generally x.New() returns an object.

There are many ways one might manage a buffer pool, not just sync.Pool; we can't lock in one scheme. Also some apps would need to expand or shrink the pool.

@aead
Copy link
Contributor

@aead aead commented Jan 19, 2020

Well your argon2.New() is essentially 2 methods; I've added one more, plus a named type. And generally x.New() returns an object.

There is now a Config type (or whatever argon2.New returns) which itself has an API (MakeBlockset) plus the Blockset and block (which we either need to export as Block or we change the underlying implementation to use []byte/[128]byte directly). Further, we require that users pass a valid / properly sized Blockset as buffer.

There are many ways one might manage a buffer pool, not just sync.Pool; we can't lock in one scheme.

I don't see why this should be an issue. For instance ioutil.Discard does that.

Also some apps would need to expand or shrink the pool.

Applications that require this behavior can still achieve this by calling argon2.New again after invoking f (f = argon2.New(...)) more than n times or after t seconds.

@networkimprov
Copy link
Author

@networkimprov networkimprov commented Jan 19, 2020

ioutil.Discard must look like any io.Writer; that it uses a sync.Pool is an implementation detail for an object fulfilling io.Writer. Our topic doesn't have a predefined contract.

Re shrink/expand pool "by calling argon2.New again after invoking f ... more than n times or after t seconds." That creates another sync.Pool. Now I have busy & idle f() widgets; how do I pick an idle one?

EDIT: f() can crash with OOM, unless New() allocates all pool buffers, causing a quantum leap in RAM use.

I want to see where I allocate memory and where I can block. And many apps already have a pool of resources which get allocated to connections. A package like this provides a toolkit, not an engine.

In my model, we can drop type Blockset:

config := argon2.New(time, memory, threads)
mypool.Put(config.MakeBuffer())
...
go func() { // login handler
   buf := mypool.Get() // type interface{}
   key := config.Key(password, salt, keylen, buf) // won't panic with OOM
   mypool.Put(buf)
}()
@aead
Copy link
Contributor

@aead aead commented Jan 19, 2020

ioutil.Discard must look like any io.Writer; that it uses a sync.Pool is an implementation detail for an object fulfilling io.Writer. Our topic doesn't have a predefined contract.

Sorry, but I'm not able to understand the relations here. Why is it not acceptable to implement (a potential) argon2.New function with sync.Pool while e.g. an ioutil.Discard implementation is fine?

Re shrink/expand pool "by calling argon2.New again after invoking f ... more than n times or after t seconds." That creates another sync.Pool. Now I have busy & idle f() widgets; how do I pick an idle one?

The sync.Pool will expand automatically to the max. number of concurrent calls of f since each invocation of f will request a new buffer from the pool (pool.Get(...)) and put it back in the pool once the argon2 hash has been computed:

func New(time, memory uint32, threads uint8) func([]byte, []byte, uint32) []byte {
   // setup sync.Pool
   return func(password, salt []byte, keyLen uint32) []byte {
      b := pool.Get().(*[]block)
      defer pool.Put(b)
     // compute argon2 hash using b
   }
}

When re-assigning f = argon2.New(...) you start with an empty pool again and the previous pool will be collected eventually by the GC. Do you request more fine-grain control over when/how memory is allocated?

I want to see where I allocate memory and where I can block. And many apps already have a pool of resources which get allocated to connections. A package like this provides a toolkit, not an engine.

Maybe I misunderstood the proposal but AFAICS your initial request was about a (potential) performance problem (w.r.t. memory allocation). For example a request handler that hashes a password - The fact that Argon2 is a memory-hard function eventually becomes a performance problem when that request handler becomes a "hot" path. (High-traffic)
Therefore, I inferred that your seeking for a way to reduce the number of memory allocation operations caused by the argon2 package - i.e. by pooling the allocated buffers. Are you additionally requesting an API that not just reduces the number of memory allocations but also allows calling code to control when memory is allocated and how it get pooled?!

The implementation you've showed does not contain a solution for a fundamental problem:

config := argon2.New(time, memory, threads)
mypool.Put(config.MakeBuffer())
...
go func() { // login handler
   buf := mypool.Get() // type interface{}
   key := config.Key(password, salt, keylen, buf) // won't panic with OOM
   mypool.Put(buf)
}()

Let's assume mypool.Get() cannot return the buffer you put into the pool before (via mypool.Put(config.MakeBuffer())) because there are 2 concurrent login handler calls. What does mypool.Get() do in this case? Block?! Return a new buffer - of which size?!

A implementation that uses sync.Pool would need to do the following:

config := argon2.New(time, memory, threads)
mypool := sync.Pool{
   New: func() interface{} {
       return config.MakeBuffer()
   },
}
mypool.Put(config.MakeBuffer())
...
go func() { // login handler
    buf := mypool.Get() // type interface{}
    key := config.Key(password, salt, keylen, buf)
    mypool.Put(buf)
}()

That looks unreasonable more complex to me than the following code - since I'm not able to see why it is necessary to tightly control the allocations to (significantly) reduce the GC preasure:

f := argon2.New(time, memory, threads)
go func() { // login handler
    key :=f(password, salt, keylen)
}()

Also, you cannot avoid the a potential OOM panic in the general case (i.e. sync.Pool) since you don't know whether the pool returns an element from the pool or allocates a new one using its New function.

@FiloSottile
Copy link
Member

@FiloSottile FiloSottile commented Jan 19, 2020

Is there a benchmark showing that the allocation is significant next to the password hashing work?

If yes, why can't we use a global sync.Pool? It gets drained in two GC cycles, and most applications will use a consistent buffer size (and if not they will tend to grow to the max size, which is fine?).

@FiloSottile FiloSottile changed the title x/crypto/argon2: add API variants to support a buffer pool proposal: x/crypto/argon2: add API variants to support a buffer pool Jan 19, 2020
@gopherbot gopherbot added the Proposal label Jan 19, 2020
@FiloSottile FiloSottile modified the milestones: Unreleased, Proposal Jan 19, 2020
@aead
Copy link
Contributor

@aead aead commented Jan 19, 2020

If yes, why can't we use a global sync.Pool? It gets drained in two GC cycles, and most applications will use a consistent buffer size (and if not they will tend to grow to the max size, which is fine?).

Basically my point. However, a global sync.Pool does not work since you may allocate a 64 MiB buffer for one password and a 128 MiB buffer for another one. There is no way (AFAICS) to tell the sync.Pool.New function to allocate different buffers nor to get a buffer of a specific size from the pool.

@balasanjay
Copy link
Contributor

@balasanjay balasanjay commented Jan 19, 2020

Couldn't we just have an array of 32 pools internally, one for each power of two and use the appropriate one transparently?

(Conceptually, that is. I think we can have even fewer since people are unlikely to need 1 byte buffers, or 4 gig buffers)

@aead
Copy link
Contributor

@aead aead commented Jan 19, 2020

An implementation of the idea mentioned above shows no significant performance improvement compared to always allocating the buffers. However, this is an isolated benchmark and it's possible that continuously allocating larger chunks of memory may cause performance degeneration in a more complex system...

benchmark                                                   old ns/op     new ns/op     delta
BenchmarkArgon2id/_Time:_3,_Memory:_32_MB,_Threads:_1-4     98211824      96041183      -2.21%
BenchmarkArgon2id/_Time:_4,_Memory:_32_MB,_Threads:_1-4     130097021     129017218     -0.83%
BenchmarkArgon2id/_Time:_5,_Memory:_32_MB,_Threads:_1-4     159752346     161418792     +1.04%
BenchmarkArgon2id/_Time:_3,_Memory:_64_MB,_Threads:_4-4     68492977      80733770      +17.87%
BenchmarkArgon2id/_Time:_4,_Memory:_64_MB,_Threads:_4-4     90304361      92262618      +2.17%
BenchmarkArgon2id/_Time:_5,_Memory:_64_MB,_Threads:_4-4     110350844     127906065     +15.91%
BenchmarkArgon2id/_Time:_3,_Memory:_32_MB,_Threads:_1-4     96031422      94935380      -1.14%
BenchmarkArgon2id/_Time:_4,_Memory:_32_MB,_Threads:_1-4     126912084     126494319     -0.33%
BenchmarkArgon2id/_Time:_5,_Memory:_32_MB,_Threads:_1-4     159629421     158409139     -0.76%
BenchmarkArgon2id/_Time:_3,_Memory:_64_MB,_Threads:_4-4     64857493      62216787      -4.07%
BenchmarkArgon2id/_Time:_4,_Memory:_64_MB,_Threads:_4-4     87004452      102140203     +17.40%
BenchmarkArgon2id/_Time:_5,_Memory:_64_MB,_Threads:_4-4     110170128     111634540     +1.33%

benchmark                                                   old allocs     new allocs     delta
BenchmarkArgon2id/_Time:_3,_Memory:_32_MB,_Threads:_1-4     28             27             -3.57%
BenchmarkArgon2id/_Time:_4,_Memory:_32_MB,_Threads:_1-4     32             32             +0.00%
BenchmarkArgon2id/_Time:_5,_Memory:_32_MB,_Threads:_1-4     36             36             +0.00%
BenchmarkArgon2id/_Time:_3,_Memory:_64_MB,_Threads:_4-4     40             39             -2.50%
BenchmarkArgon2id/_Time:_4,_Memory:_64_MB,_Threads:_4-4     45             44             -2.22%
BenchmarkArgon2id/_Time:_5,_Memory:_64_MB,_Threads:_4-4     48             48             +0.00%
BenchmarkArgon2id/_Time:_3,_Memory:_32_MB,_Threads:_1-4     28             27             -3.57%
BenchmarkArgon2id/_Time:_4,_Memory:_32_MB,_Threads:_1-4     32             32             +0.00%
BenchmarkArgon2id/_Time:_5,_Memory:_32_MB,_Threads:_1-4     36             36             +0.00%
BenchmarkArgon2id/_Time:_3,_Memory:_64_MB,_Threads:_4-4     40             39             -2.50%
BenchmarkArgon2id/_Time:_4,_Memory:_64_MB,_Threads:_4-4     46             44             -4.35%
BenchmarkArgon2id/_Time:_5,_Memory:_64_MB,_Threads:_4-4     48             48             +0.00%

benchmark                                                   old bytes     new bytes     delta
BenchmarkArgon2id/_Time:_3,_Memory:_32_MB,_Threads:_1-4     33558691      2800560       -91.65%
BenchmarkArgon2id/_Time:_4,_Memory:_32_MB,_Threads:_1-4     33558758      4198778       -87.49%
BenchmarkArgon2id/_Time:_5,_Memory:_32_MB,_Threads:_1-4     33558822      4798048       -85.70%
BenchmarkArgon2id/_Time:_3,_Memory:_64_MB,_Threads:_4-4     67115854      4480949       -93.32%
BenchmarkArgon2id/_Time:_4,_Memory:_64_MB,_Threads:_4-4     67116140      9594164       -85.71%
BenchmarkArgon2id/_Time:_5,_Memory:_64_MB,_Threads:_4-4     67115988      11192085      -83.32%
BenchmarkArgon2id/_Time:_3,_Memory:_32_MB,_Threads:_1-4     33558692      2800560       -91.65%
BenchmarkArgon2id/_Time:_4,_Memory:_32_MB,_Threads:_1-4     33558770      4198776       -87.49%
BenchmarkArgon2id/_Time:_5,_Memory:_32_MB,_Threads:_1-4     33558822      4798048       -85.70%
BenchmarkArgon2id/_Time:_3,_Memory:_64_MB,_Threads:_4-4     67115843      3735409       -94.43%
BenchmarkArgon2id/_Time:_4,_Memory:_64_MB,_Threads:_4-4     67116508      5599674       -91.66%
BenchmarkArgon2id/_Time:_5,_Memory:_64_MB,_Threads:_4-4     67115960      11192085      -83.32%

@balasanjay That sounds like a potential solution, too - with the advantage that we don't need a new API.

@networkimprov
Copy link
Author

@networkimprov networkimprov commented Jan 19, 2020

When re-assigning f = argon2.New(...) you start with an empty pool again and the previous pool will be collected eventually

That will cause an unnecessary memory spike: first pool1, then pool1 + pool2, then pool2

@FiloSottile I'm concerned about memory consumption and OOM panics, not allocation performance.

For example, using 32MB memory per hash calc, 1000 concurrent logins consume 32GB. That's not very many logins.

assume mypool.Get() cannot return the buffer you put into the pool before ... because there are 2 concurrent login handler calls. What does mypool.Get() do in this case?

It waits. Otherwise it could OOM. Note that sync.Pool.New is optional, and only useful if you don't care about OOM, or limit pool use by external means.

@aead
Copy link
Contributor

@aead aead commented Jan 20, 2020

That will cause an unnecessary memory spike: first pool1, then pool1 + pool2, then pool2

Yes, for applications that want to shrink the pool explicitly - which I assume is an edge case + should not be a problem...

@FiloSottile I'm concerned about memory consumption and OOM panics, not allocation performance.
For example, using 32MB memory per hash calc, 1000 concurrent logins consume 32GB. That's not very many logins.

If you're not concerned about the alloc performance/GC pressure, then I don't understand why you want to block when allocating the memory for Argon2 and not just limit the amount of parallel requests - The effect will be the same AFAICS, right?!

EDIT:

It waits. Otherwise it could OOM. Note that sync.Pool.New is optional, and only useful if you don't care about OOM, or limit pool use by external means.

The sync.Pool does not wait - it returns nil if there is no element in the pool: https://play.golang.org/p/ERnIJuGC1xA

@FiloSottile
Copy link
Member

@FiloSottile FiloSottile commented Jan 20, 2020

For example, using 32MB memory per hash calc, 1000 concurrent logins consume 32GB. That's not very many logins.

1000 concurrent logins will always consume 32GB. All we can do is try to help with sequential logins that happen between GC cycles, but I would expect the GC to notice the memory pressure and start being pretty aggressive.

Have we seen it be a problem in practice?

@networkimprov
Copy link
Author

@networkimprov networkimprov commented Jan 20, 2020

That will cause an unnecessary memory spike: first pool1, then pool1 + pool2, then pool2

Yes, for applications that want to shrink the pool explicitly

It causes a spike on expand as well. Also f() needs to return an error when its pool is exhausted:

var f = argon2.New(...)
...
go func() {
   var k []byte
   var err error
   for {
      k, err = f(...)
      if err == nil {
         break
      }
      f = argon2.New(...) // oops, concurrency mistake
   }
}()

The sync.Pool does not wait - it returns nil if there is no element in the pool

Ah; so I'd need a custom pool (e.g. via buffered channel) to avoid OOM. Since memory is the factor limiting logins, it makes sense to use a memory pool to implement a concurrency limit.

There may not be much reason to enhance the API if the pool would be internal.

@aead
Copy link
Contributor

@aead aead commented Jan 20, 2020

There may not be much reason to enhance the API if the pool would be internal.

Therefore, I guess: proposal declined / abandoned - since rate limiting has to be done the application itself?!

@networkimprov
Copy link
Author

@networkimprov networkimprov commented Jan 20, 2020

Your API proposal is abandoned? I proposed one that's flexible and idiomatic, which can use sync.Pool for unlimited pool or a buffered channel for limited one.

@aead
Copy link
Contributor

@aead aead commented Jan 20, 2020

@networkimprov The only thing we can do here is help reducing the number of sequential allocations.
As @FiloSottile already mentioned N concurrent key derivations will always consume e.g. N * 32mb memory. We can only reduce the number of new allocations (Allocate the memory once and reuse it on every subsequent sequential call).

If you want to limit the number of used memory you have to limit the amount of concurrent calls of argon2.KeyID(...). A very naive approach can be found here: https://play.golang.org/p/yWdtOfpFpAC

If you want to limit the amount of memory used for agron2 to e.g 4 GB and use 32 MB per derivation then you can handle 128 requests concurrently.

@renthraysk
Copy link

@renthraysk renthraysk commented Jan 20, 2020

Given the time recommendation varies between 0.5s (argon2 devs) and 1s (libsodium devs), if can't immediately be started, probably should fail immediately? No waiting.

@networkimprov
Copy link
Author

@networkimprov networkimprov commented Jan 20, 2020

@renthraysk can you give references and/or elaborate? Why would deferring the computation of password to hash for logins (i.e. not the original one) make the hash less secure?

@renthraysk
Copy link

@renthraysk renthraysk commented Jan 20, 2020

@networkimprov Was just thinking in regards to response times, what's the point of waiting up to a second to start the argon2 derivation which will take another 1s.

So if don't wait then the code for limiting the number of argon2 derivations that are in flight becomes simpler.

https://play.golang.org/p/XktJ5Slgn6P

@aead
Copy link
Contributor

@aead aead commented Jan 20, 2020

@networkimprov Do we agree that the application has to deal with limiting the amount of (concurrently) allocated memory - such that no argon2 API change is necessary?! If so, this proposal can be closed.

@networkimprov
Copy link
Author

@networkimprov networkimprov commented Jan 20, 2020

@renthraysk you can't reject login requests just because there's a spike in concurrent ones. Is that what you suggested, or did I miss something?

@renthraysk
Copy link

@renthraysk renthraysk commented Jan 20, 2020

@networkimprov
Why can't you? At some rate of login requests you have to.

@networkimprov
Copy link
Author

@networkimprov networkimprov commented Jan 20, 2020

If the system has the capacity to handle another logged-in user, disallowing login seems wrong. Better to post a note on the login page:

"Due to pervasive criminality on the Internet, your login may take up to 10 minutes to complete. While you're waiting, please phone your legislative representative and request that your nation begin performing customs inspections on inbound international Internet traffic."

:-)

@networkimprov
Copy link
Author

@networkimprov networkimprov commented Jan 20, 2020

@aead why does your benchmark in #36634 (comment) show no difference in allocations? It looks like nothing was pooled. Can you post your code?

@aead
Copy link
Contributor

@aead aead commented Jan 20, 2020

@networkimprov It does. It reduces the amount of allocated memory dramatically:
See:
BenchmarkArgon2id/_Time:_3,_Memory:_32_MB,_Threads:_1-4 33558691 2800560 -91.65%

You may refer to the number of allocations per operation. The benchmark usually does fewer allocations per operation - for instance:
BenchmarkArgon2id/_Time:_3,_Memory:_32_MB,_Threads:_1-4 28 27 -3.57%

The difference is just one allocation per operation but that's the large 32 MB buffer - which is exactly what I would expect by using a pool.
The patch: https://gist.github.com/aead/2f032134c1d08a3a05cb0e679a541416

@networkimprov
Copy link
Author

@networkimprov networkimprov commented Jan 20, 2020

Could we benchmark this with concurrent argon2 ops and allocations of random size amidst them? Memory fragmentation would produce different results. I wouldn't expect better performance from a pool in this benchmark.

@aead
Copy link
Contributor

@aead aead commented Jan 20, 2020

Feel free to do so. At the moment it's pure speculation whether the memory allocation is significant next to the hashing. As long as there is no benchmark proofing it, I would consider this proposal as not needed b/c either out-of-scope (see app rate limiting) or no evidence for a performance problem (caused by mem allocs).

@networkimprov
Copy link
Author

@networkimprov networkimprov commented Jan 22, 2020

I'll test this in my app when I have a chance.

@gopherbot please add Proposal-Hold

EDIT: cc @dmitshur re gopherbot failure (also tried the label in quotes)

@rsc rsc added this to Incoming in Proposals Jan 22, 2020
@rsc
Copy link
Contributor

@rsc rsc commented Jan 22, 2020

Proposal labels are restricted from Gopherbot. Not a bug.
We're also using the project statuses more than the labels now for the main (non-Go2) proposals.
In any event, this proposal does seem like it needs compelling performance numbers.

We don't add new API to fix performance problems that aren't measured,
and even for measured problems, we try as hard as possible to fix them
without new API. (See for example #33136.)

If you want to bring this proposal back, you need both a clear performance win
and an explanation for why you can't get that win with purely internal changes
like sync.Pool or a couple sync.Pools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.