New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix races in etcdclient #801
Conversation
@tjungblu: This pull request references Bugzilla bug 2077833, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/payload 4.11 ci informing |
@tjungblu: trigger 4 jobs of type informing for the ci release of OCP 4.11
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8d291500-c553-11ec-9922-b75bcdbb766d-0 |
@tjungblu: This pull request references Bugzilla bug 2077833, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-required |
getting some more data here, sorry for the additional cost :/ |
@tjungblu: An error was encountered. No known errors were detected, please see the full error message for details. Full error message.
could not resolve jobs for 4.11 ci required: job type is not supported: required
Please contact an administrator to resolve this issue. |
/payload 4.11 ci informing |
@tjungblu: trigger 4 jobs of type informing for the ci release of OCP 4.11
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/85902f60-c574-11ec-9478-b916e1f81898-0 |
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/payload 4.11 ci blocking |
@tjungblu: trigger 5 jobs of type blocking for the ci release of OCP 4.11
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9ce83b00-c5f4-11ec-8237-3b5624e6e5ea-0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, but just a question on the placement of the existing Lock() in getEtcdClient()
.
@tjungblu: This pull request references Bugzilla bug 2077833, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-required |
@tjungblu: trigger 5 jobs of type blocking for the ci release of OCP 4.11
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/779eb580-c7e9-11ec-9e9d-eee4fe11ed2b-0 |
/retest-required |
1 similar comment
/retest-required |
/lgtm |
/hold cancel |
grpclog.SetLoggerV2(grpclog.NewLoggerV2(ioutil.Discard, ioutil.Discard, os.Stderr)) | ||
clientOpts, err := newClientOpts(opts...) | ||
if err != nil { | ||
return nil, err | ||
return nil, fmt.Errorf("error during clientOpts: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good addition of clearer error messages throughout. Much appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank @Elbehery!
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, just a couple nits and a question on what the tickets actually represent.
pkg/etcdcli/etcdcli_pool.go
Outdated
returnTicket(p.availableTickets) | ||
err := p.closeFunc(client) | ||
if err != nil { | ||
klog.Errorf("could not close cache client exceeding pool capacity: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confusing at first read. I thought initially the error was because we had exceeded the client pool capacity.
A bit wordy but maybe something like this?
klog.Errorf("could not close cache client exceeding pool capacity: %v", err) | |
klog.Errorf("failed to close extra etcd client which is not being re-added in the client pool: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great suggestion, let me add this
assert.Errorf(t, err, "too many active cache clients, rejecting to create new one") | ||
// returning one should unlock the get again | ||
poolRecorder.pool.Return(clients[0]) | ||
client, err = poolRecorder.pool.Get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error check.
client, err = poolRecorder.pool.Get() | |
client, err = poolRecorder.pool.Get() | |
require.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/etcdcli/etcdcli_pool_test.go
Outdated
} | ||
assert.Equal(t, maxNumCachedClients, poolRecorder.numCloseCalls) | ||
|
||
// now we should be able to get the full amount of tickets again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// now we should be able to get the full amount of tickets again | |
// now we should be able to get the full amount of clients again |
I think you mean clients and not tickets here as the number of available tickets is just 5 at this point.
This threw me off initially as it made me think the tickets were returned once the client was as well.
So I expected the following unit test to succeed.
func TestAllTicketsAvailableForFullyCachedPool(t *testing.T) {
integration.BeforeTestExternal(t)
testServer := integration.NewClusterV3(t, &integration.ClusterConfig{Size: 3})
defer testServer.Terminate(t)
poolRecorder := newTestPool(testServer)
var clients []*clientv3.Client
for i := 0; i < maxNumCachedClients; i++ {
client, err := poolRecorder.pool.Get()
require.NoError(t, err)
assert.NotNil(t, client)
clients = append(clients, client)
}
// return all clients to the client pool and check that none are closed
for _, client := range clients {
poolRecorder.pool.Return(client)
}
assert.Equal(t, maxNumCachedClients, poolRecorder.numNewCalls)
assert.Equal(t, maxNumCachedClients, poolRecorder.numEndpointCalls)
assert.Equal(t, maxNumCachedClients, poolRecorder.numHealthCalls)
assert.Equal(t, 0, poolRecorder.numCloseCalls)
// The tickets should equal the max available (10) since all clients are ready to be taken from the cached pool
assert.Equal(t, maxNumClientTickets, len(poolRecorder.pool.availableTickets))
// Grab a cached client from the pool
client, err := poolRecorder.pool.Get()
require.NoError(t, err)
assert.NotNil(t, client)
assert.Equal(t, maxNumClientTickets-1, len(poolRecorder.pool.availableTickets))
// Return it back and all tickets should be available again
poolRecorder.pool.Return(client)
assert.Equal(t, maxNumClientTickets, len(poolRecorder.pool.availableTickets))
}
But looking through the code it seems tickets used == number of clients currently open. So just wanted to confirm this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, there are two limits:
- pool size, how many active clients we are keeping cached at max
- tickets, how many clients we are handing out in general
The first should ensure we are not creating new clients all the time and keeping them cached, because the TLS handshakes are expensive. Yet bound to maxNumCachedClients
as too many IDLE connections to etcd are also not helpful.
The second should ensure that on connectivity issues we do not overwhelm etcd additionally by creating more connections. The behavior in a previous version before the ticketing was:
- I have to get the members, can I get a cached client?
- pool is filled, let me grab one and try to health check
- none of the healthy client are available, let's create a new one
The last bullet was basically creating hundreds of new clients as there was no connectivity for some time period. The ticketing rate limits this and ensures an explicit failure when this happens. Imagine going to a full amusement park and you have to wait for somebody to come out of it.
But looking through the code it seems tickets used == number of clients currently open. So just wanted to confirm this.
yep, good summary. Shall we rename that ticketing aspect to numOpenClients
/ numActiveClients
/ numCreatedClients
?
pkg/etcdcli/etcdcli_pool_test.go
Outdated
} | ||
|
||
// replenish the maxNumCachedClients that were closed earlier | ||
assert.Equal(t, maxNumClientTickets+maxNumCachedClients, poolRecorder.numNewCalls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a check for the number of available tickets too just to make it explicit that available tickets are the capacity to create more clients.
assert.Equal(t, maxNumClientTickets+maxNumCachedClients, poolRecorder.numNewCalls) | |
assert.Equal(t, maxNumClientTickets+maxNumCachedClients, poolRecorder.numNewCalls) | |
assert.Equal(t, maxNumClientTickets-maxNumCachedClients, len(poolRecorder.pool.availableTickets)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, in this case the tickets are zero as all clients are handed out:
// no tickets are available anymore, as we have handed out all clients
assert.Equal(t, 0, len(poolRecorder.pool.availableTickets))
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dusk125, EmilyM1, geliu2016, hasbro17, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@tjungblu: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/cherry-pick 4.10 |
@tjungblu: cannot checkout In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.10 |
@tjungblu: #801 failed to apply on top of branch "release-4.10":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
there is a bug while caching the client, the order in which the endpoints are returned is not deterministic. That causes the cache client to be constantly created from scratch and invalidating the old (potentially while in use).
the new client works in a pooling fashion, which handles the mutual exclusiveness of returning an existing or newly created client. HealthCheck+Member updates are handled during the retrieval, so is the close. Once a consumer is done, they simply return the client to the pool.
added some more additional logging as well.