Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apiserver/storage: test caching and indexing layers at the unit level #109831

Open
46 tasks done
stevekuznetsov opened this issue May 5, 2022 · 24 comments
Open
46 tasks done
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@stevekuznetsov
Copy link
Contributor

stevekuznetsov commented May 5, 2022

What would you like to be added?

Today, unit-level tests only exist for *etcd3.store{}, and cannot be run against any other implementation of storage.Interface. In general, these tests define what it means to be a correct implementation of storage.Interface, and only use exported functionality. We aim to refactor the tests to accept any storage.Interface. This allows for testing the entire exported function space for an etcd storage layer with whichever caching and indexing layers on top.

The following tests can be ported as-is, since they use only storage.Interface methods:

In #109833:

  • TestCreateWithTTL
  • TestCreateWithKeyExist
  • TestGet
  • TestUnconditionalDelete
  • TestConditionalDelete
  • TestDeleteWithSuggestion
  • TestDeleteWithSuggestionAndConflict
  • TestDeleteWithSuggestionOfDeletedObject
  • TestValidateDeletionWithSuggestion
  • TestPreconditionalDeleteWithSuggestion
  • TestGetListNonRecursive
  • TestGuaranteedUpdateWithTTL
  • TestGuaranteedUpdateWithConflict
  • TestGuaranteedUpdateWithSuggestionAndConflict
  • TestCount
  • TestWatch
  • TestDeleteTriggerWatch
  • TestWatchFromNoneZero
  • TestWatchInitializationSignal

Other:

  • TestProgressNotify (requires that the underlying etcd is configured to send bookmarks)

The following tests currently require the etcd client, and need to be re-thought or adapted:

  • TestCreate uses the etcd client to: storage/testing: move creation test to generic package #109909
    • make sure nothing exists at the key before creation. Unclear what this is enforcing, and perhaps could use storage.Interface.Get() instead?
    • check invariants about data in the store, should be relegated to an *etcd3.store{} specific test only
  • TestGuaranteedUpdate uses the etcd client to:
    • check invariants about data in the store, should be relegated to an *etcd3.store{} specific test only
  • TestGuaranteedUpdateChecksStoredData uses the etcd client to:
    • put data into the store using a different codec
  • TestList uses the etcd client to: storage: split paginated and non-paginated list tests, make them generic #110024
    • share a single underlying storage between two *etcd3.store{} implementations (with and without paging)
  • TestListContinuation uses the etcd client to:
    • track the number of calls made to etcd
  • TestListPaginationRareObject uses the etcd client to:
    • track the number of calls made to etcd
  • TestListContinuationWithFilter uses the etcd client to:
    • track the number of calls made to etcd
  • TestListInconsistentContinuation uses the etcd client to:
    • cause a compaction in order to test clients asking for compacted revisions
  • TestWatchFromZero uses the etcd client to:
    • cause a compaction in order to test clients asking for compacted revisions
  • TestWatchError uses the etcd client to:
    • share a single underlying storage between two *etcd3.store{} implementations (with a valid and invalid codec)
  • TestWatchDeleteEventObjectHaveLatestRV uses the etcd client to:
    • create a raw watch, we check that the mod revision at which the raw watch shows us the delete is the same as the revision the storage.Interface uses to report the deletion ... but if we know nothing else is touching our storage, why not just use storage.Interface.Get() (or the returned value from creating) to check that the last RV we saw is what it's deleted at?

The following tests currently touch internals of *etcd3.store{}:

  • TestGuaranteedUpdate touches:
    • store.transformer in order to inject staleness to cause an update when the semantic value of the data does not change
  • TestGuaranteedUpdateChecksStoredData touches:
    • store.transformer in order to inject staleness to cause an update when the semantic value of the data does not change
  • TestTransformationFailure touches:
    • store.transformer in order to test behavior under transformation failure
  • TestListContinuation touches:
    • store.transformer to count the number of items read from etcd
  • TestListPaginationRareObject touches:
    • store.transformer to count the number of items read from etcd
  • TestListContinuationWithFilter touches:
    • store.transformer to count the number of items read from etcd
  • TestConsistentList touches:
    • store.transformer to ... cause side-effect reads during a paginated LIST call ? @wojtek-t you wrote this, perhaps there's a less invasive way to do this
  • TestLeaseMaxObjectCount touches:
    • store.leaseManager - this is etcd-specific, so perhaps it remains in the *etcd3.store{} level?
  • TestWatchFromZero touches:
    • store.versioner, but that is a singleton and there's no real reason that needs to be an instance variable of the store as opposed to an exported package-level function?
  • TestWatchContextCancel touches: storage/testing: move cancelled watch test to generic package #109914
    • store.watcher.Watch(), but we're testing user-facing behavior of the watch method, so it's not clear why calling the watcher specifically is necessary here
  • TestWatchErrResultNotBlockAfterCancel touches:
    • store.watcher.createWatchChan() - this looks like an implementation detail of the watcher?
  • TestWatchDeleteEventObjectHaveLatestRV touches:
    • store.versioner, but that is a singleton and there's no real reason that needs to be an instance variable of the store as opposed to an exported package-level function?

Next steps:

  • migrate all of the tests that can be used as-is into *_tests.go files and change their signature to accept a storage.Interface, import these and use them to test caching and indexing layers (storage/etcd3: factor tests to accept storage.Interface #109833)
  • make a decision for all of the tests that touch the etcd client whether they could instead use the storage.Interface itself, etc
  • similarly figure out if the tests that touch internals of the *etcd3.store{} are useful to test other storage.Interface implementations, and, if so, how they can be adapted

/sig api-machinery
/assign @stevekuznetsov @wojtek-t
/cc @liggitt @lavalamp

Why is this needed?

To improve our ability to test the caching and indexing layers that sit on top of the etcd storage implementation.

@stevekuznetsov stevekuznetsov added the kind/feature Categorizes issue or PR as related to a new feature. label May 5, 2022
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 5, 2022
@leilajal
Copy link
Contributor

leilajal commented May 5, 2022

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 5, 2022
@wojtek-t
Copy link
Member

wojtek-t commented May 6, 2022

Couple thoughts from me:

TestCreate uses the etcd client to:

Indeed I would switch to storage.Interface.Get instead.
Checking invariants about data, I would make a dedicated etcd3-specific test for it (that just does: Create+checkinvartians).

TestGuaranteedUpdate uses the etcd client to:

Making a generic test from that seems trivial, but I'm yet sure how to make a simple etcd-specific additional test...

TestList uses the etcd client to:

Can we just make the test to take two implementations of storage.Interface as an input? And then it trivially generalizes, no?

TestListContinuation uses the etcd client to:
TestListPaginationRareObject uses the etcd client to:
TestListContinuationWithFilter uses the etcd client to:

Hmm - it's similar to TestGuaranteedUpdate - it seems trivial to generalize (by removing those calls), but I'm not sure how to test that functionality without duplicating the same test more-or-less...

TestListInconsistentContinuation uses the etcd client to:
TestWatchFromZero uses the etcd client to:

I'm wondering if we would be to somehow pass a compaction callback or sth like that to the test.

TestWatchError uses the etcd client to:

Can we pass two stores to the test?

TestWatchDeleteEventObjectHaveLatestRV uses the etcd client to:

Roughly what you suggested...

@wojtek-t
Copy link
Member

wojtek-t commented May 6, 2022

Thinking more about TestGuaranteedObject (and similar cases where we just check simple invariants).

Can we pass an additional callback to the test function that will be called (in etcd test we can just pass checkStorageInvariants, in watchcache ones, we can basically just pass a no-op function)...

@liggitt
Copy link
Member

liggitt commented May 6, 2022

TestWatchDeleteEventObjectHaveLatestRV uses the etcd client to:

  • create a raw watch, we check that the mod revision at which the raw watch shows us the delete is the same as the revision the storage.Interface uses to report the deletion ... but if we know nothing else is touching our storage, why not just use storage.Interface.Get() (or the returned value from creating) to check that the last RV we saw is what it's deleted at?

maybe I'm misunderstanding, but the rv returned from a Get or from the Create is explicitly not what we want to be the RV in a delete watch event or delete response. The RV on a delete response or watch event must be distinct from and later than the RV on the last write to the object.

instrumenting the current test:

diff --git a/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher_test.go b/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher_test.go
index 273bb1c4389..21f42ec1f06 100644
--- a/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher_test.go
+++ b/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher_test.go
@@ -293,6 +293,7 @@ func TestWatchDeleteEventObjectHaveLatestRV(t *testing.T) {
 		t.Fatalf("Watch failed: %v", err)
 	}
 	rv, err := APIObjectVersioner{}.ObjectResourceVersion(storedObj)
+	t.Log("created",rv)
 	if err != nil {
 		t.Fatalf("failed to parse resourceVersion on stored object: %v", err)
 	}
@@ -310,12 +311,14 @@ func TestWatchDeleteEventObjectHaveLatestRV(t *testing.T) {
 		t.Fatalf("timed out waiting for watch event")
 	}
 	deletedRV, err := deletedRevision(watchCtx, etcdW)
+	t.Log("deleted",deletedRV)
 	if err != nil {
 		t.Fatalf("did not see delete event in raw watch: %v", err)
 	}
 	watchedDeleteObj := e.Object.(*example.Pod)
 
 	watchedDeleteRev, err := store.versioner.ParseResourceVersion(watchedDeleteObj.ResourceVersion)
+	t.Log("watched",watchedDeleteRev)
 	if err != nil {
 		t.Fatalf("ParseWatchResourceVersion failed: %v", err)
 	}

you get

go test k8s.io/apiserver/pkg/storage/etcd3 -run TestWatchDeleteEventObjectHaveLatestRV -v
=== RUN   TestWatchDeleteEventObjectHaveLatestRV
    watcher_test.go:296: created 2
    watcher_test.go:314: deleted 3
    watcher_test.go:321: watched 3

@stevekuznetsov
Copy link
Contributor Author

@liggitt thanks - seems like I had a mistaken assumption that etcdWatch.Events[0].Kv.ModRevision would be the revision of the last modification, but it seems like etcd sets that to the logical time at which the kv was deleted. This test then can only exist at the etcd unit level unless we make it less strict and allow any number > the previous revision.

@stevekuznetsov
Copy link
Contributor Author

@liggitt mentioned this test plan in a Slack thread, wanted to add it here for posterity:

  1. create an object, record RV (createdRV)
  2. start a watch on the object
  3. delete the object, record RV (deletedRV)
  4. assert deletedRV != createdRV
  5. wait for watch to observe delete (tolerating any intermediate updates from other actors, I guess)
  6. assert deleted watch event is received
  7. assert deleted watch event RV == deletedRV
  8. assert deleted watch event RV is different from any intermediate update watch event RVs

Unfortunately, we can't do step 3. Clients definitely do not get anything back from a DELETE call, and it looks like the current implementation for etcd puts the resourceVersion of the value used for preconditions into the out object passed in. @logicalhan mentioned that it would be useful and more complete to let clients know the resourceVersion at which their DELETE was committed, but this seems like an orthogonal bit of work and the test should, I guess, remain etcd-specific until that point?

@logicalhan
Copy link
Member

@liggitt mentioned this test plan in a Slack thread, wanted to add it here for posterity:

  1. create an object, record RV (createdRV)
  2. start a watch on the object
  3. delete the object, record RV (deletedRV)
  4. assert deletedRV != createdRV
  5. wait for watch to observe delete (tolerating any intermediate updates from other actors, I guess)
  6. assert deleted watch event is received
  7. assert deleted watch event RV == deletedRV
  8. assert deleted watch event RV is different from any intermediate update watch event RVs

Unfortunately, we can't do step 3. Clients definitely do not get anything back from a DELETE call, and it looks like the current implementation for etcd puts the resourceVersion of the value used for preconditions into the out object passed in. @logicalhan mentioned that it would be useful and more complete to let clients know the resourceVersion at which their DELETE was committed, but this seems like an orthogonal bit of work and the test should, I guess, remain etcd-specific until that point?

This is the type of test which should become possible once we introduce the interface layer right above etcdClient, so I think it's okay to omit this for now on the presumption that this will be tested in the interface we're proposing.

@stevekuznetsov
Copy link
Contributor Author

@logicalhan would it be sufficient to test that the storage provides this semantic to the storage.Interface implementation? As this is end-client-visible I would have thought that we want to make sure that we continue to forward this property up the stack

@logicalhan
Copy link
Member

As this is end-client-visible I would have thought that we want to make sure that we continue to forward this property up the stack

I would be okay with just testing this on the inner layer, it should transitively retain the properties on the outer layer. But that's just me. Porting this logic to the outer layer involves patching the storage.Interface and the implementation which is considerably involved.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 22, 2022
@wojtek-t
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 23, 2022
@MadhavJivrajani
Copy link
Contributor

reusing those tests for watchcache

I can take a stab at this if anyone isn't already working on it!

@wojtek-t
Copy link
Member

wojtek-t commented Nov 7, 2022

I can take a stab at this if anyone isn't already working on it!

A bit of that already hapenned in: #113427 and #113588

I have a bit more started in my local branch (#113666 is a preparation for making that possible), but my work (for now) is limited to lists.

If you want to help with making that for other tests (in particular Create, GuaranteedUpdate and Delete tests), that would definitely be useful.

@wojtek-t
Copy link
Member

I have a bit more started in my local branch (#113666 is a preparation for making that possible), but my work (for now) is limited to lists.

and watches

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023
@wojtek-t
Copy link
Member

wojtek-t commented Feb 8, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023
@MadhavJivrajani
Copy link
Contributor

MadhavJivrajani commented Feb 9, 2023

@wojtek-t @stevekuznetsov i see from the issue description that those tests are taken care of, is there anything else left that I can help with here?

@wojtek-t
Copy link
Member

wojtek-t commented Feb 9, 2023

@MadhavJivrajani - the next step I would like to take here (which is a bit started, but not that much happened here) is to ensure that all tests that we generalized will actually also be run against watchcache:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/tests/cacher_test.go

There are three of them already there (and one more on its way in #114656 ), but we need to enable all tests.

If you have capacity to help with it, it would be great.

The next bit (or ideally something that can be taken simultanously, we need to audit tests that exist in that file and either:

  • confirm those cases are already tested by those generic tests already
  • extend the generic tests if they test cacher tests do test more case
    and get rid of the cacher-specific tests from that file. Everything there should be generic.

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 9, 2024
@wojtek-t
Copy link
Member

wojtek-t commented Feb 9, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

8 participants