Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cache metadata interface. #2316

Merged
merged 1 commit into from
Aug 25, 2021
Merged

Refactor cache metadata interface. #2316

merged 1 commit into from
Aug 25, 2021

Conversation

sipsma
Copy link
Collaborator

@sipsma sipsma commented Aug 17, 2021

There are a few goals with this refactor:

  1. Remove external access to fields that no longer make sense and/or
    won't make sense soon due to other potential changes. For example,
    there can now be multiple blobs associated with a ref (for different
    compression types), so the fact that you could access the "Blob"
    field from the Info method on Ref incorrectly implied there was just
    a single blob for the ref. This is on top of the fact that there is
    no need for external access to blob digests.
  2. Centralize use of cache metadata inside the cache package.
    Previously, many parts of the code outside the cache package could
    obtain the bolt storage item for any ref and read/write it directly.
    This made it hard to understand what fields are used and when. Now,
    the Metadata method has been removed from the Ref interface and
    replaced with getters+setters for metadata fields we want to expose
    outside the package, which makes it much easier to track and
    understand. Similar changes have been made to the metadata search
    interface.
  3. Use a consistent getter+setter interface for metadata, replacing
    the mix of interfaces like Metadata(), Size(), Info() and other
    inconsistencies.

Signed-off-by: Erik Sipsma erik@sipsma.dev

@sipsma sipsma requested a review from tonistiigi August 17, 2021 20:50

const keyDeleted = "cache.deleted"
type MetadataStore interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't like this. Cache/metadata package should not know about every single possible caller it might have. A new caller can define a unique name for itself and get access to store information. If there are helper methods they should be on caller side. Same for GetSharedKey/GetGitRemote/GetGitSnapshot/Etag methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked it because it made it made it possible to easily know all the metadata that's being tracked for a given cache record by just checking one place (in the cache package) rather than searching all over the codebase. Understanding what metadata is being tracked for a given record can obviously be important when you are making behavioral changes to the cache package.

If you disagree with the approach, what would you think of exposing the generic methods for setting/getting arbitrary values and then having outside callers just use those? While not as convenient, you can at least still find usages of those generic getters/setters to track down metadata that's being associate with records.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

know all the metadata that's being tracked

There shouldn't be any need to manage this together. A caller only knows about its use case, it should not have dependency/knowledge of other callers. If new callers are added or removed it should not concern cache pkg or other callers. Cache package just provides an arbitrary amount of callers access to persistent storage for a ref.

what would you think of exposing the generic methods for setting/getting arbitrary values

Yes, generic helpers without fixed keys are fine. Helpers with fixed keys are also fine on caller side (should be private though if they are only used by one package).

return si.SetValue(b, keyDiffID, v)
})
return nil
type Metadata interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more like RefMetadata now. No changes needed if you disagree.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that clarifies that it's specific to a record rather than metadata for the cache as a whole, will update

cache/manager.go Outdated
if err := MigrateV2(
context.TODO(),
filepath.Join(opt.MetadataStoreRoot, "metadata.db"),
filepath.Join(opt.MetadataStoreRoot, "metadata_v2.db"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this change. Why would the cache database need to have a constant name from the package standpoint?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that the cache package is the only one with direct access to the raw store and then just provides an interface for interacting with it to other packages (as opposed to before where a bunch of different packages had direct access to the raw store). It still needs to be told where to put the data on disk though, so I switched to this parameter.

This approach makes more sense to me but if you disagree I don't have any issue with switching back to passing the store as long as NewManager is the only place it gets passed to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense to me that you can't create two instances of manager with metadata.db in same dir. It is some internal structure limitation that leaks to the public API. I'm ok with for example just adding MetadataDBV1 or smth in manageropt if you want to move migration here. It is bit weird though that all other stores are initialized and passed in as interface, and these are now passed as strings.

cache/refs.go Outdated
@@ -32,33 +30,19 @@ import (
// Ref is a reference to cacheable objects.
type Ref interface {
Mountable
ID() string
Metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change needed, but I kind of liked it as separate from the rest of the implementation. Guess it can still be nullable if implementation just defines embedded interface as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found that having all the getters/setters as methods on refs rather than detached functions made it way easier to write code and remember what metadata exists. It works much better with IDE autocompletion and is slightly less verbose.

Release(context.Context) error
Size(ctx context.Context) (int64, error)
Metadata() *metadata.StorageItem
IdentityMapping() *idtools.IdentityMapping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is IdentityMapping removed. It is not fully connected (used atm only in moby) but seems like required data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this can be accessed after Mount is called from snapshot.Mountable. Not sure if this is enough for all cases or something might need this info without actually doing a mount.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I didn't see that it was ever used so I got rid of it. If moby needs it though then I'll just re-add it, not a big deal at all.

@sipsma sipsma force-pushed the rm-info branch 2 times, most recently from 0336182 to 8cbffe8 Compare August 23, 2021 23:15
@sipsma sipsma requested a review from tonistiigi August 24, 2021 00:46
GetString(string) string
GetExternal(string) ([]byte, error)

SetValue(string, interface{}, string) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used where it requires to be public and interface{}? Current interface is weird as there is GetString/SetValue that don't have a matching getter/setter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it didn't make sense anymore now that the methods were public, changed to SetString

GetExternal(string) ([]byte, error)

SetValue(string, interface{}, string) error
SetExternal(string, []byte) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pair with GetExternal

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@sipsma
Copy link
Collaborator Author

sipsma commented Aug 25, 2021

There was a single random test failure in CI: https://gist.github.com/sipsma/a171fd146fa01cee69175e5a69ca793c

Failure is from this line:

checkAllRemoved(t, c, sb)

I can't reproduce locally, even running with -count=20 so not sure if it's related to any changes here or a pre-existing rare failure. Retrying CI to see if it happens again there for some reason.

@sipsma sipsma requested a review from tonistiigi August 25, 2021 01:09
There are a few goals with this refactor:
1. Remove external access to fields that no longer make sense and/or
   won't make sense soon due to other potential changes. For example,
   there can now be multiple blobs associated with a ref (for different
   compression types), so the fact that you could access the "Blob"
   field from the Info method on Ref incorrectly implied there was just
   a single blob for the ref. This is on top of the fact that there is
   no need for external access to blob digests.
2. Centralize use of cache metadata inside the cache package.
   Previously, many parts of the code outside the cache package could
   obtain the bolt storage item for any ref and read/write it directly.
   This made it hard to understand what fields are used and when. Now,
   the Metadata method has been removed from the Ref interface and
   replaced with getters+setters for metadata fields we want to expose
   outside the package, which makes it much easier to track and
   understand. Similar changes have been made to the metadata search
   interface.
3. Use a consistent getter+setter interface for metadata, replacing
   the mix of interfaces like Metadata(), Size(), Info() and other
   inconsistencies.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma
Copy link
Collaborator Author

sipsma commented Aug 25, 2021

Rebased. There was another ephemeral test failure on a previous run, different from the one I mentioned previously: https://gist.github.com/sipsma/593107a30f4886ff2b3dbb02a4862361

I looked around and can't see any clear way those failure could be related to the changes here and can't reproduce them locally after many runs. I also looked through past CI runs and found similar failures from before this PR:

So I don't think it's likely those ephemeral failures are related to the change here.

@tonistiigi
Copy link
Member

@ktock #2316 (comment)

@tonistiigi tonistiigi merged commit 89ebbe5 into moby:master Aug 25, 2021
@crazy-max crazy-max added this to the v0.10.0 milestone Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants