Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add image soft prune (api object yes, registry no) #17480

Merged

Conversation

gabemontero
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 27, 2017
@gabemontero
Copy link
Contributor Author

@gabemontero
Copy link
Contributor Author

results from make update pushed in separate commit to facilitate review

will squash once all feedback on code changes is addressed

@jim-minter
Copy link
Contributor

I'm dubious about the flag name --leave-registry-alone, otherwise looking good. Does @openshift/cli-review need to sign off?

@gabemontero
Copy link
Contributor Author

gabemontero commented Nov 27, 2017 via email

@gabemontero
Copy link
Contributor Author

unit test failure was the recently discovered flake /race condition #17472

}
})

g.It("should prune old image with config but leave registry alone", func() { testSoftPruneImages(oc) })
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ with config//

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part of next push

@@ -162,6 +165,7 @@ func NewCmdPruneImages(f *clientcmd.Factory, parentName, name string, out io.Wri
cmd.Flags().StringVar(&opts.CABundle, "certificate-authority", opts.CABundle, "The path to a certificate authority bundle to use when communicating with the managed Docker registries. Defaults to the certificate authority data from the current user's config file. It cannot be used together with --force-insecure.")
cmd.Flags().StringVar(&opts.RegistryUrlOverride, "registry-url", opts.RegistryUrlOverride, "The address to use when contacting the registry, instead of using the default value. This is useful if you can't resolve or reach the registry (e.g.; the default is a cluster-internal URL) but you do have an alternative route that works. Particular transport protocol can be enforced using '<scheme>://' prefix.")
cmd.Flags().BoolVar(&opts.ForceInsecure, "force-insecure", opts.ForceInsecure, "If true, allow an insecure connection to the docker registry that is hosted via HTTP or has an invalid HTTPS certificate. Whenever possible, use --certificate-authority instead of this dangerous option.")
cmd.Flags().BoolVar(opts.LeaveRegistryAlone, "leave-registry-alone", *opts.LeaveRegistryAlone, "If true, the prune operation will clean up image API objects in master and etcd, but none of the associated content in the registry is removed.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about --metadata-only?
s/in master and etcd/in etcd/
I'd add a recommendation to run this only in case of enormously large etcd/too many images and errors from pruning blobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommendation / string change part of next push

yeah, to elaborate on my thought process here, I considered things like --soft-prune and --metadata-only but felt like they were terminology that speaks to use and developers/owners of the code, but perhaps wouldn't to users with less familiarity / background

where as --leave-registry-alone, klunky as it might be speaks of the registry concept which the user should be familiar with.... I also considered --do-not-prune-registry. At the time, I thought that was not as good, but in hindsight, perhaps it is better than --leave-registry-alone.

@miminar @jim-minter @bparees thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If true, the prune operation will clean up image API objects in master and etcd, but none of the associated content in the registry is removed."

->

"If true, the prune operation will clean up image API objects, but the none of the associated content in the registry is removed."

As for the flag name, "--metadata-only", "--skip-registry", "--do-not-prune-registry", or "--skip-blobs" would all be preferred for me, in the end this flag is going to come down to how good the help+doc are(the flag name is never going to convey enough information), so i'd rather have a "clean" name than a clunky one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll go with --skip-registry

Change of first sentence of help to @bparees recommendation part of next push

re: the recommendation from @miminar on when to run, my next push currently has:
This option make most sense when there is either a large amount of data in etcd and/or a large number of images or blobs. You would then use this option to clean up etcd, and use the 'hard prune' administrative task (https://docs.openshift.org/latest/admin_guide/pruning_resources.html#hard-pruning-registry) to clean up the registry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure where you're intending for that text to appear, but i'd suggest it should really only be part of the documentation. It's too wordy to convey in CLI help. (Also direct links to the doc in CLI output are problematic because the OCP version of the CLI should send users to the OCP docs, not the origin docs (and in particular, to the specific OCP docs for their version).

We need a general image pruning recommendation/best-practices guide (should also cover what pruning does (what will be pruned under what conditions, what references are respected, etc) and this would belong there imho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I was proposing putting that text in the CLI help in response to @miminar 's suggestion

I'll remove and just stick to the first sentence.

Of course the best practice guide you mention sounds useful. Based on the exchanges in the card, I had only planned on making the necessary (and smaller in scope) changes specific to the new option in the existing https://docs.openshift.org/latest/admin_guide/pruning_resources.html#pruning-images where a broader best practice guide would be beyond that scope. That said, perhaps this new sentence would be part of that update.

Thoughts?

@@ -412,6 +417,10 @@ func (o PruneImagesOptions) Run() error {
fmt.Fprintln(o.ErrOut, "Dry run enabled - no modifications will be made. Add --confirm to remove images")
}

if o.LeaveRegistryAlone != nil && *o.LeaveRegistryAlone {
fmt.Fprintln(o.Out, "Soft prune enabled - only API objects will be removed. No modifications to the image registry will be made.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, don't call it soft prune. It doesn't make any sense.

LeaveRegistryAlone makes the kube objects pruner more robust, but it still may remove too much, it doesn't make the pruner safer.

We have pruners for kubernetes objects (and I think we should split it in two smaller parts: for image streams and for images) and for a registry storage. Both of them are pretty dangerous, but both of them must be used by customers in some cases. None of them should be emphasised as safer and more delicate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the "it is not safer" point @dmage - while that was in the card's description originally, that discussion point came up during planning as the comment was removed from the card description.

I'm fine for not using "soft prune" ... this does IMO tie in with the discussion / struggle with what the name of the parameter should be (see my comments above where I explicitly did not using --soft-prune).

I'm open to suggestions on the message wording .... the pattern does not mimic the Dry run enabled - .. message above, but how ab out simply removing Soft prune enabled - and start with Only API objects will be... ?

On the splitting things up to an option for only deleting image streams and an option for only deleting images, I don't think I can assess the strength of the potentially decreased usability (i.e. more options) vs. the increased safety of further dissecting things. I would defer to you and the other SMEs here, though think I should wait on @bparees take before committing.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mind keeping the soft prune naming when accompanied by a strong warning that the registry will be left with unprunable objects. Prunable only via hard-prune. I like the idea of hard prune being a complement of soft prune. What should we call the regular prune 😄?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: pruning imagestreams and images separately:
removing images w/o updating imagestreams just creates a bunch of broken content.
removing imagestreams w/o removing images just creates a bunch of hidden/mostly-inaccessible content

so while I appreciate the idea from a perspective of reducing the granularity and potentially making each operation cleaner/safer, I don't think there's a compelling motivation for it right now. Either way it would be a follow on to this effort if we did it at all, so i think we can table it for now.

For the text, i'm ok w/ just removing the soft prune part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10-4 part of next push thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmage when i said "image" i meant the image etcd object, not the registry data. So given that understanding:

removing images w/o updating imagestreams just creates a bunch of broken content.
removing imagestreams w/o removing images just creates a bunch of hidden/mostly-inaccessible content
I didn't suggest this. This breaks consistency ans it's a wrong way.

isn't that exactly what you're suggestring here (remove image etcd objects and imagestream etcd objects independently):

We have pruners for kubernetes objects (and I think we should split it in two smaller parts: for image streams and for images)

and:

removing imagestreams w/o removing images just creates a bunch of hidden/mostly-inaccessible content
Yep, exactly what we are doing in this PR. Though, these hidden content can be reused later.

no, what we're doing in this PR is removing the image+imagestreamtag etcd data but not removing the registry blob data.

So to summarize, there are 3 types of things that get pruned:

  1. registry blob data
  2. image etcd definitions
  3. imagestreamtag references to image etcd definitions

This PR is going to allow 1 and 2/3 to be done independently.

It sounds like you were suggesting 2+3 should also be possible to perform independently, and I am suggesting that doing 2+3 independently doesn't add value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you remove image streams and images from etcd, you create a bunch of hidden/mostly-inaccessible content (I mean blobs on the storage).

Removing images from etcd (2) is just a garbage collection routine (remove images that are not referenced in any image stream), in theory it should be absolutely safe and it can be performed autonomously. It doesn't visible for a regular user, only cluster admins care about it.

On the other side, removing tags from image streams and removing image streams from etcd (3) may affect users: it makes some data inaccessible. And, as the integrated registry may be exposed, only the user may remove items from his image streams in a safe way. We cannot guarantee safety of this prune in general, but it still useful for a lot of users.

Therefore, 2 and 3 provide different safety levels. That's why I suggest to split this functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i can see the value in being able to prune just the un-referenced image etcd objects. we should consider it for the future.

Copy link

@miminar miminar Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that 2 is completely:

  1. pruner decides to prune image I since it's not referenced
  2. user re-pushes I to his repository
  3. pruner deletes I
  4. user has imagestreamtag pointing to the deleted I

I don't think we can put the burden of image stream pruning on all the cluster users - even if we do, there will always be irresponsible users building many images without pruning (e.g. online). Admins will want to prune their namespaces as well saying "each user can have up to 3 image revisions per istag, up to 5 is, etc...". With istag pruning immunity in place, users can request it for their imagestream(tag)s and be safe from the dangerous pruning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miminar that could only happen if the user re-pushes image I w/ the exact same imageid and does so at the same time that pruning is running and selected image I for pruning because it was not referenced by an imagestream. So a risk, but a pretty small one.

Anyway again, these are all discussion points for the larger "how do we rethink pruning" conversation, but independent of the incremental improvement in this PR.

@gabemontero
Copy link
Contributor Author

latest responses to comments pushed in new commit

@@ -5285,6 +5285,8 @@ _oc_adm_prune_images()
local_nonpersistent_flags+=("--keep-tag-revisions=")
flags+=("--keep-younger-than=")
local_nonpersistent_flags+=("--keep-younger-than=")
flags+=("--leave-registry-alone")
local_nonpersistent_flags+=("--leave-registry-alone")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to regen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep - thanks - will push when make update finishes

@gabemontero
Copy link
Contributor Author

unit test failure was #17472 again

@miminar
Copy link

miminar commented Nov 29, 2017

/retest

@gabemontero
Copy link
Contributor Author

the gce conformance test failure was because it could not provision the test cluster

given the certificate issue Michalis recently noted on the mailing list, I'm going to hold until I see an announcement there before kicking off a redo of the test

@gabemontero
Copy link
Contributor Author

/test extended_conformance_gce

@bparees
Copy link
Contributor

bparees commented Nov 29, 2017

/cc @openshift/cli-review

@@ -118,6 +119,9 @@ type PrunerOptions struct {
PruneOverSizeLimit *bool
// AllImages considers all images for pruning, not just those pushed directly to the registry.
AllImages *bool
// LeaveRegistryAlone allows for pruning of the API Objects in etcd, but bypass any associated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix godoc

@bparees bparees self-assigned this Nov 29, 2017
@bparees
Copy link
Contributor

bparees commented Nov 29, 2017

/unassign @dcbw

@gabemontero
Copy link
Contributor Author

opened flake #17518 for the extended networking minimal failure

@gabemontero
Copy link
Contributor Author

opened flake #17519 for first conformance install flake

@gabemontero
Copy link
Contributor Author

@gabemontero
Copy link
Contributor Author

there was a yum mirror failure on the conformance gce

@gabemontero
Copy link
Contributor Author

/retest

@bparees
Copy link
Contributor

bparees commented Nov 30, 2017

@openshift/api-review @openshift/cli-review please approve the new argument being added to pruning.

Copy link

@miminar miminar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -162,6 +165,7 @@ func NewCmdPruneImages(f *clientcmd.Factory, parentName, name string, out io.Wri
cmd.Flags().StringVar(&opts.CABundle, "certificate-authority", opts.CABundle, "The path to a certificate authority bundle to use when communicating with the managed Docker registries. Defaults to the certificate authority data from the current user's config file. It cannot be used together with --force-insecure.")
cmd.Flags().StringVar(&opts.RegistryUrlOverride, "registry-url", opts.RegistryUrlOverride, "The address to use when contacting the registry, instead of using the default value. This is useful if you can't resolve or reach the registry (e.g.; the default is a cluster-internal URL) but you do have an alternative route that works. Particular transport protocol can be enforced using '<scheme>://' prefix.")
cmd.Flags().BoolVar(&opts.ForceInsecure, "force-insecure", opts.ForceInsecure, "If true, allow an insecure connection to the docker registry that is hosted via HTTP or has an invalid HTTPS certificate. Whenever possible, use --certificate-authority instead of this dangerous option.")
cmd.Flags().BoolVar(opts.SkipRegistry, "skip-registry", *opts.SkipRegistry, "If true, the prune operation will clean up image API objects, but the none of the associated content in the registry is removed.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My ingliš is not perfect, so I may be easily wrong, but the the between but and none seems superfluous to me.

@juanvallejo
Copy link
Contributor

I'm dubious about the flag name --leave-registry-alone, otherwise looking good.

The current name skip-registry SGTM - it is consistent with what we have elsewhere.

LGTM from a cli perspective.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 1, 2017
@deads2k
Copy link
Contributor

deads2k commented Dec 2, 2017

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@gabemontero
Copy link
Contributor Author

image registry, conformance install - prob building openshift-ansible
conformance gce - prob provisioning cluster

/retest

@@ -1006,9 +1015,11 @@ func (p *pruner) Prune(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prunableComponents := getPrunableComponents(p.g, prunableImageIDs)

What is the reason in calculating components if they are not used afterwards in case of algorithm.pruneRegistry == true ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed / good catch / part of next push

@@ -1006,9 +1015,11 @@ func (p *pruner) Prune(

var errs []error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move errs after algorithm.pruneRegistry == true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that causes a compile error since errs is referenced both within that algorithm.pruneRegistry == true block as well as after that block

@@ -412,6 +417,10 @@ func (o PruneImagesOptions) Run() error {
fmt.Fprintln(o.ErrOut, "Dry run enabled - no modifications will be made. Add --confirm to remove images")
}

if o.PruneRegistry != nil && !*o.PruneRegistry {
fmt.Fprintln(o.Out, "Only API objects will be removed. No modifications to the image registry will be made.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a warning that re-doing the pruning without this option (or with --prune-registry=true) will not lead to pruning the storage of image registry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point it's too late to show a warning for the first time, the user won't be asked for any additional confirmations. Should we add this warning to help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were previous comments around keeping the help text concise, so that gives me pause, but yeah if we are going to warn, I see no other place to put it where the warning would do some good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I previously had a reference to hard prune in the help text before and @bparees had me remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that was prior to my understanding that once you run this, hard prune is the only way to finish cleaning up the registry. I'm ok w/ a concise warning in the help text that references the need to run hard prune (not regular prune) to clean up the registry after doing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning to help added and pushed

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2017
@openshift-merge-robot openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 4, 2017
@@ -162,6 +165,7 @@ func NewCmdPruneImages(f *clientcmd.Factory, parentName, name string, out io.Wri
cmd.Flags().StringVar(&opts.CABundle, "certificate-authority", opts.CABundle, "The path to a certificate authority bundle to use when communicating with the managed Docker registries. Defaults to the certificate authority data from the current user's config file. It cannot be used together with --force-insecure.")
cmd.Flags().StringVar(&opts.RegistryUrlOverride, "registry-url", opts.RegistryUrlOverride, "The address to use when contacting the registry, instead of using the default value. This is useful if you can't resolve or reach the registry (e.g.; the default is a cluster-internal URL) but you do have an alternative route that works. Particular transport protocol can be enforced using '<scheme>://' prefix.")
cmd.Flags().BoolVar(&opts.ForceInsecure, "force-insecure", opts.ForceInsecure, "If true, allow an insecure connection to the docker registry that is hosted via HTTP or has an invalid HTTPS certificate. Whenever possible, use --certificate-authority instead of this dangerous option.")
cmd.Flags().BoolVar(opts.PruneRegistry, "prune-registry", *opts.PruneRegistry, "If false, the prune operation will clean up image API objects, but the none of the associated content in the registry is removed. Note, if only image API objects are cleaned up through use of this flag, the only means for subsequently cleaning up registry data corresponding to those image API objects is to employ the 'hard prune' administrative task.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump @legionus @miminar - you guys good with the warning and other updates ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabemontero this warning looks good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool - thanks @legionus

@gabemontero
Copy link
Contributor Author

conformance install failure was flake #17595

errs = append(errs, pruneImageComponents(p.g, p.registryClient, p.registryURL, prunableComponents, layerLinkPruner)...)
errs = append(errs, pruneBlobs(p.g, p.registryClient, p.registryURL, prunableComponents, blobPruner)...)
errs = append(errs, pruneManifests(p.g, p.registryClient, p.registryURL, prunableImageNodes, manifestPruner)...)
}

if len(errs) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabemontero how about this condition ? please move it too inside if p.algorithm.pruneRegistry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep part of next push

@gabemontero
Copy link
Contributor Author

update for @legionus from this AM pushed

@legionus @miminar @bparees let me know if y'all are good now and I'll squash the commits

@bparees
Copy link
Contributor

bparees commented Dec 5, 2017

lgtm and thanks for doing the revisions in separate commits!

Copy link
Contributor

@legionus legionus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gabemontero
Copy link
Contributor Author

thanks @legionus @bparees @miminar

I'll go ahead and squash.

@gabemontero
Copy link
Contributor Author

commits squashed

@gabemontero
Copy link
Contributor Author

last conformance install failure at https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17480/test_pull_request_origin_extended_conformance_install/3530/ looks like flake #17605 but minus the downward.api piece ... presuming same root cause, just a slightly different flavor

@bparees
Copy link
Contributor

bparees commented Dec 5, 2017

/hold cancel
/lgtm

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Dec 5, 2017
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees, gabemontero

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@gabemontero
Copy link
Contributor Author

cmd test hit flake #17574

@gabemontero
Copy link
Contributor Author

conformance install hit flake #17605 again

@gabemontero
Copy link
Contributor Author

/retest

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue.

@openshift-merge-robot openshift-merge-robot merged commit 05b898d into openshift:master Dec 6, 2017
@gabemontero gabemontero deleted the prune-img-api-obj-only branch December 6, 2017 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet