Skip to content
This repository has been archived by the owner on Oct 28, 2019. It is now read-only.

Add always_download to DeclarativeArtifact #50

Merged
merged 1 commit into from Feb 21, 2019

Conversation

mdellweg
Copy link
Member

@mdellweg mdellweg commented Feb 8, 2019

@mdellweg mdellweg force-pushed the non_lazy_artifacts branch 2 times, most recently from a2471d6 to b9e5998 Compare February 8, 2019 15:00
@codecov
Copy link

codecov bot commented Feb 8, 2019

Codecov Report

Merging #50 into master will increase coverage by 0.32%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #50      +/-   ##
==========================================
+ Coverage   42.53%   42.85%   +0.32%     
==========================================
  Files          22       22              
  Lines         710      707       -3     
==========================================
+ Hits          302      303       +1     
+ Misses        408      404       -4
Impacted Files Coverage Δ
pulpcore/plugin/stages/artifact_stages.py 49.05% <0%> (ø) ⬆️
pulpcore/plugin/stages/models.py 75% <100%> (+0.49%) ⬆️
pulpcore/plugin/stages/declarative_version.py 37.93% <50%> (+4.59%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 527f2cd...a691387. Read the comment docs.

@mdellweg mdellweg force-pushed the non_lazy_artifacts branch 2 times, most recently from 60d8943 to a923477 Compare February 12, 2019 19:21
args: unused positional arguments passed along to :class:`~pulpcore.plugin.stages.Stage`.
kwargs: unused keyword arguments passed along to :class:`~pulpcore.plugin.stages.Stage`.
"""

def __init__(self, max_concurrent_content=200, *args, **kwargs):
def __init__(self, max_concurrent_content=200, download_artifacts=True, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why this is useful, but this seems a bit confusing: ArtifactDownloader(download_artifacts=False). We could avoid this by the plugin writers setting it on all versus some of the DeclarativeContent objects? I think that is almost as easy and would make this all less confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attribute would be need to be renamed for this change to make sense. Maybe da.no_download and those are just no-opped by these stages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think, it were less confusing if it was named lazy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That name would be clearer. I'm concerned though that we have two mechanisms when we can do it almost as easily with one. A plugin writer who has the "mixed content/metadata" needs and wants to add lazy will have to mark some portion of the DeclarativeArtifacts attributes but not others. Having to additionally specify that functionality should be "disabled" seems like more work and more to maintain when it could be not set by the plugin writer like they did before adding "lazy" support.

Does this concern make sense or maybe I'm misunderstanding the need?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more about the easy plugin with no special needs.
But you are right, that having two mechanisms, when one could do seems odd.
How should the field on d_artifact be named? lazy, no_download, ...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to no_download. I think it's the most self-descriptive name. Things that can't be downloaded can't be saved, so this captures the intent the most clearly to me.

As an aside on the "easy plugin" case, they have some other options already. They can either use DeclarativeVersion with the existing download_artifacts parameter, or a custom pipeline that omits the Artifact handling stages. I documented those options here https://docs.pulpproject.org/en/pulpcore-plugin/nightly/reference/lazy-support.html#adding-support-when-using-declarativeversion

Copy link
Member Author

@mdellweg mdellweg Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to only have one way to do this:
Always include the artifact_stages and rely on the no_download member. So i am going to remove the download_artifacts option from the DeclarativeVersion.
Otherwise we would have the the confusion, what download_artifacts combined with no_download means.

Another option were to rename download_artifacts to include_artifact_stages, but i think, this is not instructive to a user without broad understanding of the stages.

@mdellweg
Copy link
Member Author

@bmbouter I now have reduced it to only having a no_download flag on d_artifact. But i am not convinced, that this gives a nice API.
I think i would like the api better if you could specify the d_artifacts with supports_lazy and have a global gating flag sync_lazy on the d_version. And thinking more about it i would really go for naming them with 'lazy' because we have a dedicated section in the documentation called 'Lazy Support'.

@bmbouter
Copy link
Member

@mdellweg let me try to share some info on the design considerations that brought us here. With all the info on the table we can make a good decision together.

Consider the performance implications for plugins that would include Artifact stages that don't need them. I haven't measured it in practice, but architecturally having three stages in pipelines where they aren't needed will perform slower. My feeling with pipelines is that they are highly re-combinable and probably shouldn't be the same plugin-to-plugin. I believe a design will only work for everyone if it is performing the best for everyone.**

I think the majority of plugins are not in the situation that Docker, Debian, and Maven are in, i.e. they want to download some but not all Artifacts during lazy. So having all plugins include these stages when many of them won't need them during lazy sync doesn't seem the best.

What do you think about these considerations and how they inform this design? Please make a case for the design you think is the best.

**As a short aside on performance and the "one-pipeline" approach, overall I try to avoid subjective choices about which plugins we should optimize the pipeline for. I want to avoid global designs that optimize for some plugins but sub-optimize for others. That is a non-pareto efficient solution. The beauty of the pipeline architecture is that each plugin can use the pipeline that is right for them, creating a pareto efficient solution by avoiding a single global optimization.

@mdellweg mdellweg force-pushed the non_lazy_artifacts branch 3 times, most recently from edef5af to f182905 Compare February 14, 2019 16:22
@mdellweg
Copy link
Member Author

@bmbouter
This is how prefer it.
The parameter is now called sync_lazy because it is not anymore directly about downloading artifacts or not. The indirection being, that the this flag activates a mechanism to decide for each artifact individually whether downloading is in order.

@mdellweg mdellweg force-pushed the non_lazy_artifacts branch 2 times, most recently from 576797e to 10053f6 Compare February 14, 2019 16:39
@bmbouter
Copy link
Member

@mdellweg I'm concerned about the feature flipping when we don't necessarily need it. It seems simpler to leave the checking of no_download to be enabled all the time, and let the plugin code only set da.no_download=True if the current sync mode is lazy. That way we don't carry more features (like the flipping of them). That is my opinion at least.

@mdellweg
Copy link
Member Author

@bmbouter Let me try to explain, why i prefer this version:
We cannot build a one-size-fits-all pipeline, but we can try to build one that works for most plugins with the least surprise.
Having the possibility to specify that only specific content units may not be downloaded and then excluding the artifact stages altogether (by accident?) breaks expectations. Including the Artifact stages at all times with only the one flag on the d_artifact imposes the duty to decide whether an individual artifact should be downloaded on all plugin writers.
On the other hand, this design allows the plugin writer to specify at a central point, that pipeline should work in 'lazy' mode, while he only needs to classify the individual artifacts to be special (independent of the operation mode). While trying to envision the documentation for this, i think it is easier.
In the case of the file_plugin, there is also the possibility to build an optimized individual pipeline without downloading/saving artifacts and without future resolving.

Copy link
Contributor

@asmacdo asmacdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one nitpicky var name suggestion, but on the whole I think this is the right pattern.

Flags vs Leaving out stages

Simplicity

The complexity of leaving out stages is that a plugin writer is essentially operating on the entire pipeline as a single entity. Instead, the flag design lets the plugins practically ignore lazy-- they create declarative content and artifacts with 1 extra piece of useful metadata. Those declarative artifacts are handled differently by a core stage, which plugin writers don't need to worry about because the effect of the variables is obvious. The key point is that there is only needs to be 1 pipeline per sync task, which means the plugin writer does not need an in depth understanding of all of the core stages.

Plugin similarity

Another major advantage to this approach is that all the plugins will be able to use it. Though some plugins might work just fine by leaving certain stages out it would force us to figure out an entirely different pattern for deb, docker, maven, npm, etc. We should make 1 interface if at all possible.

Performance

I don't think this will have very much of an impact on performance because the affected stages will skip the object without executing anything expensive. Clearly, they will operate a little slower, but I'd bet that the database will still be the bottleneck and therefore these no-ops will not cause a noticeable slowdown.

these stages included, no Artifact downloading will occur. Content unit saving will occur, which
will correctly create the lazy content units.
passing the `sync_lazy` parameter to the `ArtifactDownloader` and `ArtifactSaver` stages.
They will then function selectively on the `support_lazy` flag of `DeclarativeArtifact`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a naming suggestion that I think would be a little more obvious. (Current naming is pretty good though)

s/sync_lazy/defer_downloads/
s/support_lazy/deferrable/

I think this is just a little bit more explicit because it indicates what will happen. ie, if defer_downloads and deferrable are both true, that download will be deferred.

@bmbouter
Copy link
Member

Let's make this conversation a bit simpler by removing one of the areas of debate. I'm ok to include stages even when they are not needed. This should allow us all to talk about the user experience and not worry about the performance stuff yet. Including those stages does make it easy to talk about pipelines in a consistent way and custom pipelines can still be built if plugin writers really want better performance.

Now that all stages can be included w/ the default pipeline, I want to make a case that the data of the DeclarativeArtifact should drive the pipeline for simplicity and we should avoid "behavior by configuration" aka 'modes'.

I don't think feature flipping behaviors of individual stages is helping our user experience; it's making things more confusing. It would be simpler for the Artifact stages respect no_download all the time. Why can't it? It's simpler by itself (to explain, to document) and we don't have to 'flip' it. A single attribute is simpler than an attribute you still have to set and then also turn on. This design only creates more work and confusion.

As an architectural point, if you want a stage to have a significantly different behavior, my belief is that we should have a different stage with a different implementation. The pipeline is a form of functional programming. It's defined by what you include, not how you are procedurally configuring each thing. Note that I'm talking about configuring individual stages not about a factory like DeclarativeVersion which does configure the pipeline with pipeline-wide options. Still, the lazy option doesn't need to be added to DeclarativeVersion either because the no_download attribute can be respected at all times.

Also look at the many other features in the pipeline. Notice how we didn't take a 'modes' approach with the "futures resolution" feature. Plugin writer's either use it or they don't. It's always there if they need it. They don't have to 'turn it on'. We should do the same here with no_download.

@asmacdo
Copy link
Contributor

asmacdo commented Feb 19, 2019

As I read it, @bmbouter's most important point is:

As an architectural point, if you want a stage to have a significantly different behavior, my belief is that we should have a different stage with a different implementation.

However, either solution violates this concept. The 2 var solution (da.deferrable and defer_downloads) causes the pulpcore stages to behave differently depending on the situation. The 1 var solution (no_download) forces each plugin writer to implement their FirstStage to behave differently depending on whether it is lazy sync or not.

From the docs, "The plugin writer needs to specify a first_stage that will create a DeclarativeContent object for each Content unit that should exist in the new RepositoryVersion." IMO, the responsibility of the first stage is to download and parse metadata and should explicitly not be making any decisions about how the rest of the pipeline should behave. (ie, the FirstStage should not need to care whether the sync is lazy or not)

The DeclarativeVersion is the object that is responsible for creating the pipeline to act on the declarative content, and I would expect to tell that object to behave differently for alternative scenarios. (We already configure the declarative version to behave differently whether the sync is mirror or not.) This puts the code branching where the differences matter, (ArtifactDownloader and ArtifactSaver) and abstracts that branching so the plugin writer shouldn't need to worry about it.

To sum up, FirstStage is "what to sync" the DeclarativeVersion takes care of "how to sync".

@mdellweg
Copy link
Member Author

So this means we would need two different implementations of ArtifactDownloader? One being sepecific and one downloading everything?

@bmbouter
Copy link
Member

Well my point was not to make two, but more that we should bake this behavioral aspect directly into ArtifactDownloader so that it's always there. Then when (for any reason lazy or otherwise) a plugin writer needs to have an Artifact not downloaded but have its RemoteArtifacts created it can set no_download.

@mdellweg
Copy link
Member Author

@bmbouter @asmacdo I think, this is the version we agreed upon.

I tried to give a shot at the lazy documentation. It wasn't so obvious. Maybe, if we have a first go with pulp_file we can start copy the example and describe it. Speaking of which. This is another breaking change...

@bmbouter
Copy link
Member

@mdellweg I can come through and update the docs in a separate piece of work.

Copy link
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me. Thanks @mdellweg and @asmacdo for all the discussion and this final resolution. 👍

@bmbouter
Copy link
Member

This doesn't have to be a breaking change if we don't remove the download_artifact aspect to DeclarativeVersion. Also Travis is failing due to pulp_file tests failing I think for that same reason. Are we sure we want to make this that big of a breaking change. I'm ok if so, we just need to fix pulp_file then too.

@mdellweg
Copy link
Member Author

It feels like not going all the way when leaving the download_artifact flag on the DeclarativeVersion. And despite the fact, that we concluded the name wasn't ideal, it were yet another feature to support in the future. The plugin writer always has the possibility to build a custom Pipeline to exclude the artifact stages (and ResolveContentFuture, as it probably looses its purpose then).

@bmbouter
Copy link
Member

@mdellweg I'll go on to make another PR fixing the docs. Would you be willing to update https://pulp.plan.io/issues/4209 with our final solution?

Copy link
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thank you @mdellweg and @asmacdo for all the discussion. It's great collaborating with you both. 🥇 👍 🎆

@bmbouter bmbouter merged commit ea03cd9 into pulp:master Feb 21, 2019
@mdellweg mdellweg deleted the non_lazy_artifacts branch February 21, 2019 15:44
@mdellweg
Copy link
Member Author

@bmbouter Seems, i cannot edit the Message body. Only add a comment. So feel free to copy the text upstairs and adjust it. Thanks for all the discussions.

@bmbouter
Copy link
Member

bmbouter commented Feb 21, 2019

I checked the permissions and I think they are there. I can handle this one. Redmine is confusing in this area though. It's kind of silly. After clicking edit you have to click 'edit' again to show the original body to edit.

@bmbouter
Copy link
Member

@mdellweg ty for leaving this comment https://pulp.plan.io/issues/4209#note-10

daviddavis pushed a commit to daviddavis/pulp_file that referenced this pull request May 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants