Conversation
|
Hello @bmbouter! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on May 31, 2018 at 22:28 Hours UTC |
b6558e5
to
3114b8c
Compare
This replaces ChangeSets as the interface.
3114b8c
to
ab10be1
Compare
| ) | ||
| [i for i in changeset.apply()] | ||
|
|
||
| def _mirror(self, new_version): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ChangeSet applies the additions, then removals.
Suggest:
def _mirror(self):
retained = set()
def additions():
for content in self.pending_content:
key = content.model.natural_key()
retained.add(key)
yield content
def removals():
base_version = RepositoryVersion.latest(self.repository)
for content in base_version.content:
key = content.cast().natural_key()
if key not in retained:
yield content
changeset = ChangeSet(
self.remote,
self.new_version,
additions=additions(),
removals=removals()
)
[i for i in changeset.apply()]
|
|
||
| class PendingVersion: | ||
|
|
||
| def __init__(self, pending_content, repository, remote, sync_mode='mirror'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plugin should interpret the sync_policy name for 2 reasons:
- This object will only be able to deal with either mirror or not-mirror (additive).
- The plugin can accept any policy name in addition to 'mirror' and 'additive'.
Recommend: (bool) mirror instead of (str) sync_policy.
| self.remote = remote | ||
| self.sync_mode = sync_mode | ||
|
|
||
| def apply(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think create() is more appropriate than apply(). I also think it would be appropriate for this method to return the newly created RepositoryVersion to the plugin.
|
A more general observation is that the goal was to provide the differencing logic (here) for the plugin writer. I'm concerned that, for additions, the
Please consider implementing the differencing logic for additions. This should involve simple logic in the additions generator to omit pending content by natural key based on a |
|
I see what you are saying about the differencing code. With bringing back differencing code required to make this correct, we should check in on the approach overall. I see the goal as slightly different from what you've written. I'll restate my understanding. The goal is to make plugin writer's lives easier by creating the In order to do that,
We're using option 1 currently. The reasoning is that it minimizes effort since the From the coding I've done so far, I believe option 2 would result in significantly less code to maintain and is also now the lowest effort option too. It also architecturally eliminates any redundancy since they are one thing. Option 3 is also viable, but we would need a great reason to do that over Option 2, which I don't see currently. @jortel what do you think we should do? |
|
The This includes:
(Option 1), having the (Option 2 & 3) would be significantly more effort and not result in any code de-duplication. We should stick with (option 1) as planned. |
|
I'd anticipated something simple like this: |
|
After thinking about this some more, I think option 1 (keeping Here is an example. Assume that we have the unit keys in memory that we want to remove like in my PR here. The easiest thing to do would be to find-and-delete those using the db directly with code just like these few lines here. In implementing that delete in You can also see the extra work and complexity option 1 brings in various ways. One is with the differencing code that would still need to be added. Also look at this additional closure here. If the main loop draining the iterator was in This gets at something that you asked also: which is if I think we need to get input and perspectives and reach an agreement before more coding is done. Discussing on the issue has been great. Also this API change I think will be very beneficial to plugin writers; for example look at the code percentage drop significantly from pulp_file when using the declarative interface. |
|
I don't think your assertions that (option 1) "leads to a less efficient db performance, more code, and higher memory usage" are substantiated. DB performanceThe example of DB performance re: removals is not correct. The More CodeI just don't see it. The differencing logic needs to be added somewhere. Using the Higher memory usageGiven the NextAgreed. Let's get more people involved in the discussion. |
|
I want to look at the design more to talk through these concerns. To simplify things, let's just look at the DB concerns. I wanted to be sure my understanding of how the 1000 DELETE example loads the database so I asked in #postgresql on Freenode. They confirmed a few things:
I was looking at the I see a lot of ^ as a design mismatch. The One concern you mentioned and I recognize is the separation of concerns. What if we have fewer classes and have shorter, well-documented methods on a single class? This would still keep the separation of concerns at the method level on What do you think we should do about these ideas? What are the problems with them? What can we do that is better? Also what about the key question of if we are offering http://www.postgresql.org/docs/current/static/mvcc.html |
|
I don't want to overrun with comments, but I just thought of something that was significant enough to write out. Since the
One point of value I always saw in the
With the core changes mentioned, the above situations aren't issues for plugin writers anymore. Now with |
|
I'm convinced the single delete statement matching a 1000 records is more efficient. Mainly it reduces the statement parsing and round-trip to the DB. I'm not sure the difference is significant but I agree we should always try be as efficient as possible. That said, the I don't see the Grouping the methods within a single class by separation of concerns is a yellow flag and confirms they should be 2 separate classes. It's synonymous to suggesting that one big function containing separated code blocks is somehow better than having the same functionality separated into multiple functions each doing on thing (having a single concern). |
|
The |
|
I don't think the analogy that multiple methods on a single class is similar to multiple code blocks in a single function holds up in at least two ways. A single function requires you to run all code blocks at once while small methods can be run in a fine-grained way. Also look at testing differences. Each method can be clearly and easily tested while a large monolithic function with code blocks cannot. I'm pointing this out because monolithic methods is a well-known bad practice. I think having one object with various, simple methods would allow each method to have a separation of concerns. What I'm advocating is to move some of the methods from Changeset to PendingVersion and removing the rest of |
|
The function/code-blocks analogy was meant only as a separation-of-concerns analogy and not making a modularity argument. Just as functions should do one thing and delegate other functionality to other functions, classes (objects) should be one thing and delegate other behavior to other objects. Ah, so you are only talking about merging the |
|
Let me suggest a path forward. The main goal is to provide the How does this sound? |
|
I should be clearer about the scope. I think we should look to simplify the code of the whole changeset package with this transition to the declarative interface. Consider the 'report' package; if Changeset isn't in the plugin API then we don't need formal reporting outside of the Progress Reporting which is recorded on the task. We can remove report entirely. Also the iterator package could be simplified a good amount too. We can also remove a lot of the boilerplate getter/setter/property code. We would keep Overall I think improving the efficiency while reducing maintenance is a goal worth the effort. it would be higher performing for all the reasons discussed on the PR so far since it would be custom built to implement If an experimental branch was approved to be made, we would better know exactly what the changes are, quantify how more efficient it is, and show how much code we would be saving for the same feature set. How do we solicit input on going forward with something like this? What are the problems with this approach? |
|
@jortel I agree the most important user-facing change is I plan to have removals handled by one mass delete inside of PendingVersion itself. That is the exact point that got this PR discussion started actually. That would be the most efficient thing to do. Will that work for you? Also this only leaves PendingContent, PendingVersion, and PendingArtifact in the plugin API and it removes Changeset and all other objects from the API. They are still internally left as-is in the code. Will this work for you? Finally, having those things live in pulpcore.plugin.changeset is a little strange, I think they could come from a package named |
|
Here you stated "I see the goal as slightly different from what you've written. I'll restate my understanding. The goal is to make plugin writer's lives easier by creating the PendingVersion with the interface described here." Has this changed? When discussing the single statement vs. multiple statements on #postgresql, I asked them to quantify the difference in expected performance and efficiency between the two methods. They recommended that we time both approaches, so I did. I tested removing 5000 content units from a repository which results in deleting 5000 Measurements: Multiple (10k) statements within a transaction averaged: 25 seconds. I found these results curious in several ways:
Implementing the removals in the Any changes to the changeset package (except merging #3464) or the plugin API other than adding the |
|
I'd just like to point out that if we still hope to regain support for sqlite in the future (which seems increasingly unlikely at this point), we can't assume that we can make very large mass creations or deletions due to the expression tree size limit. https://pulp.plan.io/issues/3499 Sqlite would limit us to working with 600-800 items at a time. With respect to
It could be that Django ORM is emitting a really unoptimized query vs. the one that you're producing. Perhaps we should compare? https://stackoverflow.com/questions/1074212/how-can-i-see-the-raw-sql-queries-django-is-running |
|
I want to go back to the high level with the use cases. I think understanding what the use cases we are solving and understanding how we are fulfilling these use cases is the best way for us to navigate how we serve our users. From that perspective I don't think we're committed to keeping any of the code we have now due to sunk cost reasoning. If what we have is the best way to fulfil the use cases then great, if not then we need to make an adjustment. Here are the use cases I see the plugin writers needing from us:
I think these are the 3 use cases that plugin writers need in this area. Are there more requirements? Do people agree or disagree with aspects of these? In terms of approach, after the use cases are known, we'll have a solid foundation to move into a discussion about implementation options. |
|
The organization into 3 use cases (some having details: bullets) does not resonate well with me. Rather it seems more like 2 use case with details. Terms: As a plugin writer, I want the plugin API to provide an object that can be used to create a new
As a plugin writer, I want the plugin API to provide an object that can be used to create a new
|
|
@gmbnomis, one option would be for your Something like: Thoughts? |
|
@gmbnomis and @jortel I've also been thinking about that same use case. It's where the plugin writer already knows the items they want to add or remove, but they don't know if Pulp already has them downloaded and created. This is similar to what We should consider integrating this functionality with I think having fewer objects that the users need to learn, and having this functionality be available on RepositoryVersion would be good. One thing to point out is that the That interface would effectively provide "batching" so everytime you call it that group is treated as a batch. Then the |
|
@jortel: Doing this at that point in time feels a little bit odd. But there is also a real drawback when doing REST calls in For example, in an initial sync (i.e. Pulp repo is empty), there will be many PendingContent units to add. If the REST calls for metadata happen in the synchronous One could go one step further and do the metadata enrichment asynchronously as part of the PendingVersion streaming. But I think computing a set of content to add and then just fetch additional metadata for the set is much simpler. |
Good point. We'll keep brainstorming but looks like |
|
@gmbnomis I want to go back to your use case. I see the similar "two-layers of metadata pattern" in pulp_ansible and pulp_maven. The solution you outlined would work, but I want to look at a failure scenario. The outlined solution has two phases. Phase 1 uses the declarative interface to determine what to add, and Phase 2 adds extra metadata for created content units. What about this failure scenario though:
Code could be added to handle these failures, but if we handled that extra data fetching prior to saving and as part of the stream it would be a safer architecture for this type of processing. I believe having a capability to update a "content unit that is in memory but not saved" step in the stream processing would be broadly useful for our stream processing. It could also inspect the locally downloaded data if that is what it needed to do to discover the data it needs to write. To accomplish this extra "step" we could allow a plugin writer to specify a generator that yields updated content units and receives the content units created by What do others think about this as a way to resolve the two-layer metadata problem at the architecture itself (with an optional step)? |
|
@bmbouter I like this idea, but I want to take it a step further. The solution you proposed is good for a 2 stage unit creation process. What happens when the plugin writer needs a 3 stage process? What about an n stage process? Supporting a pipeline of work would be useful. |
|
@bmbouter wrote:
The goal is to provide the plugin writer with an opportunity supplement the metadata for only the |
|
The issue with this approach is that it limits concurrency. We can observe the problem with the Chef example. When Here are the issues this creates. It's two halves of the same coin: While blocking on the download occurring in In addition to the Python code being blocked, other downloads are also not occurring, specifically the batch artifact downloaders. When the batch artifact downloaders are running, the downloads being on in What can we do about this problem? |
|
@bmbouter, I'd anticipated the plugin code running in the |
|
Basically. all these approaches are 'pipelining' generators in one way or another (the output of one generator is the input to the next). I am wondering, if that goes well with using async. If one generator A uses I don't know what happens exactly in this case. Usually A and B will use the same event loop. Already scheduled futures may even continue to run in the other generator's I think we need more than pipelined iterators. Ideas that come to mind:
|
|
@bmbouter: Just saw your comment. I think we both describe the same problem here. |
|
Thought of another approach. The I think, this would be very straight forward to add. Thoughts? |
|
@jortel: I think this is similar to the To address @bmbouter's concern about error handling: IMHO, the sync logic should not add a content unit to the add set if its metadata download raises/returns an error (and it must not be removed from the new repo version) Re naming: I would not call them metadata downloaders, as they may also get metadata from the downloaded artifact. I am bad at naming; no alternative proposal from my side 😀 |
Agreed. The
Fair enough about the naming. I leaned toward downloader but we could go with something future related. The main thing is that each be a |
|
@jortel, That would workaround the problem for this specific use case, but my concern is that the concurrency model of Changeset's are incompatible with the concurrency of the Downloaders and asyncio more broadly. More content types will come, with more use cases that involve Downloaders and asyncio based concurrency in various ways during stream processing. This convo has been super useful because I think we've uncovered an even larger problem than I realized. I believe we need to explore and possibly prototype an architectural solution. Also consider that if we don't fix this plugin writer's can't use anything else asyncio based in the pipeline either (not just downloaders). For instance this would block us using asyncpg while benefiting from it's concurrency (which is the reason to use it). Also any of these things too. That's a lot of things to not be able to use during our stream processing. We could get out of this situation if we adopted the option A from @gmbnomis comment. In practice I think this will be easy as long as the components have well-defined types going in-out. Here is a SO post w/ some example code just like this. There are also libraries for this kind of thing I believe. This is not unlike what @dkliban was alluding to. The Plugin API can provide the asyncio-compatible raw components of the pipeline and also some pre-built pipelines from those components. If plugin writers only needs the functionality of PendingVersion there would be a pre-built pipeline they could use. In the case of anyone wanting to do something custom (e.g. this use case from @gmbnomis ) they would inject their steps any point in the stream without any involvement from us. This would make the plugin API both stream processing, fully concurrent, and asyncio compatible. |
|
@bmbouter: I think I see your point. I thought we could support stages with the extension point I discussed with @jortel: Pulp core could run e.g. 10 of these and the artifact downloads concurrently. the 'stages' would run sequentially for each content unit. But the latter may already the problem: What if it is more efficient for the plugin to request meta data for more than one content unit at a time (i.e. it wants to include multiple content units in a request to the remote)? Or in the extreme case, the plugin might need the entire "add set" before progressing. Also I forgot about the iterator that feeds PendingVersion, which has the same problem as the stages after PendingVersion. OTOH I am wondering whether we ask too much from plugin writers if Pulp plugin API offers a very powerful but hard to grasp bleeding edge asynchronously coupled pipelining framework... |
|
@gmbnomis I agree that it's important to strike a balance between performance and easy of use. I am confident that we can achieve both in this case. A lot of developers may not be familiar with asyncio right now, but that will change with time. |
|
@gmbnomis and @dkliban I also believe we can deliver the power of asyncio based concurrent streaming without the plugin writer having to understand how it works underneath or even asyncio. An alternative to hiding the event loop would be to have the plugin writer schedule coroutine I find hiding the event loop a bit magical, so I prefer not hiding it (explicit>implcit), but I'm ok with either. What is more preferable between these two choices? In terms of use cases, this would allow plugin writers could rebuild the pipeline easily as their content type needs with the same steps used by @jortel what do you think about these things? I'm going to post some use cases here that reflect these requirements recomposability. All responses and ideas are welcome. |
|
FWIW, I looked into the pipelining options a bit and cobbled together this script. The script's intention is to explore how async stages could be implemented; I did not bother to think about good abstractions and did not care about errors for most parts. It requires Python 3.6 (async generators are supported beginning with Python 3.6. Alternatively, I could have used async iterators which are supported in 3.5 already, but they are not as elegant). The basic idea is that we have a simple async function The stages implement different methods to schedule the work (streaming sequentially, streaming concurrently, gather all content units first). In the simplest case, the plugin writer would implement the worker function (or most probably class) and plug it into a pre-defined stage scheduler. At the beginning of the pipeline, it would be a function/generator that generates units. In more complex cases, the plugin writer could implement a stage. There are two implementations of stages in the script:
Async generators are "pull" based: By default, the upstream iterator does not run if the downstream iterator can't process input. I think this is a nice feature. Probably, async iterators are a little bit faster than the queue based implementation (units are passed directly between stages instead of put/get methods on queues). In all other respects I like the queue based approach better:
@bmbouter: Regarding 'hiding' the event loop: I don't know how much code we need outside the event loop (e.g. for handling exceptions that bubble up, reporting and, if there is a use case for it, cancellation). If we need it, implementing it inside of |
Incompatible how? I don't see how this is a workaround. The proposal provides a solution for this use case and other like it. It provides an opportunity for plugin writers to specify one or more asyncio futures/coroutines used to do anything concurrently like download additional metadata used to compete
What larger problem?
The Why would plugin writer's use
The proposal supports plugin writer defined asyncio futures/coroutines. How would plugin writers be unable to use aio libs?
Background: Streams processing is synchronous and serial; aio is asynchronous and parallel. The
The Questions:
|
|
@gmbnomis Thanks so much for posting your script example. It looks like what I was thinking of also. Today I'm going to try to build a prototype on top of it that will provide the PendingVersion feature set only using stages as its design. |
|
It took me a long time to make it, but I finally finished the DeclarativeVersion code based on your prototype @gmbnomis It still needs a few things, and I'm tracking that stuff as part of Issue 3844. A diagram of a proposed solution: https://i.imgur.com/7cEXC5e.png |
|
This work was the inspiration to DeclarativeVerison. That code is now merged and available so I'm closing this PR. Thanks for all the discussion! |
https://pulp.plan.io/issues/3570
Builds on pr: #3464. Those diffs appear here as well but not part of this pr.