Skip to content
This repository has been archived by the owner on Dec 7, 2022. It is now read-only.

Proposed download package and ChangeSet #2876

Closed
wants to merge 4 commits into from
Closed

Proposed download package and ChangeSet #2876

wants to merge 4 commits into from

Conversation

jortel
Copy link
Contributor

@jortel jortel commented Dec 6, 2016

Introduction

This proposal attempts to provide tooling intended to reduce the effort required for plugin writers to contribute a plugin.

It focus narrowly on:

  • File downloading.
  • Creating content and artifacts.
  • Content association with a repository.
  • Deferred (Lazy) downloading for-free on minimal burden to plugin writer.
  • Support the streamer.
    • Twisted integration.
    • Support validation when streaming the bits.
    • Each download is autonomous and provides protocol independent interface.
  • Keeping the plugin API lean-and-mean.

The proposed solution is layered.

Layer 1

Concurrent Downloading.
Plugin writers are free to use any lib they want to download files. However, I don't know of any that support concurrency and validation. The proposed solution is an abstract Download (noun) object that can also be executed concurrently within a Batch. A basic HttpDownload is provided. The Download hierarchy is intended to be extended to support additional protocols such asFTP or BitTorrent in the future and work nicely with the Batch. A basic file size and checksum validation is included. Additional validations can be supplied by both platform and plugin writers.

Example:

from pulp.plugin import Batch, HttpDownload

url = 'http://redhat.com/content/great.rpm'
path = '/tmp/working/great.rpm'
downloads = (HttpDownload(url, path) for _ in range(10))

with Batch(downloads) as batch:
    for plan in batch():
        if plan.succeeded:
            # Use the downloaded file
            pass
        else:
            # Log something
            pass

Layer 2

Content Changes To A Repository.
The goal is for the plugin writer to only be concerned with determining what content needs to be added/deleted from a repository. Once determined, a generator of RemoteContent to be added and a generator of Content to be removed is used to create a ChangeSet. The term remote content refers to content that exists in the remote repository but does not yet exist locally (in pulp). The ChangeSet is then applied to the repository. The result is a generator of ChangeReport. Each report contains an action (ADDED | DELETED) and the content. The plugin can iterate these reports to determine overall success for failure.

The ChangeSet is capable of detecting the Importer download policy and act accordingly. When content downloading is deferred, catalog entries are created instead of downloading artifact files. Concurrent downloading is handled by the ChangeSet using the layer1 download lib.

The plugin writer does not need to be concerned with whether a content (unit) already exists (or not). The
ChangeSet determines this and acts appropriately.

The concept of RemoteContent and RemoteArtifact is introduced. The term remote is used because it describes content and artifacts that exist in the remote repository but does not exist in the local Pulp repository.

Example:

# determine what needs to be added to the repository
wanted = <generator>

# determine what needs to be delete from the repository
unwanted = <generator>

# apply
changeset = ChangeSet(importer, adds=wanted, deletes=unwanted)
for report in changeset.apply():
    ...

Discussion

The deferred (Lazy) catalog cannot contain everything needed to create a Download object.
The thing that's missing is logic. It may require a specialized Download class
with special error handling that knows to get an authentication token (or something).
Here are a few options:

  1. Add a method the importer. Like: Importer.get_artifact_download().
    The base class implementation would be sufficient for most cases and would rarely need
    to be overridden by a plugin writer. I'm leaning this way.
  2. Provide a mechanism whereby a plugin writer registers a DownloadBuilder.
  3. Other?

Both options make it the responsibility of the plugin writer but #2 may keep the Importer API cleaner.

FAQ

  • Question: Is layer1 just another Nectar?
    Answer: Sort of but no. I really hope not. It is far simpler and delegates all of the
    heavy lifting to requests and concurrent.futures.

  • Question: Why have layer1 at all? Wny not just use requests?
    Answer: Plugin writers are free to use whatever they want. However, requests does not provide
    concurrent downloading. Also, requests is only HTTP. Since downloading needs to be part of the
    plugin API, we need a formal way for plugins to participate. Supporting The Streamer is one example.

  • Question: Is the ChangeSet another Step framework?
    Answer: No. Unlike the step framework, the ChangeSet provides support for a very specific
    part of the importer synchronization workflow.

  • Question: Does the ChangeSet do progress reporting?
    Answer: Yes.

  • Question: How is the memory footprint constrained?
    Answer: Every component is designed to work-with or be a generator. Even a component
    using a Queue will restrict the queue size.

  • Question: For plugins that need to import one kind of content to determine additional content
    that needs to be imported, how would this work?
    Answer: The plugin could use (or chain) multiple ChangeSets. The plugin can iterate the reports
    yielded by the ChangeSet.apply() to determine the content that needs to be added using a subsequent ChangeSet.

  • Question: For docker, downloads need to get and share auth token. How would this work?
    Answer: The Download has a dictionary (context) for sharing information. The Batch ensures
    that all downloads share the same context. The first docker request that detects that it needs
    a new token can obtain it and put it in the context to be shared with other downloads. The docker
    plugin will need to use a subclass of HttpDownload with a custom on_error(). But should only
    require a few lines of code. This is why the plugin is responsible for providing the download object.

  • Question: How are fatal errors handled within any of this machinery?
    Answer: Regardless of which thread or component occurs, an exception is propagated and
    raised to the caller.

  • Question: What if my plugin is downloading with something other than regular HTTP?
    Answer: The Download is a callable and can be implemented to support any protocol.
    The Batch will provided batched, concurrent downloading for any kind of Download.

  • Question What if the plugin needs to download the file before creating the Content.
    For example: In RPM - all RPMs are stored using a SHA256 as part of the unit key even when the
    metadata only includes SHA1. This would mean that the ChangeSet would need to download the file (rpm) and generate the SHA256 before creating the DB records. Or, provide a way for the plugin to participate in this flow. Right?

    Answer Well, this may be flawed to begin with. What about deferred (Lazy) downloading.
    When the download policy is deferred the file will not be available to create the SHA256.
    I think we need to challenge the practice of normalizing the checksum to SHA256 to start with.

  • Question How would mirror lists be handled?

    Answer A subclass of HttpDownload would be provided by the plugin. It would implement resolving the list of mirrors and stash it in the context to be shared with other downloads involved in the same batch download. The last mirror used could also be stored in the context to support round-robin.

This includes a sanity test but eventually needs some unit tests and some combination of functional and perhaps smash tests. The sanity test will not be included in the final PR as-is.

TODO

  • The ChangeSet only has logging roughed in.
  • Test and optimize using real model/DB.
  • Requests (mainly the Session) and urllib3 seem to have known thread safety issues. Look into keep the Session in a thread local to avoid using concurrently.
  • Add DB transaction management in the ChangeSet.

@jortel jortel changed the title Proposed Download API Proposed download package and ChangeSet Dec 6, 2016
@jortel
Copy link
Contributor Author

jortel commented May 9, 2017

This PR replaced by #3006. Easier to rebase to reorganized 3.0-dev using a new branch.

@jortel jortel closed this May 9, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
1 participant