Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync now skips most steps if the feed metadata has not changed. #676

Merged
merged 1 commit into from
Apr 21, 2015

Conversation

mhrivnak
Copy link
Contributor

@beav beav self-assigned this Apr 20, 2015
return metadata_files
scratchpad = self.sync_conduit.get_scratchpad() or {}
previous_revision = scratchpad.get(constants.REPOMD_REVISION_KEY, 0)
previous_skip_set = set(scratchpad.get(constants.PREVIOUS_SKIP_LIST, []))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that if a previous sync has the skip_list set to skip a type, and then the user removes that from the skip list and does a sync, we don't want to skip the sync just because the metadata didn't change.

So if anything was removed from the previous skip list, we will do a full sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks.

@beav
Copy link
Contributor

beav commented Apr 20, 2015

assigning back to @mhrivnak per IRC convo

@beav beav assigned mhrivnak and unassigned beav Apr 20, 2015
@beav
Copy link
Contributor

beav commented Apr 20, 2015

@mhrivnak Would it make more sense to check timestamp from the primary data instead of revision? That might be what we are looking for. This is also what yum uses when examining metadata to determine if something is new.

for example:

<?xml version="1.0" encoding="UTF-8"?>
<repomd xmlns="http://linux.duke.edu/metadata/repo" xmlns:rpm="http://linux.duke.edu/metadata/rpm">
 <revision>1429551055</revision>
<data type="filelists">
  <checksum type="sha256">5cc15042c277bfbf8b04e8b1eadc9cf998016529663d504e93a79e0bff0246fa</checksum>
  <open-checksum type="sha256">df79c6949d62ecaff4f53bd4758b02b445f7eddd3a5b92eb8e6bdccfcd6203dd</open-checksum>
  <location href="repodata/5cc15042c277bfbf8b04e8b1eadc9cf998016529663d504e93a79e0bff0246fa-filelists.xml.gz"/>
  <timestamp>1429551057</timestamp>
  <size>312</size>
  <open-size>522</open-size>
</data>

revision is the epoch time but can be freeform text passed in by the user to createrepo. However, the timestamp might be a better candidate for what we are doing.

Another thought is that we could just save a hash of the repomd.xml itself and save it in the scratch pad 🐨. If the hash doesn't change, it would be OK to skip the other steps. This would free us from having to worry about specifics of repomd.xml creation.

@mhrivnak
Copy link
Contributor Author

Good thoughts. I didn't go with the timestamp on primary.xml because it only covers RPMs. In a case where there are drpms, errata, groups, or any other custom metadata files, we need to know if any of those changed either.

revision seems like the "right" field for this. It is the part of the schema that is clearly meant to express whether or not the content has changed in a meaningful way (although docs would be nice to confirm the intent). I agree that using the file's checksum to detect change would work and could even be more reliable, but it removes the opportunity for someone to use the revision field appropriately. For example, someone could generate repo metadata without sqlite files, and then add them later, keeping the same revision.

In any case, I can't find an example of a yum repo where the revision is not an integer. I think we should give that a try, and if we find a use case where someone is committed to doing something different, we can consider fall-back strategies, like using the checksum.

We could additionally use cache-related HTTP headers to determine when this file, or others, hasn't changed, but that would be gravy.

@beav
Copy link
Contributor

beav commented Apr 21, 2015

👍 sounds good, I am OK with using revision.

I don't remember if we need more changes or not; just assign back to me when it's time to re-review.

@mhrivnak
Copy link
Contributor Author

Great. I am adding the improvement that it will default to full-sync if the revision can't be parsed as an int.

@beav
Copy link
Contributor

beav commented Apr 21, 2015

I like the "if zero then always sync" approach. LGTM

@beav beav added the LGTM label Apr 21, 2015
mhrivnak added a commit that referenced this pull request Apr 21, 2015
Sync now skips most steps if the feed metadata has not changed.
@mhrivnak mhrivnak merged commit 3f52ca9 into pulp:2.6-dev Apr 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants