Sync now skips most steps if the feed metadata has not changed. #676

mhrivnak · 2015-04-20T19:52:36Z

closes #2

https://pulp.plan.io/issues/2

mhrivnak · 2015-04-20T20:19:44Z

plugins/pulp_rpm/plugins/importers/yum/sync.py

-        return metadata_files
+        scratchpad = self.sync_conduit.get_scratchpad() or {}
+        previous_revision = scratchpad.get(constants.REPOMD_REVISION_KEY, 0)
+        previous_skip_set = set(scratchpad.get(constants.PREVIOUS_SKIP_LIST, []))


The idea here is that if a previous sync has the skip_list set to skip a type, and then the user removes that from the skip list and does a sync, we don't want to skip the sync just because the metadata didn't change.

So if anything was removed from the previous skip list, we will do a full sync.

That makes sense, thanks.

beav · 2015-04-20T21:35:14Z

assigning back to @mhrivnak per IRC convo

beav · 2015-04-20T21:50:30Z

@mhrivnak Would it make more sense to check timestamp from the primary data instead of revision? That might be what we are looking for. This is also what yum uses when examining metadata to determine if something is new.

for example:

<?xml version="1.0" encoding="UTF-8"?>
<repomd xmlns="http://linux.duke.edu/metadata/repo" xmlns:rpm="http://linux.duke.edu/metadata/rpm">
 <revision>1429551055</revision>
<data type="filelists">
  <checksum type="sha256">5cc15042c277bfbf8b04e8b1eadc9cf998016529663d504e93a79e0bff0246fa</checksum>
  <open-checksum type="sha256">df79c6949d62ecaff4f53bd4758b02b445f7eddd3a5b92eb8e6bdccfcd6203dd</open-checksum>
  <location href="repodata/5cc15042c277bfbf8b04e8b1eadc9cf998016529663d504e93a79e0bff0246fa-filelists.xml.gz"/>
  <timestamp>1429551057</timestamp>
  <size>312</size>
  <open-size>522</open-size>
</data>

revision is the epoch time but can be freeform text passed in by the user to createrepo. However, the timestamp might be a better candidate for what we are doing.

Another thought is that we could just save a hash of the repomd.xml itself and save it in the scratch pad 🐨. If the hash doesn't change, it would be OK to skip the other steps. This would free us from having to worry about specifics of repomd.xml creation.

mhrivnak · 2015-04-21T13:52:26Z

Good thoughts. I didn't go with the timestamp on primary.xml because it only covers RPMs. In a case where there are drpms, errata, groups, or any other custom metadata files, we need to know if any of those changed either.

revision seems like the "right" field for this. It is the part of the schema that is clearly meant to express whether or not the content has changed in a meaningful way (although docs would be nice to confirm the intent). I agree that using the file's checksum to detect change would work and could even be more reliable, but it removes the opportunity for someone to use the revision field appropriately. For example, someone could generate repo metadata without sqlite files, and then add them later, keeping the same revision.

In any case, I can't find an example of a yum repo where the revision is not an integer. I think we should give that a try, and if we find a use case where someone is committed to doing something different, we can consider fall-back strategies, like using the checksum.

We could additionally use cache-related HTTP headers to determine when this file, or others, hasn't changed, but that would be gravy.

beav · 2015-04-21T15:46:34Z

👍 sounds good, I am OK with using revision.

I don't remember if we need more changes or not; just assign back to me when it's time to re-review.

mhrivnak · 2015-04-21T17:26:28Z

Great. I am adding the improvement that it will default to full-sync if the revision can't be parsed as an int.

beav · 2015-04-21T19:06:34Z

I like the "if zero then always sync" approach. LGTM

closes pulp#2

Sync now skips most steps if the feed metadata has not changed.

mhrivnak added the enhancement label Apr 20, 2015

beav self-assigned this Apr 20, 2015

mhrivnak reviewed Apr 20, 2015
View reviewed changes

beav assigned mhrivnak and unassigned beav Apr 20, 2015

mhrivnak force-pushed the sync-opt-backport branch from db2169f to 9a140c7 Compare April 21, 2015 18:41

mhrivnak assigned beav and unassigned mhrivnak Apr 21, 2015

beav added the LGTM label Apr 21, 2015

Sync now skips most steps if the feed metadata has not changed.

183cb1e

closes pulp#2

mhrivnak force-pushed the sync-opt-backport branch from 9a140c7 to 183cb1e Compare April 21, 2015 20:07

mhrivnak added a commit that referenced this pull request Apr 21, 2015

Merge pull request #676 from mhrivnak/sync-opt-backport

3f52ca9

Sync now skips most steps if the feed metadata has not changed.

mhrivnak merged commit 3f52ca9 into pulp:2.6-dev Apr 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync now skips most steps if the feed metadata has not changed. #676

Sync now skips most steps if the feed metadata has not changed. #676

mhrivnak commented Apr 20, 2015

mhrivnak Apr 20, 2015

beav Apr 20, 2015

beav commented Apr 20, 2015

beav commented Apr 20, 2015

mhrivnak commented Apr 21, 2015

beav commented Apr 21, 2015

mhrivnak commented Apr 21, 2015

beav commented Apr 21, 2015

Sync now skips most steps if the feed metadata has not changed. #676

Sync now skips most steps if the feed metadata has not changed. #676

Conversation

mhrivnak commented Apr 20, 2015

mhrivnak Apr 20, 2015

Choose a reason for hiding this comment

beav Apr 20, 2015

Choose a reason for hiding this comment

beav commented Apr 20, 2015

beav commented Apr 20, 2015

mhrivnak commented Apr 21, 2015

beav commented Apr 21, 2015

mhrivnak commented Apr 21, 2015

beav commented Apr 21, 2015