Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for taking multiple snapshots of websites over time #179

Open
pirate opened this Issue Mar 19, 2019 · 2 comments

Comments

Projects
None yet
2 participants
@pirate
Copy link
Owner

pirate commented Mar 19, 2019

This is by far the most requested feature.

People want an easy way to take multiple snapshots of websites over time.

This will be easier to do once we've added pywb support since we'll be able to use timestamped de-duped WARCs to save each snapshot: #130

For people finding this issue via Google / incoming links, if you want a hacky solution to take a second snapshot of a site, you can add the link with a new hash and it will be treated as a new page and a new snapshot will be taken:

echo https://example.com/some/page.html#archivedate=2019-03-18 | ./archive
# then to re-shapshot it on another day...
echo https://example.com/some/page.html#archivedate=2019-03-22 | ./archive
@n0ncetonic

This comment has been minimized.

Copy link
Contributor

n0ncetonic commented Mar 19, 2019

Looking forward to this feature. Thanks for the hacky workaround as well, I have a few pages I'd like to continue monitoring for new content but I was worried about the implications of my current backup being overwritten by a 404 page if the content went down.

@pirate

This comment has been minimized.

Copy link
Owner Author

pirate commented Mar 19, 2019

I just updated the README to make the current behavior clearer as well:

Running ./archive adds only new, unique links into your data folder on each run. Because it will ignore duplicates and only archive each link the first time you add it, you can schedule it to run on a timer and re-import all your feeds multiple times a day. It will run quickly even if the feeds are large, because it's only archiving the newest links since the last run. For each link, it runs through all the archive methods. Methods that fail will save None and be automatically retried on the next run, methods that succeed save their output into the data folder and are never retried/overwritten by subsequent runs. Support for saving multiple snapshots of each site over time will be added soon (along with the ability to view diffs of the changes between runs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.