Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility Web archives (WaybackMachines) #372

Closed
7 tasks done
boogheta opened this issue Dec 12, 2019 · 4 comments
Closed
7 tasks done

Compatibility Web archives (WaybackMachines) #372

boogheta opened this issue Dec 12, 2019 · 4 comments

Comments

@boogheta
Copy link
Member

boogheta commented Dec 12, 2019

To allow to crawl the past using some kind of Internet Archive relying on OpenWayback (such as web.archive.org) just a few changes shall be required:

  • in configuration we would need an archive_host_prefix (i.e. https://web.archive.org/web/) and an archive_timestamp (such as 20190319191212) around which pages should be crawled
  • the crawler's spider shall rewrite all urls to crawl by prefixing them with archive_host_prefix/archive_timestamp/
  • the crawler's spider shall transparently follow redirection to the available timestamp for each page
  • the crawler's spider shall rewrite urls of all links collected during the crawl by removing the prefix archive_host_prefix/\d{14}/
  • the crawler's spider shall save as metadata of crawled pages a field with the final timestamp of the crawled pages
  • think to reduce drastically the number of allowed parallels crawls since they will all run on the same server
  • there should be some way from the front to access not real urls but those archived with the prefix (but maybe we don't want to rewrite it all, typically since BNF's archives are not publicly accessible, having such urls would make no sense)
@paulgirard
Copy link
Member

Nice.
Note: the wayback machines add a header in HTML to indicate the possible dates. This headers is added by javascript so it will not be present if crawled by a no-js spider in hyphe which is a good thing for text analysis.

@boogheta
Copy link
Member Author

boogheta commented Dec 13, 2019

The BNF's archives work a bit differently and would require different changes. Identified so far:

  • setup the proxy and a timestamp in config
  • when creating a corpus, call http://archivesinternet.bnf.fr/DESIREDTIMESTAMP/http://www.bnf.fr and collect session info (some SESSIONID to reuse as header later. Present in the HTTP header of the response?) call desiredtimestamp for all urls using the same forged sessionid
  • add to all crawler's requests the proper HTTP header "BnF-OSWM-Username: SESSIONID"
  • collect metadata from added banner in HTML and store timestamp of archived page received
  • skip urls rewriting (proxy, so no rewriting in BNF archives)
  • remove the BNF banner at the top (included within the HTML, not added by JS)
  • enable/disable BNF archives from global hyphe config since unaccessible from outside BNF
  • use timerange urls ? ex: http://pfcarchivesinternet.bnf.fr/20161125000000-20171201000000/http://www.medialab.sciencespo.fr/ Warning: => returns snapshot closest to the upper limit, not the middle

boogheta added a commit that referenced this issue May 26, 2021
…in html + rewrite all links found + store archive date as meta + skip internal archive links (WIP #372)
boogheta added a commit that referenced this issue May 26, 2021
@boogheta
Copy link
Member Author

boogheta commented May 26, 2021

Complementary ideas or todos:

  • fix bad 302 not followed from sciencespo for instance on regardscitoyens.org
  • add a daterange setting to only keep archives within the daterange
  • add disclaimer in front regarding slow crawls because of archives
  • display in crawl's summary and details in front whether an archive was used
  • allow to enable and configure archives crawl by crawl and not only globally in backend
  • add options in front to setup archives for a single crawl
  • use a datepicker
  • use datepicker into crawl by crawl settings
  • adjust crawl lookup tests in front
  • check whether crawler's resolver should deal with anything new

@boogheta boogheta changed the title Compatibility WaybackMachines Compatibility Web archives (WaybackMachines) Jun 1, 2021
boogheta added a commit that referenced this issue Jun 7, 2021
boogheta pushed a commit that referenced this issue Jun 10, 2021
boogheta added a commit that referenced this issue Jun 10, 2021
boogheta pushed a commit that referenced this issue Jun 17, 2021
boogheta pushed a commit that referenced this issue Jun 17, 2021
boogheta added a commit that referenced this issue Jun 18, 2021
…imestamp and trash it if outside desired range (#372)
boogheta added a commit that referenced this issue Jun 22, 2021
boogheta added a commit that referenced this issue Jun 22, 2021
boogheta pushed a commit that referenced this issue Jun 25, 2021
boogheta pushed a commit that referenced this issue Jun 25, 2021
boogheta pushed a commit that referenced this issue Jun 25, 2021
boogheta added a commit that referenced this issue Jun 30, 2021
boogheta added a commit that referenced this issue Jul 6, 2021
…r single archives crawls in corpus not set for archives (#372)
boogheta added a commit that referenced this issue Jul 6, 2021
boogheta pushed a commit that referenced this issue Jul 7, 2021
boogheta added a commit that referenced this issue Jul 7, 2021
boogheta added a commit that referenced this issue Jul 9, 2021
@boogheta
Copy link
Member Author

Ideas left aside :

  • optional autoretry failed crawl on archives with last available date ? (i.e. date = today + range = infinity)
  • what to do when recrawling an entity over both live + archive: overwrite or complete
  • add metadata in traph on archive date

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants