Compatibility Web archives (WaybackMachines) #372

boogheta · 2019-12-12T15:54:40Z

To allow to crawl the past using some kind of Internet Archive relying on OpenWayback (such as web.archive.org) just a few changes shall be required:

in configuration we would need an archive_host_prefix (i.e. https://web.archive.org/web/) and an archive_timestamp (such as 20190319191212) around which pages should be crawled
the crawler's spider shall rewrite all urls to crawl by prefixing them with archive_host_prefix/archive_timestamp/
the crawler's spider shall transparently follow redirection to the available timestamp for each page
the crawler's spider shall rewrite urls of all links collected during the crawl by removing the prefix archive_host_prefix/\d{14}/
the crawler's spider shall save as metadata of crawled pages a field with the final timestamp of the crawled pages
think to reduce drastically the number of allowed parallels crawls since they will all run on the same server
there should be some way from the front to access not real urls but those archived with the prefix (but maybe we don't want to rewrite it all, typically since BNF's archives are not publicly accessible, having such urls would make no sense)

The text was updated successfully, but these errors were encountered:

paulgirard · 2019-12-12T16:17:35Z

Nice.
Note: the wayback machines add a header in HTML to indicate the possible dates. This headers is added by javascript so it will not be present if crawled by a no-js spider in hyphe which is a good thing for text analysis.

boogheta · 2019-12-13T16:15:21Z

The BNF's archives work a bit differently and would require different changes. Identified so far:

setup the proxy and a timestamp in config
when creating a corpus, call http://archivesinternet.bnf.fr/DESIREDTIMESTAMP/http://www.bnf.fr and collect session info (some SESSIONID to reuse as header later. Present in the HTTP header of the response?) call desiredtimestamp for all urls using the same forged sessionid
add to all crawler's requests the proper HTTP header "BnF-OSWM-Username: SESSIONID"
collect metadata from added banner in HTML and store timestamp of archived page received
skip urls rewriting (proxy, so no rewriting in BNF archives)
remove the BNF banner at the top (included within the HTML, not added by JS)
enable/disable BNF archives from global hyphe config since unaccessible from outside BNF
use timerange urls ? ex: http://pfcarchivesinternet.bnf.fr/20161125000000-20171201000000/http://www.medialab.sciencespo.fr/ Warning: => returns snapshot closest to the upper limit, not the middle

…ive url (#372)

…in html + rewrite all links found + store archive date as meta + skip internal archive links (WIP #372)

…ame domain (WIP #372)

…s urls (#372)

boogheta · 2021-05-26T16:02:54Z

)

…imestamp and trash it if outside desired range (#372)

…372)

…such in front (#372)

…ple (#372)

…each crawl (#372)

…372)

…r single archives crawls in corpus not set for archives (#372)

… from front are valid and respect it (#372)

…afternoon of first and last day! (#372)

…372)

boogheta · 2021-09-13T14:53:16Z

Ideas left aside :

optional autoretry failed crawl on archives with last available date ? (i.e. date = today + range = infinity)
what to do when recrawling an entity over both live + archive: overwrite or complete
add metadata in traph on archive date

boogheta added crawler feature labels Dec 12, 2019

boogheta added a commit that referenced this issue May 20, 2021

handle corpus options related to webarchives (#372)

71c7a8e

boogheta added a commit that referenced this issue May 20, 2021

pass webarchives options to crawler and prefix crawled urls with arch…

21e86ee

…ive url (#372)

boogheta added a commit that referenced this issue May 20, 2021

[WIP] rewrite urls without archive prefix and store archive url (#372)

def2b0b

boogheta added a commit that referenced this issue May 26, 2021

follow seamlessly redirections from archive, whether declared or with…

0c0a538

…in html + rewrite all links found + store archive date as meta + skip internal archive links (WIP #372)

boogheta added a commit that referenced this issue May 26, 2021

adjust crawl rate and concurrency since all queries are calling the s…

88dff9e

…ame domain (WIP #372)

boogheta added a commit that referenced this issue May 26, 2021

apply same filters to urls from linkextractor after rewriting archive…

27d48a3

…s urls (#372)

boogheta added a commit that referenced this issue May 26, 2021

consistent dependencies (#372)

bf95cde

boogheta added a commit that referenced this issue May 26, 2021

add some minor access to archives metadata within front (#372)

70e816d

boogheta changed the title ~~Compatibility WaybackMachines~~ Compatibility Web archives (WaybackMachines) Jun 1, 2021

boogheta added a commit that referenced this issue Jun 1, 2021

handle relative redirections from archive (#372)

bcb784f

boogheta added a commit that referenced this issue Jun 1, 2021

filter more internal archives links (#372)

18a73b5

boogheta added a commit that referenced this issue Jun 1, 2021

setup timelimits to which extent the archives should be accepted (#372)

b5cc2d9

boogheta added a commit that referenced this issue Jun 7, 2021

use corpus webarchives config instead of global one when deploying (#372

86d6123

)

boogheta added a commit that referenced this issue Jun 7, 2021

fix rendering edit menu (#372)

dd3fb72

boogheta added a commit that referenced this issue Jun 7, 2021

Oups wrong date value checked for timestamped redirection (#372)

4c16afc

boogheta pushed a commit that referenced this issue Jun 10, 2021

first 2020 tryouts at BNF (#372)

38021db

boogheta added a commit that referenced this issue Jun 10, 2021

follow BNF redirections (#372)

16b0fa1

boogheta pushed a commit that referenced this issue Jun 17, 2021

handle testing BNF archive proxy (#372)

458fba7

boogheta pushed a commit that referenced this issue Jun 17, 2021

fix bad use of string.contains (#372)

64ebf24

boogheta pushed a commit that referenced this issue Jun 17, 2021

make calling BNF archives functional (#372)

818d24e

boogheta added a commit that referenced this issue Jun 18, 2021

try to parse and remove BNF banner from archives to extract archive t…

2225909

…imestamp and trash it if outside desired range (#372)

boogheta added a commit that referenced this issue Jun 22, 2021

refacto archives config to simplify web interface into dropdown menu (#…

16af974

…372)

boogheta added a commit that referenced this issue Jun 22, 2021

safer test (#372)

648816d

boogheta added a commit that referenced this issue Jun 22, 2021

add disclaimer regarding slower crawls when using web archives (#372)

8cb63e2

boogheta added a commit that referenced this issue Jun 22, 2021

minor (#372)

48aa1cd

boogheta pushed a commit that referenced this issue Jun 25, 2021

add timerange when disabled (#372)

0ed9914

boogheta pushed a commit that referenced this issue Jun 25, 2021

adjust BNF conf + banner remover (#372)

04713d6

boogheta pushed a commit that referenced this issue Jun 25, 2021

fix modifying body to remove bnf banner (#372)

c8dcb70

boogheta pushed a commit that referenced this issue Jun 25, 2021

wrong attribute name (#372)

946732f

boogheta added a commit that referenced this issue Jun 25, 2021

log info proxy when definitive (#372)

242b85e

boogheta added a commit that referenced this issue Jun 28, 2021

more explicit archives option keys (#372)

c97da99

boogheta added a commit that referenced this issue Jun 28, 2021

improve rendering of archives markers (#372)

8e04999

boogheta added a commit that referenced this issue Jun 28, 2021

fix issue with startpages not displayed as crawled from archives (#372)

cd2d128

boogheta added a commit that referenced this issue Jun 28, 2021

keep pages not found in timerange as error in db and display them as …

6b1e8b7

…such in front (#372)

boogheta added a commit that referenced this issue Jun 30, 2021

set default archives date to bisextile year for better full year exam…

6bf4221

…ple (#372)

boogheta added a commit that referenced this issue Jun 30, 2021

handle cases with webarchives dates set but not the option (#372)

b5f73be

boogheta added a commit that referenced this issue Jun 30, 2021

allow to setup webarchives, cookies & depth options individually for …

c7359b5

…each crawl (#372)

boogheta added a commit that referenced this issue Jun 30, 2021

use datepicker (#372)

b193dce

boogheta added a commit that referenced this issue Jul 5, 2021

make archives period a dropdown selection for days before and after (#…

b5071c0

…372)

boogheta added a commit that referenced this issue Jul 6, 2021

fix period selector custom detection + apply it to single crawls (#372)

ada2d16

boogheta added a commit that referenced this issue Jul 6, 2021

add option for infinite period of archives, and propose by default fo…

0fceb93

…r single archives crawls in corpus not set for archives (#372)

boogheta added a commit that referenced this issue Jul 6, 2021

fix introduced bug making only one link followed by page (#372)

c7fda35

boogheta added a commit that referenced this issue Jul 6, 2021

save a min_date available for each archive and ensure dates submitted…

9cbcca1

… from front are valid and respect it (#372)

boogheta pushed a commit that referenced this issue Jul 7, 2021

prepare prod archives bnf (#372)

a6c1be1

boogheta added a commit that referenced this issue Jul 7, 2021

oups job id is better than all arguments (#372)

fdc9f61

boogheta added a commit that referenced this issue Jul 7, 2021

reword time periods None & Infinity (#372)

8fe1d2f

boogheta added a commit that referenced this issue Jul 7, 2021

remove accessibility warnings (#372)

c01fa75

boogheta added a commit that referenced this issue Jul 9, 2021

reword tooltips regarding archives period of time (#372)

6ad3a7a

boogheta added a commit that referenced this issue Jul 9, 2021

fix single day collection and make other periods include morning and …

3319591

…afternoon of first and last day! (#372)

boogheta added a commit that referenced this issue Sep 7, 2021

use html5 datepicker as a fallback since angular's won't work (#372)

c2a36df

boogheta added a commit that referenced this issue Sep 8, 2021

apply datepicker also to webarchives by crawl (#372)

d3c194a

boogheta added a commit that referenced this issue Sep 10, 2021

prepare code for eventual option to run lookups on archives one day (#…

550508a

…372)

boogheta added a commit that referenced this issue Sep 10, 2021

tag automatically entities when they were crawled using webarchives (#…

507138e

…372)

boogheta closed this as completed Sep 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compatibility Web archives (WaybackMachines) #372

Compatibility Web archives (WaybackMachines) #372

boogheta commented Dec 12, 2019 •

edited

Loading

paulgirard commented Dec 12, 2019

boogheta commented Dec 13, 2019 •

edited

Loading

boogheta commented May 26, 2021 •

edited

Loading

boogheta commented Sep 13, 2021

Compatibility Web archives (WaybackMachines) #372

Compatibility Web archives (WaybackMachines) #372

Comments

boogheta commented Dec 12, 2019 • edited Loading

paulgirard commented Dec 12, 2019

boogheta commented Dec 13, 2019 • edited Loading

boogheta commented May 26, 2021 • edited Loading

boogheta commented Sep 13, 2021

boogheta commented Dec 12, 2019 •

edited

Loading

boogheta commented Dec 13, 2019 •

edited

Loading

boogheta commented May 26, 2021 •

edited

Loading