Skip to content

httpreserve/tikalinkextract

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

tika-httpreserve

Tika client for httpreserve

Demo

asciicast

Use with Wget

Extract the links from your files using seeds option

./tikalinkextract -seeds -file archives-nz-demo/ > transferlinks.txt

Use the seeds to generate a warc file

wget --page-requisites --span-hosts --convert-links  --execute robots=off --adjust-extension --no-directories --directory-prefix=output --warc-cdx --warc-file=accession.warc --wait=0.1 --user-agent=httpreserve-wget/0.0.1 -i transferlinks.txt 

Known Issues

  • HTTP links that are formatted in such a way to be split across lines, thus include a newline \n character.

Resources that might help

License

Tika is licensed as follows: http://www.apache.org/licenses/

This tool is licensed GNU General Public License Version 3. Full Text