HTML Go Other
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
archives-nz-demo
digipres
links
test-files
tools
README.md
build-tle.sh
bulkfilehandler.go
filehandler.go
http.go
keys.go
logger.go
scanner.go
start-tools.bat
start-tools.sh
stop-tools.bat
stop-tools.sh
testconnections.go
tikahandler.go
tikahttpreserve.go
tools.go
version.go

README.md

tika-httpreserve

Tika client for httpreserve

Demo

asciicast

Use with Wget

Extract the links from your files using seeds option

./tikalinkextract -seeds -file archives-nz-demo/ > transferlinks.txt

Use the seeds to generate a warc file

wget --page-requisites --span-hosts --convert-links  --execute robots=off --adjust-extension --no-directories --directory-prefix=output --warc-cdx --warc-file=accession.warc --wait=0.1 --user-agent=httpreserve-wget/0.0.1 -i transferlinks.txt 

Known Issues

  • HTTP links that are formatted in such a way to be split across lines, thus include a newline \n character.

Resources that might help

License

Tika is licensed as follows: http://www.apache.org/licenses/

This tool is licensed GNU General Public License Version 3. Full Text