Skip to content

martinsbalodis/warc-content

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

warc-content

Simple warc archive content browser. This tool takes warc archives as input, indexes them and creates a simple web page where you can browse crawled urls in a tree grid.

I personaly will use this tool to locate useless links in crawled pages. For example - calendars, print pages, image generators.

This is how the webpage looks:

warc content webpage example

Usage

./warccontent.py ~/warcs/*.warc.gz

Wait till data gets indexed and then open http://localhost:8080/ in your browser.

features to add in the future

  • content size counter
  • regex tool to test against urls
  • multiple core support for thos gziped archives

known issues

  • warc-tools library doesn't handle well large files within archives. Large files can cause MemoryError

License

GPLv3

About

simple warc archive content browser

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published