Skip to content

Rudimentary dumper of WARC files via the python warcio lib

Notifications You must be signed in to change notification settings

jaygattuso/WARC_dumper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

WARC_dumper

Rudimentary dumper of WARC files via the python warcio lib

This is a very basic script.

You need to two things.

A valid WARC file (and its full system path)

A system path for where you want to resulting files to go.


You give it a WARC file and it tries to extract all the binaries into one of two folders.

filename is pulled from record.rec_headers.get_header('WARC-Target-URI')

media_dump is where any file that loks like it ends with [.jpg, .pdf, .png, .mp4] (add your own to suit).

file_dump is where everything else goes. Tries to go.... If record.rec_headers.get_header('WARC-Target-URI') results in a bad filename (length, characters etc) then the file doesn't get written, this is logged on screen.

About

Rudimentary dumper of WARC files via the python warcio lib

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages