WebWar is a proof of concept web archival tool which saves all data that flows through an HTTP proxy.
Right now I'm just cheap and using mitmproxy for the heavy lifting. B3
- Edit
mitm_archive_http.py
and change theDB
to where you want to save stuff - Run
mitmproxy
with--script
set as the path tomitm_archive_http.py
- Set up your web browser to use the proxy (probably
127.0.0.1
port8080
) - Browse around to save your shit :3
I wrote some really shitty browser that can read back the saved files. Edit the path in there to point to your DB and then python ./netwar_browser.py
:)
You can then visit sites like http://localhost:8000/https://furaffinity.net/user/knot126
.
Note that you need to get the page name exactly right (ex www.furaffinity.net
!= furaffinity.net
and example.com/
!= example.com
- will have some way to correct this later).
The archive "database" is a simple content addressed storage system, sorted per domain, with a map.json
file mapping URIs and time of archival to content and headers. Content files are named after the hex of their SHA-256 hash and stored in the domain folder - that is, alongside the map.json
.
map.json
is a simple array of objects with the following properties:
url
: URL for this capturetime
: UNIX timestamp of the capturecontent
: Hash of the saved contentsheaders
: Hash of the response headers (should be optional but currently required for browser)
<archive root>
/www.furaffinity.net
/f0e4c2f76c58916ec258f246851bea091d14d4247a2fc3e18694461b1816e13b
/13954213a197701957f334ace6845c1ebcd0a329053c790a8b31c47bc18c83de
/b0eb9b2e16cd79eb4471af9f7d34de90b69d79b5de4177604e0109f82a83bc54
/ ...
/map.json
/example.com
/a379a6f6eeafb9a55e378c118034e2751e682fab9f2d30ab13d2125586ce1947
/0efb0ab6e3a4e54c1a3ed2633c8a542125a9945498ae491dfb5d15d9648342d1
/map.json
- For portability, archives can be compressed into a ZIP file. Domain folders should be stored directly at the root of the archive, and the resulting ZIP file should retain a
.zip
file extension. - One major pillar of this design is that most of the formats should be easy to understand and based on widely known standards, so that even if this spec document were lost, it would be easy to get content out of the archive files. After all, an archive is useless if it can't be understood!