ArchiveKit manages data and documents during ETL processes, either on a local file system or on S3.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Build Status Coverage Status

archivekit provides a mechanism for storing a (large) set of immutable documents and data files in an organized way. Transformed versions of each file can be stored the alongside the original data in order to reflect a complete processing chain. Metadata is kept with the data as a YAML file.

This library is inspired by OFS, BagIt and Pairtree. It replaces a previous project, docstash.


The easiest way of using archivekit is via PyPI:

$ pip install archivekit

Alternatively, check out the repository from GitHub and install it locally:

$ git clone
$ cd archivekit
$ python develop


archivekit manages Packages which contain one or several Resources and their associated metadata. Each Package is part of a Collection.

from archivekit import open_collection, Source

# open a collection of packages
collection = open_collection('file', path='/tmp')

# or via S3:
collection = open_collection('s3', aws_key_id='..', aws_secret='..',

# import a file from the local working directory:

# import an http resource:
# ingest will also accept file objects and httplib/urllib/requests responses

# iterate through each document and set a metadata
# value:
for package in collection:
    for source in package.all(Source):
        with source.fh() as fh:
            source.meta['body_length'] = len(

The code for this library is very compact, go check it out.


If AWS credentials are not supplied for an S3-based collection, the application will attempt to use the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. AWS_BUCKET_NAME is also supported.


archivekit is open source, licensed under a standard MIT license (included in this repository as LICENSE).