archivekit provides a mechanism for storing a (large) set of immutable documents and data files in an organized way. Transformed versions of each file can be stored the alongside the original data in order to reflect a complete processing chain. Metadata is kept with the data as a YAML file.
The easiest way of using
archivekit is via PyPI:
$ pip install archivekit
Alternatively, check out the repository from GitHub and install it locally:
$ git clone https://github.com/pudo/archivekit.git $ cd archivekit $ python setup.py develop
Packages which contain one or several
Resources and their associated metadata. Each
Package is part of a
from archivekit import open_collection, Source # open a collection of packages collection = open_collection('file', path='/tmp') # or via S3: collection = open_collection('s3', aws_key_id='..', aws_secret='..', bucket_name='test.pudo.org') # import a file from the local working directory: collection.ingest('README.md') # import an http resource: collection.ingest('http://pudo.org/index.html') # ingest will also accept file objects and httplib/urllib/requests responses # iterate through each document and set a metadata # value: for package in collection: for source in package.all(Source): with source.fh() as fh: source.meta['body_length'] = len(fh.read()) package.save()
The code for this library is very compact, go check it out.
If AWS credentials are not supplied for an S3-based collection, the application will attempt to use the
AWS_SECRET_ACCESS_KEY environment variables.
AWS_BUCKET_NAME is also supported.
archivekit is open source, licensed under a standard MIT license (included in this repository as