ArchiveKit manages data and documents during ETL processes, either on a local file system or on S3.
Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
archivekit
tests
.gitignore
.travis.yml
LICENSE
NOTES.md
README.md
setup.py

README.md

archivekit

Build Status Coverage Status

archivekit provides a mechanism for storing a (large) set of immutable documents and data files in an organized way. Transformed versions of each file can be stored the alongside the original data in order to reflect a complete processing chain. Metadata is kept with the data as a YAML file.

This library is inspired by OFS, BagIt and Pairtree. It replaces a previous project, docstash.

Installation

The easiest way of using archivekit is via PyPI:

$ pip install archivekit

Alternatively, check out the repository from GitHub and install it locally:

$ git clone https://github.com/pudo/archivekit.git
$ cd archivekit
$ python setup.py develop

Example

archivekit manages Packages which contain one or several Resources and their associated metadata. Each Package is part of a Collection.

from archivekit import open_collection, Source

# open a collection of packages
collection = open_collection('file', path='/tmp')

# or via S3:
collection = open_collection('s3', aws_key_id='..', aws_secret='..',
                             bucket_name='test.pudo.org')

# import a file from the local working directory:
collection.ingest('README.md')

# import an http resource:
collection.ingest('http://pudo.org/index.html')
# ingest will also accept file objects and httplib/urllib/requests responses

# iterate through each document and set a metadata
# value:
for package in collection:
    for source in package.all(Source):
        with source.fh() as fh:
            source.meta['body_length'] = len(fh.read())
    package.save()

The code for this library is very compact, go check it out.

Configuration

If AWS credentials are not supplied for an S3-based collection, the application will attempt to use the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. AWS_BUCKET_NAME is also supported.

License

archivekit is open source, licensed under a standard MIT license (included in this repository as LICENSE).