Data Object Registry Connect
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
dos_connect
services
test
util
.dockerignore
.gitignore
.gitmodules
README.md
docker-compose-swift.yml
docker-compose-webserver.yml
docker-compose.yml.example
requirements.txt

README.md

DOS connect

concept

Exercise the GA4GH data-object-schemas

image

epics

  • As a researcher, in order to maximize the amount of data I can process from disparate repositories, I can use DOS to harmonize those repositories
  • As a researcher, in order to minimize cloud costs and processing time, I can use DOS' to harmonized data to make decisions about what platform/region I should use to download from or where my code should execute.
  • As a informatician, in order to injest from disparate repositories, I need to injest an existing repository into DOS
  • As a informatician, in order to keep DOS up to date, I need to observe changes to the repository and automatically update DOS
  • As a developer, in order to enable DOS, I need to integrate DOS into my backend stack

capabilities

This project provides two high level capabilities:

  • observation: long lived services to observe the object store and populate a webserver with data-object-schema records. These observations catch add, moves and deletes to the object store.
  • inventory: on demand commands to capture a snapshot of the object store using data-object-schema records.

image

customizations

The data-object-schema is 'unopinionated' in several areas:

  • authentication and authorization is unspecified.
  • no specific backend is specified.
  • 'system-of-record' for id, if unspecified, is driven by the client.

dos_connect addresses these on the server and client by utilizing plugin duck-typing

Server plugins:

  • BACKEND: for storage. Implementations: in-memory and elasticsearch. e.g. BACKEND=dos_connect.server.elastic_backend
  • AUTHORIZER: for AA. noop, keystone, and basic. e.g. AUTHORIZER=dos_connect.server.keystone_api_key_authorizer
  • REPLICATOR: for downstream consumers. noop, keystone e.g. REPLICATOR=dos_connect.server.kafka_replicator

Client plugins:

All observers and inventory tasks leverage a middleware plugin capability.

  • user_metadata(): customize the collection of metadata
  • before_store(): modify the data_object before persisting
  • md5sum(): calculate the md5 of the file
  • id(): customize id e.g. CUSTOMIZER=dos_connect.apps.aws_customizer

To specify your own customizer, set the CUSTOMIZER environmental variable.

For example: AWS S3 returns a special hash for multipart files. The aws_customizer uses a lambda to calculate the true md5 hash of multipart files. Other client customizers include noop, url_as_id, and smmart (obfuscates paths and associates user metadata)

setup

see here

server

Setup: .env file

# ******* webserver
# http port
DOS_CONNECT_WEBSERVER_PORT=<port-number>
# configure backend
BACKEND=dos_connect.server.elasticsearch_backend
ELASTIC_URL=<url>
# configure authorizer
AUTHORIZER=dos_connect.server.keystone_api_key_authorizer
# (/v3)
DOS_SERVER_OS_AUTH_URL=<url>
AUTHORIZER_PROJECTS=<project_name>
# replicator
REPLICATOR=dos_connect.server.kafka_replicator
KAFKA_BOOTSTRAP_SERVERS=<url>
KAFKA_DOS_TOPIC=<topic-name>

Server Startup:

$ alias web='docker-compose -f docker-compose-webserver.yml'
$ web build ; web up -d

Client Startup: note: execute source <openstack-openrc.sh> first

# webserver endpoint
export DOS_SERVER=<url>
# sleep in between inventory runs
export SLEEP=<seconds-to-sleep>
# bucket to monitor
export BUCKET_NAME=<existing-bucket-name>
$ alias client='docker-compose -f docker-compose-swift.yml'
$ client build; client up -d

ohsu implementation:

  • see swagger

  • note: you will need to belong to openstack and provide a token from openstack token issue

image

  • see kafak topic 'dos-events' for stream

  • the kafka queue is populated with

    {'method': method, 'doc': doc}
    

    where doc is a data_object and method is one of ['CREATE', 'UPDATE', 'DELETE']

next steps

  • testing
  • evangelization
  • swagger improvements (403, 401 status codes)