Git Wrapper for Dataset Management
Python Makefile
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
dgitcore
docs
tests
.gitignore
AUTHORS.rst
CONTRIBUTING.rst
HISTORY.rst
LICENSE
MANIFEST.in
Makefile
README.rst
requirements.txt
setup.cfg
setup.py
tox.ini

README.rst

dgit - Lightweight "Git Wrapper for Datasets"

Join the chat at https://gitter.im/pingali/dgit

Note: Code is alpha. It is being improved. Feedback welcome.

dgit is an application on top of git.

A lot of data-scientists' time goes towards generating, shaping, and using datasets. dgit enables organizing and using datasets with minimal effort.

dgit uses git for version management but structures the repository content, and interface to suit data management tasks.

dgit is agnostic to form and content of the datasets and post-processing scripts. It tries to be sync with best available dataset standards (WIP)

Read documentation

Slides on a Scaling Data Science with dgit at R Data Science Meetup, Bangalore

Contents:

Setup

Note that only Python 3 and ubuntu are supported for now.

# Dependencies (Ubuntu commands for lxml dependency)
$ sudo apt-get install libxml2-dev libxslt1-dev python3-dev git zlib1g-dev

# Prepare the environment
virtualenv -p /usr/bin/python3 env
. env/bin/activate

# Install dgit
$ pip install dgit

# Optional
$ pip install dgit_extensions

# Generate overall configuration file
$ dgit config init

Tutorial

We show how to create a simple dataset that is a git repo with s3 as the backend.

dgit has an auto mode in which it tries to do as much work as possible using a combination of configuration and intelligent defaults. When you run it first time, it asks a few questions that it uses to generate a configuration file. The latter is editable any time. When we run dgit auto, it uses the configuration to determine what to do.

# One command to rule them all!
$ dgit auto

dgit scans the working directory for changes and automatically commits them to the dataset.

1. Clone/create a model directory (may contain scripts and other files)

$ git clone https://gitlab.com/pingali/simple-regression.git
$ cd simple-regression
$ ls
regression2.py  regression.py

# setup for regression
$ pip install numpy pandas statsmodels
  1. Create a dgit configuration file
$ dgit auto
Let us know a few details about your data repository
Please specify username [pingali]
Please specify repo name [simple-regression]
Please specify remote URL [s3://mybucket/git/pingali/simple-regression.git]
One line summary of your repo: Simple regression model
Add any more details:

Updated dataset specific config file: dgit.json
Please edit it and rerun dgit auto.
Tip: Consider committing dgit.json to the code repository.
  1. Bootstrap the dataset. It will capture any files that match the include pattern
$ dgit auto
Repo doesnt exist. Should I create one? [yN]y
Adding: datapackage.json
Adding: .gitignore
  1. Run the model and update dataset
$ ./regression.py
$ ls
dgit.json  regression2.py  regression.py model-results.txt

$ dgit auto
Adding: model-results.txt
Quick summary of changes? One run of the model
  1. If a dataset metadata server is enabled, then previous command will post to the server.
...
Collecting all the required metadata to post
Adding preview for  model-results.txt
Add commit data for model.py
Added platform information
Adding validation information
No dependencies
Computing diffs
Posting to http://<server>
  1. Explicit push to s3/backend. This can be enabled automatically through dgit.json if needed.
...
remote: upload: hooks/post-update.sample to s3://appsloka/git/pingali/simple-regression.git/hooks/post-update.sample
remote: upload: refs/heads/master to s3://appsloka/git/pingali/simple-regression.git/refs/heads/master
remote: upload: ./config to s3://appsloka/git/pingali/simple-regression.git/config
To /home/pingali/.dgit/git/pingali/simple-regression.git
  * [new branch]      master -> master

Usage

Read documentation for details on the commands supported.

$ dgit
Usage: dgit [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  add-files  Add files to the repo
  auto       Auto mode of operation
  clone      Clone a git URL
  commit     Commit repo data
  config     Create configuration file (~/.dgit.ini)
  diff       Show the diff between two commits
  drop       Drop dataset
  init       Bootstrap a new dataset (a git repo+s3...
  list       List datasets
  log        Gather the log details
  plugins    Plugin management
  post       Post metadata (only) to thirdparty server
  push       Gather the log details
  remote     Manage remote
  rm         Delete files from repo
  sh         Run generic shell commands in repo
  show       Show details of commit
  stash      Trash all the changes in the dataset
  status     Status of the repo
  transform  Transform content of the repo
  validate   Validate the content of the repository

Available Plugins

This is the base set of plugins supported by the default dgit repo. More extensions are part of dgit-extensions.

$ dgit plugins list
 ========
 backend
 ========
 local (v0) : Local Filesystem Backend
 s3 (v0) : S3 backend

 ========
 repomanager
 ========
 git (v0) : Git-based Repository Manager

 ========
 metadata
 ========
 basic-metadata (v0) : Basic metadata server

 ========
 validator
 ========
 metadata-validator (v0) : Validate integrity of the dataset metadata
 regression-quality-validator (v0) : Check R2 of regression model

 ========
 instrumentation
 ========
 content (v0) : Basic content analysis
 platform (v0) : Execution platform information
 executable (v0) : Executable analysis

 # from dgit_extensions module
 ========
 transformer
 ========
 simple-file-encryptor (v0) : Simple encryptor of files
 mysql-generator (v0) : Materialize queries in dataset
 simple-table-anonymizer (v0) : Simple anonymizer for tables

Security and Privacy

Some basic principles adhered to by dgit:

  1. dgit code is opensource to enable auditing if needed.
  2. No data ever leaves organizational premises (or even local machine) without explicit actions.
  3. When pushing data repo to a backend such as s3, it is done using credentials stored on the local machine. Nobody outside the organization can access the repo.
  4. When metadata is posted to any server to enable search, lineage computation etc. the parameters are controlled - what is posted, when and where.
  5. When data leaves premises (e.g., dgit post), it is only metadata by default (filenames, timestamps etc). There is an ability to add previews/schemas etc but that information must be explicitly added. All metadata being posted is stored in a standard location (datapackage.json) within the data repo. Posting rawdata is not supported by design.

Background

Dataset Management Problem

Some persistent problems of datascientists include:

  • Tracking which dataset was used to generate a result?
  • How did we get to the dataset to begin with?
  • Finding analysis that will be impacted by change in version of a dataset?

Datascience domain needs a tool that is no more complex than git to manage these problems that:

  • Is simple to deploy and use, and does not impose a certain way of doing things.
  • Does not require coordination with people if there is only one user, but does not prevent coordination and collaboration
  • Addresses the needs of dataset versioning including metadata content and representation and use of third party versioning or storage services such as s3 and instabase.

Usecase

  • A single code repo may generate many datasets, each of which may have one or more files, during many runs
  • There are usually large number of small files
  • Datasets are used by non-technical teams including business teams
  • Datasets may be generated outside git repos (e.g., acquisition from third party, software such as simulators)
  • Datasets may be rawdata or data generator scripts
  • Files may be added to datasets over time
  • Datasets may not be able to leave premises
  • Data analysis projects tend to have relatively short duration (1 day to few months) and executed by relatively isolated teams (one individual to a few).
  • Auditability and shareability is required but sharing is not as extensive as software development. People tend to work on different business problems.

We could force express these into a one or more git repos, run a git server locally, and/or use github LFS/gitlab annex. We felt that the usecase is slightly different from software repos

License

MIT license.

Copyright (c) 2016, Venkata Pingali All rights reserved.

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Contibutors

Venkata Pingali (pingali@gmail.com)