dgit - Lightweight "Git Wrapper for Datasets"
Note: Code is alpha. It is being improved. Feedback welcome.
dgit is an application on top of git.
A lot of data-scientists' time goes towards generating, shaping, and using datasets. dgit enables organizing and using datasets with minimal effort.
dgit uses git for version management but structures the repository content, and interface to suit data management tasks.
dgit is agnostic to form and content of the datasets and post-processing scripts. It tries to be sync with best available dataset standards (WIP)
Slides on a Scaling Data Science with dgit at R Data Science Meetup, Bangalore
Note that only Python 3 and ubuntu are supported for now.
# Dependencies (Ubuntu commands for lxml dependency) $ sudo apt-get install libxml2-dev libxslt1-dev python3-dev git zlib1g-dev # Prepare the environment virtualenv -p /usr/bin/python3 env . env/bin/activate # Install dgit $ pip install dgit # Optional $ pip install dgit_extensions # Generate overall configuration file $ dgit config init
We show how to create a simple dataset that is a git repo with s3 as the backend.
dgit has an auto mode in which it tries to do as much work as possible using a combination of configuration and intelligent defaults. When you run it first time, it asks a few questions that it uses to generate a configuration file. The latter is editable any time. When we run dgit auto, it uses the configuration to determine what to do.
# One command to rule them all! $ dgit auto
dgit scans the working directory for changes and automatically commits them to the dataset.
1. Clone/create a model directory (may contain scripts and other files)
$ git clone https://gitlab.com/pingali/simple-regression.git $ cd simple-regression $ ls regression2.py regression.py # setup for regression $ pip install numpy pandas statsmodels
- Create a dgit configuration file
$ dgit auto Let us know a few details about your data repository Please specify username [pingali] Please specify repo name [simple-regression] Please specify remote URL [s3://mybucket/git/pingali/simple-regression.git] One line summary of your repo: Simple regression model Add any more details: Updated dataset specific config file: dgit.json Please edit it and rerun dgit auto. Tip: Consider committing dgit.json to the code repository.
- Bootstrap the dataset. It will capture any files that match the include pattern
$ dgit auto Repo doesnt exist. Should I create one? [yN]y Adding: datapackage.json Adding: .gitignore
- Run the model and update dataset
$ ./regression.py $ ls dgit.json regression2.py regression.py model-results.txt $ dgit auto Adding: model-results.txt Quick summary of changes? One run of the model
- If a dataset metadata server is enabled, then previous command will post to the server.
... Collecting all the required metadata to post Adding preview for model-results.txt Add commit data for model.py Added platform information Adding validation information No dependencies Computing diffs Posting to http://<server>
- Explicit push to s3/backend. This can be enabled automatically through dgit.json if needed.
... remote: upload: hooks/post-update.sample to s3://appsloka/git/pingali/simple-regression.git/hooks/post-update.sample remote: upload: refs/heads/master to s3://appsloka/git/pingali/simple-regression.git/refs/heads/master remote: upload: ./config to s3://appsloka/git/pingali/simple-regression.git/config To /home/pingali/.dgit/git/pingali/simple-regression.git * [new branch] master -> master
Read documentation for details on the commands supported.
$ dgit Usage: dgit [OPTIONS] COMMAND [ARGS]... Options: --help Show this message and exit. Commands: add-files Add files to the repo auto Auto mode of operation clone Clone a git URL commit Commit repo data config Create configuration file (~/.dgit.ini) diff Show the diff between two commits drop Drop dataset init Bootstrap a new dataset (a git repo+s3... list List datasets log Gather the log details plugins Plugin management post Post metadata (only) to thirdparty server push Gather the log details remote Manage remote rm Delete files from repo sh Run generic shell commands in repo show Show details of commit stash Trash all the changes in the dataset status Status of the repo transform Transform content of the repo validate Validate the content of the repository
This is the base set of plugins supported by the default dgit repo. More extensions are part of dgit-extensions.
$ dgit plugins list ======== backend ======== local (v0) : Local Filesystem Backend s3 (v0) : S3 backend ======== repomanager ======== git (v0) : Git-based Repository Manager ======== metadata ======== basic-metadata (v0) : Basic metadata server ======== validator ======== metadata-validator (v0) : Validate integrity of the dataset metadata regression-quality-validator (v0) : Check R2 of regression model ======== instrumentation ======== content (v0) : Basic content analysis platform (v0) : Execution platform information executable (v0) : Executable analysis # from dgit_extensions module ======== transformer ======== simple-file-encryptor (v0) : Simple encryptor of files mysql-generator (v0) : Materialize queries in dataset simple-table-anonymizer (v0) : Simple anonymizer for tables
Security and Privacy
Some basic principles adhered to by dgit:
- dgit code is opensource to enable auditing if needed.
- No data ever leaves organizational premises (or even local machine) without explicit actions.
- When pushing data repo to a backend such as s3, it is done using credentials stored on the local machine. Nobody outside the organization can access the repo.
- When metadata is posted to any server to enable search, lineage computation etc. the parameters are controlled - what is posted, when and where.
- When data leaves premises (e.g., dgit post), it is only metadata by default (filenames, timestamps etc). There is an ability to add previews/schemas etc but that information must be explicitly added. All metadata being posted is stored in a standard location (datapackage.json) within the data repo. Posting rawdata is not supported by design.
Dataset Management Problem
Some persistent problems of datascientists include:
- Tracking which dataset was used to generate a result?
- How did we get to the dataset to begin with?
- Finding analysis that will be impacted by change in version of a dataset?
Datascience domain needs a tool that is no more complex than git to manage these problems that:
- Is simple to deploy and use, and does not impose a certain way of doing things.
- Does not require coordination with people if there is only one user, but does not prevent coordination and collaboration
- Addresses the needs of dataset versioning including metadata content and representation and use of third party versioning or storage services such as s3 and instabase.
- A single code repo may generate many datasets, each of which may have one or more files, during many runs
- There are usually large number of small files
- Datasets are used by non-technical teams including business teams
- Datasets may be generated outside git repos (e.g., acquisition from third party, software such as simulators)
- Datasets may be rawdata or data generator scripts
- Files may be added to datasets over time
- Datasets may not be able to leave premises
- Data analysis projects tend to have relatively short duration (1 day to few months) and executed by relatively isolated teams (one individual to a few).
- Auditability and shareability is required but sharing is not as extensive as software development. People tend to work on different business problems.
We could force express these into a one or more git repos, run a git server locally, and/or use github LFS/gitlab annex. We felt that the usecase is slightly different from software repos
Copyright (c) 2016, Venkata Pingali All rights reserved.
Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.