local blobstore? global? #2

jbenet · 2013-12-19T14:08:29Z

data-blob doc describing more copied below.

Should there be a local blobstore separate from the working directory datasets?
Should it be global?

Implications:

no local blobstore (current):
- pro: space saving? only one blob copy per file
- pro: no suprises (no random extra data repositories lying about. wysiwyg.)
- con: blobs are stored as the files they represent. can be deleted easily.
local blobstore:
- pro: keeping working directory and repository separate confers git-like safety
- con: duplicates all data on filesystem. bad as some will be massive.
local blobstore (global, 1 location per user, like go workspace):
- pro: caching of blobs across all projects in machine.
- pro: saves space
- pro: fast
- con: random (heavy) files added to a global spot in the machine
- con: settings around the global blobstore

data blob - Manage blobs in the blobstore.

    Managing blobs means:

      put <hash>    Upload blob named by <hash> to blobstore.
      get <hash>    Download blob named by <hash> from blobstore.
      check <hash>  Verify blob contents named by <hash> match <hash>.
      show <hash>   Output blob contents named by <hash>.


    What is a blob?

    Datasets are made up of files, which are made up of blobs.
    (For now, 1 file is 1 blob. Chunking to be implemented)
    Blobs are basically blocks of data, which are checksummed
    (for integrity, de-duplication, and addressing) using a crypto-
    graphic hash function (sha1, for now). If git comes to mind,
    that's exactly right.

    Local Blobstores

    data stores blobs in blobstores. Every local dataset has a
    blobstore (local caching with links TBI). Like in git, the blobs
    are stored safely in the blobstore (different directory) and can
    be used to reconstruct any corrupted/deleted/modified dataset files.

    Remote Blobstores

    data uses remote blobstores to distribute datasets across users.
    The datadex service includes a blobstore (currently an S3 bucket).
    By default, the global datadex blobstore is where things are
    uploaded to and retrieved from.

    Since blobs are uniquely identified by their hash, maintaining one
    global blobstore helps reduce data redundancy. However, users can
    run their own datadex service. (The index and blobstore are tied
    together to ensure consistency. Please do not publish datasets to
    an index if blobs aren't in that index)

    data can use any remote blobstore you wish. (For now, you have to
    recompile, but in the future, you will be able to) Just change the
    datadex configuration variable. Or pass in "-s <url>" per command.

    (data-blob is part of the plumbing, lower level tools.
    Use it directly if you know what you're doing.)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local blobstore? global? #2

local blobstore? global? #2

jbenet commented Dec 19, 2013

local blobstore? global? #2

local blobstore? global? #2

Comments

jbenet commented Dec 19, 2013