Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local blobstore? global? #2

Open
jbenet opened this issue Dec 19, 2013 · 0 comments
Open

local blobstore? global? #2

jbenet opened this issue Dec 19, 2013 · 0 comments

Comments

@jbenet
Copy link
Owner

jbenet commented Dec 19, 2013

data-blob doc describing more copied below.

Should there be a local blobstore separate from the working directory datasets?
Should it be global?

Implications:

  • no local blobstore (current):
    • pro: space saving? only one blob copy per file
    • pro: no suprises (no random extra data repositories lying about. wysiwyg.)
    • con: blobs are stored as the files they represent. can be deleted easily.
  • local blobstore:
    • pro: keeping working directory and repository separate confers git-like safety
    • con: duplicates all data on filesystem. bad as some will be massive.
  • local blobstore (global, 1 location per user, like go workspace):
    • pro: caching of blobs across all projects in machine.
    • pro: saves space
    • pro: fast
    • con: random (heavy) files added to a global spot in the machine
    • con: settings around the global blobstore
data blob - Manage blobs in the blobstore.

    Managing blobs means:

      put <hash>    Upload blob named by <hash> to blobstore.
      get <hash>    Download blob named by <hash> from blobstore.
      check <hash>  Verify blob contents named by <hash> match <hash>.
      show <hash>   Output blob contents named by <hash>.


    What is a blob?

    Datasets are made up of files, which are made up of blobs.
    (For now, 1 file is 1 blob. Chunking to be implemented)
    Blobs are basically blocks of data, which are checksummed
    (for integrity, de-duplication, and addressing) using a crypto-
    graphic hash function (sha1, for now). If git comes to mind,
    that's exactly right.

    Local Blobstores

    data stores blobs in blobstores. Every local dataset has a
    blobstore (local caching with links TBI). Like in git, the blobs
    are stored safely in the blobstore (different directory) and can
    be used to reconstruct any corrupted/deleted/modified dataset files.

    Remote Blobstores

    data uses remote blobstores to distribute datasets across users.
    The datadex service includes a blobstore (currently an S3 bucket).
    By default, the global datadex blobstore is where things are
    uploaded to and retrieved from.

    Since blobs are uniquely identified by their hash, maintaining one
    global blobstore helps reduce data redundancy. However, users can
    run their own datadex service. (The index and blobstore are tied
    together to ensure consistency. Please do not publish datasets to
    an index if blobs aren't in that index)

    data can use any remote blobstore you wish. (For now, you have to
    recompile, but in the future, you will be able to) Just change the
    datadex configuration variable. Or pass in "-s <url>" per command.

    (data-blob is part of the plumbing, lower level tools.
    Use it directly if you know what you're doing.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant