You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Should there be a local blobstore separate from the working directory datasets?
Should it be global?
Implications:
no local blobstore (current):
pro: space saving? only one blob copy per file
pro: no suprises (no random extra data repositories lying about. wysiwyg.)
con: blobs are stored as the files they represent. can be deleted easily.
local blobstore:
pro: keeping working directory and repository separate confers git-like safety
con: duplicates all data on filesystem. bad as some will be massive.
local blobstore (global, 1 location per user, like go workspace):
pro: caching of blobs across all projects in machine.
pro: saves space
pro: fast
con: random (heavy) files added to a global spot in the machine
con: settings around the global blobstore
data blob - Manage blobs in the blobstore.
Managing blobs means:
put <hash> Upload blob named by <hash> to blobstore.
get <hash> Download blob named by <hash> from blobstore.
check <hash> Verify blob contents named by <hash> match <hash>.
show <hash> Output blob contents named by <hash>.
What is a blob?
Datasets are made up of files, which are made up of blobs.
(For now, 1 file is 1 blob. Chunking to be implemented)
Blobs are basically blocks of data, which are checksummed
(for integrity, de-duplication, and addressing) using a crypto-
graphic hash function (sha1, for now). If git comes to mind,
that's exactly right.
Local Blobstores
data stores blobs in blobstores. Every local dataset has a
blobstore (local caching with links TBI). Like in git, the blobs
are stored safely in the blobstore (different directory) and can
be used to reconstruct any corrupted/deleted/modified dataset files.
Remote Blobstores
data uses remote blobstores to distribute datasets across users.
The datadex service includes a blobstore (currently an S3 bucket).
By default, the global datadex blobstore is where things are
uploaded to and retrieved from.
Since blobs are uniquely identified by their hash, maintaining one
global blobstore helps reduce data redundancy. However, users can
run their own datadex service. (The index and blobstore are tied
together to ensure consistency. Please do not publish datasets to
an index if blobs aren't in that index)
data can use any remote blobstore you wish. (For now, you have to
recompile, but in the future, you will be able to) Just change the
datadex configuration variable. Or pass in "-s <url>" per command.
(data-blob is part of the plumbing, lower level tools.
Use it directly if you know what you're doing.)
The text was updated successfully, but these errors were encountered:
data-blob doc describing more copied below.
Should there be a local blobstore separate from the working directory datasets?
Should it be global?
Implications:
The text was updated successfully, but these errors were encountered: