Add `mcclient archive` command to generate archive tarball #185

yusefnapora · 2017-02-21T23:13:31Z

I figured that it would be easier to just make the tarball in JS using the tar-stream module vs writing some shell or python scripts, since we already have the code to extract and request the associated objects.

This adds an mcclient archive <queryString> command that will write a gzipped tarball to stdout (or you can give a --output|-o flag). In the tarball will be a stmt/<statementId> entry for each statement, and a data/<objectId> entry for each data object. The statements are stringified JSON objects, but could easily do protobufs instead.

Happy to tweak the archive format in the morning (multiple statements per entry is probably a good idea).

parkan

So this is kind of what I was thinking with my "just a tarball of stuff" approach. My memory of the tar format is kinda rusty but it looks like the "file" entry headers are 500 bytes, which is a bunch but not a catastrophic amount, especially with compound statements. I am tempted to say let's use this as is and leave batching as a possible enhancement.

parkan · 2017-02-22T02:40:41Z

src/client/cli/commands/archive.js

+        queryStream.on('data', obj => {
+          let stmt
+          try {
+            stmt = Statement.fromProtobuf(obj)


I would be inclined to write the protobufs directly -- there's definitely a human readability cost but it is, after all, a serialization format.

Yeah, I think I agree

parkan · 2017-02-22T02:42:43Z

src/client/cli/commands/archive.js

+        if (dataResult == null || typeof dataResult !== 'object' || dataResult.data == null) return
+
+        const bytes = Buffer.from(dataResult.data, 'base64')
+        writeToTarball(tarball, `data/${key}`, bytes)


though maybe if we're batching anyway might as well write the batch as a single "file"?

i think it's preferable to have objects in their own individual files as it facilitates quickly checking (or seeking) an object for existence in an archive.

vyzo · 2017-02-22T10:17:14Z

I think it might be a mistake to have each statement in its own file.
Firstly it costs space (those 500 bytes are 2x overhead) and it also makes the load process slow, as statements need to be read from file one by one (and doesn't take advantage of the streaming nature of the import api)

I think the better approach is to batch statements together in json, in one (or more) ndjson files in stmt/

vyzo · 2017-02-22T10:22:55Z

src/client/cli/commands/archive.js

+          writeToTarball(tarball, name, content)
+
+          for (const id of stmt.objectIds) {
+            objectIds.push(id)


we need to deduplicate objects (and dependencies!) for the entire archive.

Yeah, true. Using a Set instead of an Array for the refs should do the trick, although if we're willing to accept some small amount of duplication we could request the objects in batches while the statement stream is still coming in instead of waiting until the end to fully deduplicate them. We'd probably end up with duplicate schema objects though.

yusefnapora · 2017-02-22T15:45:54Z

Also, again if we're willing to increase the archive size somewhat, the dead simplest thing to ingest at load time would be to use collections of ndjson batches for both statements and objects, where the object ndjson is of the form {data: "base64-encoded-object"}. Then we could just feed the ndjson directly into the concat API with curl or mcclient or whatever.

With gzipping the extra overhead might from the json might not be too bad, although we'd take a hit from base64, of course.

parkan · 2017-02-22T17:24:12Z

@vyzo

Firstly it costs space (those 500 bytes are 2x overhead) and it also makes the load process slow, as statements need to be read from file one by one (and doesn't take advantage of the streaming nature of the import api)

OK, that's fair, though I wouldn't call this "its own file" -- it's a big file with large-ish record delimiters, not separate files. The advantage is potentially higher recoverability/seeking due to presence of headers but since we're treating this as all-or-nothing I can live with big ol' ndjson

@yusefnapora

Also, again if we're willing to increase the archive size somewhat, the dead simplest thing to ingest at load time would be to use collections of ndjson batches for both statements and objects, where the object ndjson is of the form {data: "base64-encoded-object"}

Might as well just store a base64 object per line then? Don't really need the extra markup

vyzo · 2017-02-22T19:40:07Z

src/client/cli/commands/archive.js

@@ -15,6 +16,10 @@ const TAR_ENTRY_OPTS = {
  gname: 'staff'
 }

+function leftpad (str, length, char = '0') {


vyzo · 2017-02-22T20:28:09Z

src/client/cli/commands/archive.js

+        function writeStatementBatch (force: boolean = false) {
+          if (force || stmtBatch.length >= STMT_BATCH_SIZE) {
+            const content = Buffer.from(stmtBatch.join('\n'), 'utf-8')
+            const filename = `stmt/${leftpad(stmtBatchNumber.toString(), 8)}.ndjson`


just .json should work too for the extension? it's common for .json files to actually contain ndjson.

we use .ndjson elsewhere, it's descriptive and reasonably commonly used

vyzo · 2017-02-22T20:30:14Z

src/client/cli/commands/archive.js

+}
+
+function writeDataObjectsToTarball (client: RestClient, tarball: Object, objectIds: Set<string>): Promise<*> {
+  if (objectIds.size < 1) return Promise.resolve()


that's so javascript :)
in a compiled language you would write the comparison as == 0 comparison, as it has direct compilation to test and jz/jnz ops

and then you get runaway loops in concurrency scenarios with multiple workers 😉 (or with bad math)

I always try to be as paranoid as possible with js, and assume that some demented dependency could have monkey patched .size to return a negative number 😛

right, javascript, the land of crazy!

vyzo · 2017-02-22T20:44:54Z

src/client/cli/commands/archive.js

+    .then(stream => new Promise((resolve, reject) => {
+      stream.on('data', dataResult => {
+        const key = objectIds.shift()
+        if (dataResult == null || typeof dataResult !== 'object' || dataResult.data == null) return


so if an object i missing from the datastore (or some other error condition), we just silently stop adding objects and still produce an archive?
I think we should at least notify the user that there was an error and possibly keep going and add the remaining objects.

yeah, that's a good point... perhaps we could be strict by default and abort on errors, with a flag to just warn about them?

yeah, let's error out by default and allow the user to specify a lax mode that keeps going.

with warnings :)

+1 for erroring out

vyzo · 2017-02-23T10:45:41Z

test/resources/fixtures/test-statements.js

@@ -83,6 +83,12 @@ module.exports = {
    envelope: [ SIMPLE_STMT_1.publisher ],
    envelopeEmpty: [ ENVELOPE_STMT.publisher ]
  },
+  expectedDeps: {
+    simple: [ new Set(['dep1', 'dep2']), new Set(['dep1', 'dep3']) ],
+    compound: [ new Set() ],


we should test that it retrieves dependencies correctly in compound, rather than envelopes; we don't have any of the latter yet.

yeah, good call. I'll update the test

vyzo · 2017-02-23T10:47:55Z

src/client/cli/commands/archive.js

+          const msg = (dataResult && dataResult.error) ? dataResult.error : 'Unknown error'
+          if (allowErrors) {
+            printlnErr(`Error fetching object for ${key}: ${msg}`)
+            return


so the promise stays unresolved here; is this desired behaviour?

It will still resolve when the stream ends (in the stream.on('end', ...) below. That will probably be immediately after the error, since concat will close the stream after the error.

add mcclient archive command to generate archive tarball

ee2011a

parkan reviewed Feb 22, 2017

View reviewed changes

vyzo reviewed Feb 22, 2017

View reviewed changes

yusefnapora added 2 commits February 22, 2017 11:43

add depsSet to Statement to get object deps

a2f87c1

dedup object refs, write statements as ndjson batches

1c5c990

vyzo reviewed Feb 22, 2017

View reviewed changes

yusefnapora added 2 commits February 22, 2017 16:39

add tests for Statement.depsSet

42343c2

abort archive generation by default if data fetching fails

eccd844

vyzo reviewed Feb 23, 2017

View reviewed changes

yusefnapora added 2 commits February 23, 2017 09:38

add test for compound statement deps

ea0c987

better error handling for missing objects

76032e3

vyzo approved these changes Feb 23, 2017

View reviewed changes

yusefnapora merged commit e261147 into master Feb 23, 2017

yusefnapora deleted the yn-archive branch February 23, 2017 18:37

vyzo mentioned this pull request Feb 23, 2017

Archives for backing up and restoring nodes mediachain/concat#86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `mcclient archive` command to generate archive tarball #185

Add `mcclient archive` command to generate archive tarball #185

yusefnapora commented Feb 21, 2017

parkan left a comment

parkan Feb 22, 2017

yusefnapora Feb 22, 2017

parkan Feb 22, 2017

vyzo Feb 22, 2017

vyzo commented Feb 22, 2017

vyzo Feb 22, 2017 •

edited

Loading

yusefnapora Feb 22, 2017

yusefnapora commented Feb 22, 2017

parkan commented Feb 22, 2017

vyzo Feb 22, 2017

yusefnapora Feb 22, 2017

vyzo Feb 22, 2017

parkan Feb 22, 2017

vyzo Feb 22, 2017 •

edited

Loading

parkan Feb 22, 2017 •

edited

Loading

yusefnapora Feb 22, 2017

vyzo Feb 22, 2017

vyzo Feb 22, 2017

yusefnapora Feb 22, 2017

vyzo Feb 22, 2017

vyzo Feb 22, 2017

parkan Feb 22, 2017

vyzo Feb 23, 2017

yusefnapora Feb 23, 2017

vyzo Feb 23, 2017

yusefnapora Feb 23, 2017

Add mcclient archive command to generate archive tarball #185

Add mcclient archive command to generate archive tarball #185

Conversation

yusefnapora commented Feb 21, 2017

parkan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyzo commented Feb 22, 2017

vyzo Feb 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yusefnapora commented Feb 22, 2017

parkan commented Feb 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyzo Feb 22, 2017 • edited Loading

Choose a reason for hiding this comment

parkan Feb 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add `mcclient archive` command to generate archive tarball #185

Add `mcclient archive` command to generate archive tarball #185

vyzo Feb 22, 2017 •

edited

Loading

vyzo Feb 22, 2017 •

edited

Loading

parkan Feb 22, 2017 •

edited

Loading