Database: Single YAML file vs multiple line-oriented files? #3

pfalcon · 2015-04-29T10:24:43Z

From current README:

Currently uses multiple files for "database", each storing particular type of information. Switch to a single YAML file instead?

This tickets is for discussion of this issue, opinions welcome!

pfalcon · 2015-04-29T10:45:01Z

From #1:

Regarding the question of individual files vs. a combined database in one file, I guess that individual files might be better for github?

I wouldn't say it's "better for github", it's likely better for a human reviewer of VCS changes. My idea is to integrate "save" operation with git commit, then a whole "disassembly" project can be backed by a repo containing everything one needs to continue work, so one can clone it, do some useful work, and submit to a maintainer for inclusion in centralized repo. A maintainer will of course want to review changes for sanity and making sure they're focused surely helps to do that quicker. For example, first thing one would be interested in label list - what was added, what was renamed. Then how many comments were added, then what instruction arguments were disambiguates. Easy of reviewing all that separately is why I implemented multi-file format so far.

But well, next thing after high-level review is more detailed one, and that would require cross-referencing individual items. For example, a function was renamed from "do_x" to "do_y". How so? There must be a comment for that function detailing what it really does then - with YAML, everything related to the same address would be together.

I don't know where the golden middle is. I'm afraid it may require actual experience with both (or more) ways to do it.

But any even speculative arguments are surely welcome.

pfalcon · 2015-07-20T00:31:23Z

Ok, I gave this not just good thought-over, but also some experimentation. Results? No silver bullet. As a quick fact, loading 7MB YAML with pure-Python parser (PyYAML) takes ... about forever, I lost patience after 5min. So, idea to have single YAML isn't viable. Of course, I would write fast custom parser, but that's just too big a file for not so big project. I have ambition to support 2 order of magnitude larger projects. And all this is intended to be used with git, so gigantic files isn't good idea.

So, at the very least, it should be older .aspace file for byte flags, and then everything else in .yaml. With such setup (and with enough abbreviation inside), YAML is ~700K. 2 order of magnitude would be 70M, but .aspace is about the same, so that's as good as it gets without going route of binary database.

There's still issue of saving speed, initial implementation took 1s to write that file, and again, that's not good for integration as I envision it. But that can, and needs optimization.

Otherwise, stuff like:

40003a14:
 f: C 01
 l: Uart_Init
 xref:
 - 40000ffb: c
 - 40001294: c
40003a4a:
 f: C 01
 l:
 xref:
 - 40003b2a: j

Hopefully looks better for a human than that info spread around multiple files.

Summing up:

File size is an issue. But I had an idea to dump them memory segment by memory segment, so there's way to tackle it still.
Processing speed is an issue too, and requires noticeable refactors to be tackled.

pfalcon · 2015-07-20T23:35:28Z

Well, of course the above example is semantically-broken YAML, actually it should be:

0x40003a14:
 f: C 01
 l: Uart_Init
 xref:
 - 0x40000ffb: c
 - 0x40001294: c

Which make the above ~700K be ~750K...

pfalcon · 2015-07-20T23:48:15Z

Which brings the question - having combined data format is certainly good, but how important for it to be (compliant) YAML? The initial idea for using YAML was to let people (re)use it without need to write own parser, and to show adherence to open standards. But:

A parser would be just few dozens lines of code, and will even work better than a random YAML parser, as mentioned above.
One good thing about standards is that there're many of them. So, there're no "standard standard" anyway, and if some aspects of standard get in your way, you can optimize them, there will be just yet another "standard", no kittens died.

Tough choices, I'd really could use 2nd opinion.

projectgus · 2015-07-21T07:38:46Z

Hi Paul,

For my 2c, I think sticking with YAML is a good goal even if you end up hand-writing a parser and/or save routine for the subset of YAML used here. Still more likely people can throw it into other tools, make quick scripts to iterate over it, etc.

I don't think there's going to be much more of a win on the file size thing while keeping a human readable ASCII format (which I agree is a great goal).

Regarding single file vs multiple file, maybe this is something you could specify in the definition files? (for an address region, have an optional filename suffix to split off that address region to a different database file if desired.) For instance I think it'd be sensible for esp8266 to keep ROM addresses in a different database file, but splitting IRAM and IROM regions seems less useful. There are probably similar project-specific divisions in other RE projects, hard to cater for all the possibilities upfront.

FWIW also, I think I would find a git-commit-on-save feature difficult to use. Keeping databases in git is an excellent idea, but having them manually versioned with meaningful messages for the commits ("investigated X", "traced through Y", etc.) would add a lot of value to the versioning. Especially if people end up collaborating on them requiring merging, etc. Trying to decipher intention from a sequence of many automatically generated commits (did they mean to do this? did they revert the same change a few saves later?) while merging would be hard, I think.

I know it'd be possible to rebase or reset and squash the auto-generated commits into manual ones with meaningful messages, but at least for me that'd mean I was working against the tool rather than with it..

One last suggestion, put some kind of metadata line or block at the top of the files (if you haven't already), under the YAML directive, with a version number in it. And have the tool check for any newer version and fail out. So if you do want to add tweaks later, you have a clean upgrade path to a newer format (apart from the directive & metadata block, I guess if you dump YAML entirely then it'd be a totally new format).

pfalcon · 2015-07-21T21:34:15Z

For my 2c, I think sticking with YAML is a good goal even if you end up hand-writing a parser and/or save routine for the subset of YAML used here. Still more likely people can throw it into other tools, make quick scripts to iterate over it, etc.

Even if that can potentially cause (performance) problems with git versioning? esp8266 SDK + bootrom (~320K raw binary code) lead to 2x 800K files. I don't commonly use git on files of such size, but sure it'll be ok. But what if there will be 2 orders of magnitude more binary code (32Mb, pretty realistic and not too big). I don't believe git (even with bigfile support which gets more traction recently) will make comfortable working with such files. Then they will need to be split, then it all seems to going backwards: I wanted YAML to keep everything in one place, and now need to split it up (albeit "along other axis"), because of YAML. Well, I needed quick reality/logic check on this. If you think that YAML is still worth it (given the benefits it gives, and those were the reasons to choose it), then I'm not going mad with this ;-).

I don't think there's going to be much more of a win on the file size thing while keeping a human readable ASCII format (which I agree is a great goal).

There wouldn't be big win, but can be a big bloat if it's done carelessly (and I'd like to write practical, usable tool, not yet another toy).

Regarding single file vs multiple file, maybe this is something you could specify in the definition files?

I assume here you talk about "sharding" YAML files by address range, not current per-data-type multi-file scheme, which surely will go as YAML is selected. Well, I'd like to avoid doing too much configuration, without real need. I'd rather write generic loader which can load any number of files with minimal constraints (perhaps, validating that region in each don't overlap). As for save, splitting bootrom region from ELF regions in esp8266 projects, like you say, makes sense, but the easiest way to achieve it is to just split files per region, voila. Later, an option to limit single file to N MB of binary content can be added (if really needed).

FWIW also, I think I would find a git-commit-on-save feature difficult to use.

Well, show me alternative ;-). I can tell about 2: one company used to provide plugin, and service, for IDA to collaboratively RE some stuff. The company was bought by Google, apparently for sole purpose to shutdown this service. There's another opensource project which provides (rather primitive imho) interactive disassembler with central collaboration server. It actually works so smart then when you start that disassembler, it does network connection without letting you know, sending something, receiving something (and yes, author refuses to add "trojan" to its name ;-) ).

So, both of these are counter-examples of what I want to achieve. I want pushing/pull work to be explicit, changes come in units, being attributed, reviewable, then revertable if needed. git does all that and even not written by me, so people can trust it (or not) independently.

but having them manually versioned with meaningful messages for the commits ("investigated X", "traced through Y", etc.) would add a lot of value to the versioning.

Yep, that would nice extra feature. If people ask for it, I'll implement it. But even without this extra feature, git adheres to basic requirements outlined above, by having author name, date, and culture of doing small changes per commit, not rewriting half of project in one.

Trying to decipher intention from a sequence of many automatically generated commits (did they mean to do this? did they revert the same change a few saves later?) while merging would be hard, I think.

Well, everyone who maintains an open-source project knows that sometimes it's to decipher intention from a source code patch submitted ;-). But solution is simple - patches are not merged until described well and got into shape. Same basic idea applies here, and git rebase interacative works here just the same! Perhaps, it will be harder to review such changes, but that's why I spend so much time on this ticket, and seeking more opinions, how to make it more human-readable (while still efficient).

pfalcon · 2015-07-21T21:44:57Z

One last suggestion, put some kind of metadata line or block at the top of the files (if you haven't already), under the YAML directive, with a version number in it. And have the tool check for any newer version and fail out. So if you do want to add tweaks later, you have a clean upgrade path to a newer format (apart from the directive & metadata block, I guess if you dump YAML entirely then it'd be a totally new format).

Well, one of the reason for calling YAML was that it's generic structured data format, so new stuff can be added later w/o pain (and tools which don't know some kind of data, can just skip it). Again, that's why I'd prefer to have a good design phase and collect feedback. What I plan is to add comment which explains to a human meaning of heavily abbreviates (because they're too frequent) YAML keys, like "l" is "label". Version number? You know the irony, projects which think well about things, up to providing version number, oftentimes stuck at 1.0, and vice-versa, where distinguishing versions directly would help, is when it's not available ;-). So, I'd prefer to do good thinking part, but perhaps explicitly sticking to version 1.0 isn't bad either ;-).

pfalcon · 2015-07-21T21:45:31Z

Thanks for finding time to provide detailed feedback on several aspects, much appreciated!

projectgus · 2015-07-21T22:21:37Z

It if there will be 2 orders of magnitude more binary code (32Mb, pretty realistic and not too big). I don't believe git (even with bigfile support which gets more traction recently) will make comfortable working with such files.

Yeah, I see. I don't really have any good suggestions on this.

One idea, I don't know how feasible, is it possible to only write database data when it has varied from whatever the tool automatically infers from the currently loaded definition? That way it may be possible to eventually work up to a tens-of-megabytes database, but only if people manually put in all of the work to get there.

So, both of these are counter-examples of what I want to achieve. I want pushing/pull work to be explicit, changes come in units, being attributed, reviewable, then revertable if needed. git does all that and even not written by me, so people can trust it (or not) independently.

Absolutely. But you can get all this with git without baking it into the tool. If people aren't going to think collaboratively when using scratchabit then having it automatically create git commits behind the scenes doesn't necessarily help with that (it's a technological solution to a non-technological problem). It may actively hinder people who do want to think collaboratively because now they have to process dozens or hundreds of meaningless machine-generated commits.

I could be wrong, maybe you'll try it and it'll be a really novel solution, but I think there's a reason people don't normally configure their text editor or IDE to commit every save.

You know the irony, projects which think well about things, up to providing version number,
oftentimes stuck at 1.0, and vice-versa, where distinguishing versions directly would help, is when it's
not available ;-). So, I'd prefer to do good thinking part, but perhaps explicitly sticking to version 1.0
isn't bad either ;-).

I think it's good that you're doing design up front and thinking about it first, but I promise you won't get it 100% right this time. There's things the collaborators so far haven't thought of, and if the project becomes successful then new people will turn up with use cases you haven't though of at all.

I really recommend doing the best version 1.0 you can, but have a facility to provide a version 2.0 without old versions of the software just crashing out!

pfalcon · 2015-07-23T22:36:25Z

One idea, I don't know how feasible, is it possible to only write database data when it has varied from whatever the tool automatically infers from the currently loaded definition?

Do you mean taking what loader, and mostly, disassembly analyzer, constructs, as a baseline and not saving it? But reconstructing such baseline om each load is rather expensive, for example, initial analysis of 320K mentioned above takes ~30s. The whole reason why DB is needed is to avoid re-analyzing it all the time ;-).

It may actively hinder people who do want to think collaboratively because now they have to process dozens or hundreds of meaningless machine-generated commits.

Again, "meaningness" is not the cornerstone here. Accountability and measure of progress is. For example, if turned out that dude X's contributions to project was in fact sabotage (which may be not immediately visible), then not the whole project is contaminated, but only data touched by him, and his commits can be reverted (yes, possible with a lot of pain, which is still probably less pain than losing everything).

I could be wrong, maybe you'll try it and it'll be a really novel solution, but I think there's a reason people don't normally configure their text editor or IDE to commit every save.

"Normally" is a keyword here. It's just well-know as not being a best practice. A lot of people still use essentially that (commits at arbitrary times with dummy or empty commit messages). And figure what - it's still better than giving out to the world just the single latest snapshot ;-).

But you're right that I don't know how useful that will be to actually foster review of contributions, or contributions themselves. What I'm sure is that adding natural-language description of functions to a wiki at a pace of 1 func per week isn't a way to go, so I'm trying wild ideas on how to streamline that ;-).

pfalcon · 2015-08-20T23:55:21Z

Ok, so I just tagged 0.8.9, which has initial support for YAML. Actually, the way it is implemented, there's new data structure too, and old and new datastructures, as well as old and new save format, are used in parallel, to make it possible to compare and validate new stuff. Caveat: currently it cannot load older saved state, only start from scratch and save. That's why it's called 0.8.9. 0.9 is going to have migration of old format, but I find it a tad non-exciting to implement, knowing that neither me nor unlikely someone else needs it. I'm going to finish this exercise, but I opened #9 to discuss how to handle migrations in the future.

pfalcon · 2015-08-21T23:06:07Z

Ok, using new YAML-based format, but migrating old format if detected, is not in master. The only thing left before tagging 0.9 is adding format version as suggested by @projectgus. Proposed syntax is:

header:
 version: 1.0

More things can be put into "header" later.

pfalcon · 2015-08-21T23:29:44Z

0.9 tag has been cut. https://github.com/pfalcon/ScratchABit/wiki/DBMigration is added a DB migration reference.

pfalcon mentioned this issue Jul 2, 2015

Recommended way to add plugins? #5

Closed

pfalcon mentioned this issue Aug 20, 2015

Database migration approach #9

Open

pfalcon closed this as completed Aug 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database: Single YAML file vs multiple line-oriented files? #3

Database: Single YAML file vs multiple line-oriented files? #3

pfalcon commented Apr 29, 2015

pfalcon commented Apr 29, 2015

pfalcon commented Jul 20, 2015

pfalcon commented Jul 20, 2015

pfalcon commented Jul 20, 2015

projectgus commented Jul 21, 2015

pfalcon commented Jul 21, 2015

pfalcon commented Jul 21, 2015

pfalcon commented Jul 21, 2015

projectgus commented Jul 21, 2015

pfalcon commented Jul 23, 2015

pfalcon commented Aug 20, 2015

pfalcon commented Aug 21, 2015

pfalcon commented Aug 21, 2015

Database: Single YAML file vs multiple line-oriented files? #3

Database: Single YAML file vs multiple line-oriented files? #3

Comments

pfalcon commented Apr 29, 2015

pfalcon commented Apr 29, 2015

pfalcon commented Jul 20, 2015

pfalcon commented Jul 20, 2015

pfalcon commented Jul 20, 2015

projectgus commented Jul 21, 2015

pfalcon commented Jul 21, 2015

pfalcon commented Jul 21, 2015

pfalcon commented Jul 21, 2015

projectgus commented Jul 21, 2015

pfalcon commented Jul 23, 2015

pfalcon commented Aug 20, 2015

pfalcon commented Aug 21, 2015

pfalcon commented Aug 21, 2015