Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An API for load/edit/save .conf and .json files #272

Closed
havocp opened this issue Mar 3, 2015 · 12 comments
Closed

An API for load/edit/save .conf and .json files #272

havocp opened this issue Mar 3, 2015 · 12 comments

Comments

@havocp
Copy link
Collaborator

havocp commented Mar 3, 2015

Continuing the discussion from #149 , once we merge #271 we will have all of the information we need to save the file again in memory.

Some initial thoughts...

The currently-private API in https://github.com/typesafehub/config/blob/master/config/src/main/java/com/typesafe/config/impl/Tokens.java is pretty awful as a public API, with no real thought to clarity, namespacing, extensibility, or anything of that nature.

I also recently messed with SimpleConfigOrigin in #267 and discovered that it's quite a can of worms to change that thing. Not impossible but not much fun either. I'm also a little concerned about the memory footprint of hanging a bunch of tokens off of every ConfigOrigin, when most apps won't use the info at all. (A potential solution would be to drop the tokens at the first use of withFallback.)

But, I'm not sure we even need the tokens on the origin, because if we have a ConfigDocument kind of API, it could simply support getting the nodes at a given path.

What I don't know yet is what the ConfigDocument or ConfigNode API should look like, exactly.

It could be very limited, supporting only setting paths to values, parsing the doc into a config, and rendering:

interface ConfigDocument {
   // replaces tokens at path with new tokens representing the given value
   ConfigDocument setValue(String path, ConfigValue newValue);
   // parses tokens into a Config and returns it
   ConfigObject parse(ConfigParseOptions options);
   // render doc by rendering tokens as originally formatted
   String render();
}

Questions already arise:

  • where do we put the value if the setValue path wasn't in the doc already?
  • what happens if the path was in the doc more than once?
  • ConfigFactory has all kinds of overloads for the parse() method to support parsing files, urls, resources, etc.; can we avoid having all of these for ConfigDocument, maybe we only support parsing Reader and File?

Still, a minimal API like that is potentially good enough for load/edit/save ? If so we can go with YAGNI and not do a lot more to start ...

The thing I'm wondering whether we need, would be a ConfigNode kind of thing - perhaps slightly higher-level than tokens, or perhaps almost the same as our current set of tokens, I'm not sure - this would allow looking at substitution expressions, comments, and other stuff and manipulating them sort of like the DOM. This may be needed for something like #225, but it's a lot more API surface area to get wrong.
You would also need an API like this to write a HOCON formatter, for example.

It could be worth tackling the basic "load, change some values, save" use-case before the general AST use-cases, but I don't know.

@MSLilah
Copy link
Contributor

MSLilah commented Mar 4, 2015

@havocp @cprice404 So here's what I'm thinking:

Step 1: We create a new ConfigDocument class. This class will have a constructor that takes an input to be parsed. It will then proceed to tokenize its input and produce a TokenIterator.

Step 2: Create a new ConfigNode class. This will likely be an abstract class, with various subclasses representing certain types of values. These classes will contain the relevant tokens so that they can be modified or replaced as necessary. Child nodes will be stored in an ArrayList to preserve ordering.

Step 3: Write a new Parser. This Parser will be less complex than the current Parser. It will take all the tokens and construct an AST. This will allow us to find the relevant nodes when we call the setValue() method.

In cases like the below:

foo: {
  bar: "bar"
}
foo.baz = "baz"

We would NOT parse baz into being a child node of foo, and instead would have a foo.baz node on the same level as foo. This will require a more sophisticated searching algorithm but I think we can do it.

Step 4: Write the setValue() method. This will find the relevant ConfigNode and replace the necessary token. We'll probably need a setValue() method on each subclass of ConfigNode, as a number of them will likely need to be handled differently.

Step 5: Write the render() method, which just traverses the tree, rendering all the tokens into a single string. Ordering should be preserved since child nodes will be stored in an ArrayList.

How does this sound? I'm thinking this would likely only support File at first. Also, when a path that hasn't been seen before is added, it should be rendered in the correct place by the tree traversal. I'm still unsure how we would handle the case in which the path appears more than once in the same file, though.

@havocp
Copy link
Collaborator Author

havocp commented Mar 4, 2015

Here's what I think the "separation of concerns" could be:

  • ConfigValue: has withFallback() and resolve(), so supports merging multiple files. Has the "typed" getters (getInt, getString, etc.). Does not expose substitutions or concatenations or multiple values at a path or include statements, except that it can be in an "unresolved" state where these things potentially haven't been processed yet.
  • ConfigNode: Group of tokens representing a logical unit within a config file. Represents: "include" statements, substitutions, multiple nodes at the same path, and syntatic details such as hash-comments vs. slash-comments, or different types of string quoting. Does not support: merging/withFallback, resolve, typed getters. Doesn't process include statements, just has an "include" node.
  • ConfigDocument: Represents a single file or stream as a list or tree (?) of ConfigNode, where nodes in turn contain tokens.
  • Tokenizer: goes from a Reader to tokens
  • Parser1/NodeParser: goes from tokens to ConfigNode
  • Parser2/ValueParser: goes from ConfigNode to ConfigValue (or tokens to ConfigValue?). currently called just "Parser"

Re: the steps you mention, some thoughts:

  • Step 1: nitpick, I wouldn't stick the Tokenizer or Parser1 in this class's constructor, make them static methods in their own files then construct ConfigDocument from the result.
  • Step 2: Yes, though I'm not sure yet whether ConfigNode is public, or at least not sure which methods on it are public
  • Step 3: I think this is Parser1. At least at first, you could make it parallel to the existing Parser2 rather than modifying Parser2 in-place. Changing Parser2 from tokens=>ConfigValue to ConfigNode=>ConfigValue could be considered a separate task.
  • Step 4: yes. I'm not sure what the complexities of splicing in replacement tokens will turn out to be. we have to be careful that the result is always a valid file, while keeping the formatting as close as possible to the original.
  • Step 5: yes

As far as how to proceed - I think it might work best to avoid modifying the existing parser (parser2) at first, though you could borrow liberally from it; instead, focus on Parser1 and defining ConfigNode (if you like, make it a class AbstractConfigNode in the impl package, and we'll factor out a public config.ConfigNode interface later, using the same pattern as ConfigValue/AbstractConfigValue). You could make yourself a new test file ConfigNodeTest or whatever to test Parser1 and ConfigNode.

If you start there, then we can write tests and work out what the kinds of ConfigNode are and generally know what's involved here. Reading through Parser.java as it exists, it looks quite hard to refactor to a Tokens=>ConfigNode parser, but maybe pretty easy to refactor so that it parses ConfigNode instead of parsing Tokens. That is, most of the "shape" of Parser.java looks to me like it's driven by the needs of generating ConfigValue. I could be wrong as always.

Remember that the file could be considered either JSON or HOCON, and if it's JSON we need to enforce JSON rules and also ensure we create valid JSON.

A thorny issue for ConfigNode/ConfigDocument, perhaps, is how to arrange a document like this so it's easy to edit a value using the API:

foo {
  bar = 10
  bar = 11 // bar duplicated
  baz = 12
}
foo.bar = 13 // bar duplicated again, but with the path
foo.baz = 14 // and baz duplicated
foo {
  bar = 15 // not again!
}

So from the standpoint of recreating this file, you probably need a tree that matches how it's written above... a node foo with three children bar, bar, baz, then a node foo.bar, then a node foo.baz, then a node foo with one child bar.

However, that syntax-retaining tree will be annoying from an editing standpoint, and the setValue on individual nodes could be quite misleading. The "right" way to set foo.bar=42 in a file could be one of the following:

  • remove all nodes that are at the path foo.bar and then append a new node foo.bar = 42
  • remove all nodes except the last one and then change the last one's value to 42
  • leave all nodes but change the value of the last one to 42
  • simply append foo.bar=42 and leave all the other ones (problem with this is that it isn't idempotent, i.e. your file can grow and grow if you keep doing it)
  • leave all nodes, change the value of the last one to 42 unless it's a substitution, if it's a substitution append a new node instead
  • ... more possibilities no doubt exist

This sounds more terrible than it is, since in practice values and files you edit by machine probably won't be that complicated. So a simple rule like "remove all nodes but the last one and change the last one's value" is likely fine. That could become a method like ConfigDocument replaceAll(String path, ConfigValue) ... we do need a ConfigNode.setValue to implement this I think, but in most cases people should prefer to use a document-level setter.

Implementation-wise, remember that we use an all-immutable convention. For trees, one implication of this is that nodes can't have a parent() (since it would change every time you changed any node in the tree).

@MSLilah
Copy link
Contributor

MSLilah commented Mar 4, 2015

@havocp Hmm, alright, that makes sense. So it sounds like maybe I should start by defining a ConfigNode class, and then writing Parser1, or whatever it ends up being called, going one type of ConfigNode at a time.

In terms of inserting into the tree, I'm thinking we should leave all the nodes but change the value of the last one to 42 to preserve the original text of the file as close as possible. However, I also think removing all nodes except the last one and then changing the last one's value to 42 sounds like it might work well.

@cprice404
Copy link

For our use cases, I think we pretty much frown upon (or just outright don't support) cases where someone defines the same value multiple times in the same file. So from our perspective I'd love to see us end up in a state where modifying a value would effectively remove dupes from the file... but assuming we're not the only target audience for these changes :), we'd obviously be happy to defer to whatever algorithm you think is best, @havocp .

@havocp
Copy link
Collaborator Author

havocp commented Mar 4, 2015

I think the document.replaceAll kind of method will effectively just be a convenience method - you could also iterate over the nodes by hand and do what you want - but we should try to make it conveniently do what we think people will want. I think I agree with removing dups and changing the last one.

Starting with parser1/ConfigNode and a test file for same sounds good to me.

@MSLilah
Copy link
Contributor

MSLilah commented Mar 5, 2015

@havocp Here's what I'm thinking for the ConfigNode implementation:

There will be an abstract AbstractConfigNode class from which all nodes inherit, and each individual token will be wrapped in a ConfigNode of some sort, and then put in a tree structure. There will be around 5 different types of ConfigNode (class names are not final).

  • BasicConfigNode: This will wrap all your basic tokens, like comments, whitespace, and various keywords/symbols (things like +=, =, :, ,, include, etc.). These nodes will NOT be modifiable. They only contain one method, render(), which renders the child token back into its original text.
  • ConfigNodeSetting: This will represent the name of a HOCON setting. This is functionally the same as BasicConfigNode, but is a separate type to indicate this is a key in a map. This does not need a reference to its value because ConfigNodeComplexValue preserves the ordering of its child nodes, so the next value node appearing after ConfigNodeSetting is the value node attached to that setting.
  • ConfigNodeValue: This node will be parsed whenever a value attached to a key/setting is parsed. This node DOES NOT CONTAIN ANY TOKENS. Instead, it contains another config node that can be one of two types, ConfigNodeSimpleValue or ConfigNodeComplexValue, as well as a method to replace its child node. This seemed weird to me, but doing it this way allows us to replace a simple value with a complex value and vice versa.
  • ConfigNodeSimpleValue: This node represents a simple setting value (any value type that is not a data structure like a map or an array). This is functionally the same as BasicConfigNode, as replacing settings will involve replacing the entire value node rather than the token associated with it.
  • ConfigNodeComplexValue: This node represents values that are data structures, like maps or arrays. The top-level node is an instance of ConfigNodeComplexValue. This node contains an ArrayList of ConfigNodes representing its children. This type of node is rendered by rendering all of its child nodes in order. New nodes can be inserted into this node type. Nodes can be removed or replaced as well.

Does this sound sane/like what you were thinking? Also, @cprice404 you might be interested in this

@MSLilah
Copy link
Contributor

MSLilah commented Mar 6, 2015

The ConfigNodeComplexValue class will do the heavy lifting of replacing settings based on a given path. It will also have a method, children(), which will return the list of children so the user can modify it how they want.

@havocp
Copy link
Collaborator Author

havocp commented Mar 6, 2015

It sounds reasonable to me; I would mostly have naming nitpicks I think. I guess I'd also say, try it and maybe we'll have to change it after we see how it goes, but it sounds like a good first thing to try.

Naming thoughts (not the final word, just ideas):

  • be consistent about prefix vs. suffix, so ConfigNodeBasic, ConfigNodeSimpleValue, etc. probably. Also, for private/internal class names it's fine to use NodeSimpleValue with no Config, and we can add the Config for the public interfaces later. (As you may have noticed, the public API is almost all interfaces, but to get this working I think you could skip the public API at first and then we'll factor it out. Eclipse can even do that automatically.)
  • BasicConfigNode - it sounds like these are nodes which are just passthrough text. So maybe NodeSyntax or NodePassthrough or NodeToken ...
  • ConfigNodeSetting - NodeFieldName perhaps?
  • ConfigNodeValue - you probably don't need a "holder" node for simple/complex, in an immutable setup. it would be enough to have a base class for the two kinds of value, and a way to replace the value within whatever normally contains it.
  • Simple/Complex - maybe NodeValue, NodePrimitive extends NodeValue, NodeContainer extends NodeValue, NodeObject extends NodeContainer, NodeList extends NodeContainer ? not sure. Current codebase uses Simple to mean "concrete default implementation of an abstract class or interface" - I don't especially like that convention, but we shouldn't make Simple mean anything else probably.

Don't get too hung up on the names; we can easily search-and-replace later. The main reason I bring it up is that I think naming can be helpful for conceptual clarity.

Mostly I'd say just give it a shot and see how it turns out, it's hard for me anyway to predict this too much a priori.

@MSLilah
Copy link
Contributor

MSLilah commented Mar 31, 2015

@havocp So I'm getting to work now on indentation when adding to a ConfigDocument, and wanted to get your feedback on approaches. Here's my idea:

This has two phases: indentation calculation and indentation addition to a value node.

To calculate the indentation, we can loop through the list of children in the object, and check for newline characters. If there is no newline character, we add the new setting onto the same line. If there IS a newline character, we calculate the indentation on the last line of the object, and use that indentation, adding the setting on its own line.

We will then have to add the indentation to a value node. This should not be particularly difficult; we can have a new protected method on ConfigNodeComplexValue that will take in an indentation string and insert it after every newline it sees. This will ensure consistent indentation if the parsed value was a multi-line object or array.

The only problem I can see with this approach is the case in which we're inserting a multi-line value into a single-line object. So in a case like this:

a { b { c { } } }, if we wanted to insert

{
    e : 12
    f : 13
}

at path a.b.c.d, I'm not sure of a good way to handle that indentation. We COULD add a new-line to the object at c, but then the question becomes how much do we indent that? What if the map we want is contained entirely on a single line, but a parent map a couple levels up is indented?

Mostly I just wanted to run this approach by you, and see if you had any suggestions/feedback, or if you'd prefer a different approach.

Thanks!

@havocp
Copy link
Collaborator Author

havocp commented Mar 31, 2015

Philosophically I think this can be a pretty simple best-effort algorithm - if the file is nicely-formatted already in a reasonable way, then we try to match that, but if the file is too creative then too bad. As long as we produce a valid file it's ok to be a little ugly in corner cases.

For a { b { c { } } } if a itself is indented, you could use the a indent to decide on the spaces-per-nesting-level, and then indent based on that ? like if a is indented two spaces and is nested once, then we have two-space indentation; so b should be indented 4 spaces, c indented 6 spaces, etc. ?

I don't mean to imply you'd reformat the existing stuff; just if you add fields inside c you'd use this to decide on their indentation?

Anyway I think what you describe sounds fine. Let's just do something simple that works in common cases (and always produces a valid file, even in the corner cases where it gets the indentation a little weird).

@MSLilah
Copy link
Contributor

MSLilah commented Mar 31, 2015

Sounds good! Thanks for the feedback as always!

@havocp
Copy link
Collaborator Author

havocp commented Apr 2, 2015

I think we can declare victory on this issue and close it! There are some more APIs we want to do, but I'll make new issues for those (or if I don't, anyone is welcome to of course). Thanks so much for the work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants