Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve as much of the original structure as possible #47

Open
gitonthescene opened this issue May 29, 2020 · 13 comments
Open

Preserve as much of the original structure as possible #47

gitonthescene opened this issue May 29, 2020 · 13 comments
Labels

Comments

@gitonthescene
Copy link

Hello there,

Thanks again for such an awesome project. It would be great to have an orga-stringify utility to fit more completely into the unified ecosystem and open ourselves up to using more transform tools. Then we could parse org files to an AST, transform them and then re-render the org. Ideally minimal transformations would re-render something pretty close to the original. To do that, we'd need to preserve as much of the original structure as possible.

I propose something like these changes. I'm after the effect more than the approach so I'm happy to discuss/modify/whatever. If you'd like me to make this a pull request, please let me know.

My thinking is that the extra structure in the AST can always be stripped when not needed. For instance, you could filter out whitespace/keyword nodes as well as trim() inner text if desired. But having it in the AST allows us to (nearly) faithfully re-render the original org file.

In there is a separate commit with the changes to the snapped files if you just want to see the effect on the AST. I think in a couple of cases it even renders a bit more accurately.

Again, more than happy to discuss.

Thanks again,
-Doug

P.S. I have a prototype for orga-stringify as well which I'll add to my fork as soon as I figure out how lerna works.

@gitonthescene
Copy link
Author

I've now incorporated orga-stringify into my fork. It's just pure javascript currently. But when you run the following code on this sample org file it differs from the orginal only by a single trailing new line.

const unified = require("unified");
const vfile = require("to-vfile");
const parse = require("orga-unified");
const render = require("orga-stringify");
const processor = unified().use(parse).use(render, { toJSON: false });

function main() {
  processor
    .process(
      vfile.readSync(
        "/sample/orgfile.txt"
      )
    )
    .then(
      (file) => {
        process.stdout.write(String(file));
      },
      (err) => {
        console.log(String(err));
      }
    );
}
main();

It optionally just spits out the JSON version of the tree using your getCircularReplacer() function.

@gitonthescene
Copy link
Author

The head version of my fork now handles the trailing newline. Moreover, it completely reproduces all of the test examples but three. It renumbers two list examples where the numbers are out of order and it reformats a raggedly entered table into a more rectangular one.

@gitonthescene
Copy link
Author

Hey there,

Not that this needs to be a goal to have these line up, but for curiosity sake I wrote the following tiny elisp function to have a look at what the emacs internal syntax tree looks like for a given org buffer:

(defun grab-org-nodes (node)
  (list (if (listp node) (car node)) (-map 'grab-org-nodes (om-get-children node))))

You need to package-install both dash.el and om.el to run it. It's just a general outline of the tree. Non-node types show up as nil.

Regards,
-Doug

@gitonthescene
Copy link
Author

gitonthescene commented Jun 12, 2020

Also, to align with the unified structure maybe orga-unified should be called orga-parse sort of like remark-parse and there can be another package with a frozen parser like remark. Or maybe just make orga-unified have the processor.

@gitonthescene
Copy link
Author

It would be great to get a reply here. The more full featured the tools are the more likely they are to be used.

@boj
Copy link

boj commented Jul 27, 2020

@gitonthescene Perusing through this project and wanted to say that this all seems to be on the right track. The ability to convert to<->from the source material without altering it would be a great use case for the toy I have in mind.

@gitonthescene
Copy link
Author

Thanks. You're welcome to play with my fork. I'm happy to answer any questions you might have.

@xiaoxinghu
Copy link
Collaborator

@gitonthescene orga-stringify looks amazing, I was busy working on v2, part of the reason is that with the strongly typed codebase, it's much easier to collaborate and have a set of conventions. Can you have a look at the current master see if you can adopt the new style. also with v2 we now have Position in nodes. It's extra information that might be useful for faithfully rerender the org-mode text. I'd like to help with any issues.

@xiaoxinghu
Copy link
Collaborator

xiaoxinghu commented Aug 10, 2020

I'd like your opinion here. We now have the ability to tokenize everything including whitespaces, do you think that's a good idea to include all tokens in the AST? I was worried that it's going to be too verbose. So that's why I currently skip all the whitespaces. We can easily change it now. We do have the newline token though, but it's not included in the final Syntax Tree. What's your thought?

@gitonthescene
Copy link
Author

Hey, thanks for getting back. I think it makes sense to put in all the tokens until they become a performance problem and even then make the level of detail optional. The reason I say this is that some people may want the full detail to "edit" the tree and then stringify it. That was my use case. The only potential problem I see from the extra detail is performance in processing, but as Knuth says, "The greatest evil in the world is premature optimization". You can always transform a detailed tree into a less detailed tree, but you can't go the other way around. It might even be worth providing a transformer or two which strips whitespace or whatever just to demonstrate. I'm happy to contribute code.

I'll have a look at the master and try to rework orga-stringify. As I said in one of these issues, I was more after the effect than insisting on an approach. I'm a big believer in programming "for effect" (i.e. to an API) since you can always revisit the code later. Plus shipping results helps keep users interested.

Thanks again,
-Doug

P.S. since most use cases of this are build time I'd bet most people aren't that performance sensitive.

@xiaoxinghu
Copy link
Collaborator

xiaoxinghu commented Aug 10, 2020

Also, to align with the unified structure maybe orga-unified should be called orga-parse sort of like remark-parse and there can be another package with a frozen parser like remark. Or maybe just make orga-unified have the processor.

My intention for orga is to be standalone, even though it is heavily modelled after remark, but the package orga itself is self-contained. So for the naming of the packages, remark is a unified processor, but orga is not, it's basically a function that parses a string into a syntax tree. So I am thinking of renaming orga-unified into orga-unified-parse, because we are going to add more plugins into the ecosystem, like orga-unified-toc etc. Just to give a hint that these packages should be used within unifiedjs ecosystem. And they are just wrapper around packages like oast-to-hast, which is standalone (the only "dependency" is the HAST definition, which is kind of standard convention rather than dependency). orga-unified-toc should be a think wrapper around oast-toc, just like remark-toc is to mdast-util-toc. What do you think?

Take a look at PR #62

@xiaoxinghu xiaoxinghu added the discussion Ideas label Aug 10, 2020
@gitonthescene
Copy link
Author

gitonthescene commented Aug 11, 2020

If you mean you want to keep unified wrappers separate from a core orga library, I think that that makes sense. One of the things I like about the unified setup is that it tends to be made up of a lot of small packages so that you only have to pull in what you need, sort of like the UNIX philosophy. If that's the plan, then having a consistent naming for the unified wrappers also makes sense and I think your suggestions sound good. (FWIW, I wasn't really sure what names to use when I made the suggestion above.) FWIW, @wooorm seems like a really helpful guy.

I do kind of like reorg- as a prefix, though. If nothing else, it's less typing.

@tconfrey
Copy link

@gitonthescene @xiaoxinghu did anything ever come of this?

For my application I'm only concerned about the header, paragraph text and link elements. I was originally dropping any other elements and handling writing out the header/para/links in an application-specific manner. Most recently I've updated to V2 and am now using the position attributes to save the original text and mirror it back out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants