Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Denormalise 'shared' formulas #7

Closed
nacnudus opened this issue Jan 21, 2017 · 0 comments
Closed

Denormalise 'shared' formulas #7

nacnudus opened this issue Jan 21, 2017 · 0 comments
Milestone

Comments

@nacnudus
Copy link
Owner

nacnudus commented Jan 21, 2017

The deal here is that Excel doesn't write every formula to disk. Formulas that differ only by the cell references, (e.g. =A1 and =A2 are basically the same, bar a row-offset), are only written once, and other cells refer to that one instance. The application is expected to parse the 'master' formula, and then reconstruct the others based on their relative position.

That's fine for spreadsheet applications, which have to parse the formula anyway for calculation, but for tidyxl, which leaves the formulas as strings, it's a pain.

There are a few open-source parsers around, which I refer to in comments below. Most are handwritten. The only one that uses a parser generator and a grammar is XLParser, which targets analysis rather than calculation. Microsoft publishes a grammar, Excel (.xlsx) extensions to the office openxml spreadsheetml file format p.24, but it is 25 pages long, and horrible.

The minimum-viable parser for tidyxl would simply separate cell references from the rest of the formula, offset them, and put the pieces back together. I've written the grammar, so just need to design a formula object to hold the pieces and handle the offsetting.

Parsers

  • C++ (MIT licence), which I have forked for a couple of fixes, and wrapped in an R package.
  • C# parser-generator grammar and my attempt to port it to C++/Rcpp using the PEGTL header-only C++ parser generator. Another parser generator I could use is Boost C++ included in the BH package and used in readr, with similar downsides to PEGTL (it backtracks after 'successful' intermediate matches have acted their side-effects, which is too late). My port now also has a much simpler parser that simply extracts cell references from the rest of a formula.
  • Handwritten Apache POI parser in Java.
  • Gnumeric parser. Gnumeric is a goldmine of well-commented C++ code:
  • Python for pretty-printing Excel formulas (could be an opportunity to use crayon?)
@nacnudus nacnudus modified the milestone: 0.3 Jan 28, 2017
@nacnudus nacnudus changed the title Propagate formulas across groups Denormalise 'shared' formulas (tokenize them first) Aug 1, 2017
@nacnudus nacnudus changed the title Denormalise 'shared' formulas (tokenize them first) Denormalise 'shared' formulas Aug 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant