Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code restructuring and collaboration #34

Open
teucer opened this issue Jul 3, 2020 · 4 comments
Open

Code restructuring and collaboration #34

teucer opened this issue Jul 3, 2020 · 4 comments

Comments

@teucer
Copy link

teucer commented Jul 3, 2020

@gpoore

I went through the code, please find below some suggestions. My goal is not to criticize at all the time and work that you have dedicated to the project. Codebraid is filling a very useful niche and am grateful for your efforts. I think a "novice" friendlier code base might attract more people to contribute. Please let me know what you think.

  1. Some functions are very long e.g. the function _run. It would helpful to have more modules and smaller functions, cf. Unix philosophy
  2. Related to above, it seems that a lot effort is spent to read and process the cell configuration. It looks like an DSL, maybe one can have that in a separate module
  3. It would be helpful to use type hinting
  4. pandoc json/AST might not be ideal to process. Maybe one can use another parser/format (marked, cmark have nicer ASTs) to pre-process and execute the code
  5. It would be helpful to remove pandoc from the code base: codebraid would only convert from input-markdown to executed-markdown
  6. Maybe consider to support just jupyter kernels to leverage their capabilities. Don't know the implications in terms of performance
@gpoore
Copy link
Owner

gpoore commented Jul 4, 2020

Thanks for the feedback!

  1. Yes, some of the code could be improved. With _run(), part of the issue is that I've rewritten it a half-dozen times or so at this point and still have a lot of features to add, so it's nowhere near final form. The other area that's particularly complex is the Pandoc AST handling. In that case, further simplification/refactoring may be limited by performance considerations.

  2. Yes, there are many ways things could potentially be reorganized.

  3. There are some type hints in function definitions. There will be more type hints once Python 3.5 support is dropped.

  4. Pandoc has some disadvantages. It affects performance since it must be run several times and there is a lot of serialization/deserialization. However, I don't believe there are any other Markdown variants that have comparable built-in features. If you want Pandoc's power, it's best to use Pandoc from the beginning. At a minimum, Codebraid requires built-in support for a standardized attribute syntax for both code blocks and inline code, which I believe is still lacking in CommonMark after all these years. Another advantage is that if everything is based off the Pandoc AST, then in principle with a few changes Codebraid can add support for LaTeX and other formats supported by Pandoc.

  5. Codebraid's Markdown support is fundamentally built on Pandoc. Without Pandoc, Codebraid would have to parse Markdown to locate code. That would require a Markdown parser, which then goes back to point 4. Many programs have done things similar to Codebraid by using regex to locate code. But that will fail at times unless at least a partial Markdown parser has been implemented, which again goes back to point 4. Codebraid could just use Pandoc to convert to Markdown, but if Pandoc is being used anyway it is more efficient to go straight to the final output format.

  6. Jupyter kernels are great when used in the notebook, because you can get results back quickly without rerunning previous code or reloading data (though this can lead to problems when previous code should be re-executed but is not). Codebraid always starts a kernel, runs all code for a session, and then stops the kernel. At least for a lot of what I do, the built-in system can be finished before a Jupyter kernel can start. I have thought about adding a different mode that would keep kernels running in the background.

    There are other limitations of Jupyter kernels. When you run code with a typical kernel, you are getting the result of the language plus the kernel. You aren't getting the result of sending the code through the language's standard compiler or interpreter. When I'm writing about code and giving example output, I sometimes want the default result without any Jupyter modifications or add-ons.

    Also, my impression is that creating a Jupyter kernel for some languages, especially compiled languages, can be very complex. The built-in system just needs a few lines of template code to add a new language.

@teucer
Copy link
Author

teucer commented Jul 6, 2020

Some further thoughts:

  1. Don't know if it would help to think about architecture now or wait for the "final" solution before refactoring. There are a pros and cons with both approaches. Maybe there is a middle ground (?)

  2. I was wondering if one could internally use decorators to achieve that.

  3. Ok.

  4. This is how e.g. marked parses the file. E.g.

`print(1 + 2)`{.python .cb.run example=true} 

is parsed as

{type:"space", raw:"\n\n"}
{type:"paragraph", raw:"`print(1 + 2)`{.python .cb.run example=true}", text:"`print(1 + 2)`{.python .cb.run example=true}", tokens:[
  {type:"codespan", raw:"`print(1 + 2)`", text:"print(1 + 2)"}
  {type:"text", raw:"{.python .cb.run example=true}", text:"{.python .cb.run example=true}"}
]}

cmark has a similar AST. I believe this is much easier to work with than pandoc's AST. Besides with cmark (it has a python wrapper) it is all in memory, no need to call pandoc to save the file and reload with json.

  1. This creates an internal dependency with pandoc. E.g. I have seen that you manually need to add pandoc's parameters. Having a single purpose tool is better imho.

  2. Ok.

@gpoore
Copy link
Owner

gpoore commented Jul 6, 2020

For 1 and 2: I agree that some of the code could be organized better, but I am hesitant to do any refactoring simply on that basis. I prefer to refactor when limitations of the current, working code become apparent, or when refactoring has objective, concrete, well-defined goals. At least in my experience, refactoring as part of adding functionality results in code that mirrors functionality. I've wasted a lot of time in the past by refactoring to make things look nicer, which ultimately produced an architecture that was incompatible with adding future functionality and thus eventually resulted in even more refactoring.

For 4: The ASTs produced by marked and cmark are simpler because they don't do nearly as much. Neither parses code block attributes and neither supports inline code attributes, so Codebraid would then have to perform additional AST modifications to get its own, custom AST that contains all the necessary information. Pandoc's AST is more complex, but that's because it does more and as a result is ready to use immediately. Also, everything already works with Pandoc's AST, and being compatible with Pandoc brings a lot of possibities for the future.

For 5: I think we simply have philosophical differences here. I'm primarily concerned with Codebraid being easy to use. Letting people run code in their Pandoc documents by adding codebraid in front of the command they would normally use is about as simple as it gets. Having users pipe codebraid output through pandoc makes things more complicated for users, and also adds a lot of extra overhead when speed is already a concern. Also, this is already the way things work, so changing things would mean that everyone currently using Codebraid would suddenly have to change their workflows.

@mfhepp
Copy link

mfhepp commented Oct 25, 2022

+1 for keeping Pandoc as a core dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants