Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fenced code blocks with multiple children (spans from highligters) not converted #45

Closed
sanzoghenzo opened this issue Jan 28, 2024 · 3 comments

Comments

@sanzoghenzo
Copy link

Hi, first of all many thanks for your work, I'm using this library in my android app and it's working really well!

Unfortunately, a user of the app opened an issue because some code blocks in a webpage don't get converted: only the first line is displayed.

That specific webpage was created with Jekyll from a markdown source, so I'm expecting that many other websites could be affected.

This is an excerpt from the page:

<div class="language-python highlighter-rouge">
  <div class="highlight">
<pre class="highlight"><code><span class="c1">#! /usr/bin/env python3
</span>
<span class="kn">import</span> <span class="nn">tika</span>
<span class="kn">from</span> <span class="nn">tika</span> <span class="kn">import</span> <span class="n">parser</span>

<span class="n">fileIn</span> <span class="o">=</span> <span class="s">"berk011veel01_01.epub"</span>
<span class="n">fileOut</span> <span class="o">=</span> <span class="s">"berk011veel01_01.txt"</span>

<span class="n">parsed</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="n">from_file</span><span class="p">(</span><span class="n">fileIn</span><span class="p">)</span>
<span class="n">content</span> <span class="o">=</span> <span class="n">parsed</span><span class="p">[</span><span class="s">"content"</span><span class="p">]</span>

<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">fileOut</span><span class="p">,</span> <span class="s">'w'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">'utf-8'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>
    <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
</code></pre>
  </div>
</div>

The issue is at this line: only the first child of the code tag is read.

@sanzoghenzo
Copy link
Author

I tried with a custom rule, and replacing the incriminated line with childNodes().map((e) => e.textContent).join() renders all the code.

I'm not sure how to solve the language identification, that info appears two divs up instead of in the first children. I don't know if it a standard of some kind or a particular case of this website (it uses jekyll and a bootstrap theme).

@jarontai
Copy link
Owner

jarontai commented Feb 2, 2024

Hi sanzoghenzo, I don't think that's a standard for code blocks.

In the custom rule, you can get the parents or maybe any elements you want, by using the dom api.
For example(may not work): node.asElement()?.parent?.parent?.classes;

@sanzoghenzo
Copy link
Author

Thanks @jarontai, your example missed a parent, but it was very helpful!

I've turned it into a generic "walker" of all the parents, I'll leave it here for posterity:

String getLanguage(node) {
  var regex = RegExp(r'language-(\S+)');
  var className = node.firstChild!.className;
  var languageMatched = regex.firstMatch(className)?.group(1);
  if (languageMatched != null) {
    return languageMatched;
  }
  var nodeElement = node.asElement();
  while (nodeElement.parent != null) {
    nodeElement = nodeElement.parent;
    for (var className in nodeElement.classes) {
      languageMatched = regex.firstMatch(className)?.group(1);
      if (languageMatched != null) {
        return languageMatched;
      }
    }
  }
  return '';
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants