Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Desktop: Fix #5626: When web clipper clipping code blocks, keep code in multiline and delete code number lines #10126

Merged
merged 38 commits into from Apr 20, 2024

Conversation

wljince007
Copy link
Contributor

@wljince007 wljince007 commented Mar 15, 2024

Desktop: Fix #5626: when clipper, keep code in multiline and delete code number lines

See an example, clipper from https://blog.csdn.net/qq_42025798/article/details/121727781,

html look:

image

befor this modify, web clipper clipping code blocks, code blocks are in one line:

image

after this modify, web clipper clipping code blocks, code blocks are in multiline, readability is good:

image

@laurent22
Copy link
Owner

We would need test units for this, but even then I don't understand what is being fixed at the moment

@personalizedrefrigerator
Copy link
Collaborator

personalizedrefrigerator commented Mar 15, 2024

what is being fixed at the moment

From the description, this seems to fix #5626 (closed due to inactivity).

Edit: For reference, tests for existing Webclipper HTML->MD logic can be found here.

@wljince007
Copy link
Contributor Author

Yes it is. this PR try to fix #5626, I well trying to add test units.

@wljince007 wljince007 changed the title Desktop: when clipper, keep code in multiline and delete code number lines Desktop: Fix #5626: when clipper, keep code in multiline and delete code number lines Mar 15, 2024
Copy link
Contributor

github-actions bot commented Mar 16, 2024

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@wljince007
Copy link
Contributor Author

@laurent22 Add test units packages/app-cli/tests/html_to_md/{code_multiline_1.html,code_multiline_2.html} run ok.

Comment on lines 269 to 271
// delete code number lines
code = code.replace(/(^-\s+\d+\s*$\r?\n)+/gm, '');

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this may also remove information that shouldn't be removed from certain code blocks. In the comment above could clarify exactly what strings are being removed? (as I'm unable to read that regex)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regular expression removes consecutive lines of code that start with a - followed by a number. The regular expression marker /gm means that each line break is separated by a line, and can be perfectly matched with a line using(^ - \ s+\ d+\ s * $\ r? \ n)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laurent22 Match the lines:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuous lines of code starting with - followed by numbers are basically useless information.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make more sense to ignore code-block children of type ul with class pre-numbering. This should also match line numbers in the second example provided.

This would allow the web clipper (and perhaps also the Rich Text Editor) to save blocks that contain lists of numbers, while also removing line numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patient guidance. I will try to modify the code using the method you mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@personalizedrefrigerator Already edit rules.list as your guidance. Test units run ok. Thanks a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this may also remove information that shouldn't be removed from certain code blocks.

@laurent22 This regular replacement code has been removed.


content = content.replace(/\r?\n|\r/g, '\n')
// if code is multiline and in codeBlock, just return it, codeBlock well add fence(default is ```)
if (content.indexOf('\n') != -1 && node.parentNode && isCodeBlock(node.parentNode)){
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please always use strict inequality !==

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed, submitted.

@personalizedrefrigerator
Copy link
Collaborator

I've noticed that the web clipper has trouble with code blocks on Joplin's website (e.g. the table of contents plugin tutorial page). Do you know if this pull request would also fix processing of those code blocks?

(Thank you for taking the time to work on this!)

@wljince007
Copy link
Contributor Author

wljince007 commented Mar 25, 2024

I've noticed that the web clipper has trouble with code blocks on Joplin's website

Add packages/turndown/src/commonmark-rules.js/rules.joplinOrgTokenLineSpan to fix this.

Copy first code blocks from: toc_plugin

html looks:

image

befor rules.joplinOrgTokenLineSpan added, on Joplin 2.13.15, save md looks:

image

after rules.joplinOrgTokenLineSpan added, save md looks:

image

Comment on lines 495 to 511
// Fix: Web clipper has trouble with code blocks on Joplin's website.
// See https://github.com/laurent22/joplin/pull/10126#issuecomment-2016523281 .
// Web clipper clippering contents is all in oneline. The format features are: <pre ...><code ...><span class="token-line">. Span with class of "token-line" represent one line.
// Test case: packages/app-cli/tests/html_to_md/code_multiline_3.html .
rules.joplinOrgTokenLineSpan = {
filter: function (node) {
const grandfather = node.parentNode.parentNode?? null;
return node.nodeName === 'SPAN' && node.getAttribute('class') === 'token-line' && grandfather && isCodeBlock(grandfather);
},

replacement: function (content, node, options) {
content = content.replace(/\r?\n|\r/g, '');
// If content replaced '\r\n' is empty, it indicates that the line is empty line; if not, keep the leading before and after content.
return content === '' ? '\n\n' : node.flankingWhitespace.leading + content + node.flankingWhitespace.trailing + '\n';
}
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to review this, I don't really understand what it does

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@personalizedrefrigerator, was this for the bug you mentioned, and does the fix make sense to you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@personalizedrefrigerator, was this for the bug you mentioned, and does the fix make sense to you?

This was the bug I mentioned, however I'm not sure that the fix makes sense to me. I think the issue here is that the <br/>s at the end of Docusaurus code blocks aren't being converted to linebreaks by Turndown.

For context, I've attached simplified HTML for a Docusaurus code block (taken from here):

<code class="codeBlockLines_FJaf">
    <span class="token-line" style="color: rgb(212, 212, 212);">
        <span class="token front-matter-block punctuation" style="color: rgb(212, 212, 212);">---</span>
        <br>
    </span>
    <span class="token-line" style="color: rgb(212, 212, 212);">
        <span class="token front-matter-block"></span>
        <span class="token front-matter-block front-matter yaml language-yaml key atrule" style="color: rgb(86, 156, 214);">id</span>
        <span class="token front-matter-block front-matter yaml language-yaml punctuation" style="color: rgb(212, 212, 212);">:</span>
        [...]
        <br>
    </span>
    [...]
    <span class="token-line" style="color: rgb(212, 212, 212);">
        <span class="token front-matter-block"></span>
        <span class="token front-matter-block punctuation" style="color: rgb(212, 212, 212);">---</span>
        <br>
    </span>
</code> 

Observe that each (inline) token-line span ends with a <br> that previously wasn't detected by Turndown.

Copy link
Contributor Author

@wljince007 wljince007 Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove packages/turndown/src/commonmark-rules.js/rules.joplinOrgTokenLineSpan. Analyze the current code, find the reasons for incorrect formatting in the code block, modify code for preserve the line break formatting for code block. Web clipper clipping code blocks on Joplin's website run ok. Test case run ok.

@wljince007 wljince007 changed the title Desktop: Fix #5626: when clipper, keep code in multiline and delete code number lines Desktop: Fix #5626: When web clipper clipping code blocks, keep code in multiline and delete code number lines Mar 28, 2024
@wljince007
Copy link
Contributor Author

One of the main goals of Joplin's 3.0 version development is to add mobile plugin functionality. Can we take this opportunity to add a user script type for ContentScriptType.TurndownPlugin? So that users can write plugins to affect the process of 'HtmlToMd'. I think this feature is missing in the current Joplin plugin system.

Copy link
Collaborator

@personalizedrefrigerator personalizedrefrigerator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments based on a quick read through the code.

Thank you for working on this!

packages/app-cli/tests/html_to_md/code_multiline_1.html Outdated Show resolved Hide resolved
packages/turndown/src/commonmark-rules.js Outdated Show resolved Hide resolved
packages/turndown/src/turndown.js Outdated Show resolved Hide resolved
packages/turndown/src/commonmark-rules.js Outdated Show resolved Hide resolved
wljince007 and others added 4 commits April 9, 2024 17:06
Coding style fix

Co-authored-by: Henry Heino <46334387+personalizedrefrigerator@users.noreply.github.com>
fix typo

Co-authored-by: Henry Heino <46334387+personalizedrefrigerator@users.noreply.github.com>
Optimize regular expressions: .replace(/( *\t*\n)+( *\t*)*$/, '') equivalent to .trimEnd()

Co-authored-by: Henry Heino <46334387+personalizedrefrigerator@users.noreply.github.com>
@wljince007
Copy link
Contributor Author

We could discuss it but you might want to create a thread on the forum about it. If you could mention example plugins that could benefit from it that could help decide if we add this or not

Add Feature request: Add content script type of ContentScriptType.TurndownPlugin

Copy link
Collaborator

@personalizedrefrigerator personalizedrefrigerator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this locally with Joplin's website and it works!!!

I've left a few minor stylistic/commenting suggestions. For the most part, I think this is ready to merge!

packages/app-cli/tests/html_to_md/code_multiline_5.html Outdated Show resolved Hide resolved
packages/turndown/src/commonmark-rules.js Outdated Show resolved Hide resolved
packages/turndown/src/commonmark-rules.js Outdated Show resolved Hide resolved
packages/turndown/src/turndown.js Outdated Show resolved Hide resolved
packages/turndown/src/turndown.js Outdated Show resolved Hide resolved
wljince007 and others added 5 commits April 18, 2024 14:16
fix typo

Co-authored-by: Henry Heino <46334387+personalizedrefrigerator@users.noreply.github.com>
better comments

Co-authored-by: Henry Heino <46334387+personalizedrefrigerator@users.noreply.github.com>
better comments

Co-authored-by: Henry Heino <46334387+personalizedrefrigerator@users.noreply.github.com>
add space

Co-authored-by: Henry Heino <46334387+personalizedrefrigerator@users.noreply.github.com>
remove space

Co-authored-by: Henry Heino <46334387+personalizedrefrigerator@users.noreply.github.com>
@wljince007
Copy link
Contributor Author

Thank you for the revision. My English proficiency is relatively weak and annotations are not well expressed. Thank you for your guidance.

@laurent22 laurent22 merged commit 294cc4a into laurent22:dev Apr 20, 2024
10 checks passed
@laurent22
Copy link
Owner

Thanks for implementing this @wljince007 and thanks for the review @personalizedrefrigerator!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Web clipper not clipping code snippets properly
3 participants