Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TinyMCE Paste Filter doesn't carry over formatting from MS Word or Google Docs #1866

Open
jlahijani opened this issue Jan 10, 2024 · 13 comments

Comments

@jlahijani
Copy link

jlahijani commented Jan 10, 2024

When using TinyMCE with the default paste filter settings, it doesn't include the formatting when pasting from MS Word or Google Docs.

However it does work correctly when copying and pasting from something like the rich text editor here (which I'm telling clients to paste to then copy from as a temporary work-around):
https://html-cleaner.com/

@ryancramerdesign
Copy link
Member

@jlahijani I don't have MS Word, but did try with Google Docs. I wasn't able to duplicate it. When a copy/paste a GoogleDocs document full of headlines, lists and links, they all come through properly in TinyMCE. Is it possible that the document you are copying/pasting is using [for example] font-sizes rather than the editor for things like headlines?

@jlahijani
Copy link
Author

With the update to 3.0.235 (which updated TinyMCE to the newest version), it looks like the issues with MS Word (in Windows, didn't test macOS) are now fixed.

Google Docs is still imperfect. I will look into it further.

@jlahijani
Copy link
Author

@ryancramerdesign I made a video demonstrating a bug with pastefilter and some other considerations when pasting from MS Word:

https://www.youtube.com/watch?v=wMmmzeVi4cY

@jlahijani
Copy link
Author

I made a couple mistakes in my video:

  1. I kept saying "unordered" list when I meant to say "ordered" list. However the same applies to both.
  2. In Word, I have text that says "Headline 3", when it's been actually set to be formatted as H2.

@jlahijani
Copy link
Author

jlahijani commented Jan 24, 2024

Diving deeper into this... it seems based on some quick research there doesn't exist some sort of open-source JS code that handles this stupid age-old problem. However I experimented with ChatGPT to write the necessary JS and it looks promising.

This is what I got with two very basic prompts which were:

Can you write a javascript function that cleans up messy code when pasting from word and converts it to clean html?
...
How about unordered lists?

Result...

function cleanWordHTML(input) {
    let doc = new DOMParser().parseFromString(input, 'text/html');

    // Remove Word's "Mso" classes and style attributes
    doc.querySelectorAll('[class^="Mso"], [class^="mso"], [style]').forEach(el => {
        el.removeAttribute('class');
        el.removeAttribute('style');
    });

    // Handle unordered lists
    // Word often uses <p> or other tags with special styles or classes for lists
    doc.querySelectorAll('p[style*="mso-list"]').forEach(p => {
        let ul = document.createElement('ul');
        p.parentNode.insertBefore(ul, p);

        while (p && p.getAttribute('style') && p.getAttribute('style').indexOf('mso-list') !== -1) {
            let li = document.createElement('li');
            while (p.firstChild) {
                li.appendChild(p.firstChild);
            }

            ul.appendChild(li);
            let nextP = p.nextElementSibling;
            p.parentNode.removeChild(p);
            p = nextP;
        }
    });

    // Remove remaining <span> tags but keep their content
    doc.querySelectorAll('span').forEach(span => {
        let parent = span.parentNode;
        while (span.firstChild) parent.insertBefore(span.firstChild, span);
        parent.removeChild(span);
    });

    // Return the cleaned HTML
    return doc.body.innerHTML;
}

// Usage example
document.getElementById('yourContentEditableDiv').addEventListener('paste', (event) => {
    event.preventDefault();
    const text = (event.clipboardData || window.clipboardData).getData('text/html');
    const cleanHTML = cleanWordHTML(text);
    document.execCommand('insertHTML', false, cleanHTML);
});

Anyway, that is to say the tricky stuff with regards to detecting a list and wrapping it in a ul tag... GPT knows how to program that and probably all other sillyness with Word formatting which may be helpful.

Remember it can simply be asked to convert it to jQuery style as well.

@jlahijani
Copy link
Author

One other library that may be helpful is Summernote Cleaner, which is a 3rd party plugin for Summernote rich text editor. I'm sure their cleaner is pretty advanced although I have tested it. May be worth looking into:

https://github.com/DiemenDesign/summernote-cleaner

ryancramerdesign added a commit to processwire/processwire that referenced this issue Jan 26, 2024
…option to the Markup Toggle settings. Plus refactoring of the pasteFilter JS in attempt to fix processwire/processwire-issues#1866 which should improve pasting from MS Word.
@ryancramerdesign
Copy link
Member

Thanks @jlahijani That video was helpful. While I don't have MS Word to duplicate the issue, I was able to copy the Word markup out of your video and substitute it in pasteFilter to see how it would clean it up. I found that it cleaned it up reasonably well but left the conditional comments and <o:p> tags, and it didn't convert the bold or italic tags like you mentioned. I have made some updates which should fix all of that... at least it did in my testing here. Can you confirm that it also fixes it there?

@ryancramerdesign
Copy link
Member

Btw, I don't think we can do anything with the word ordered/unordered lists, as it's MS Word that's converting them to <p> elements, and without any begin/end list tags present, we can't very easily convert it to a proper ul/ol list. But the latest pasteFilter update leaves them as just <p>List item</p> values, so it's a simple matter at that point to just select which items should be in the list and then click the UL or OL icon to convert it when needed.

@ryancramerdesign
Copy link
Member

Regarding the other conversion methods, those rely on having the markup in the DOM. In our case, we are operating on the raw HTML/text, as that's what TinyMCE gives us, plus it's probably not safe to place into the DOM at this stage. Once TinyMCE inserts it into the editor, we could always go back and manipulate as DOM elements, which is possible, but probably outside the scope of the pasteFilter.

@jlahijani
Copy link
Author

Regarding the other conversion methods, those rely on having the markup in the DOM. In our case, we are operating on the raw HTML/text, as that's what TinyMCE gives us, plus it's probably not safe to place into the DOM at this stage. Once TinyMCE inserts it into the editor, we could always go back and manipulate as DOM elements, which is possible, but probably outside the scope of the pasteFilter.

That's a good point and one I didn't consider.

I will test the changes a bit further when time permits as well as Google Docs (and a little more Word). I will also provide the raw HTML that gets pasted so you don't have to rewrite that by hand.

@jlahijani
Copy link
Author

I made two videos about Google Docs:
https://www.youtube.com/watch?v=qDbRsOYGvBk
https://www.youtube.com/watch?v=VO5SGquoXEc

Raw code video 1:

<meta charset="utf-8"><b id="docs-internal-guid-39578836-7fff-ffe8-df71-0199fecdd34e"><p dir="ltr"><span>This is </span><span>bold</span><span> text.</span></p><br /><p dir="ltr"><span>This is normal text but </span><span>this is italic</span><span>.</span></p><br /><p dir="ltr"><span>A line</span></p><p dir="ltr"><span>Another line without hitting enter twice.</span></p><br /><p dir="ltr"><span>What about </span><span>bold italic</span><span>?</span></p><h2 dir="ltr"><span>This is headline 2.</span></h2><br /><p dir="ltr"><span>This is a bullet list:</span></p><br /><ul><li dir="ltr" aria-level="1"><p dir="ltr" role="presentation"><span>one</span></p></li><li dir="ltr" aria-level="1"><p dir="ltr" role="presentation"><span>two is italic</span></p></li><li dir="ltr" aria-level="1"><p dir="ltr" role="presentation"><span>three</span></p></li></ul><br /><p dir="ltr"><span>Another line of text.</span></p></b>

Raw code video 2:

<meta charset="utf-8"><b id="docs-internal-guid-e372d8f2-7fff-6b68-3080-4c08a524fa8d"><p dir="ltr"><span>bla bla bla&nbsp;</span></p><br /><p dir="ltr"><span>this is a line of text, then the [enter] key is pressed</span></p><p dir="ltr"><span>here is the second line</span></p><br /><p dir="ltr"><span>this is a line of text, then the [shift+enter] keys are pressed</span><span><br /></span><span>here is the second line</span></p><br /><p dir="ltr"><span>bla bla bla</span></p></b>

@ryancramerdesign
Copy link
Member

@jlahijani Thanks. Just looking at the first example to start. But here is the input markup from Google Docs. It's strange because it doesn't seem like there's any bold or italic retained in it, and instead the entire batch of markup is wrapped in a <b> tag. So it looks like any info about bold or italic was removed prior to pasteFilter even seeing it?

  1. Original input
<meta charset="utf-8">
<b id="docs-internal-guid-39578836-7fff-ffe8-df71-0199fecdd34e">
	<p dir="ltr"><span>This is </span><span>bold</span><span> text.</span></p>
	<br />
	<p dir="ltr"><span>This is normal text but </span><span>this is italic</span><span>.</span></p>
	<br />
	<p dir="ltr"><span>A line</span></p>
	<p dir="ltr"><span>Another line without hitting enter twice.</span></p>
	<br />
	<p dir="ltr"><span>What about </span><span>bold italic</span><span>?</span></p>
	<h2 dir="ltr"><span>This is headline 2.</span></h2>
	<br />
	<p dir="ltr"><span>This is a bullet list:</span></p>
	<br />
	<ul>
		<li dir="ltr" aria-level="1">
			<p dir="ltr" role="presentation"><span>one</span></p>
		</li>
		<li dir="ltr" aria-level="1">
			<p dir="ltr" role="presentation"><span>two is italic</span></p>
		</li>
		<li dir="ltr" aria-level="1">
			<p dir="ltr" role="presentation"><span>three</span></p>
		</li>
	</ul>
	<br />
	<p dir="ltr"><span>Another line of text.</span></p>
</b>
  1. Here it is after pasteFilter has been applied:
<strong>
	<p>This is bold text.</p>
	<br>
	<p>This is normal text but this is italic.</p>
	<br>
	<p>A line</p>
	<p>Another line without hitting enter twice.</p>
	<br>
	<p>What about bold italic?</p>
	<h2>This is headline 2.</h2>
	<br>
	<p>This is a bullet list:</p>
	<br>
	<ul>
		<li>
			<p>one</p>
		</li>
		<li>
			<p>two is italic</p>
		</li>
		<li>
			<p>three</p>
		</li>
	</ul>
	<br>
	<p>Another line of text.</p>
</strong>

And here it is after TinyMCE inserts it into the editor. Meaning, it's gone through TinyMCE's content filtering rules, which disalllow things like block level elements wrapped with inline elements, which is why the <strong> is gone, but it used it in the empty paragraphs:

<p>This is bold text.</p>
<p><strong>&nbsp;</strong></p>
<p>This is normal text but this is italic.</p>
<p><strong>&nbsp;</strong></p>
<p>A line</p>
<p>Another line without hitting enter twice.</p>
<p><strong>&nbsp;</strong></p>
<p>What about bold italic?</p>
<h2>This is headline 2.</h2>
<p><strong>&nbsp;</strong></p>
<p>This is a bullet list:</p>
<p><strong>&nbsp;</strong></p>
<ul>
	<li>
		<p>one</p>
	</li>
	<li>
		<p>two is italic</p>
	</li>
	<li>
		<p>three</p>
	</li>
</ul>
<p><strong>&nbsp;</strong></p>
<p>Another line of text.</p>

The part that we've got some control over is what converts the original input (1) to 2 above. But it looks to me like we might have a garbage-in-garbage-out scenario here, at least with regard to the bold and italic.

I'll have a look at the second bit of code next.

@ryancramerdesign
Copy link
Member

@jlahijani Here's the same data for example 2:

  1. Original input
<meta charset="utf-8">
<b id="docs-internal-guid-e372d8f2-7fff-6b68-3080-4c08a524fa8d">
  <p dir="ltr"><span>bla bla bla&nbsp;</span></p><br />
  <p dir="ltr"><span>this is a line of text, then the [enter] key is pressed</span></p>
  <p dir="ltr"><span>here is the second line</span></p><br />
  <p dir="ltr">
    <span>this is a line of text, then the [shift+enter] keys are pressed</span>
    <span><br /></span><span>here is the second line</span>
  </p><br />
  <p dir="ltr"><span>bla bla bla</span></p>
</b>
  1. After pasteFilter:
<strong>
  <p>bla bla bla </p><br>
  <p>this is a line of text, then the [enter] key is pressed</p>
  <p>here is the second line</p><br>
  <p>
    this is a line of text, then the [shift+enter] keys are pressed<br>
    here is the second line
  </p><br>
  <p>bla bla bla</p>
</strong>
  1. After TinyMCE applies its rules:
<p>bla bla bla</p>
<p><strong>&nbsp;</strong></p>
<p>this is a line of text, then the [enter] key is pressed</p>
<p>here is the second line</p>
<p><strong>&nbsp;</strong></p>
<p>
  this is a line of text, then the [shift+enter] keys are pressed<br>
  here is the second line
</p>
<p><strong>&nbsp;</strong></p>
<p>bla bla bla</p>

I'm thinking pasteFilter should replace </p><br> with </p>, which should hopefully prevent TinyMCE from inserting those empty paragraphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants