Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html-serializer doesn't work with nested blocks #1497

Closed
nghuuphuoc opened this issue Jan 2, 2018 · 15 comments
Closed

html-serializer doesn't work with nested blocks #1497

nghuuphuoc opened this issue Jan 2, 2018 · 15 comments

Comments

@nghuuphuoc
Copy link

Do you want to request a feature or report a bug?

A bug

What's the current behavior?

const BLOCK_TAGS = {
  blockquote: 'quote',
  p: 'paragraph',
  //div: 'div'
}

const rules = [
  {
    deserialize(el, next) {
      const type = BLOCK_TAGS[el.tagName.toLowerCase()]
      if (!type) return
      return {
        kind: 'block',
        type: type,
        nodes: next(el.childNodes)
      }
    }
  }
]

const pureHtml = '<blockquote><div>a text<blockquote>inner quote</blockquote></div></blockquote>'
const initialValue = new HtmlSerializer({ rules: rules }).deserialize(pureHtml);

It only renders a text element, and I couldn't see inner quote.
See https://jsfiddle.net/oj53q1n2/26/

What's the expected behavior?

We should see both text and quote.

@nghuuphuoc
Copy link
Author

nghuuphuoc commented Jan 2, 2018

If I add div to BLOCK_TAGS:

const pureHtml = '<blockquote><div>a text</div><blockquote>inner quote</blockquote></blockquote>'
const pureHtml = '<blockquote><div>a text<blockquote>inner quote</blockquote></div></blockquote>'

@zhujinxuan
Copy link
Contributor

My solution about 'div':

if-div
  return next(children) 

@nghuuphuoc
Copy link
Author

Do you mean the following approach?

const rules = [
  {
    deserialize(el, next) {
      const tag = el.tagName.toLowerCase()
      if (tag == 'div') {
      	return next(el.childNodes)
      }
      const type = BLOCK_TAGS[tag]
      if (!type) return
      return {
        kind: 'block',
        type: type,
        nodes: next(el.childNodes)
      }
    }
  }
]

@zhujinxuan
Copy link
Contributor

zhujinxuan commented Jan 3, 2018 via email

@bengotow
Copy link
Contributor

bengotow commented Jan 8, 2018

Hey folks—I've messed with the JSFiddle and created what I think is a minimal example of this problem (https://jsfiddle.net/oj53q1n2/29/).

Input HTML:
<div>aa<div>missing</div></div>

Output Value:

{
  "object": "value",
  "document": {
    "object": "document",
    "data": {},
    "nodes": [
      {
        "object": "block",
        "type": "paragraph",
        "isVoid": false,
        "data": {},
        "nodes": [
          {
            "object": "block",
            "type": "div",
            "isVoid": false,
            "data": {},
            "nodes": [
              {
                "object": "text",
                "leaves": [
                  {
                    "object": "leaf",
                    "text": "aa",
                    "marks": []
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

I'm not sure why the second text (missing) isn't present in the result, but I can't see why this shouldn't be valid?

@bengotow
Copy link
Contributor

bengotow commented Jan 8, 2018

After a bit more digging, the problem appears to be this constraint in the core Slate schema (which is enforced by Value.fromJSON inside the HTML deserializer):

/**
 * Only allow block nodes or inline and text nodes in blocks.
 *
 * @type {Object}
 */

{
  validateNode: function validateNode(node) {
    if (node.object != 'block') return;
    var first = node.nodes.first();
    if (!first) return;
    var objects = first.object == 'block' ? ['block'] : ['inline', 'text'];
    var invalids = node.nodes.filter(function (n) {
      return !objects.includes(n.object);
    });
    if (!invalids.size) return;

    return function (change) {
      invalids.forEach(function (child) {
        change.removeNodeByKey(child.key, { normalize: false });
      });
    };
  }
},

If the input HTML contains <div> tags and the Serializer rules convert those div to blocks rather than ignoring them, it's easy to create a structure that will be ripped apart by the schema validation after parsing, because Slate does not allow blocks to have both block children and text / inline children and this is a very common <div> case.

My solution is here: https://gist.github.com/bengotow/f5408e9cb543f22409d033df58e34579. Before running the HTML deserializer, I traverse the DOM tree and ensure that divs, blockquotes, and other nodes converted to Slate blocks contain either text + inline children OR block children, wrapping children into blocks as necessary. Curious whether this would be welcomed as default behavior in some way (cc @ianstormtaylor).

@ianstormtaylor
Copy link
Owner

Hey @bengotow good digging! I'd be open to a way to make the default behavior more helpful. I'm not sure if there's a way to do it that doesn't get too restrictive though, but if you have ideas I'd love to hear them.

I'd also be open to a PR that adding warning logging for these kinds of "shouldn't really be encountered often" normalization rules, so that it's more obvious what's going on.

@Kornil
Copy link
Contributor

Kornil commented Mar 11, 2018

I feel this issue is really important and should be addressed by slatejs:
At the moment pasting a nested list does not return and error, just copy the non-nested content, users have no way to identify the omission if not manually going back to check the pasted content.

In light of this I'd suggest to use part of the gist by @bengotow (which I adapted already to a private project to fix the same issue) to make blocks that contains BOTH text and other blocks, readable by slate by adding a div (or p) node around the text.

@bengotow
Copy link
Contributor

bengotow commented Mar 11, 2018

Hey @Kornil! After a bit more polish, I actually ended up switching to an approach that adds wrapping blocks, etc. to the resulting Slate graph before passing it through the normalizer, rather than changing the HTML before converting it. I think that's preferable because it works with any HTML <> Slate mapping rather than relying on an assumed set of conversions.

You can find the latest code I'm using here: https://github.com/Foundry376/Mailspring/blob/master/app/src/components/composer-editor/conversion.jsx#L172. I also wrote code to join adjacent text nodes rather then letting Slate do it during normalization, which sped things up a LOT because it's a simple transform and Slate "assumes the worst" when it runs a normalization step (and spends time re-finding the nodes, etc.)

@pvande
Copy link

pvande commented Apr 11, 2018

I'd be curious to hear more about why that validation rule exists at all. While I agree that there's a conceptual correctness to it, there's no such restriction in HTML. In addition to the example provided by @nghuuphuoc, the following is valid HTML that is "unrepresentable" in Slate.

<ul>
  <li>
    Text Content
    <ul>
      <li>Nested Text Content</li>
    </ul>
  </li>
</ul>

The implicit behavior of silently destroying content feels like it needs a strong justification and prominent documentation.

@crisward
Copy link
Contributor

crisward commented Oct 3, 2018

For what it's worth, I'm also having issues with mixed inline / block level elements in slate with

<figure>
  <img src="" />
  <figcaption>Some Caption</figcaption>
</figure>

This is pretty standard html. I can get around this by wrapping the image in a div, but that's pretty nasty.

@ianstormtaylor
Copy link
Owner

Hey folks, this is by design. Slate does not allow you to have mixed inline and block level content in the same node. A block can either contain all block nodes, or it can contain inline and text nodes. This is enforced in the core editor-level schema.

The reason for this is that it makes implementing editing behaviors much simpler. It allows you to avoid a whole class of issues and questions that crop up related to intermingling. I realize there are no restrictions on HTML, but that's also what makes the native contenteditable behaviors so hard to standardize and predict.

If someone wants to open a pull request with a specific improvement to the docs for this, I'd be happy to merge it. I'm going to close this otherwise, since it's not something that is a bug that we can address.

@q1998763
Copy link

@crisward I have the same problem. Have you found a solution?

@crisward
Copy link
Contributor

crisward commented Dec 11, 2018

Bit of a hack, when I convert the source to slate I wrap inlines in blocks. The remove them again on save eg

<figure>
  <div class="inlinewrapper">
    <img src="" />
  </div>
  <figcaption>Some Caption</figcaption>
</figure>

Not ideal, but easy enough to remove.

@crisward
Copy link
Contributor

crisward commented Dec 11, 2018

Rough code here - https://gist.github.com/crisward/b61bd926d44c1e58d05f0c0c472262a4
There is a bit of sanitisation code mixed in with that method, I was using it when pulling in content from our older cms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants