Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getAggregateText function sticks together text between different nodes #29

Closed
ilya-khorev opened this issue Mar 24, 2015 · 21 comments
Closed
Assignees

Comments

@ilya-khorev
Copy link

Hi

Please, look at my test case
https://jsfiddle.net/weirrrdy/xLcfnrpk/

screenshot_1

I'd like to match the exact word "Highlighted", that's why there is a special metacharacter "\b" wrapping the regex expression. Semantically it doesn't necessary to put '\b' in this case but I'm using it in my application and it's required for the test case.

The expected behaviour here would be selecting both 'highlighted' words, however only the second phrase has matched.

The problem is the getAggregateText function has combined text of first two paragraphs into the following text: "HighlightedPhrase". Therefore the current regexp expression has skipped the word "Highlighted".

I would really appreciate if you fix this.

@padolsey padolsey self-assigned this Apr 20, 2015
@padolsey
Copy link
Owner

Sorry for the massive delay on this. Am investigating now.

@ethantw
Copy link
Contributor

ethantw commented Apr 20, 2015

A paragraph can be styled presentationally as an inline element in CSS. Likewise, an emphasis text run can be styled as a block element too.

Two adjacent ps should be handled just like two adjacent ems by default. Maybe it's better to add an additional option (element/selector-wise) for such cases.

@padolsey
Copy link
Owner

This is a tough one...

I'm happy to make it configurable, but it seems like the default behaviour should be what people typically expect, and people probably don't typically expect <p> elements to be treated as inline. But I'm very wary of making this assumption because it could lead people to a lot of undesirable behaviour.

@padolsey
Copy link
Owner

So, yeh, defaulting to "everything is inline" makes sense. Just trying to figure out how to make this easily configurable and obvious.

@padolsey
Copy link
Owner

The alternative is to assume everything is a block-level element by default, and then provide the opportunity to configure inline elements.

@ethantw
Copy link
Contributor

ethantw commented Apr 20, 2015

The alternative is to assume everything is a block-level element by default, and then provide the opportunity to configure inline elements.

Do you mean with the boundary matcher \b only?

@padolsey
Copy link
Owner

Ok, so, this is what I've come up with: (Please let me know what you both think)


PROPOSAL

Everything stays the same as it currently is, by default. The findAndReplaceDOMText function will assume that it can bleed matches through ALL elements.

If someone wants to explicitly disallow this, e.g. in the case of block-level <p> elements, then they can set a custom boundary function that returns true in the case that either of its arguments are paragraphs (the boundary function will always be passed two elements so you can decide whether to create a boundary between them):

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  boundary: function(a, b) {
    return a.nodeName.toLowerCase() === 'p' || b.nodeName.toLowerCase() === 'p';
  },
  wrap: 'em'
});

You can also set boundary to true which will mean that NO matches bleed through elements, e.g. a match must exist in contiguous text nodes.

Additionally there will be an easy way to set ALL block-level borders as boundaries:

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  boundary: findAndReplaceDOMText.BLOCK_LEVEL_ELEMENTS,
  wrap: 'em'
});

(I imagine this will be the most common usage)

@ethantw
Copy link
Contributor

ethantw commented Apr 20, 2015

Would you care to use Element.prototype.matches method instead of the fairly long nodeName comparison?

https://github.com/ethantw/fibre.js/blob/6857deac21/dist/fibre.js#L18-28

@padolsey
Copy link
Owner

Oo, yep, didn't realize that had such good support. Still a shame about IE8.

@ilya-khorev
Copy link
Author

Thanks for the answer.

The second option is the best one, at least to me. That's what I'm looking for, if I understood you correctly.
boundary: findAndReplaceDOMText.BLOCK_LEVEL_ELEMENTS,

padolsey added a commit that referenced this issue Apr 22, 2015
their own matching contexts.

E.g. useful with block-level elements like P and DIV.
CC #29
@padolsey
Copy link
Owner

Some progress: d41ae22 introduces something similar to my proposal above. Instead of 'boundaries' I've chosen the term 'contexts' and specifically a new option: forceContext which accepts either a boolean or a function that returns a boolean. With this concept you can prescribe that any specific element creates a new matching context (so that no matches bleed through the element in either direction).

So, for the block-level use-case, you can do:

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  forceContext: findAndReplaceDOMText.BLOCK_LEVEL_MATCH,
  wrap: 'em'
});

This means that any block-level elements will create new matching contexts. The findAndReplaceDOMText.BLOCK_LEVEL_MATCH function returns true if passed a traditionally block-level element.

You can set forceContext to your own function too, and even use Element.matches to figure out which elements to force a new context on.


If this looks good to you, I'll merge into master and push a new release. (Shall timeout in 48h and assume we're all happy to go ahead with it if I don't hear anything).

@weirdy: Here's your JSFiddle with the change added: https://jsfiddle.net/8dcqgcoa/1/

@ilya-khorev
Copy link
Author

Awesome! Thank you, good work.

@ethantw
Copy link
Contributor

ethantw commented Apr 22, 2015

Great! Thanks for the significant fix. I have two questions though.

  • Even without the \b matcher they'll still be considered as another context?
  • Also, I believe that br, details, dt, img, li, summary, tbody, td, th, thead, tr (, etc) are accidentally missed out?

The br and img (like hr) are necessary for cases like,

Highlight<img src="./path"><span>xxx</span>

https://developer.mozilla.org/en/docs/Web/HTML/Element

@padolsey
Copy link
Owner

Ah, good point; I'll review the list of block elements. Actually, I think what we want is not a list of block elements, per se, but instead a list of all elements that are not inline textual elements (including <img> and <br> and <hr> etc.). I'll work on this.

@ethantw
Copy link
Contributor

ethantw commented Apr 22, 2015

Agree. 👍

@padolsey
Copy link
Owner

Until I find a better name for this it's gonna be called "NON_INLINE_PROSE":

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  forceContext: findAndReplaceDOMText.NON_INLINE_PROSE,
  wrap: 'em'
});

This will force a context on all of the following elements:

NON_PROSE_ELEMENTS = {
    // Block Elements
    address:1, article:1, aside:1, blockquote:1, canvas:1, dd:1, div:1,
    dl:1, fieldset:1, figcaption:1, figure:1, footer:1, form:1, h1:1, h2:1, h3:1,
    h4:1, h5:1, h6:1, header:1, hgroup:1, hr:1, main:1, nav:1, noscript:1, ol:1,
    output:1, p:1, pre:1, section:1, ul:1,
    // Other misc. elements that are not part of continuous inline prose:
    br:1, li: 1, summary: 1, dt:1, details:1,
    // Media / Source elements:
    script:1, style:1, img:1, video:1, audio:1, canvas:1, svg:1, map:1, object:1,
    // Input elements
    input:1, textarea:1, select:1, option:1, optgroup: 1, button:1,
    // Table related elements:
    table:1, tbody:1, thead:1, th:1, tr:1, td:1, caption:1, col:1, tfoot:1, colgroup:1
};

It's probably missing a few (let me know if you notice any). Thanks!

PR happening in #30 (shall merge soon I hope). I also need to add a chunk to the readme to explain what on earth this is all about :P

@ethantw
Copy link
Contributor

ethantw commented Apr 25, 2015

Until I find a better name for this it's gonna be called "NON_INLINE_PROSE":

Awesome, I think that's comprehensible enough. 👍

It's probably missing a few (let me know if you notice any).

I would like to suggest we add rp, rt and rtc, but exclude ruby and rb. The rt element is used to wrap the annotation for its corresponding inline text run, while rp is used to provide fallback. Their box display attribute would not be inline as well. For example,

He invented <ruby>WWW<rp>(<rt>world wide web<rp>)</ruby> that helps people connect to one another.

2015-04-26 00 58 46

Well, maybe they belong to those who need ignoring instead?

As for ruby and rb, they are both inline and part of the main content. Would make more sense for them to remain outside of the list.

And what do you think about the elements designed for web component such as template?

@padolsey
Copy link
Owner

Cool, I've added rp, rt and rtc.

Right now there's no easy way to filter out non-prose elements completely. So, even though they can now have "forced contexts" via NON_INLINE_PROSE, things like <style> and <script> are, by default, still searched.

So... I think it'd be worth adding a "prose" preset, like:

findAndReplaceDOMText(..., {
  preset: 'prose'
});

And it could do the following:

  • Only searches textual elements (not STYLE, SCRIPT, OBJECT, etc.)
    • I.e. they're completely avoided/skipped
  • Forces contexts using the NON_INLINE_PROSE set of elements

It would also avoid <template>.

Seems like this is the most sensible/common usage of the lib, and it wouldn't be a breaking change since it's a brand new option. Thoughts?

@ilya-khorev
Copy link
Author

Excellent!

I thought styles, scripts, etc are already skipped by your function. If not, it would be a good option.
It's worth making this as a default behavior.

@ethantw
Copy link
Contributor

ethantw commented Apr 26, 2015

Seems like this is the most sensible/common usage of the lib, and it wouldn't be a breaking change since it's a brand new option. Thoughts?

Agree. 🌈

@padolsey
Copy link
Owner

Ok, all shipped (0.4.3). 🐑

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants