New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

innerText implementation #1245

Open
vsemozhetbyt opened this Issue Sep 25, 2015 · 19 comments

Comments

Projects
None yet
@vsemozhetbyt
Copy link
Contributor

vsemozhetbyt commented Sep 25, 2015

jsdom is a great tool for web scraping. However the textContent is a very inconvenient way to get readable text for html2text conversion.

There is a wonderful article about usefulness of negligible innerText in many cases:

http://perfectionkills.com/the-poor-misunderstood-innerText/

The author suggests getSelection().toString() as a very slow workaround, but getSelection is not implemented in the jsdom yet.

Could you consider an implementing of the innerText in the jsdom? The author has done a great exploration about it, he has even added a simple spec at the end.

@vsemozhetbyt

This comment has been minimized.

Copy link
Contributor Author

vsemozhetbyt commented Sep 25, 2015

And what a pity that rangy Selection and innerText library is not compatible with jsdom: timdown/rangy#348

@domenic

This comment has been minimized.

Copy link
Member

domenic commented Sep 25, 2015

So, innerText is not standard, and not implemented in at least one major engine (Firefox). Without a standard, I don't think we should implement it.

@Sebmaster

This comment has been minimized.

Copy link
Member

Sebmaster commented Oct 9, 2015

Looks like there's some movement in this whole thing with a draft spec here. See also all the references. There are no issues on the repo though, so I wonder how complete it already is / how quick progress will be.

@vsemozhetbyt

This comment has been minimized.

Copy link
Contributor Author

vsemozhetbyt commented Jan 25, 2016

Firefox has implemented: https://bugzilla.mozilla.org/show_bug.cgi?id=264412

WHATWG semms to approve: whatwg/compat#5 (comment)

@inikulin

This comment has been minimized.

Copy link
Contributor

inikulin commented Jan 25, 2016

From the spec it's seems like we can't implement innerText properly without basic layout support.

@domenic

This comment has been minimized.

Copy link
Member

domenic commented Jan 25, 2016

Yeah, this is not really going to be implementable in jsdom anyway, without a lot of infrastructure work... nobody get their hopes up :(.

@vsemozhetbyt

This comment has been minimized.

Copy link
Contributor Author

vsemozhetbyt commented Jan 30, 2016

As to layout support requirement: rocallahan/innerText-spec#2

r4j4h added a commit to r4j4h/jasmine-phantom-utils that referenced this issue Jun 17, 2016

Added istanbul/jsdom for code coverage and leaner tests. Changed `inn…
…erText` usage to `textContent` based on [this discussion](jsdom/jsdom#1245). Added tests for many evaluators.

@domenic domenic added the feature label Jul 2, 2016

@vsemozhetbyt

This comment has been minimized.

Copy link
Contributor Author

vsemozhetbyt commented Aug 27, 2016

Is there any plan to implement it because of WHATWG adoption?

@domenic

This comment has been minimized.

Copy link
Member

domenic commented Aug 27, 2016

Yeah... Although the spec requires a lot of stuff jsdom doesn't have, around CSS boxes :(. Not sure what to do.

@vsemozhetbyt

This comment has been minimized.

Copy link
Contributor Author

vsemozhetbyt commented Aug 27, 2016

Is there any lib for this to plug along with jsdom?

@snuggs

This comment has been minimized.

Copy link
Contributor

snuggs commented Aug 29, 2016

@domenic care to drop some knowledge on why this is such an infrastructure overhaul? We thought the 800lb gorilla in the room would leave lo-key. But looks like it's not going anywhere. As you know have been wrapping my head around the innards of jsdom. Where would be a great place in the repo to start reviewing code to a jsdom newb?

Thanks in advance 🙏 /cc @vsemozhetbyt

@dmethvin

This comment has been minimized.

Copy link
Contributor

dmethvin commented Aug 29, 2016

The primary issue is the fact that innerText leans on the layout engine for guidance, and jsdom has no layout engine. See https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute and
http://perfectionkills.com/the-poor-misunderstood-innerText/ . From the second link:

Notice how innerText almost precisely represents exactly how text appears on the page. textContent, on the other hand, does something strange — it ignores newlines created by
and around styled-as-block elements ( in this case). But it preserves spaces as they are defined in the markup.

@vsemozhetbyt

This comment has been minimized.

Copy link
Contributor Author

vsemozhetbyt commented Apr 26, 2017

Still out of scope and no workaround?

@coreh

This comment has been minimized.

Copy link

coreh commented May 24, 2017

Apparently the spec says:

If this element is not being rendered, or if the user agent is a non-CSS user agent, [emphasis added] then return the same value as the textContent IDL attribute on this element.

I think a workaround would be then to simply return textContent.

@domenic

This comment has been minimized.

Copy link
Member

domenic commented May 25, 2017

We implement enough CSS that I don't think that applies. We just don't implement the layout parts...

@Suzii

This comment has been minimized.

Copy link

Suzii commented Jan 24, 2018

Hi guys, any news on this one?

@Bnaya

This comment has been minimized.

Copy link

Bnaya commented Jan 25, 2018

Just use headless chrome :)

@Janpot

This comment has been minimized.

Copy link

Janpot commented Aug 5, 2018

@domenic from that spec that @coreh mentioned:
https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute

If this element is not being rendered, or if the user agent is a non-CSS user agent, then return the same value as the textContent IDL attribute on this element.

https://html.spec.whatwg.org/multipage/rendering.html#being-rendered

An element is being rendered if it has any associated CSS layout boxes, SVG layout boxes, or some equivalent in other styling languages.

If jsdom doesn't implement the layout parts, doesn't that mean "not being rendered" applies?

@bennypowers

This comment has been minimized.

Copy link

bennypowers commented Dec 10, 2018

This message is for anyone reaching this github thread that just wants a way to get their tests passing without changing their function implementations.

copypasta for the top of your test files:

// Expose JSDOM Element constructor
global.Element = (new JSDOM()).window.Element;
// 'Implement' innerText in JSDOM: https://github.com/jsdom/jsdom/issues/1245
Object.defineProperty(global.Element.prototype, 'innerText', {
  get() {
    return this.textContent;
  },
});

Naturally, caveats from the above discussion apply.

yoshihara added a commit to yoshihara/yyyymmddesa that referenced this issue Dec 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment