Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HTML escaping for arbitrary string input #31

Open
guidedways opened this issue May 10, 2018 · 6 comments
Open

Implement HTML escaping for arbitrary string input #31

guidedways opened this issue May 10, 2018 · 6 comments

Comments

@guidedways
Copy link

guidedways commented May 10, 2018

This looks like a powerful library to navigate around HTML nodes, however what would be the simplest method of obtaining cleaned up 'plain text' from HTML input? I'd like it to preserve any 'invalid' non-html tags such as John Do <john@do.com> and not try and parse it as NSAttributedString's initWithHTML does.

@guidedways
Copy link
Author

Okay the following seems to fail

let element:HTMLElement = HTMLElement(tagName: "div")
element.innerHTML = "This is an <b>email</b>: John Do <john@do.com>"
print("\(element.textContent)")

outputs: This is an email: John Do

What do I have to do to make this work so that it ignores anything that doesn't look like HTML?

@iabudiab
Copy link
Owner

@guidedways Hey there. Let me see if I understood you correctly.

You want to input a HTML string and have all HTML tags stripped, as in This is an <b>email</b>: John Do <john@do.com> should return This is an email: John Do <john@do.com>?

If so, then the easiest way to do it, is to escape all HTML reserved characters to prevent interpreting them as HTML. In your case:

let element: HTMLElement = HTMLElement(tagName: "div")
element.innerHTML = "This is an <b>email</b>: John Do &lt;john@do.com&gt;"
print("\(element.textContent)")
// This is an email: John Do <john@do.com>

Some Details

innerHTML in HTMLKit behaves like it would in a browser, i.e. it sets the HTML content of an element to the string that is passed. The string is then interpreted as a HTML fragment and is parsed inside the element as its parent context.

What does it mean? Well, your input gets parsed to this DOM:

<div>This is an  <b>email</b>: John Do <john@do.com></john@do.com></div>

Take a look here for more info: MDN Element.innerHTML

Does this answer you question? Do you have any followup questions?

@guidedways
Copy link
Author

Yes that is the output I'm after, but I am not in control of the string being received from the user. It could be anything <some strange non-html tag>. I need the library to be able to do this for me so I can escape < as &lt;. Can HTMLKit find and escape non-html 'tags' for me?

@guidedways
Copy link
Author

I should explain. I'm receiving input directly from the user as notes. The notes could be actual HTML or could be partial / invalid HTML. There's no way to tell since they're free to type in whatever they wish. What I need to do is be able to parse HTML and extract the plain text version of whatever they entered, however I need to retain any such odd entries, links etc that aren't otherwise entered as HTML.

@iabudiab
Copy link
Owner

@guidedways I see, currently HTMLKit does not provide this functionality. I'll see if I could implement this in the next couple of days. Will let you know as soon as I have something.

I'll rename the issue then and mark as feature request.

@iabudiab iabudiab changed the title Plain text? Implement HTML escaping for arbitrary string input May 10, 2018
@guidedways
Copy link
Author

Thank you, that would be extremely helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants