Skip to content

String#unescapeHTML() decodes entities after stripping tags, reintroducing markup #371

@knogineer

Description

@knogineer

String#unescapeHTML() calls stripTags() first and then decodes entities. Because the decode runs after the strip, encoded markup that survives stripping (since it is not a real tag at that point) gets turned back into live markup. Any code that assumes the output is tag-free will be wrong.

Current implementation (around line 439 of src/prototype/lang/string.js):

function unescapeHTML() {
  return this.stripTags().replace(/&lt;/g,'<').replace(/&gt;/g,'>').replace(/&amp;/g,'&');
}

Reproduction:

'&lt;img src=x onerror=alert(1)&gt;'.unescapeHTML();
// stripTags() leaves the entity text alone (there is no real tag yet),
// then the decode step produces a live tag:
// => '<img src=x onerror=alert(1)>'

If a developer relies on unescapeHTML() to produce safe, tag-free text before inserting it into the page, the decode step reintroduces executable markup, which is a path to XSS.

Suggested fix: decode entities first and then strip, or use a single normalization pass that does not leave decoded markup behind. It would also help to document that the result is not safe to insert into the DOM as HTML.

Refs: CWE-79, CWE-116, OWASP ASVS V5.3.3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions