Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Russian text support #29

Closed
ihor-shubin opened this issue Jun 5, 2015 · 6 comments
Closed

Russian text support #29

ihor-shubin opened this issue Jun 5, 2015 · 6 comments

Comments

@ihor-shubin
Copy link

I have a problem with russian text:

Code:

[Test]
        public void TestRussianText()
        {
            // Arrange
            var s = new HtmlSanitizer();

            // Act
            var htmlFragment = "Тест";
            var actual = s.Sanitize(htmlFragment);

            // Assert
            var expected = htmlFragment;
            Assert.That(actual, Is.EqualTo(expected).IgnoreCase);
        }

Test result:

 Expected string length 4 but was 28. Strings differ at index 0.
  Expected: "Тест", ignoring case
  But was:  "Тест"
  -----------^
@mganss
Copy link
Owner

mganss commented Jun 5, 2015

This is because the default out formatter encodes any character with a unicode value greater than 160 to &#nnnn;. If you don't want that, use a custom output formatter, e.g.

[Test]
public void TestRussianText()
{
    // Arrange
    var s = new HtmlSanitizer();

    // Act
    var htmlFragment = "Тест";
    var outputFormatter = new CsQuery.Output.FormatDefault(DomRenderingOptions.RemoveComments | DomRenderingOptions.QuoteAllAttributes, HtmlEncoders.Minimum);
    var actual = s.Sanitize(htmlFragment, "", outputFormatter);

    // Assert
    var expected = htmlFragment;
    Assert.That(actual, Is.EqualTo(expected).IgnoreCase);
}

@mganss mganss closed this as completed Jun 5, 2015
@mganss
Copy link
Owner

mganss commented Jun 5, 2015

I can't repro. The CsQuery docs state:

HtmlEncoders.Minimum: Only encodes ampersant, left-caret and right-caret, the minimum needed to produce valid HTML.

Which version of CsQuery are you using (1.3.4 here)?

@ihor-shubin
Copy link
Author

Which version of CsQuery are you using (1.3.4 here)?

Yep.

I changed all

sanitizer.Sanitize(htmlFragment);

and

sanitizer.Sanitize(html);

to

FormatDefault outputFormatter = new CsQuery.Output.FormatDefault(DomRenderingOptions.RemoveComments | DomRenderingOptions.QuoteAllAttributes, HtmlEncoders.Minimum);
string actual = sanitizer.Sanitize(htmlFragment, "", outputFormatter);

And there are 14 failed tests in the Tests.cs. Some of them are dangerous. For example:

  Expected string length 74 but was 54. Strings differ at index 5.
  Expected: "<div>&quot; SRC=&quot;http://ha.ckers.org/xss.js&quot;&gt;&qu...", ignoring case
  But was:  "<div>" SRC="http://ha.ckers.org/xss.js"&gt;"&gt;</div>"
  ----------------^

   at NUnit.Framework.Assert.That(Object actual, IResolveConstraint expression, String message, Object[] args)
   at Ganss.XSS.Tests.HtmlSanitizerTests.DivHtmlQuotesEncapsulation1XSSTest() in Tests.cs: line 1541

@mganss
Copy link
Owner

mganss commented Jun 8, 2015

Since the tests check for exact string equality, some tests will fail if the output formatting is changed but that doesn't automatically mean the output isn't clean. I don't see a XSS problem with the output in the above test. Which other ones do you believe are dangerous?

@RickBlacker
Copy link

RickBlacker commented Jun 1, 2021

@mganss
Do you by chance have an example of how to properly configure and use this and or anglesharp to allow for Chinese chars? We have an app with a text box, this text box can have HTML/TEXT in Chinese. And like others, our text is getting converted to other chars rather than being left alone.

I'm using Sanitizer version 5.0.404 inside a .net core API.

@mganss
Copy link
Owner

mganss commented Jun 2, 2021

@RickBlacker This used to be an issue until we switched to AngleSharp years ago. There's no specific configuration necessary in HtmlSanitizr or AngleSharp. It's likely an encoding issue at an earlier stage in your processing pipeline. Can you post an example string and/or code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants