New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p:escape-markup not properly escaping special characters (e.g. Arabic characters) #220

Closed
davidfox-ap opened this Issue Jul 30, 2015 · 15 comments

Comments

Projects
None yet
4 participants
@davidfox-ap

davidfox-ap commented Jul 30, 2015

Escaping XML containing special characters--in particular, foreign alphabets--does not preserve these characters in the escaped version. A sample pipeline is below:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
    xmlns:c="http://www.w3.org/ns/xproc-step" version="1.0">
    <p:serialization port="result" encoding="utf-8"/>
    <p:input port="source">
        <p:inline>
            <row id="1">
                <field id="2">Edita Food Industries</field>
                <field id="3">ايديتا للصناعات الغذائية</field>
            </row>
        </p:inline>
    </p:input>
    <p:output port="result"/>
    <p:escape-markup/>
    <p:wrap-sequence wrapper="test"/>
    <p:unescape-markup/>
</p:declare-step>

Output with latest version of Calabash:

<test>
    <field id="2">Edita Food Industries</field>
    <field id="4">ايديتا للصناØ
        ¹Ø§Øª الغذائية</field>  
</test>

Expected output:

<test>
                <field id="2">Edita Food Industries</field>
                <field id="3">ايديتا للصناعات الغذائية</field>
</test>

I am escaping and unescaping here, only to make it easier to read. The results are the same if you skip the final 'unescape-markup' step.

@ndw

This comment has been minimized.

Show comment
Hide comment
@ndw

ndw Aug 3, 2015

Owner

I can't reproduce this. What version of Java are you using, and what version of XML Calabash?

Is the pipeline encoded in UTF-8?

It's not immediately obvious to me how this is related to the escape and unescape markup steps. If you just run the data through p:identity, do you get the same results?

Owner

ndw commented Aug 3, 2015

I can't reproduce this. What version of Java are you using, and what version of XML Calabash?

Is the pipeline encoded in UTF-8?

It's not immediately obvious to me how this is related to the escape and unescape markup steps. If you just run the data through p:identity, do you get the same results?

@davidfox-ap

This comment has been minimized.

Show comment
Hide comment
@davidfox-ap

davidfox-ap Aug 3, 2015

Thanks for your reply. My Java version:

java version "1.8.0_40"
Java(TM) SE Runtime Environment (build 1.8.0_40-b26)

And I'm using Calabash: 1.1.4-95

In my actual workflow I am calling an external XML, but for testing purposes I put the problem data inline. This testing xproc is encoded in UTF-8 and when I swap out the escape funtion for p:identity the characters look good.

I've created two versions of the test pipeline (one escaped and one using identity) with their outputs:

https://www.dropbox.com/sh/iqdutzi15m1196i/AACDMIwqokebDiyCsljqUNw7a?dl=0

I'm new to XProc and Calabash so it's possible I'm just doing something incorrectly. If it doesn't seem like a bug to you I can move this discussion elsewhere. Thanks!

davidfox-ap commented Aug 3, 2015

Thanks for your reply. My Java version:

java version "1.8.0_40"
Java(TM) SE Runtime Environment (build 1.8.0_40-b26)

And I'm using Calabash: 1.1.4-95

In my actual workflow I am calling an external XML, but for testing purposes I put the problem data inline. This testing xproc is encoded in UTF-8 and when I swap out the escape funtion for p:identity the characters look good.

I've created two versions of the test pipeline (one escaped and one using identity) with their outputs:

https://www.dropbox.com/sh/iqdutzi15m1196i/AACDMIwqokebDiyCsljqUNw7a?dl=0

I'm new to XProc and Calabash so it's possible I'm just doing something incorrectly. If it doesn't seem like a bug to you I can move this discussion elsewhere. Thanks!

@davidfox-ap

This comment has been minimized.

Show comment
Hide comment
@davidfox-ap

davidfox-ap Aug 13, 2015

Hi Norm, after we spoke earlier this week I thought to try the same files, with the latest calabash, on a Linux machine (rather than Windows, where I started). The files I shared above both worked fine here. So, I wonder if it is a java version issue, particular to Windows. My application will eventually run in Linux, so I'm not sure why I began by testing in Windows. :-)

Thank you for your help!

davidfox-ap commented Aug 13, 2015

Hi Norm, after we spoke earlier this week I thought to try the same files, with the latest calabash, on a Linux machine (rather than Windows, where I started). The files I shared above both worked fine here. So, I wonder if it is a java version issue, particular to Windows. My application will eventually run in Linux, so I'm not sure why I began by testing in Windows. :-)

Thank you for your help!

@ndw

This comment has been minimized.

Show comment
Hide comment
@ndw

ndw Aug 25, 2015

Owner

I can't conveniently test on Windows so I'm going to close this out. If it becomes important to you (or someone else reading this) later, please feel free to re-open it and I'll investigate.

Owner

ndw commented Aug 25, 2015

I can't conveniently test on Windows so I'm going to close this out. If it becomes important to you (or someone else reading this) later, please feel free to re-open it and I'll investigate.

@ndw ndw closed this Aug 25, 2015

@raducoravu

This comment has been minimized.

Show comment
Hide comment
@raducoravu

raducoravu Jun 30, 2016

👍 @wendellpiez reported the same problem. When you run encoding tests on Linux you can specify in the Java command line a parameter like this:

    -Dfile.encoding=ASCII

to force set the default platform encoding to a very restrictive encoding.
Could this issue be reopened?

raducoravu commented Jun 30, 2016

👍 @wendellpiez reported the same problem. When you run encoding tests on Linux you can specify in the Java command line a parameter like this:

    -Dfile.encoding=ASCII

to force set the default platform encoding to a very restrictive encoding.
Could this issue be reopened?

@raducoravu

This comment has been minimized.

Show comment
Hide comment
@raducoravu

raducoravu Jun 30, 2016

In this particular case the problem is in the method: com.xmlcalabash.library.EscapeMarkup.run() at line:

            String data = outstr.toString();

where you are converting a stream of bytes to a string without specifying an encoding. This means that the default platform encoding will be used.
In this case I would do something like this:

            StringWriter sw = new StringWriter();
            serializer.setOutputWriter(sw);
            S9apiUtils.serialize(runtime, child, serializer);
            String data = sw.toString();

so I would instruct the serializer to save directly to a character stream.

raducoravu commented Jun 30, 2016

In this particular case the problem is in the method: com.xmlcalabash.library.EscapeMarkup.run() at line:

            String data = outstr.toString();

where you are converting a stream of bytes to a string without specifying an encoding. This means that the default platform encoding will be used.
In this case I would do something like this:

            StringWriter sw = new StringWriter();
            serializer.setOutputWriter(sw);
            S9apiUtils.serialize(runtime, child, serializer);
            String data = sw.toString();

so I would instruct the serializer to save directly to a character stream.

@raducoravu

This comment has been minimized.

Show comment
Hide comment
@raducoravu

raducoravu Jun 30, 2016

Other places where you have potential similar problems:

  com.xmlcalabash.extensions.Zip.storeJSON(FileToZip, XdmNode, OutputStream)
  com.xmlcalabash.library.Store.storeJSON(XdmNode, String, String, String)
  com.xmlcalabash.util.NodeToBytes.storeJSON(XdmNode, OutputStream)

As JSON is UTF-8 encoded you would need to explicitly give UTF-8 encoding when creating the print writer (which writes in chars) over the output stream (which writes in bytes):

new PrintWriter(new OutputStreamWriter(os, "UTF8"));

raducoravu commented Jun 30, 2016

Other places where you have potential similar problems:

  com.xmlcalabash.extensions.Zip.storeJSON(FileToZip, XdmNode, OutputStream)
  com.xmlcalabash.library.Store.storeJSON(XdmNode, String, String, String)
  com.xmlcalabash.util.NodeToBytes.storeJSON(XdmNode, OutputStream)

As JSON is UTF-8 encoded you would need to explicitly give UTF-8 encoding when creating the print writer (which writes in chars) over the output stream (which writes in bytes):

new PrintWriter(new OutputStreamWriter(os, "UTF8"));

@ndw ndw reopened this Jun 30, 2016

@ndw

This comment has been minimized.

Show comment
Hide comment
@ndw

ndw Jun 30, 2016

Owner

Thanks. I'll take a look. I'm hoping to do some work on XML Calabash next week.

Owner

ndw commented Jun 30, 2016

Thanks. I'll take a look. I'm hoping to do some work on XML Calabash next week.

@raducoravu

This comment has been minimized.

Show comment
Hide comment
@raducoravu

raducoravu Jun 30, 2016

Great, we are in no hurry. If you fix this we'll probably ship the fix with Oxygen 18.1 in a couple of months.

raducoravu commented Jun 30, 2016

Great, we are in no hurry. If you fix this we'll probably ship the fix with Oxygen 18.1 in a couple of months.

@wendellpiez

This comment has been minimized.

Show comment
Hide comment
@wendellpiez

wendellpiez Sep 30, 2016

This is so awesome thanks!

wendellpiez commented Sep 30, 2016

This is so awesome thanks!

@raducoravu

This comment has been minimized.

Show comment
Hide comment
@raducoravu

raducoravu commented Oct 3, 2016

👍

@raducoravu

This comment has been minimized.

Show comment
Hide comment
@raducoravu

raducoravu Oct 25, 2016

The problem I reported in "com\xmlcalabash\library\EscapeMarkup" still persists in 1.1.12. Was the fix made after the kit was built?

raducoravu commented Oct 25, 2016

The problem I reported in "com\xmlcalabash\library\EscapeMarkup" still persists in 1.1.12. Was the fix made after the kit was built?

@raducoravu

This comment has been minimized.

Show comment
Hide comment
@raducoravu

raducoravu Dec 15, 2016

@ndw could you give me an answer to the question I asked above? Because the problem with EscapeMarkup still persists...

raducoravu commented Dec 15, 2016

@ndw could you give me an answer to the question I asked above? Because the problem with EscapeMarkup still persists...

@ndw

This comment has been minimized.

Show comment
Hide comment
@ndw

ndw Dec 15, 2016

Owner

Indeed. I overlooked that case. I thought I had a test...grrr. Will fix now and publish 1.1.15 asap.

Owner

ndw commented Dec 15, 2016

Indeed. I overlooked that case. I thought I had a test...grrr. Will fix now and publish 1.1.15 asap.

@ndw ndw reopened this Dec 15, 2016

@ndw ndw closed this in a456a52 Dec 15, 2016

ndw added a commit that referenced this issue Dec 15, 2016

ndw added a commit that referenced this issue Dec 15, 2016

@raducoravu

This comment has been minimized.

Show comment
Hide comment
@raducoravu

raducoravu Dec 16, 2016

Thanks @ndw, much appreciated.

raducoravu commented Dec 16, 2016

Thanks @ndw, much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment