Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Entities converted during parsing, but not restored during serializing #175

Closed
nickkitto opened this Issue May 18, 2015 · 8 comments

Comments

Projects
None yet
3 participants
@nickkitto
Copy link

nickkitto commented May 18, 2015

In XMLUtil the VALID_ENTITY_NAMES allows for entities to be converted to their values during parsing. However they are not converted back during serialization.

For example, having a resource with a text>div tag like so

<div xmlns="http://www.w3.org/1999/xhtml">&sect;</div>

and running it through the xml parser

IParser parser = fhirContext.newXmlParser();
IResource res = parser.parseResource(...);
String output = parser.encodeResourceToString(res);

Results in the following text instead of the input text.

<div xmlns="http://www.w3.org/1999/xhtml">§</div>

Is this expected? I expected it to return the same value as the input, since this effectively changes the value.

Note: This is on HAPI-FHIR 0.7. Apologies if this is no longer applicable, but having a brief look at the code it seems to be.

@nickkitto

This comment has been minimized.

Copy link
Author

nickkitto commented May 18, 2015

I think the easiest solution to this problem is to either test if the namespace is xhtml, or test for that specific element, and use a different StringEscapeUtils escape method (maybe escapeHtml4). Or perhaps to stop expansion of entities in the text>div tag.

@jamesagnew

This comment has been minimized.

Copy link
Owner

jamesagnew commented May 20, 2015

Hi Nick,

This behaviour is sort of by design. Entities such as &sect; are valid in HTML/XHTML but not in XML by default, and since a FHIR document is not an XHTML document (even if it contains parts that are) it is not valid to use entities outside of what's allowable in XML

We had some complaints about HTML that was parsed in containing &sect; entities failing to validate with XML validators when the entites were preserved so we coded it to unescape these. Given that FHIR requires the use of UTF-8, this seemed like a safe thing to do.

Is this causing issues on your end? I'm definitely not against changing this behaviour if you have specific issues it's causing, although I don't know what the right behaviour would be..

@grahamegrieve

This comment has been minimized.

Copy link
Collaborator

grahamegrieve commented May 21, 2015

I believe the existing functionality is correct. Since this is an error commonly made by implementers, it's useful for HAPI to read html entities, but it should not produce them

@nickkitto nickkitto closed this May 24, 2015

@nickkitto nickkitto reopened this Jun 2, 2015

@nickkitto

This comment has been minimized.

Copy link
Author

nickkitto commented Jun 2, 2015

Reopening this issue as this fails in one of our examples.
A bit of info of our usage: we are leveraging HAPI-FHIR to convert from XML->JSON and JSON->XML.

Looking through the examples for DSTU1 there is the v3-codesystems. In the examples it has the entity '&trade;' in the JSON and in the symbol '™' in the XML. However using the HAPI-FHIR library this element is dropped as it does not exist in the mapping table.

@jamesagnew

This comment has been minimized.

Copy link
Owner

jamesagnew commented Jun 4, 2015

Hi Nick,

Thanks for the info. It does look like we're missing some entities, including &trade.

I'm completely unable to reproduce what you're seeing though. How are you doing the parsing? For reference I came up with the following unit test, but it fails on the parse step with an exception about there being an unknown entity (which is what I'd expect).

    @Test
    public void testParseTextWithUnknownEntity() {
        String msg = "<Patient xmlns=\"http://hl7.org/fhir\"><text><status value=\"generated\"/>"
                + "<div xmlns=\"http://www.w3.org/1999/xhtml\">Trade &trade;</div></text></Patient>";
        Patient pt = ourCtx.newXmlParser().parseResource(Patient.class, msg);

        ourLog.info(pt.getText().getDiv().getValueAsString());
        assertThat(pt.getText().getDiv().getValueAsString(), containsString("Trade ™"));

        String enc = ourCtx.newXmlParser().encodeResourceToString(pt);
        ourLog.info(enc);
        assertThat(enc, containsString("Trade ™"));

    }

jamesagnew added a commit that referenced this issue Jun 4, 2015

@nickkitto

This comment has been minimized.

Copy link
Author

nickkitto commented Jun 4, 2015

Hey James,
Our test has it being parsed from JSON, then being encoded into XML.

So the input would be something like

"{\"resourceType\":\"Supply\",\"text\":{\"status\":\"generated\",\"div\":\"<div xmlns=\\\"http://www.w3.org/1999/xhtml\\\">&trade;</div>\"}}"

And using the JSON parser. This then throws the exception (but this is valid, as v3-codesystems.json has this). This then shouldn't throw the exception and this should be added to the list of entities.

I think it was dropped due to some minor modifications we made to the library to get around this problem, so ignore that. Reverting to the vanilla library it throws the exception which is what you said.

@grahamegrieve

This comment has been minimized.

Copy link
Collaborator

grahamegrieve commented Jun 4, 2015

actually, just because v3-codesystems.json has it doesn't make it valid.
And I don't think it is; it's an html entity, not valid in xhtml.

On Fri, Jun 5, 2015 at 7:58 AM, Nick Kitto notifications@github.com wrote:

Hey James,
Our test has it being parsed from JSON, then being encoded into XML.

So the input would be something like

"{"resourceType":"Supply","text":{"status":"generated","div":"<div xmlns=\"http://www.w3.org/1999/xhtml\\\">™"}}"

And using the JSON parser. This then throws the exception (but this is
valid, as v3-codesystems.json has this). This then shouldn't throw the
exception and this should be added to the list of entities.

I think it was dropped due to some minor modifications we made to the
library to get around this problem, so ignore that. Reverting to the
vanilla library it throws the exception which is what you said.


Reply to this email directly or view it on GitHub
#175 (comment)
.


http://www.healthintersections.com.au / grahame@healthintersections.com.au
/ +61 411 867 065

@jamesagnew

This comment has been minimized.

Copy link
Owner

jamesagnew commented Jun 5, 2015

Perfect, I'll add an update list of entities to convert.

@jamesagnew jamesagnew closed this in 4d9b2c6 Jun 5, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.