Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide ability to ignore namespaces when parsing #130

Open
jpd236 opened this issue Mar 28, 2023 · 16 comments
Open

Provide ability to ignore namespaces when parsing #130

jpd236 opened this issue Mar 28, 2023 · 16 comments

Comments

@jpd236
Copy link
Contributor

jpd236 commented Mar 28, 2023

I'm dealing with parsing XML in the wild that is occasionally inconsistent about specifying the correct/expected namespace for certain tags, or for that matter, any namespace at all. While I'd like serialization to include the correct namespace, when deserializing, I really only want to look at the tag name and can safely ignore the namespace value altogether. (Most other applications parsing this particular type of XML file already do so).

I'm not seeing an easy way to accomplish this; I don't think an unknown child handler can work here because the child isn't treated as unknown; it matches the expected @Serializable class with that tag name but then fails because the namespace doesn't match the expected one for that class, resulting in this error.

Did I miss an API for this? If not, this would be a helpful feature request. In the mean time, I've resorted to manually doing find-and-replace tweaks to the raw content to try to normalize the namespaces before parsing.

@pdvrieze
Copy link
Owner

pdvrieze commented Apr 2, 2023

The way it is intended to be achieved is by implementing your own subtype of the policy (handleUnknownContentRecovering):

@ExperimentalXmlUtilApi
@Suppress("DirectUseOfResultType", "DEPRECATION")
public fun handleUnknownContentRecovering(
input: XmlReader,
inputKind: InputKind,
descriptor: XmlDescriptor,
name: QName?,
candidates: Collection<Any>
): List<XML.ParsedData<*>> {
handleUnknownContent(input, inputKind, name, candidates)
return emptyList()
}

But I'll give you that this is not the cleanest way to solve your particular issue (where you just want to provide the name to use given a particular input and context). I'll keep this open to consider a nicer way to handle that case. Of course you could use a filtering XmlReader instead that would just do the mapping outside of serialization.

@yschimke
Copy link

yschimke commented May 1, 2023

I couldn't get it working to ignore or change the namespace

On this PR which gets a versioned namespace, 1, 5, or 7.
takahirom/roborazzi#56

I settled for

        val content = archive.readByteString(entry.realSize.toInt()).utf8()

        val withSingleNamespace = content.replace(
          "http://schemas.android.com/sdk/devices/\\d+".toRegex(),
          "http://schemas.android.com/sdk/devices/1"
        )

        val devices: Devices = xml.decodeFromString(withSingleNamespace)

@pdvrieze
Copy link
Owner

pdvrieze commented May 2, 2023

For this particular case I would go with a filter before getting to serialization. Something in line with /examples/DYNAMIC_TAG_NAMES.md, (perhaps you only need the reader here, not the other "magic" that makes it work transparently). This would allow you to "filter" the xml input/output to remove namespaces. Using the fallback can work too, but would be more complex/less efficient.

@yschimke
Copy link

yschimke commented May 2, 2023

I can't justify that, so I'll stick with a search and replace.

Thanks for the example.

@pdvrieze
Copy link
Owner

pdvrieze commented May 2, 2023

Actually the filter can be fairly simple. Much less than that example. You just handle tags and "rename" them. It's effectively structural search and replace. Most of the complexity in "https://github.com/pdvrieze/xmlutil/blob/master/examples/src/main/kotlin/net/devrieze/serialization/examples/dynamictagnames/DynamicTagReader.kt" is to do with the dynamic introduction of attributes and a lot of mess there. In your case you could just override the namespaceUri property to replace certain namespaces with a standard one. In that case you can just use it (use XmlStreaming to get a reader (or get one directly with the reader object names/aliases) and then wrap the filter, the result would be the input of serialization).

@yschimke
Copy link

yschimke commented May 2, 2023

I'll take a look when I get some time.

thanks for providing the Ktor XML implementation, it's saved me twice now.

@pdvrieze
Copy link
Owner

pdvrieze commented May 2, 2023

You're welcome. Btw. for ktor, when going to version 2, use the binding provided by ktor. The module in my project is now officially deprecated (left around only for those still using older versions).

@jpd236
Copy link
Contributor Author

jpd236 commented May 3, 2023

The problem I'm seeing with the approach in #130 (comment) is that it doesn't seem to propagate the overridden namespace to children elements - I'd have to do that manually.

That is, with a simple delegating reader like:

    private class NamespaceNormalizingReader(reader: XmlReader) : XmlDelegatingReader(reader) {
        override val namespaceURI: String
            get() = when (localName) {
                "some-tag" -> NORMALIZED_NAMESPACE
                else -> super.namespaceURI
            }
    }

the override happens successfully for <some-tag> itself. But if there is a child tag which does not specify its own namespace in the XML, super.namespaceURI still returns the original namespace of the parent element, not the one I provided in the delegating reader. As a result, while parsing "succeeds", the resulting element is empty, because none of the children (specified with @XmlSerialName, which has to include the namespace) end up matching the parsed XML.

I guess I could do this by maintaining a stack of namespaces, pushing and popping at the start and stop of each element, respectively. But that feels more complicated than the find-and-replace I have now.

@pdvrieze
Copy link
Owner

pdvrieze commented May 3, 2023

The way you would do it is to have a mapping from qname to qname (probably only the namespace). Mapping from localname only leads to all kinds of issues. The parser will/should present the correct namespace even for child types. If you want to handle "triggers" you'll have to do that based upon the depth of the reader (and reset it on an end tag of the initial depth.

@jpd236
Copy link
Contributor Author

jpd236 commented May 4, 2023

You're referring to XmlReader#name? Overriding this doesn't seem to do anything - it isn't invoked even on successful parses when I attach a debugger. I just get the failure in XmlReader#require which checks localName and namespaceURI directly.

@pdvrieze
Copy link
Owner

pdvrieze commented May 5, 2023

I mean to override namespaceURI (as well as name) to return whatever you want it to (or an empty string). But you do this by matching the original namespaceURI, not merely the local name. But you need to be consistent.

@jpd236
Copy link
Contributor Author

jpd236 commented May 5, 2023

I'm not 100% sure I follow, but I think what you're essentially saying is that we need to override namespaceURI for the child elements, not just the root elements where the namespace is specified in the XML - by matching by the "invalid" namespace in namespaceURI's getter, rather than just using the localName for the impacted root elements.

If so, I think this falls apart in my case because there are multiple hierarchies, e.g. the root tag has expected namespace A, but it has two children, one with expected namespace B and one with expected namespace C. And one of the scenarios that I'm trying to handle is that the namespaces are just omitted from the XML entirely. So if I just map from namespaceURI, and it is blank, I can't know whether to return B or C unless I also track the namespace I returned most recently from a parent, which means I have to maintain a stack of namespaces.

@pdvrieze
Copy link
Owner

pdvrieze commented May 6, 2023

Basically the filter works at quite a low level. It doesn't retain any scope. So if the namespace to use is variable you must handle that in each place that namespace is returned (potentially even on attributes). Tracking the namespace wouldn't be too difficult (store it in relation to depth and remove it when the depth is lower than the recorded depth of initialisation). An alternative is to just unify the namespaces (which effectively ignores it) by (for example) always returning the same namespace. If you want none at all (effectively ignoring namespaces) you would just return the empty string for all namespaces and all prefixes.

@jpd236
Copy link
Contributor Author

jpd236 commented May 16, 2023

I think the alternative you suggest doesn't work because then the data classes used for serialization would need to not declare any namespaces either. I want to be accepting (ignore namespaces) when parsing, but strict (provide valid namespaces) when writing (per the robustness principle).

This appears to be working, per your first suggestion. I'm probably making some simplifying assumptions based on the XML I expect to see. I'm not sure if there's a simpler way, or if there's a gap here I'm not seeing:

    private class NamespaceNormalizingReader(reader: XmlReader) : XmlDelegatingReader(reader) {
        private val namespaceDepthStack = ArrayDeque<Pair<String, Int>>()

        override val namespaceURI: String
            get() {
                while ((namespaceDepthStack.lastOrNull()?.second ?: 0) > depth) {
                    namespaceDepthStack.removeLast()
                }
                val newNamespace = when (localName) {
                    "crossword-compiler-applet" -> CCA_NS
                    "crossword-compiler" -> CC_NS
                    "rectangular-puzzle" -> PUZZLE_NS
                    else -> null
                }
                newNamespace?.let {
                    namespaceDepthStack.addLast(it to depth)
                }
                return namespaceDepthStack.last().first
            }
    }

I'm on the fence as to whether this is better than the simple find-and-replace I had before, but it's probably a bit more clean/robust. On the other hand, it's not as simple as the ideal API - either a simple boolean "ignore namespaces when parsing" or a way to override namespaces on a per-tag basis and have that propagate to children tags/attributes unless a new namespace appears.

@pdvrieze
Copy link
Owner

For ignoring the namespace when parsing, you could use a different policy that doesn't give namespaces in any case (for reading only). I'm not sure whether that would also fit the particular problem (if namespaces are actually needed). Dealing with broken XML is always a mess.

@chumpa
Copy link

chumpa commented May 20, 2023

@jpd236 thank you for code snippet. Just today I faced with the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants