detecting the version of WXR in namespace-aware parsing #117

pbiron · 2017-06-09T00:58:17Z

I'm working on a patch to make WXR_Importer namespace-aware. I have a version that basically works, as well as some unit tests for "hairy namespaced" WXR files such as Example 1.

Example 1

<rss xmlns:wxr='http://wordpress.org/export/1.1/'>
   ...
   <wxr_version xmlns='http://wordpress.org/export/1.1/'>1.1</wxr_version>
   ...
   <random_prefix:category xmlns:random_prefix='http://wordpress.org/export/1.1/'>
      <random_prefix:term_id>8</random_prefix:term_id>
      <random_prefix:category_nicename>alpha</random_prefix:category_nicename>
      <random_prefix:category_parent></random_prefix:category_parent>
      <random_prefix:cat_name><![CDATA[alpha]]></random_prefix:cat_name>
   </random_prefix:category>
   ...
   <item>
      <wxr:post_id>123</wxr:post_id>
      ...
      <ns1:post_meta xmlns:ns1='http://wordpress.org/export/1.1/'>
         <ns1:meta_key>some_key</ns1:meta_key>
         <meta_value xmlns='http://wordpress.org/export/1.1/'>some value</meta_value>
      </ns1:post_meta>
   </item>
   ...
</rss>

Granted, the builtin exporter is never going to produce anything like that, but plugins might, see wordpress-importer's lack of understanding of XML Namespaces causing compatibility issues.

The code I have at this point handles Example 1 just fine...and doesn't add too much complexity to the parsing.

But, I have a couple of questions before finalizing the code and submitting the pull request:

What should happen when something like Example 2 is encounted? That is, when the namespace URI of the <wxr_version> element "suggests" a different WXR version that it's value? I would think the import should abort with an error, but wanted to get an opinion before coding that. Note: to those coming from an XML background, the <wxr_version> element is at best superfluous, and at worst bad markup design (i.e., one should use either an element/attribute value to detect the markup version OR a namespace URI, but not both), but I won't go there...because we're stuck with it.
What should happen if "potential" WXR elements are encounted before the <wxr_version> element? (where "potential" means the namespace URI of the element is one of the known URIs for the various versions of WXR and the localName of the element is one of the "known" WXR elements; see Example 3). I see at least 3 options:
1. Process all "potential" WXR elements, regardless of their namespace URIs. This option might make the most sense to the average WP user/developer who is not well-versed in XML Namespaces, but, personally, doesn't make sense from an XML Namespaces perspective.
2. Use the namespace URI of the 1st "potential" WXR element encounted to decide which other WXR elements to process (i.e., essentially ignoring the <wxr_version> element if it doesn't occur before other "potential" WXR elements). I'm not really sure who the "target audience" for this option would be.
3. Ignore any "potential" WXR elements that come before <wxr_version> and then use the namespace URI/value of the <wxr_version> to decide which following WXR elements to process. I tend to lean towards this option, but could be convinced otherwise.
4. ???

Example 2

<rss>
   ...
   <wxr_version xmlns='http://wordpress.org/export/1.1/'>1.2</wxr_version>
   ...
</rss>

Example 3

<rss>
   ...
   <base_site_url xmlns='http://wordpress.org/export/1.1/'>http://example.com</base_site_url>
   <wxr_version xmlns='http://wordpress.org/export/1.2/'>1.2</wxr_version>
   ...
</rss>

Both of these questions (but, especially the 2nd) stem from the fact that there is no schema for WXR. I see that there was some discussion a while back of validatiing against a schema during import. Personally, I don't think that's worth it (especially with a stream-based parser like XMLReader). But run-time validation is not the only use for a schema: schemas also serve as 1) documentation, and 2) "contracts" for both producers and consumers to code against and use during testing.

I've started on a bare-bones XML Schema for WXR 1.2 (as one of the editors of the W3C XML Schema spec I know a thing or two about that :-)...but it's far from "ready for prime-time", mostly because of questions like the above. I realize the development of this plugin isn't the place to have the general discussion of writing a schema for WXR, but: 1) would like tentative answers to the questions above so that I can finish my 1st pass at a namespace-aware WXR_Importer and submit a pull request for it, and 2) wondering if anyone associated with this plugin can put me in contact with the Core folks who I could talk to about the question of nailing down schemas for the various versions of WXR.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detecting the version of WXR in namespace-aware parsing #117

detecting the version of WXR in namespace-aware parsing #117

pbiron commented Jun 9, 2017

Example 1

Example 2

Example 3

detecting the version of WXR in namespace-aware parsing #117

detecting the version of WXR in namespace-aware parsing #117

Comments

pbiron commented Jun 9, 2017

Example 1

Example 2

Example 3