Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML file cannot be loaded by DHF #882

Closed
danielholgate opened this issue Apr 12, 2018 · 19 comments
Closed

XML file cannot be loaded by DHF #882

danielholgate opened this issue Apr 12, 2018 · 19 comments

Comments

@danielholgate
Copy link

danielholgate commented Apr 12, 2018

DHF Quick Start 2.0.4 running on Mac (High Sierra), MarkLogic 9.0.4 backend on a Vargant box Centos 7

I have a batch of XML documents similar to the one attached which break the load flow and cannot be loaded into the DHF.

This is the document:
39eae888-9349-4604-8678-d8df4b0696e6.xml.zip

I create an Input Flow which generates the following mlcp command (the mlcp-test directory contains only the one XML document ):

mlcp.sh import mode "local" host "dev-ml1" port "8010" ~~username ~~password "*****" input_file_path "/space/software/github/sport1/sport1_data/mlcp--test" ~~input_file_type "documents" output_collections "Opta,LoadOptaData,input" output_permissions "rest----reader,read,rest---writer,update" output_uri_replace "/space/software/github/sport1/sport1_data/mlcp---test,''" document_type "xml" transform_module "/com.marklogic.hub/mlcp----flow---transform.xqy" transform_namespace "http://marklogic.com/data~~---hub/mlcp~~--flow~~-transform" transform_param "entity-name=Opta,flowname=Load%20Opta%20Data"

When I run the flow it then generates this error:

17:28:12.414 [main] INFO c.m.contentpump.LocalJobRunner - Content type: XML
17:28:12.750 [main] INFO c.marklogic.contentpump.ContentPump - Job name: local_2138110744_1
17:28:12.766 [main] INFO c.m.c.FileAndDirectoryInputFormat - Total input paths to process : 1
17:28:13.797 [pool-1-thread-1] WARN c.m.contentpump.TransformWriter - Failed document /39eae888-9349-4604-8678-d8df4b0696e6.xml
17:28:13.797 [pool-1-thread-1] WARN c.m.contentpump.TransformWriter - <error:format-string xmlns:error="http://marklogic.com/xdmp/error" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">XDMP-AS: (err:XPTY0004) $content as item()? – Invalid coercion: (, ,
<SoccerDocument .../>...) as item()?</error:format-string>
17:28:13.801 [Thread-4] INFO c.m.contentpump.LocalJobRunner - completed 100%
17:28:13.807 [main] INFO c.m.contentpump.LocalJobRunner - com.marklogic.mapreduce.MarkLogicCounter:
17:28:13.807 [main] INFO c.m.contentpump.LocalJobRunner - INPUT_RECORDS: 1
17:28:13.807 [main] INFO c.m.contentpump.LocalJobRunner - OUTPUT_RECORDS: 1
17:28:13.807 [main] INFO c.m.contentpump.LocalJobRunner - OUTPUT_RECORDS_COMMITTED: 0
17:28:13.807 [main] INFO c.m.contentpump.LocalJobRunner - OUTPUT_RECORDS_FAILED: 1
17:28:13.807 [main] INFO c.m.contentpump.LocalJobRunner - Total execution time: 1 sec

@popzip
Copy link
Contributor

popzip commented Apr 12, 2018

@danielholgate thanks for logging. if this is a blocker for an existing customer please also raise through support contact.

@grechaw
Copy link

grechaw commented Apr 12, 2018

for some reason your xml document is being treated as a sequence rather than a single node. very odd. What happens if you remove the declaration and comments from the top, thus making the contents parse as single item:

in other words take this out:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Copyright 2001-2018 Opta Sportsdata Ltd. All rights reserved. -->

<!-- PRODUCTION HEADER
     produced on:        valde-jobq-a03.nexus.opta.net
     production time:    20180316T200933,494Z
     production module:  Opta::Feed::XML::Soccer::F9
-->

I'm not suggesting this is a workaround, just trying to learn a little.

@danielholgate
Copy link
Author

danielholgate commented Apr 13, 2018

I will try removing the comments section from the document.

btw I was able to load this document with MLCP if the DHF transform module (/com.marklogic.hub/mlcp-flow-transform.xqy) is not used, so this seems to be a DHF problem rather than MLCP which I initially suspected

@grechaw
Copy link

grechaw commented Apr 13, 2018

Thanks for the bug report -- I think there's enough here for us to act on.

@sashamitrovich
Copy link

It seems I have a very similar bug, too. Also using DHF 2.0.4.

10:48:49.492 [main] INFO  c.m.contentpump.LocalJobRunner - Content type: XML
10:48:49.852 [main] INFO  c.marklogic.contentpump.ContentPump - Job name: local_1769849562_1
10:48:49.873 [main] INFO  c.m.c.FileAndDirectoryInputFormat - Total input paths to process : 2
10:48:51.140 [pool-1-thread-4] WARN  c.m.contentpump.TransformWriter - Failed document /g/truven-patient-0F194AF73670.xml
10:48:51.140 [pool-1-thread-4] WARN  c.m.contentpump.TransformWriter - <error:format-string xmlns:error="http://marklogic.com/xdmp/error" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">XDMP-AS: (err:XPTY0004) $entity-name as xs:string -- Invalid coercion: () as xs:string</error:format-string>
10:48:51.140 [pool-1-thread-4] WARN  c.m.contentpump.TransformWriter - Failed document /g/truven-patient-3A649FC8D027.xml
10:48:51.140 [pool-1-thread-4] WARN  c.m.contentpump.TransformWriter - <error:format-string xmlns:error="http://marklogic.com/xdmp/error" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">XDMP-AS: (err:XPTY0004) $entity-name as xs:string -- Invalid coercion: () as xs:string</error:format-string>
10:48:51.140 [pool-1-thread-4] WARN  c.m.contentpump.TransformWriter - Failed document /g/truven-patient-703990A93BA7.xml
10:48:51.140 [pool-1-thread-4] WARN  c.m.contentpump.TransformWriter - <error:format-string xmlns:error="http://marklogic.com/xdmp/error" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">XDMP-AS: (err:XPTY0004) $entity-name as xs:string -- Invalid coercion: () as xs:string</error:format-string>
...

The document:
truven-patient-703990A93BA7.xml.zip

@danielholgate
Copy link
Author

Update: I removed the comment section from the top of my document and DHF was then successfully able to load it, so there seems to be some error in the transform module logic around that

@damonfeldman
Copy link
Contributor

Is the stack trace for these errors logged and available? I see one message referencing "$content as item()?" so it's probably the function return from some .xqy, but no stack.

It's probably easy enough to find the line in this case (almost certainly the data from one plugin being passed to the next as $content?) but in general it's important that errors are caught and logged.

@sashamitrovich
Copy link

@damonfeldman, what do you mean under stack trace? I only have the log output from ML and that's what I attached, there's no additional stack trace to it.

@marklogic-builder
Copy link
Contributor

➤ Alexander Ebadirad commented:

Turns out this is an invalid XML document that gets turned into sequences because you can't have root level commands or instruction (marklogic will throw an error), so it ends up getting taken as a sequence and since there's no node in the first sequence, it fails out with the appropriate coercion error. This is related to #1000 for the same reason.

Now we're left to determine if we want to try to filter out the root level instruction/comments, or if we want to improve the error reporting ability to warn about a non-valid to ML XML structure, both of which will involve a minor performance hit.

@marklogic-builder
Copy link
Contributor

➤ Daniel Holgate commented:

[~aebadira] I was however able to load this document into Marklogic using plain old MLCP (seperate to DHF). Can you clarify what you mean by MarkLogic throws an error?

@marklogic-builder
Copy link
Contributor

➤ Alexander Ebadirad commented:

When built and passed around in memory:

[1.0-ml] XDMP-UNEXPECTED: (err:XPST0003) Unexpected token syntax error, unexpected XMLCommentStart_Do a raw xdmp:insert of your document (copy and paste into qconsole) and you'll see the issue. DHF is treating the multiple roots as a sequence file instead of as a single document with multiple root nodes because valid xml only has 1 root node.

Load up your document via MLCP then in qoconsole do a fn:doc('uriofthisdoc')/node() in qconsole - you'll get what is happening in the transform and why DHF is thinking this is a sequence, instead of a single document, and why the extra nodes throw errors.

@marklogic-builder
Copy link
Contributor

➤ Charles Greer commented:

est is to close loop – work is recorded in 872

@marklogic-builder
Copy link
Contributor

➤ Srinath Sambasubramanian commented:

PR #1070 created

@marklogic-builder
Copy link
Contributor

➤ Srinath Sambasubramanian commented:

Assigning to [~rvudutal] for verification

@marklogic-builder
Copy link
Contributor

➤ Srinath Sambasubramanian commented:

[~rvudutal],

Please make sure you replace the data-hub-framework/* in Modules dir with server-side/* in Marklogic-data-hub dir before you test

@marklogic-builder
Copy link
Contributor

➤ Srinath Sambasubramanian commented:

The ingestion flow works but during testing realized that there are implications in harmonize flows because we allow processing instructions and comments in the input document. Putting it to [~aebadira] to handle those scenarios

@danielholgate
Copy link
Author

I have gone back to the original dataset I raised this ticket for and can now load them with DHF 4.0.0👍

@aebadirad
Copy link
Contributor

Excellent, thanks for closing the loop :)

@marklogic-builder
Copy link
Contributor

➤ Srinath Sambasubramanian commented:

Verified that harmonize flows work fine with xml docs with processing instruction/ comments and added tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants