New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XML file cannot be loaded by DHF #882
Comments
@danielholgate thanks for logging. if this is a blocker for an existing customer please also raise through support contact. |
for some reason your xml document is being treated as a sequence rather than a single node. very odd. What happens if you remove the declaration and comments from the top, thus making the contents parse as single item: in other words take this out:
I'm not suggesting this is a workaround, just trying to learn a little. |
I will try removing the comments section from the document. btw I was able to load this document with MLCP if the DHF transform module (/com.marklogic.hub/mlcp-flow-transform.xqy) is not used, so this seems to be a DHF problem rather than MLCP which I initially suspected |
Thanks for the bug report -- I think there's enough here for us to act on. |
It seems I have a very similar bug, too. Also using DHF 2.0.4.
The document: |
Update: I removed the comment section from the top of my document and DHF was then successfully able to load it, so there seems to be some error in the transform module logic around that |
Is the stack trace for these errors logged and available? I see one message referencing "$content as item()?" so it's probably the function return from some .xqy, but no stack. It's probably easy enough to find the line in this case (almost certainly the data from one plugin being passed to the next as $content?) but in general it's important that errors are caught and logged. |
@damonfeldman, what do you mean under stack trace? I only have the log output from ML and that's what I attached, there's no additional stack trace to it. |
➤ Alexander Ebadirad commented: Turns out this is an invalid XML document that gets turned into sequences because you can't have root level commands or instruction (marklogic will throw an error), so it ends up getting taken as a sequence and since there's no node in the first sequence, it fails out with the appropriate coercion error. This is related to #1000 for the same reason. Now we're left to determine if we want to try to filter out the root level instruction/comments, or if we want to improve the error reporting ability to warn about a non-valid to ML XML structure, both of which will involve a minor performance hit. |
➤ Daniel Holgate commented: [~aebadira] I was however able to load this document into Marklogic using plain old MLCP (seperate to DHF). Can you clarify what you mean by MarkLogic throws an error? |
➤ Alexander Ebadirad commented: When built and passed around in memory: [1.0-ml] XDMP-UNEXPECTED: (err:XPST0003) Unexpected token syntax error, unexpected XMLCommentStart_Do a raw xdmp:insert of your document (copy and paste into qconsole) and you'll see the issue. DHF is treating the multiple roots as a sequence file instead of as a single document with multiple root nodes because valid xml only has 1 root node. Load up your document via MLCP then in qoconsole do a fn:doc('uriofthisdoc')/node() in qconsole - you'll get what is happening in the transform and why DHF is thinking this is a sequence, instead of a single document, and why the extra nodes throw errors. |
➤ Charles Greer commented: est is to close loop – work is recorded in 872 |
➤ Srinath Sambasubramanian commented: PR #1070 created |
➤ Srinath Sambasubramanian commented: Assigning to [~rvudutal] for verification |
➤ Srinath Sambasubramanian commented: [~rvudutal], Please make sure you replace the data-hub-framework/* in Modules dir with server-side/* in Marklogic-data-hub dir before you test |
➤ Srinath Sambasubramanian commented: The ingestion flow works but during testing realized that there are implications in harmonize flows because we allow processing instructions and comments in the input document. Putting it to [~aebadira] to handle those scenarios |
I have gone back to the original dataset I raised this ticket for and can now load them with DHF 4.0.0👍 |
Excellent, thanks for closing the loop :) |
➤ Srinath Sambasubramanian commented: Verified that harmonize flows work fine with xml docs with processing instruction/ comments and added tests |
DHF Quick Start 2.0.4 running on Mac (High Sierra), MarkLogic 9.0.4 backend on a Vargant box Centos 7
I have a batch of XML documents similar to the one attached which break the load flow and cannot be loaded into the DHF.
This is the document:
39eae888-9349-4604-8678-d8df4b0696e6.xml.zip
I create an Input Flow which generates the following mlcp command (the mlcp-test directory contains only the one XML document ):
mlcp.sh import
mode "local"port "8010" ~~username ~~password "*****"host "dev-ml1"input_file_path "/space/software/github/sport1/sport1_data/mlcp--test" ~~input_file_type "documents"output_collections "Opta,LoadOptaData,input"---writer,update"output_permissions "rest----reader,read,restoutput_uri_replace "/space/software/github/sport1/sport1_data/mlcp---test,''"document_type "xml"---transform.xqy"transform_module "/com.marklogic.hub/mlcp----flowtransform_namespace "http://marklogic.com/data~~---hub/mlcp~~--flow~~-transform"name=Load%20Opta%20Data"transform_param "entity-name=Opta,flowWhen I run the flow it then generates this error:
17:28:12.414 [main] INFO c.m.contentpump.LocalJobRunner - Content type: XML
17:28:12.750 [main] INFO c.marklogic.contentpump.ContentPump - Job name: local_2138110744_1
17:28:12.766 [main] INFO c.m.c.FileAndDirectoryInputFormat - Total input paths to process : 1
17:28:13.797 [pool-1-thread-1] WARN c.m.contentpump.TransformWriter - Failed document /39eae888-9349-4604-8678-d8df4b0696e6.xml
17:28:13.797 [pool-1-thread-1] WARN c.m.contentpump.TransformWriter - <error:format-string xmlns:error="http://marklogic.com/xdmp/error" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">XDMP-AS: (err:XPTY0004) $content as item()? – Invalid coercion: (, ,
<SoccerDocument .../>...) as item()?</error:format-string>
17:28:13.801 [Thread-4] INFO c.m.contentpump.LocalJobRunner - completed 100%
17:28:13.807 [main] INFO c.m.contentpump.LocalJobRunner - com.marklogic.mapreduce.MarkLogicCounter:
17:28:13.807 [main] INFO c.m.contentpump.LocalJobRunner - INPUT_RECORDS: 1
17:28:13.807 [main] INFO c.m.contentpump.LocalJobRunner - OUTPUT_RECORDS: 1
17:28:13.807 [main] INFO c.m.contentpump.LocalJobRunner - OUTPUT_RECORDS_COMMITTED: 0
17:28:13.807 [main] INFO c.m.contentpump.LocalJobRunner - OUTPUT_RECORDS_FAILED: 1
17:28:13.807 [main] INFO c.m.contentpump.LocalJobRunner - Total execution time: 1 sec
The text was updated successfully, but these errors were encountered: