Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apoc.load.xml on large file generates OoM Errors #2723

Closed
Sandalorian opened this issue Apr 8, 2022 · 1 comment · Fixed by #2841
Closed

apoc.load.xml on large file generates OoM Errors #2723

Sandalorian opened this issue Apr 8, 2022 · 1 comment · Fixed by #2841
Assignees

Comments

@Sandalorian
Copy link

Feature description

apoc.load.xml should be able to stream during loading of large xml files.

Currently a relatively small xml of 1GB can generate OoM heap errors even when used with a simple cypher statement. For example:

call apoc.load.xml("file:///var/lib/neo4j/import/small.xml") 
             yield value as releases
             unwind releases._children as release
             with release.id as releaseID,
               [item in release._children where item._type = "title"][0] as title
             MERGE (r:Release {id: releaseID})
             SET r.title = title._text;

Discussed with @jexp over slack. This is in contrast to apoc.import.xml(file,config) which does stream.

@jexp
Copy link
Member

jexp commented Apr 8, 2022

our comments from slack

you could try to use xpath instead then it should stream
about the iterate one you should do the unwind releases._children in the source statement. otherwise you pass the complete xml json object to the next query if this xml is a single xml structure…

hmm odd, I was sure that load.xml streams but it doesn't seem so
sorry for that
only apoc.import.xml(file,config) seems to stream
21h

private Stream<MapResult> parse(InputStream data, boolean simpleMode, String path, boolean failOnError) throws Exception {
List<MapResult> result = new ArrayList<>();
try {
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
documentBuilderFactory.setIgnoringElementContentWhitespace(true);
documentBuilderFactory.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
documentBuilder.setEntityResolver((publicId, systemId) -> new InputSource(new StringReader("")));
Document doc = documentBuilder.parse(data);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
path = StringUtils.isEmpty(path) ? "/" : path;
XPathExpression xPathExpression = xPath.compile(path);
NodeList nodeList = (NodeList) xPathExpression.evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < nodeList.getLength(); i++) {
final Deque<Map<String, Object>> stack = new LinkedList<>();
handleNode(stack, nodeList.item(i), simpleMode);
for (int index = 0; index < stack.size(); index++) {
result.add(new MapResult(stack.pollFirst()));
}
}
}
catch (FileNotFoundException e){
if(!failOnError)
return Stream.of(new MapResult(Collections.emptyMap()));
else
throw e;
}
catch (Exception e){
if(!failOnError)
return Stream.of(new MapResult(Collections.emptyMap()));
else
throw e;
}
return result.stream();
}

    private Stream<MapResult> parse(InputStream data, boolean simpleMode, String path, boolean failOnError) throws Exception {
        List<MapResult> result = new ArrayList<>();
        try {
            DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
            documentBuilderFactory.setNamespaceAware(true);

we should change that, esp. with xpath to operate on a stream similar to import.xml

vga91 added a commit to vga91/neo4j-apoc-procedures that referenced this issue May 3, 2022
vga91 added a commit to vga91/neo4j-apoc-procedures that referenced this issue May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants