Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumberFormatException when extracting text from docx file #148

Closed
akostajti opened this issue Jun 24, 2015 · 6 comments
Closed

NumberFormatException when extracting text from docx file #148

akostajti opened this issue Jun 24, 2015 · 6 comments

Comments

@akostajti
Copy link

I'm extracting text from a docx file using TextUtils.extractText(Object o, Writer w). For a certain document (generated with an older version fo google docs) I get this exception:

2015-06-21 05:55:14,999 ERROR openpackaging.parts.JaxbXmlPartXPathAware - For input string: "9360.0" [DefaultQuartzScheduler_Worker-10] {} java.lang.NumberFormatException: For input string: "9360.0" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.math.BigInteger.<init>(BigInteger.java:338) at java.math.BigInteger.<init>(BigInteger.java:476) at com.sun.xml.internal.bind.DatatypeConverterImpl._parseInteger(DatatypeConverterImpl.java:72) at com.sun.xml.internal.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$21.parse(RuntimeBuiltinLeafInfoImpl.java:766) at com.sun.xml.internal.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$21.parse(RuntimeBuiltinLeafInfoImpl.java:764) at com.sun.xml.internal.bind.v2.runtime.reflect.TransducedAccessor$CompositeTransducedAccessorImpl.parse(TransducedAccessor.java:230) at com.sun.xml.internal.bind.v2.runtime.unmarshaller.StructureLoader.startElement(StructureLoader.java:194) at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext._startElement(UnmarshallingContext.java:486) at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext.startElement(UnmarshallingContext.java:465) at com.sun.xml.internal.bind.v2.runtime.unmarshaller.InterningXmlVisitor.startElement(InterningXmlVisitor.java:60) at com.sun.xml.internal.bind.v2.runtime.unmarshaller.SAXConnector.startElement(SAXConnector.java:135) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:229) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:266) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:235) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:266) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:235) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:266) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:235) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:266) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:235) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:112) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:95) at com.sun.xml.internal.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:88) at com.sun.xml.internal.bind.v2.runtime.BinderImpl.associativeUnmarshal(BinderImpl.java:146) at com.sun.xml.internal.bind.v2.runtime.BinderImpl.unmarshal(BinderImpl.java:117) at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unwrapUsually(JaxbXmlPartXPathAware.java:283) at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal(JaxbXmlPartXPathAware.java:333) at org.docx4j.openpackaging.parts.JaxbXmlPart.getContents(JaxbXmlPart.java:147)

Is there a way to prevent this exception

?

@plutext
Copy link
Owner

plutext commented Jun 24, 2015 via email

@akostajti
Copy link
Author

sorry, I forgot it. here you can download the file: https://drive.google.com/file/d/0B6qA3QZEFwTKaXdlNE9PRGJhRVU/view?usp=sharing.

@lukateras
Copy link

lukateras commented Dec 16, 2016

@plutext Any updates? I had a very similar issue with the latest version of docx4j:

Unhandled java.lang.NumberFormatException
   For input string: "9576.0"

NumberFormatException.java:   65  java.lang.NumberFormatException/forInputString
              Integer.java:  580  java.lang.Integer/parseInt
           BigInteger.java:  470  java.math.BigInteger/<init>
           BigInteger.java:  606  java.math.BigInteger/<init>
DatatypeConverterImpl.java:   76  com.sun.xml.internal.bind.DatatypeConverterImpl/_parseInteger
RuntimeBuiltinLeafInfoImpl.java:  779  com.sun.xml.internal.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$22/parse
RuntimeBuiltinLeafInfoImpl.java:  777  com.sun.xml.internal.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$22/parse
   TransducedAccessor.java:  230  com.sun.xml.internal.bind.v2.runtime.reflect.TransducedAccessor$CompositeTransducedAccessorImpl/parse
      StructureLoader.java:  195  com.sun.xml.internal.bind.v2.runtime.unmarshaller.StructureLoader/startElement
 UnmarshallingContext.java:  559  com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext/_startElement
 UnmarshallingContext.java:  538  com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext/startElement
         SAXConnector.java:  153  com.sun.xml.internal.bind.v2.runtime.unmarshaller.SAXConnector/startElement
    AbstractSAXParser.java:  509  com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser/startElement
AbstractXMLDocumentParser.java:  182  com.sun.org.apache.xerces.internal.parsers.AbstractXMLDocumentParser/emptyElement
XMLNSDocumentScannerImpl.java:  351  com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl/scanStartElement
XMLDocumentFragmentScannerImpl.java: 2784  com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver/next
XMLDocumentScannerImpl.java:  602  com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl/next
XMLNSDocumentScannerImpl.java:  112  com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl/next
XMLDocumentFragmentScannerImpl.java:  505  com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl/scanDocument
   XML11Configuration.java:  841  com.sun.org.apache.xerces.internal.parsers.XML11Configuration/parse
   XML11Configuration.java:  770  com.sun.org.apache.xerces.internal.parsers.XML11Configuration/parse
            XMLParser.java:  141  com.sun.org.apache.xerces.internal.parsers.XMLParser/parse
    AbstractSAXParser.java: 1213  com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser/parse
        SAXParserImpl.java:  643  com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser/parse
     UnmarshallerImpl.java:  243  com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl/unmarshal0
     UnmarshallerImpl.java:  214  com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl/unmarshal
AbstractUnmarshallerImpl.java:  157  javax.xml.bind.helpers.AbstractUnmarshallerImpl/unmarshal
AbstractUnmarshallerImpl.java:  125  javax.xml.bind.helpers.AbstractUnmarshallerImpl/unmarshal
             XmlUtils.java:  540  org.docx4j.XmlUtils/unmarshalString
             XmlUtils.java:  589  org.docx4j.XmlUtils/unmarshallFromTemplate
          JaxbXmlPart.java:  266  org.docx4j.openpackaging.parts.JaxbXmlPart/variableReplace
NativeMethodAccessorImpl.java:   -2  sun.reflect.NativeMethodAccessorImpl/invoke0
NativeMethodAccessorImpl.java:   62  sun.reflect.NativeMethodAccessorImpl/invoke
DelegatingMethodAccessorImpl.java:   43  sun.reflect.DelegatingMethodAccessorImpl/invoke
               Method.java:  498  java.lang.reflect.Method/invoke
            Reflector.java:   93  clojure.lang.Reflector/invokeMatchingMethod
            Reflector.java:   28  clojure.lang.Reflector/invokeInstanceMethod
            ...

@plutext
Copy link
Owner

plutext commented Dec 19, 2016

Please post your docx at http://ndoc.it

Which version of docx4j?

Generally such issues are handled by the code at https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/jaxb/mc-preprocessor.xslt#L89

@tingley
Copy link

tingley commented Jul 27, 2019

Another example attached.

border.docx.zip

In this case, it's triggered by the decimal value of 1.8 in w:space:

        <w:pBdr>
          <w:top w:sz="7" w:space="1.8" w:color="#333437" w:val="single"/>
          <w:left w:sz="7" w:space="0" w:color="#000000" w:val="single"/>
          <w:bottom w:sz="3" w:space="7.2" w:color="#323539" w:val="double"/>
          <w:right w:sz="7" w:space="0" w:color="#000000" w:val="single"/>
        </w:pBdr>

According to the schema, w:space should be of type ST_PointMeasure, and docx4j parses it as a BigInteger. So this document may actually be schematically invalid. However, tools open it fine (LibreWriter silently corrects the value; I haven't tested in Word). I do not know what tool generated this document.

Stack trace follows.

Caused by: java.lang.NumberFormatException: For input string: "1.8"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[?:1.8.0_212]
	at java.lang.Integer.parseInt(Integer.java:580) ~[?:1.8.0_212]
	at java.math.BigInteger.<init>(BigInteger.java:470) ~[?:1.8.0_212]
	at java.math.BigInteger.<init>(BigInteger.java:606) ~[?:1.8.0_212]
	at com.sun.xml.bind.DatatypeConverterImpl._parseInteger(DatatypeConverterImpl.java:91) ~[jaxb-runtime-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$22.parse(RuntimeBuiltinLeafInfoImpl.java:800) ~[jaxb-runtime-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$22.parse(RuntimeBuiltinLeafInfoImpl.java:798) ~[jaxb-runtime-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.v2.runtime.reflect.TransducedAccessor$CompositeTransducedAccessorImpl.parse(TransducedAccessor.java:245) ~[jaxb-runtime-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.v2.runtime.unmarshaller.StructureLoader.startElement(StructureLoader.java:212) ~[jaxb-runtime-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext._startElement(UnmarshallingContext.java:577) ~[jaxb-runtime-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext.startElement(UnmarshallingContext.java:556) ~[jaxb-runtime-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.v2.runtime.unmarshaller.InterningXmlVisitor.startElement(InterningXmlVisitor.java:75) ~[jaxb-runtime-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.v2.runtime.unmarshaller.SAXConnector.startElement(SAXConnector.java:168) ~[jaxb-runtime-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:244) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:281) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:250) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:281) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:250) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:281) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:250) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:281) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:250) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:281) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:250) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:127) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:110) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:103) ~[jaxb-core-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.v2.runtime.BinderImpl.associativeUnmarshal(BinderImpl.java:161) ~[jaxb-runtime-2.3.0.jar:2.3.0]
	at com.sun.xml.bind.v2.runtime.BinderImpl.unmarshal(BinderImpl.java:132) ~[jaxb-runtime-2.3.0.jar:2.3.0]
	at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal(JaxbXmlPartXPathAware.java:574) ~[docx4j-6.0.1.jar:?]
	at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal(JaxbXmlPartXPathAware.java:355) ~[docx4j-6.0.1.jar:?]
	at org.docx4j.openpackaging.parts.JaxbXmlPart.getContents(JaxbXmlPart.java:194) ~[docx4j-6.0.1.jar:?]
	... 27 more

@plutext
Copy link
Owner

plutext commented Jul 29, 2019

Should be fixed by bc652c5

Will be in a new release this week.

Anybody else who encounters a similar issue but on some other attribute, please open your own issue, clearly showing what XML structure is at issue.

@plutext plutext closed this as completed Jul 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants