Performance issue when parsing large base64Binary data #50

Closed
J20S opened this Issue Feb 10, 2016 · 4 comments

Projects

None yet

2 participants

@J20S
J20S commented Feb 10, 2016

Hello,
I have experienced performance issues when trying uploading large files.
For example, say we have the following schema:
<xsd:complexType name="uploadFileRequest">
<xsd:sequence>
<xsd:element name="file" type="xsd:base64Binary" minOccurs="1" maxOccurs="1"/>
</xsd:sequence>
</xsd:complexType>

When the file is large, say 7MB, I can notice significant performance issue. I have located the problem to CreateFromDocument() function which is generated by pyxb. It is used to "Parse the given XML and use the document element to create a Python instance".

More specifically, it is the following line in the above method which takes majority of the time to execute:
saxer.parse(io.BytesIO(xmld))
where xmld is the xml string that is passed into this function.

I posted this issue on Source Forge,
thanks @pabigot for pointing out that it is the regex match that is costing the majority of the time.

# This is what it costs to try to be a validating processor.
if cls.__Lexical_re.match(xmlt) is None:
raise SimpleTypeValueError(cls, xmlt)

If we comment this code block out, this issue is fixed. However, since the above code is about "As PyXB is a validating processor it must check whether the incoming encoded data is a valid XML representation. (Peter)", it would be good to have a workaround for this to be part of future releases.

Thanks a lot for your help! @pabigot

Cheers,
James

@pabigot
Owner
pabigot commented Feb 11, 2016

The check exists because:

# base64 is too lenient: it accepts 'ZZZ=' as an encoding of 'e', while
# the required XML Schema production requires 'ZQ=='.  Define a regular
# expression per section 3.2.16.

Proposed workaround is to add API that allows the user to specify a maximum size for base64 literals that will be validated against the XML requirement disallowing. Setting this to zero would disable the extra check; setting it to (say) 64 would keep the check for small values while avoiding it for file uploads.

The default would be None meaning that the validation would always be performed; applications that use large files would have to intentionally disable the check.

This should be in the next release, whenever that happens.

@pabigot pabigot added this to the PyXB 1.2.5 milestone Feb 11, 2016
@pabigot pabigot added a commit that closed this issue Sep 18, 2016
@pabigot fix #50: performance issue for large base64Binary data
Allow the application to set an upper limit on the length of an XML
literal that will be checked for violations of the XML base64 lexical
space requirements that are not detected by Python's base64 module.
0092d65
@pabigot pabigot closed this in 0092d65 Sep 18, 2016
@J20S
J20S commented Sep 21, 2016

Hi Peter,

Thanks for providing this feature in the new release!

I expect the usage of this feature is that we manually add something like:
pyxb.binding.datatypes.base64Binary.XsdValidateLength(-1)
in the binding file generated by pyxbgen command to disable the validation.

I really want to get the above process automated. So except for writing extra scripts for it, is there any chance that we can configure it in command line options?

I understand this might be a separate feature request, but if there is an existing workaround, it will be awesome!

Cheers,
James

@pabigot
Owner
pabigot commented Sep 21, 2016

The setting doesn't go into the binding file; it's a configuration change that affects validation globally, so just disable the validation once in the application that uses the bindings. If you have base64Binary values that you still want fully validated you'll need to set and clear it in the application depending on whether the specific document is likely to be affected. There is no way to limit the validation to specific elements or namespaces.

@J20S
J20S commented Sep 21, 2016

Thanks Peter, I got it now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment