Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰 Issue with Bulk Update Submission - Parse XML with namespace #4283

Closed
yvalentin opened this issue Feb 17, 2023 · 5 comments
Closed

馃悰 Issue with Bulk Update Submission - Parse XML with namespace #4283

yvalentin opened this issue Feb 17, 2023 · 5 comments
Assignees
Labels
Back end bug Things broken and not working as expected

Comments

@yvalentin
Copy link

Description

On the bulk_update_submissions function from kpi/deployment_backends/kobocat_backend.py if the xml from kobocat have a namespace this part of code failed with an 500 error.

The following part of code throw the error:

# kpi/kpi/deployment_backends/kobocat_backend.py
deprecated_id_or_new = (
                deprecated_id
                if deprecated_id is not None
                else etree.SubElement(xml_parsed.find('meta'), 'deprecatedID')
            )

etree can not find anything if the xml have a namespace.

On our test/dev env, we use a tool (https://docs.getodk.org/briefcase-intro/) to import/initialize some data. So this tool, add a namespace (xmlns="http://opendatakit.org/submissions") on our data:

<a3avQEp6mZv3vMpQovXJnn id="a3avQEp6mZv3vMpQovXJnn" instanceID="uuid:e7024c32f6144c35807b0139cf78ed46" submissionDate="2022-08-27T15:45:22.560218+00:00" xmlns="http://opendatakit.org/submissions">
    <formhub>
        <uuid>uuid:001e77b5-0d53-46ce-9d6d-dc76323b1a68</uuid>
    </formhub>
    <today />
    <start>2022-08-07T14:11:56.705+03:00</start>
    ...
</a3avQEp6mZv3vMpQovXJnn>

Expected behavior

The Bulk Update Submission should worked with or without namespace on submission xml

Actual behavior

Curl Request:

curl --location --request PATCH 'http://kf.kobo.local/api/v2/assets/aC2mBVJSmiYFcKfW8as9be/data/bulk.json' \
--header 'Authorization: Token 20413acd58b13736a72c3f0f5c8019d876fdbf98' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'payload={"submission_ids":["4"],"data":{"BENEFICIARY_INFO/Age":"23"}}'

With a submission who have a namespace, we got an 500 status code error.
On mode dev/debugger activate, the error detail is:

TypeError: Argument '_parent' has incorrect type (expected lxml.etree._Element, got NoneType)

Stack trace (interesting part):

File "/srv/src/kpi/kpi/views/v2/data.py", line 372, in bulk
                        'uuid': submission['_uuid'],
                    }
                ))
 
        # Send request to KC
        json_response = action_(bulk_actions_validator.data, request.user)
 
        # If requests has succeeded, let's log deletions (if any)
        if json_response['status'] == status.HTTP_200_OK and audit_logs:
            AuditLog.objects.bulk_create(audit_logs)
 
File "/srv/src/kpi/kpi/deployment_backends/kobocat_backend.py", line 207, in bulk_update_submissions
 
 
            deprecated_id_or_new = (
                deprecated_id
                if deprecated_id is not None
                else etree.SubElement(xml_parsed.find('meta'), 'deprecatedID')
            )
            deprecated_id_or_new.text = instance_id.text
            instance_id.text = uuid_formatted
            etree.dump(xml_parsed)
            # If the form has been updated with new fields and earlier

TypeError: Argument '_parent' has incorrect type (expected lxml.etree._Element, got NoneType)

Additional details

I can provide a merge request with an updated version of this part of code to remove the namespace before to search something on the xml_parsed

for elem in xml_parsed.getiterator():
    # Skip comments and processing instructions,
    # because they do not have names
    if not (
        isinstance(elem, etree._Comment)
        or isinstance(elem, etree._ProcessingInstruction)
    ):
        # Remove a namespace URI in the element's name
        elem.tag = etree.QName(elem).localname

# Remove unused namespace declarations
etree.cleanup_namespaces(xml_parsed)
@jnm
Copy link
Member

jnm commented Mar 3, 2023

Hi @yvalentin, thanks so much for the detailed bug report! We would be happy to accept a pull request, but I think it should have the following鈥攚hich may be too much to ask of a casual contribution 馃槆 Let me know what you think.

  1. Take the opportunity to switch this particular piece of XML processing to defusedxml in order to mitigate certain XML vulnerabilities;
    • That'd mean promoting defusedxml to a top-level requirement of our application. It does already appear in dependencies/pip/requirements.txt, but that's because it's required by packages that we depend upon. It'd need to go into dependencies/pip/requirements.in.
  2. Instead of stripping away all of the namespaces, use the {*} syntax to match any namespace. https://docs.python.org/3/library/xml.etree.elementtree.html#supported-xpath-syntax has this helpful example:

    {*}spam selects tags named spam in any (or no) namespace

  3. Avoid the SubElement() call altogether as it seems that it could be replaced with a single XPath like {*}meta/{*}deprecatedID, although @joshuaberetta could perhaps correct me if I'm wrong;
  4. Include a unit test for bulk editing namespaced XML submissions.

Reproducing the problem

>>> fun_xml = '''<a3avQEp6mZv3vMpQovXJnn id="a3avQEp6mZv3vMpQovXJnn" instanceID="uuid:e7024c32f6144c35807b0139cf78ed46" submissionDate="2022-08-27T15:45:22.560218+00:00" xmlns="http://opendatakit.org/submissions">
...     <formhub>
...         <uuid>uuid:001e77b5-0d53-46ce-9d6d-dc76323b1a68</uuid>
...     </formhub>
...     <today />
...     <start>2022-08-07T14:11:56.705+03:00</start>
...     <meta>
...         <instanceID>uuid:02c65031-53e1-4705-bf28-05afb05e4543</instanceID>
...         <deprecatedID>uuid:a351a3ca-d663-4b17-a430-e5abd22700dd</deprecatedID>
...     </meta>
... </a3avQEp6mZv3vMpQovXJnn>'''
>>> from xml.etree import ElementTree as lxml_ET
>>> lxml_tree = lxml_ET.fromstring(fun_xml)
>>> lxml_ET.SubElement(lxml_tree.find('meta'), 'deprecatedID')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: SubElement() argument 1 must be xml.etree.ElementTree.Element, not None

Simplest workaround

>>> lxml_ET.SubElement(lxml_tree.find('{*}meta'), 'deprecatedID')
<Element 'deprecatedID' at 0x7fcb326c3dd0>

Alternative with single XPath

>>> lxml_tree.find('{*}meta/{*}deprecatedID')
<Element '{http://opendatakit.org/submissions}deprecatedID' at 0x7fcb324f67a0>

Preferred, with security improvement and single XPath

>>> from defusedxml import ElementTree as defused_ET
>>> defused_tree = defused_ET.fromstring(fun_xml)
>>> defused_tree.find('{*}meta/{*}deprecatedID')
<Element '{http://opendatakit.org/submissions}deprecatedID' at 0x7fcb32524130>
>>> defused_tree.find('{*}meta/{*}deprecatedID').text
'uuid:a351a3ca-d663-4b17-a430-e5abd22700dd'

(note that defused_ET and the like are for clarity in the example and not names that I would use in the application code)

@JacquelineMorrissette, fyi, since this mentions defusedxml, which is related to the static analysis security work you've been doing.

@jnm jnm self-assigned this Mar 3, 2023
@jnm jnm added bug Things broken and not working as expected Back end labels Mar 3, 2023
@joshuaberetta
Copy link
Member

Note that switching to defusedxml to parse the XML string also resolves the issue when the submission XML begins with <?xml version=\'1.0\' encoding=\'UTF-8\' ?>.

@JacquelineMorrissette
Copy link
Contributor

Work is beginning on the switch to defusedxml

@yvalentin
Copy link
Author

yvalentin commented Mar 15, 2023

Ok ! Thanks for your feedback.

We have plan to work on this and come back by the end of the next week
Please keep providing us with ongoing process

@jnm
Copy link
Member

jnm commented Aug 15, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Back end bug Things broken and not working as expected
Projects
None yet
Development

No branches or pull requests

4 participants