Skip to content

Non-ASCII characters get scrambled when retrieving objects from the REST interface #204

@RKrahl

Description

@RKrahl

Any string attributes from an object containing non-ASCII characters get scrambled when retrieving the object from ICAT's REST interface.

Example, start with an empty ICAT test server:

>>> # Prepare some strings having non-ascii characters:
>>> fac_fullname = "Beispielzentrum f\xfcr Materialien und Energie"
>>> fac_fullname
'Beispielzentrum für Materialien und Energie'
>>> inv_title = ("\u201CTest\u201D \u2013 an investigation "
...              "having non-ascii chars in the title \u2026")
>>> inv_title
'“Test” – an investigation having non-ascii chars in the title …'

>>> # Create some objects in ICAT:
>>> facility = client.new("facility", name="BZME", fullName=fac_fullname)
>>> facility.create()
>>> inv_type = client.new("investigationType", facility=facility, name="Standard")
>>> inv_type.create()
>>> investigation = client.new("investigation", facility=facility, type=inv_type,
...                            name="test", visitId="none", title=inv_title)
>>> investigation.create()

>>> # Retrieve these objects back from ICAT, verify that the non-ascii
... # characters are still intact (note that this uses the SOAP interface):
>>> query = ("SELECT i FROM Investigation i "
...          "WHERE i.name = 'test' "
...          "INCLUDE i.facility")
>>> client.assertedSearch(query)[0]
(investigation){
   createId = "simple/root"
   createTime = 2018-07-23 13:51:08+00:00
   id = 2
   modId = "simple/root"
   modTime = 2018-07-23 13:51:08+00:00
   facility = 
      (facility){
         createId = "simple/root"
         createTime = 2018-07-23 13:51:08+00:00
         id = 3
         modId = "simple/root"
         modTime = 2018-07-23 13:51:08+00:00
         fullName = "Beispielzentrum für Materialien und Energie"
         name = "BZME"
      }
   name = "test"
   title = "“Test” – an investigation having non-ascii chars in the title …"
   visitId = "none"
 }

So far, so good, everything works fine. But this was python-icat which uses the SOAP interface. At least it proves that the object attributes are intact in the database. Now try retrieving the same objects using the same query from the ICAT Rest interface:

>>> import json
>>> from urllib.parse import urlencode
>>> import urllib.request
>>> import ssl

>>> # This is only needed for a test server having no valid certificate
>>> ssl_context = ssl.create_default_context()
>>> if not conf.checkCert:
...     ssl_context.check_hostname = False
...     ssl_context.verify_mode = ssl.CERT_NONE

>>> url = 'https://icat.example.org/icat'
>>> parameters = { "sessionId": client.sessionId, "query": query, }
>>> req = urllib.request.Request(url + "/entityManager?" + urlencode(parameters))
>>> with urllib.request.urlopen(req, context=ssl_context) as f:
...     result = json.loads(f.read().decode('utf-8'))

>>> result
[{'Investigation': {'id': 2, 'createId': 'simple/root', 'createTime': '2018-07-23T13:51:08.000Z', 'modId': 'simple/root', 'modTime': '2018-07-23T13:51:08.000Z', 'datasets': [], 'facility': {'id': 3, 'createId': 'simple/root', 'createTime': '2018-07-23T13:51:08.000Z', 'modId': 'simple/root', 'modTime': '2018-07-23T13:51:08.000Z', 'applications': [], 'datafileFormats': [], 'datasetTypes': [], 'facilityCycles': [], 'fullName': 'Beispielzentrum f��r Materialien und Energie', 'instruments': [], 'investigationTypes': [], 'investigations': [], 'name': 'BZME', 'parameterTypes': [], 'sampleTypes': []}, 'investigationGroups': [], 'investigationInstruments': [], 'investigationUsers': [], 'keywords': [], 'name': 'test', 'parameters': [], 'publications': [], 'samples': [], 'shifts': [], 'studyInvestigations': [], 'title': '���Test��� ��� an investigation having non-ascii chars in the title ���', 'visitId': 'none'}}]

>>> # We see, there is a problem with the non-ascii characters:
>>> result[0]['Investigation']['facility']['fullName']
'Beispielzentrum f��r Materialien und Energie'

>>> # The U+00FC "LATIN SMALL LETTER U WITH DIAERESIS" is replaced by two
... # Unicode U+FFFD: "REPLACEMENT CHARACTER":
>>> result[0]['Investigation']['facility']['fullName'][17] == '\ufffd'
True
>>> result[0]['Investigation']['facility']['fullName'][18] == '\ufffd'
True

>>> # Same in the investigation title:
>>> result[0]['Investigation']['title']
'���Test��� ��� an investigation having non-ascii chars in the title ���'
>>> result[0]['Investigation']['title'][0] == '\ufffd'
True

We see in the result, any non-ascii character is replaced by as many U+FFFD replacement characters as the utf-8 representation of this character has bytes. It looks like the REST interface code in icat.server takes an already utf-8 encoded string from the database and tries to utf-8 encode it once again, but fails to recognize the encoded bytes and replaces them with U+FFFD.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions