Non-ASCII characters get scrambled when retrieving objects from the REST interface

Any string attributes from an object containing non-ASCII characters get scrambled when retrieving the object from ICAT's REST interface.

Example, start with an empty ICAT test server:
```python
>>> # Prepare some strings having non-ascii characters:
>>> fac_fullname = "Beispielzentrum f\xfcr Materialien und Energie"
>>> fac_fullname
'Beispielzentrum für Materialien und Energie'
>>> inv_title = ("\u201CTest\u201D \u2013 an investigation "
...              "having non-ascii chars in the title \u2026")
>>> inv_title
'“Test” – an investigation having non-ascii chars in the title …'

>>> # Create some objects in ICAT:
>>> facility = client.new("facility", name="BZME", fullName=fac_fullname)
>>> facility.create()
>>> inv_type = client.new("investigationType", facility=facility, name="Standard")
>>> inv_type.create()
>>> investigation = client.new("investigation", facility=facility, type=inv_type,
...                            name="test", visitId="none", title=inv_title)
>>> investigation.create()

>>> # Retrieve these objects back from ICAT, verify that the non-ascii
... # characters are still intact (note that this uses the SOAP interface):
>>> query = ("SELECT i FROM Investigation i "
...          "WHERE i.name = 'test' "
...          "INCLUDE i.facility")
>>> client.assertedSearch(query)[0]
(investigation){
   createId = "simple/root"
   createTime = 2018-07-23 13:51:08+00:00
   id = 2
   modId = "simple/root"
   modTime = 2018-07-23 13:51:08+00:00
   facility = 
      (facility){
         createId = "simple/root"
         createTime = 2018-07-23 13:51:08+00:00
         id = 3
         modId = "simple/root"
         modTime = 2018-07-23 13:51:08+00:00
         fullName = "Beispielzentrum für Materialien und Energie"
         name = "BZME"
      }
   name = "test"
   title = "“Test” – an investigation having non-ascii chars in the title …"
   visitId = "none"
 }
```

So far, so good, everything works fine.  But this was python-icat which uses the SOAP interface.  At least it proves that the object attributes are intact in the database.  Now try retrieving the same objects using the same query from the ICAT Rest interface:

```python
>>> import json
>>> from urllib.parse import urlencode
>>> import urllib.request
>>> import ssl

>>> # This is only needed for a test server having no valid certificate
>>> ssl_context = ssl.create_default_context()
>>> if not conf.checkCert:
...     ssl_context.check_hostname = False
...     ssl_context.verify_mode = ssl.CERT_NONE

>>> url = 'https://icat.example.org/icat'
>>> parameters = { "sessionId": client.sessionId, "query": query, }
>>> req = urllib.request.Request(url + "/entityManager?" + urlencode(parameters))
>>> with urllib.request.urlopen(req, context=ssl_context) as f:
...     result = json.loads(f.read().decode('utf-8'))

>>> result
[{'Investigation': {'id': 2, 'createId': 'simple/root', 'createTime': '2018-07-23T13:51:08.000Z', 'modId': 'simple/root', 'modTime': '2018-07-23T13:51:08.000Z', 'datasets': [], 'facility': {'id': 3, 'createId': 'simple/root', 'createTime': '2018-07-23T13:51:08.000Z', 'modId': 'simple/root', 'modTime': '2018-07-23T13:51:08.000Z', 'applications': [], 'datafileFormats': [], 'datasetTypes': [], 'facilityCycles': [], 'fullName': 'Beispielzentrum f��r Materialien und Energie', 'instruments': [], 'investigationTypes': [], 'investigations': [], 'name': 'BZME', 'parameterTypes': [], 'sampleTypes': []}, 'investigationGroups': [], 'investigationInstruments': [], 'investigationUsers': [], 'keywords': [], 'name': 'test', 'parameters': [], 'publications': [], 'samples': [], 'shifts': [], 'studyInvestigations': [], 'title': '���Test��� ��� an investigation having non-ascii chars in the title ���', 'visitId': 'none'}}]

>>> # We see, there is a problem with the non-ascii characters:
>>> result[0]['Investigation']['facility']['fullName']
'Beispielzentrum f��r Materialien und Energie'

>>> # The U+00FC "LATIN SMALL LETTER U WITH DIAERESIS" is replaced by two
... # Unicode U+FFFD: "REPLACEMENT CHARACTER":
>>> result[0]['Investigation']['facility']['fullName'][17] == '\ufffd'
True
>>> result[0]['Investigation']['facility']['fullName'][18] == '\ufffd'
True

>>> # Same in the investigation title:
>>> result[0]['Investigation']['title']
'���Test��� ��� an investigation having non-ascii chars in the title ���'
>>> result[0]['Investigation']['title'][0] == '\ufffd'
True
```

We see in the result, any non-ascii character is replaced by as many `U+FFFD` replacement characters as the utf-8 representation of this character has bytes. It looks like the REST interface code in icat.server takes an already utf-8 encoded string from the database and tries to utf-8 encode it once again, but fails to recognize the encoded bytes and replaces them with `U+FFFD`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Non-ASCII characters get scrambled when retrieving objects from the REST interface #204

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Non-ASCII characters get scrambled when retrieving objects from the REST interface #204

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions