New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode in queries #28

Closed
zelenikotao opened this Issue Sep 11, 2014 · 12 comments

Comments

3 participants
@zelenikotao

zelenikotao commented Sep 11, 2014

When testing db field with unicode string, UnicodeEncodeError exception is raised.
Line which causes the exception:

db.get(where('name') == u'žir')

Inserting unicode data went without problems:

db.insert({'name': 'žir'})

I have made quick hack which fixes problem for my little hobby project, but I will examine this problem more when I find time.

In queries.py, I've changed Query._update_repr function body to:

self._repr = u'\'{0}\' {1} {2}'.format(self._key, operator, value)

and Query.__hash__ to:

return hash(repr(unicode(self)))

Basically adding string preffix "u" in _update_repr, and "unicode" call in __hash__...

Using tinydb from git on python 2.7.6, ubuntu 14.04

@eugene-eeo

This comment has been minimized.

Show comment
Hide comment
@eugene-eeo

eugene-eeo Sep 12, 2014

Contributor

Is it possible to normalize the data first before inserting? I.e. I know that there is a function called unicodedata.normalize that should help. Then you can query easily with:

db.get(where('name') == 'zir')

Can you provide the full traceback information? (Just copy + paste from your Python interpreter session)

Contributor

eugene-eeo commented Sep 12, 2014

Is it possible to normalize the data first before inserting? I.e. I know that there is a function called unicodedata.normalize that should help. Then you can query easily with:

db.get(where('name') == 'zir')

Can you provide the full traceback information? (Just copy + paste from your Python interpreter session)

@msiemens

This comment has been minimized.

Show comment
Hide comment
@msiemens

msiemens Sep 15, 2014

Owner

@zelenikotao Can you please post a full traceback?

Owner

msiemens commented Sep 15, 2014

@zelenikotao Can you please post a full traceback?

@zelenikotao

This comment has been minimized.

Show comment
Hide comment
@zelenikotao

zelenikotao Sep 15, 2014

Sorry for not responding earlier, I didn't have any free time over weekend.

@eugene-eeo I have tried with unicodedata.normalize, result is the same.

@eugene-eeo @msiemens, here is the traceback:

$ python example.py 

Traceback (most recent call last):
  File "example.py", line 13, in <module>
    db.get(where('name') == unicodedata.normalize('NFKC', u'žir'))
  File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 184, in __eq__
    self._update_repr('==', other)
  File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 310, in _update_repr
    self._repr = '\'{0}\' {1} {2}'.format(self._key, operator, value)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017e' in position 0: ordinal not in range(128)

Here is test script which causes exception,
https://gist.github.com/zelenikotao/b23d79edc80bcea3b511.js

zelenikotao commented Sep 15, 2014

Sorry for not responding earlier, I didn't have any free time over weekend.

@eugene-eeo I have tried with unicodedata.normalize, result is the same.

@eugene-eeo @msiemens, here is the traceback:

$ python example.py 

Traceback (most recent call last):
  File "example.py", line 13, in <module>
    db.get(where('name') == unicodedata.normalize('NFKC', u'žir'))
  File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 184, in __eq__
    self._update_repr('==', other)
  File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 310, in _update_repr
    self._repr = '\'{0}\' {1} {2}'.format(self._key, operator, value)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017e' in position 0: ordinal not in range(128)

Here is test script which causes exception,
https://gist.github.com/zelenikotao/b23d79edc80bcea3b511.js

eugene-eeo added a commit to eugene-eeo/tinydb that referenced this issue Sep 15, 2014

@msiemens

This comment has been minimized.

Show comment
Hide comment
@msiemens

msiemens Sep 16, 2014

Owner

@zelenikotao You've mixed up unicode strings and byte strings. It should work if you use byte strings only, e.g.:

db.insert({'name': 'žir'})
db.search(where('name') == 'žir')

@eugene-eeo I wouldn't recommend normalizing the data that way as you will lose information. Say you insert both {'name': 'zir'} and {'name': 'žir'}, TinyDB will regard them as equal while they propably shouldn't be.

Owner

msiemens commented Sep 16, 2014

@zelenikotao You've mixed up unicode strings and byte strings. It should work if you use byte strings only, e.g.:

db.insert({'name': 'žir'})
db.search(where('name') == 'žir')

@eugene-eeo I wouldn't recommend normalizing the data that way as you will lose information. Say you insert both {'name': 'zir'} and {'name': 'žir'}, TinyDB will regard them as equal while they propably shouldn't be.

@zelenikotao

This comment has been minimized.

Show comment
Hide comment
@zelenikotao

zelenikotao Sep 16, 2014

@msiemens when I use byte strings, as you've proposed, db holds unicode string for value of inserted document, and using search raises this warning

/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  self._cmp = lambda value: value == other

Of course, search doesn't return document I was searching for, only None value.

zelenikotao commented Sep 16, 2014

@msiemens when I use byte strings, as you've proposed, db holds unicode string for value of inserted document, and using search raises this warning

/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  self._cmp = lambda value: value == other

Of course, search doesn't return document I was searching for, only None value.

@msiemens

This comment has been minimized.

Show comment
Hide comment
@msiemens

msiemens Sep 16, 2014

Owner

What's the exact code you've used? If I use byte strings for both inserting and searching, it works...

>>> from tinydb import TinyDB, where
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
[{'name': '?ir'}]

(Note: the question mark in the db.search result is caused by the Windows CMD terminal, shouldn't be a bug in TinyDB)

Owner

msiemens commented Sep 16, 2014

What's the exact code you've used? If I use byte strings for both inserting and searching, it works...

>>> from tinydb import TinyDB, where
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
[{'name': '?ir'}]

(Note: the question mark in the db.search result is caused by the Windows CMD terminal, shouldn't be a bug in TinyDB)

@zelenikotao

This comment has been minimized.

Show comment
Hide comment
@zelenikotao

zelenikotao Sep 16, 2014

This is the code I've used:

>>> from tinydb import TinyDB, where
>>> db = TinyDB('db.json')
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  self._cmp = lambda value: value == other
[]

But I've tried with MemoryStorage as in your example, and it is working. Could problem be somewhere in file storage handling?

zelenikotao commented Sep 16, 2014

This is the code I've used:

>>> from tinydb import TinyDB, where
>>> db = TinyDB('db.json')
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  self._cmp = lambda value: value == other
[]

But I've tried with MemoryStorage as in your example, and it is working. Could problem be somewhere in file storage handling?

@msiemens

This comment has been minimized.

Show comment
Hide comment
@msiemens

msiemens Sep 16, 2014

Owner

Could be, I'm investigating.

EDIT: This doesn't seem to have a trivial non-hacky solution, I'll work a bit on this.

Owner

msiemens commented Sep 16, 2014

Could be, I'm investigating.

EDIT: This doesn't seem to have a trivial non-hacky solution, I'll work a bit on this.

@eugene-eeo

This comment has been minimized.

Show comment
Hide comment
@eugene-eeo

eugene-eeo Sep 16, 2014

Contributor

@msiemens I think you should read this as well http://stackoverflow.com/questions/11759070/python-json-loads-dumps-break-unicode#11759156

UPDATE: It works:

>>> from ujson import dumps
>>> d = dumps({"name": "ålpha"}, ensure_ascii=False)
>>> d
'{"name":"\xc3\xa5lpha"}'
>>> loads(d)
{u'name': u'\xe5lpha'}
>>> 
Contributor

eugene-eeo commented Sep 16, 2014

@msiemens I think you should read this as well http://stackoverflow.com/questions/11759070/python-json-loads-dumps-break-unicode#11759156

UPDATE: It works:

>>> from ujson import dumps
>>> d = dumps({"name": "ålpha"}, ensure_ascii=False)
>>> d
'{"name":"\xc3\xa5lpha"}'
>>> loads(d)
{u'name': u'\xe5lpha'}
>>> 
@msiemens

This comment has been minimized.

Show comment
Hide comment
@msiemens

msiemens Sep 17, 2014

Owner

I was wrong, there is a trivial solution, see 6b518b8. Test cases for unicode data included.

@zelenikotao Could you test if it works in the latest development version?

Owner

msiemens commented Sep 17, 2014

I was wrong, there is a trivial solution, see 6b518b8. Test cases for unicode data included.

@zelenikotao Could you test if it works in the latest development version?

@zelenikotao

This comment has been minimized.

Show comment
Hide comment
@zelenikotao

zelenikotao Sep 17, 2014

Tested it, works great for me!
Thanks!

zelenikotao commented Sep 17, 2014

Tested it, works great for me!
Thanks!

@msiemens msiemens closed this Sep 17, 2014

@msiemens

This comment has been minimized.

Show comment
Hide comment
@msiemens

msiemens Sep 17, 2014

Owner

Thanks for reporting!

Owner

msiemens commented Sep 17, 2014

Thanks for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment