New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On thread safety #18

Closed
eugene-eeo opened this Issue Aug 7, 2014 · 9 comments

Comments

4 participants
@eugene-eeo
Contributor

eugene-eeo commented Aug 7, 2014

This is an observation from prototyping a storage system similar in semantics to TinyDB's ConcurrencyMiddleware- note that the current approach of TinyDB, although it's thread-safe, writes may have a chance of being discarded. Consider a hypothetical situation where I have a table with the ConcurrencyMiddleware being used, i.e.:

s = ConcurrencyMiddleware(JSONStorage('file'))
db = TinyDB('data.json', storage=s).table('table')

If we spawn say, three threads, though it is guaranteed by the lock that their writes will be atomic, and their reads atomic as well, some writes may be discarded due to the way TinyDB works, i.e. in the insert function:

def insert(self, element):
    """
    Insert a new element into the table.
    """

    current_id = self._last_id + 1
    self._last_id = current_id

    data = self._read()
    data[current_id] = element

    self._write(data)

Setting the key on a new dictionary (reconstructed every time the _read() function is called) is, though atomic, there may be multiple threads calling this function asynchronously and unless they happen to execute database calls serially, there is no way to guarantee that the previous write is acknowledged when the data is written to disk- that depends on which write request writes to the file last, a.k.a. race condition. Solving this problem requires the lock to be held on the entirety of the function call, i.e.

def insert(self, element):
    with self.lock:
        ...
@msiemens

This comment has been minimized.

Show comment
Hide comment
@msiemens

msiemens Aug 17, 2014

Owner

I think we should remove the ConcurrencyMiddleware as we can't overload the insert() etc. in a middleware. To implement concurrency properly one could subclass TinyDB and add locks to functions that read or write.
What do you think?

Owner

msiemens commented Aug 17, 2014

I think we should remove the ConcurrencyMiddleware as we can't overload the insert() etc. in a middleware. To implement concurrency properly one could subclass TinyDB and add locks to functions that read or write.
What do you think?

@eugene-eeo

This comment has been minimized.

Show comment
Hide comment
@eugene-eeo

eugene-eeo Aug 18, 2014

Contributor

Yes, that is actually what I had in mind. Do you mean that we do not include something like a ThreadSafeTable, and let users implement it themselves?

Contributor

eugene-eeo commented Aug 18, 2014

Yes, that is actually what I had in mind. Do you mean that we do not include something like a ThreadSafeTable, and let users implement it themselves?

@msiemens

This comment has been minimized.

Show comment
Hide comment
@msiemens

msiemens Aug 18, 2014

Owner

I'd say so. It shouldn't be hard to implement it, but as long as I don't know if someone would actually use it, I wouldn't implement it in TinyDB.

Owner

msiemens commented Aug 18, 2014

I'd say so. It shouldn't be hard to implement it, but as long as I don't know if someone would actually use it, I wouldn't implement it in TinyDB.

@eugene-eeo

This comment has been minimized.

Show comment
Hide comment
@eugene-eeo

eugene-eeo Aug 18, 2014

Contributor

Actually, I don't think we need a lock for reading though. Correct me if I'm wrong.

Contributor

eugene-eeo commented Aug 18, 2014

Actually, I don't think we need a lock for reading though. Correct me if I'm wrong.

@paul-hammant

This comment has been minimized.

Show comment
Hide comment
@paul-hammant

paul-hammant Oct 14, 2017

Contributor

I have a case where thread one has all the writes, and thread 2 only has reads, and I can frequently encounter:

    rows = my_table.search(Query().foo == 'bar')
  File "/home/paul/.local/lib/python3.5/site-packages/tinydb/database.py", line 478, in search
    docs = [doc for doc in self.all() if cond(doc)]
  File "/home/paul/.local/lib/python3.5/site-packages/tinydb/database.py", line 357, in all
    return list(itervalues(self._read()))
  File "/home/paul/.local/lib/python3.5/site-packages/tinydb/database.py", line 330, in _read
    return self._storage.read()
  File "/home/paul/.local/lib/python3.5/site-packages/tinydb/database.py", line 62, in read
    raw_data = (self._storage.read() or {})[self._table_name]
  File "/home/paul/.local/lib/python3.5/site-packages/tinydb/storages.py", line 106, in read
    return json.load(self._handle)
  File "/usr/lib/python3.5/json/__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None

The app was working well without this exception before I decided to move a time-consuming background activity to a thread.

Contributor

paul-hammant commented Oct 14, 2017

I have a case where thread one has all the writes, and thread 2 only has reads, and I can frequently encounter:

    rows = my_table.search(Query().foo == 'bar')
  File "/home/paul/.local/lib/python3.5/site-packages/tinydb/database.py", line 478, in search
    docs = [doc for doc in self.all() if cond(doc)]
  File "/home/paul/.local/lib/python3.5/site-packages/tinydb/database.py", line 357, in all
    return list(itervalues(self._read()))
  File "/home/paul/.local/lib/python3.5/site-packages/tinydb/database.py", line 330, in _read
    return self._storage.read()
  File "/home/paul/.local/lib/python3.5/site-packages/tinydb/database.py", line 62, in read
    raw_data = (self._storage.read() or {})[self._table_name]
  File "/home/paul/.local/lib/python3.5/site-packages/tinydb/storages.py", line 106, in read
    return json.load(self._handle)
  File "/usr/lib/python3.5/json/__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None

The app was working well without this exception before I decided to move a time-consuming background activity to a thread.

@msiemens

This comment has been minimized.

Show comment
Hide comment
@msiemens

msiemens Oct 17, 2017

Owner

@paul-hammant That's actually not surprising to me as TinyDB itself doesn't do any locking to prevent problems with concurrent access (as documented). To prevent these problems, you'll have to either roll your own locking (see Python's threading.RLock) or use an extension like tinyrecord.

Owner

msiemens commented Oct 17, 2017

@paul-hammant That's actually not surprising to me as TinyDB itself doesn't do any locking to prevent problems with concurrent access (as documented). To prevent these problems, you'll have to either roll your own locking (see Python's threading.RLock) or use an extension like tinyrecord.

@paul-hammant

This comment has been minimized.

Show comment
Hide comment
@paul-hammant

paul-hammant Oct 17, 2017

Contributor

I did roll my own in the end. But something else was causing the thread locking (watchdog I think), so I pulled back from multi-threading, and now it works well.

Contributor

paul-hammant commented Oct 17, 2017

I did roll my own in the end. But something else was causing the thread locking (watchdog I think), so I pulled back from multi-threading, and now it works well.

@devxpy

This comment has been minimized.

Show comment
Hide comment
@devxpy

devxpy May 4, 2018

I am trying to solve this using my multiprocessing library

I made a storage backend that supports concurrent fileIO.

I tested it by manually accessing my storage backend's read/write method from 100 processes at same time.

It seems to work great.

Here is the process, whose 100 replicas were made
def child(state: zproc.ZeroState, props):
    storage = zproc.misc.get_tinydb_storage(state)('test') # get a storage class and init with filename='test'
    storage.write({'foo': props})  # dump random data

But When I plug that into TinyDB, I notice that tinyDB only calls read() at the time of creation of db instance.

Here is the process, modified for tinydb
def child(state: zproc.ZeroState, props):
    storage = zproc.misc.get_tinydb_storage(state) # get a storage class
    db = tinydb.TinyDB('test', storage=storage) # pass it to tinydb
    db.insert({'foo': props})  # dump random data

As a result, TinyDB isn't aware of the content already in the file, and generates data that overwrites the old data

I also tried doing a dict.update() on the existing data, but it doesn't work since the counter key is always
"1"

Is there any way to force TinyDB to read the file again at every write operation and update existing contents instead of reading only at first?

devxpy commented May 4, 2018

I am trying to solve this using my multiprocessing library

I made a storage backend that supports concurrent fileIO.

I tested it by manually accessing my storage backend's read/write method from 100 processes at same time.

It seems to work great.

Here is the process, whose 100 replicas were made
def child(state: zproc.ZeroState, props):
    storage = zproc.misc.get_tinydb_storage(state)('test') # get a storage class and init with filename='test'
    storage.write({'foo': props})  # dump random data

But When I plug that into TinyDB, I notice that tinyDB only calls read() at the time of creation of db instance.

Here is the process, modified for tinydb
def child(state: zproc.ZeroState, props):
    storage = zproc.misc.get_tinydb_storage(state) # get a storage class
    db = tinydb.TinyDB('test', storage=storage) # pass it to tinydb
    db.insert({'foo': props})  # dump random data

As a result, TinyDB isn't aware of the content already in the file, and generates data that overwrites the old data

I also tried doing a dict.update() on the existing data, but it doesn't work since the counter key is always
"1"

Is there any way to force TinyDB to read the file again at every write operation and update existing contents instead of reading only at first?
@msiemens

This comment has been minimized.

Show comment
Hide comment
@msiemens

msiemens May 10, 2018

Owner

Actually TinyDB does read the storage on every operation. For example:

from tinydb import TinyDB, Storage


class MyStorage(Storage):
       def read(self):
           print('reading')
           return {}
       def write(self, data):
           pass

db = TinyDB(storage=MyStorage)
db.insert({})

Will print reading twice, one time during creation and one time during db.insert.

Owner

msiemens commented May 10, 2018

Actually TinyDB does read the storage on every operation. For example:

from tinydb import TinyDB, Storage


class MyStorage(Storage):
       def read(self):
           print('reading')
           return {}
       def write(self, data):
           pass

db = TinyDB(storage=MyStorage)
db.insert({})

Will print reading twice, one time during creation and one time during db.insert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment