Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential problems with distributed lock manager (DLM) #1

Open
stIncMale opened this issue Dec 9, 2019 · 24 comments
Open

Potential problems with distributed lock manager (DLM) #1

stIncMale opened this issue Dec 9, 2019 · 24 comments

Comments

@stIncMale
Copy link

stIncMale commented Dec 9, 2019

Generally speaking, locks may be used for two different purposes:

  • Efficiency. Saving work, when it is safe to concurrently do something, but it results in some amount of work wasted e.g. because only one of multiple concurrent transactions is allowed to be committed, and others are rolled back and have to retry.
  • Correctness. When a resource protected by a lock fails to behave correctly if accessed concurrently.

Here I am going to talk only about DLMs used to achieve correctness in an asynchronous model. This means that safety properties of algorithms cannot rely on any timing assumptions; liveness properties still can rely on such assumptions; see [1]. All real networks are as poor as the asynchronous model (unless you are Google who built physical infrastructure for their Cloud Spanner to be able to guarantee for example that the clock skew on different nodes is limited with a known limit, see "Spanner, TrueTime and the CAP Theorem" by Eric Brewer, Spanner: Google’s Globally-Distributed Database). Moreover, I will hardly say something new comparing to what has already been said in the following sources:

  1. "How to do distributed locking" by Martin Kleppmann (despite the article was inspired by the Redis Redlock algorithm, it is about DLM in general).
  2. A response from Salvatore Sanfilippo (a.k.a. antirez) to the criticism above.
  3. The "Leader Election and External Resources" paragraph in the "ZooKeeper. Distributed Process Coordination" book by Benjamin Reed, Flavio Junqueira.

The idea of DLM, as opposed to a centralized locking manager (which is easy to implement), is to provide a lock manager that is more available than a lock manager with a single point of failure (SPOF). This means that a DLM must be able to release a lock that is being held by an unresponsive process. If a DLM does not have such a feature, then a stuck DLM client that fails to release its lock in a timely manner effectively renders the whole DLM unavailable (all of this is mentioned in [2]). In order to provide this DLM-side lock release feature, the majority (all?) of DLMs introduce lock expiry.

The problem with expiring lock is that as soon as we use them, DLM alone cannot guarantee mutual exclusion anymore, and the resource protected by a lock has to take active participation so that the mutual exclusion is preserved. Both [1] and [3] propose using fencing tokens as a solution: together with acquiring a lock from a DLM, a DLM client also receives a unique fencing token that it sends to the resource protected by the lock together with a request. The protected resource handles the request only in the following situations (otherwise the request is rejected):

  • it has not seen this token before;
  • the token is the same as the last token the protected resource has received (here we assume that concurrent requests to a protected resource may come only from multiple clients, but a single client never sends concurrent requests: it either receives a response from the resource before sending another request or, if the request timed out, again acquires a new lock&token before sending another request).

It appears that the simplest way of achieving this is to use always increasing token values (as mentioned in [1]), but [2] also mentions that sufficiently large random tokens are fine to guarantee uniqueness (which I agree, but with this approach, the protected resource has to remember and search through all the values it has ever seen as opposed to remembering only the last value).

Implementing this technique requires the following:

  • an ability to modify the behaviour of (to program) the protected resource;
  • an atomic compare-and-set (CAS) operation (not necessary lock-free) supported by the resource (I believe this is what the phrase "If your data store can always accept the write only if your token is greater than all the past tokens, than it’s a linearizable store" in [2] is meant to say).

Let's look at the situation again. In order to correctly use DLM to exclude concurrent access to a resource, we must:

  • have the support of atomic CAS operations by the protected resource;
  • be able to program this resource or at least hide it behind a proxy that we develope (these conditions are equivalent);
  • use a complex 3rd-party DLM implementation which is likely incorrect.
    It appears to me that once we have the first two conditions satisfied, we could have programmed the correct handling of concurrent requests in the resource, which would be simpler and more robust than correctly achieving the same with a DLM.

As a result of all I wrote before, I think that using DLM for correctness is a hypothetical idea that has little to no practical value. I expect that in the majority of situations when DLM is used for correctness the solution is incorrect but appears to work most of the time. [3] mentions a specific bug in Apache HBase caused exactly by the scenario that fencing tokens are supposed to help to deal with. And here https://aphyr.com/tags/jepsen one may find many stories of how distributed DBs often miserably fail to work correctly.

It would be interesting to see/discuss scenarios where DLM can be applied correctly and at the same time it's usage is necessary. I am not talking here about the coordination mechanisms used inside a distributed DBMS (i.e. for accessing the state stored in it) like Apache ZooKeeper or Cassandra (for LWTs), but rather talking about situations when a DLM implemented in one system is used by another system for correctness. @juliofalbo May I ask you to please describe in detail a scenario in which DLM is used in this project?

Update 2021-Feb-10: I posted a comment a while ago about the practical usability of the fencing approach for DLM explained in "How to do distributed locking" by Martin Kleppmann. Recently someone else posted another comment expressing the same concern, and Martin replied (I don't want to copy-paste the text of the concerns and of the reply here, so I linked all the relevant comments and the reply). Essentially, while trying to come up with a correct DLM algorithm, we may only end up having one more consensus algorithm.

@juliofalbo
Copy link
Owner

juliofalbo commented Dec 9, 2019

Hi @stIncMale .

First I would like to thank you for the thread and for the amazing comment!

I'm using Redis as a DLM to avoid 2 bookings for the same room at the same days.

So, let's imagine this situation:

Request 1: Book room A from 01/01/2020 to 10/01/2020

Request 2: Book room A from 08/01/2020 to 15/01/2020.

Now let's imagine that we will receive those requests in parallel at the same time.

The lock will avoid one of those requests since we have a validation on the method responsible to book a room to check if the room is available.
I'm using the room id as a key of the lock, it means that every request for one specific room will try to acquire this lock.
You can check the code here.

So, to handle the locks I'm using a lib creates by @alturkovic and it is working fine, as you can check using this JMeter file

Do you think that is it possible to reproduce the GC error?

Maybe we can use some Chaos Engineering Tool to do it, like Perses (one tool that I co-created with 2 friends)

@stIncMale
Copy link
Author

stIncMale commented Dec 9, 2019

I have a suspicion that BookingDatesRepository works on top of an ACID DBMS (most likely PostgreSQL in the case of this project), and this DBMS has a single node that handles writes and stores all the bookings. Is this correct? If it is, then the is no reason to use DLM for booking as any ACID DBMS provides simple and robust ways of dealing with concurrency and atomic operations.

If this is not the case, i.e. bookings are stored not in an ACID DBMS, or sharding is used and in such a way that bookings for the same room are stored in different shards (I would be highly surprised to discover that this is the case), then could you please describe the booking storage?

@juliofalbo
Copy link
Owner

So, we have only a single node that handles writes and stores all the bookings. We also have a replication of this DB but only for read only purpose.

About your suggestion: any ACID DBMS provides simple and robust ways of dealing with concurrency and atomic operations.

Are you talking about to handle this lock inside Postgres instead of Redis? It it is, on this case Postgres will be my DLM and I'll have the same issue, correct?

@stIncMale
Copy link
Author

stIncMale commented Dec 9, 2019

Are you talking about to handle this lock inside Postgres instead of Redis?

Yes, you may be able to do this even with the guarantees given to you by the A (atomicity) and I (isolation) properties (see https://www.postgresql.org/docs/current/transaction-iso.html) without resorting to explicit locking (but if you want to, PostgreSQL has advisory locks).

on this case Postgres will be my DLM and I'll have the same issue, correct?

No, your single writable PG node is going to act as a centralized locking manager (as opposed to a distributed one) which is not only much less likely to have correctness issues but also is combined with the data store, meaning that the asynchrony of the network cannot cause troubles. Room booking is one of the most basic examples of what ACID DBMS can help to solve easily. And since you already store bookings there, there is no reason whatsoever to control concurrency by means of any external system.

@juliofalbo
Copy link
Owner

juliofalbo commented Dec 9, 2019

Yes, I agree that is possible to have this behavior using, for example, the SERIALIZABLE isolation level in the @Transactional annotation provided by Spring, but in this case the transaction will be rejected and not "enqueued".
So, the threw exception will be (in this case) a JpaException, but I don't want to do it, I want that my transaction wait for the execution of the previous one.

@stIncMale
Copy link
Author

If you don't want transactions to be failing as a result of concurrent requests (this is a downside of MVCC), you can use explicit advisory PostgreSQL locks.

Centralized LM implemented by the DMBS that stores booking data is enough here because there is no need for an LM to be available when a DBMS is not available (who needs a lock that protects a resource that is unavailable anyway?). I see downsides and do not see any upside of using a DLM for concurrency control in a single instance of an ACID DBMS.

@juliofalbo
Copy link
Owner

Good point @stIncMale !

So, the lib that I'm using already supports this lock, I'll create another branch and test this new implementation using Postgres to handle the locks.

But I really want to continue the discussion around the RedLock issue!

Maybe we can create a scenario that make sense to use a DLM. What do you think?

@stIncMale
Copy link
Author

the lib that I'm using already supports this lock

Just make sure that the locks you acquire (whether at session level or transaction level) are acquired in the same session/transaction that does the checkTheRoomIsFree&Book action. Otherwise, the solution may have the same problem as described above.

But I really want to continue the discussion around the RedLock issue!

It's not that much about Redlock specifically, as it is about the feasibility of using DLM for correctness in general.

Maybe we can create a scenario that makes sense to use a DLM. What do you think?

Well, I probably do not have enough fantasy/experience to imagine one. I tried asking Martin in the comments under his article, and tried connecting with him via LinkedIn - no luck. I'll try asking Kyle Kingsbury, a.k.a "Aphyr" what he thinks about this topic (he is the author of Jepsen), and I'll try asking a question on Stack Overflow (if it fits the format).

@juliofalbo
Copy link
Owner

juliofalbo commented Dec 10, 2019

@toktarev, thank you very much for your contribution!

I don't have experience with consensus protocol but I think Zookeeper is the most used consensus system that we have today. (Kafka, Hadoop, Mesos, Cassandra, Solr and etc are using Zookeeper).

So, your suggestion for this case is: Change Redis to Zookeeper, correct?

And about your note: The best lock is when we don't have lock.

I totally agree, but I can not see another option here.

@juliofalbo
Copy link
Owner

I added the distributed lock to handle this situation: #1 (comment)

@juliofalbo
Copy link
Owner

juliofalbo commented Dec 10, 2019

If I do this I will ignore the load balancer and I can not see how it will solve the problem. Can you explain this, please?

@juliofalbo
Copy link
Owner

juliofalbo commented Dec 10, 2019

Hmm, I'll try to implement something like this to validate the idea. But I can not see how to handle the requests inside my bucket. Do you have an example?

@juliofalbo
Copy link
Owner

About the GC problem, I have a point!

When we talk about Java 11+, we can use the ZGC as Garbage Collector, and the GC pauses will not be a problem anymore since the limit for those pauses is 10ms.

The Z Garbage Collector, also known as ZGC, is a scalable low latency garbage collector designed to meet the following goals:

Pause times do not exceed 10ms
Pause times do not increase with the heap or live-set size
Handle heaps ranging from a few hundred megabytes to multi terabytes in size

@juliofalbo
Copy link
Owner

juliofalbo commented Dec 10, 2019

To be honest I don't like this approach. It will limit us to scale.
We can use the same strategy using RabbitMQ and only 1 consumer. But if we need to scale we'll have a problem.

@juliofalbo
Copy link
Owner

I liked the approach to use Zookeeper and I'll create a branch with this change!

Again, thank you very much for your contribution @toktarev !

@juliofalbo
Copy link
Owner

juliofalbo commented Dec 10, 2019

So, for now we have 4 ways (for now) to handle this distributed lock:

1 - Using Redis as a DLM

2 - Using Zookeeper as a DLM

3 - Using the Isolation level of a Relational DB like Postgres (ACID)

3 - Handle the Load balancer to send the requests for the same room to one instance and sync the code

4 - Using a message broker and send the messages for only 1 consumer since the default behavior is FIFO

Now would be really nice to have the pros and cons of each approach.

@juliofalbo
Copy link
Owner

@toktarev amazing!!!!!

So, now we have 5 ideas.

1 - Using Redis as a DLM

2 - Using Zookeeper as a DLM

3 - Using the Isolation level of a Relational DB like Postgres

3 - Handle the Load balancer to send the requests for the same room to one instance and sync the code

4 - Using a message broker and send the messages for only 1 consumer since the default behavior is FIFO

5 - Lock free with HazelCast or Apache Flink.

And I really liked this fifth option! I agree with when you say: The best lock is when we don't have lock.

Can you help me to implement this strategy in this scenario?

I'm work on a branch with Zookeeper and I really think that with the all 5 options implemented we can have a great sandbox!

Again, thank you very much to bring amazing ideas to this issue!

@juliofalbo
Copy link
Owner

No problem! I'll implement all options and when I finish I'll mark you in the PR!

@stIncMale
Copy link
Author

stIncMale commented Dec 11, 2019

@toktarev

Why we can't send all requests with the same roomId on the same node ?

Alex, there are no multiple nodes here, there is a single writable PostgreSQL instance that stores bookings (see #1 (comment)).

So essentially the system is not distributed (in fact, this is a very primitive situation). Using any external to the DBMS way of handling booking conflicts in this specific situation does the following:

  • drastically complicates the solution;
  • makes the booking functionality less available by introducing more places where it can fail;
  • makes the performance of the booking actions worse by introducing more communications required to book a room.

I am suspecting that the reason it was not implemented via the PostgreSQL instance where the booking data is written to, is unfamiliarity with the basic ACID DBMS functionality.

@juliofalbo

About the GC problem, I have a point!

Júlio, what do you mean by the "GC problem"? If you are referring to the example that Martin mentioned to illustrate how an arbitrary delay may happen in a non-hard-realtime system, then this is not the point you need to pay attention to. The actual point is that we develop algorithms and software for systems that do not have hard realtime guarantees (both hardware and software) and networks that are asynchronous. This means that you cannot make any assumptions about timing when reasoning about correctness of an algorithm.

So, now we have 5 ideas.
1 - Using Redis as a DLM
2 - Using Zookeeper as a DLM
...

The point I made is not about a specific DLM implementation, it is about how DLM must be used to actually guarantee mutual exclusion and whether such usage makes practical sense.

I don't like this approach. It will limit us to scale.

There is a single PostgreSQL instance accepting writes in the application, which makes it an obvious point that cannot be scaled horizontally (and also a SPOF). Therefore the booking functionality cannot be scaled horizontally regardless of the way conflicts resulting from concurrent requests are handled.

@juliofalbo
Copy link
Owner

juliofalbo commented Dec 11, 2019

@stIncMale

I think I misunderstood, when I read "node" I was thinking about the Service node, not about Postgres node.

I am suspecting that the reason it was not implemented via the PostgreSQL instance where the booking data is written to, is unfamiliarity with the basic ACID DBMS functionality.

About this, can be possible!
The way that I found to do this was using the Isolation level (SERIALIZABLE). And I didn't like the solution, since I need to lock the whole table.

But I agree, maybe there is a better way to do this using ACID DBMS functionality. Can you help me to implement?

@juliofalbo
Copy link
Owner

juliofalbo commented Dec 11, 2019

@toktarev , here is a diagram of the architecture: https://github.com/juliofalbo/complete-microservices-env/blob/master/HotelBookingSystemArchitecture.png?raw=true

I think the solution is more simple than we are discussing here, but the goal is to provide a great discussion with possible solutions for the community.

@GabrielAmazonas
Copy link

GabrielAmazonas commented Dec 11, 2019

If I understood @stIncMale and @toktarev correctly both point that: Relying on the RDMBS Transaction and Concurrency default mechanism (MVCC) would solve 99% of the possible problems (Including the room booking example). Please, guys, correct me if I'm wrong

Common pattern IMO would be to let it handle the concurrency itself and scale horizontally (adding read replica's) in the case of a read-heavy application and this would probably be already reliable and safe enough for the huge majority of the possible scenarios

@juliofalbo
Copy link
Owner

juliofalbo commented Dec 11, 2019

I was reading about MVCC on Postgres and I found this explanation:

The main advantage of using the MVCC model of concurrency control rather than locking is that in MVCC locks acquired for querying (reading) data do not conflict with locks acquired for writing data, and so reading never blocks writing and writing never blocks reading. PostgreSQL maintains this guarantee even when providing the strictest level of transaction isolation through the use of an innovative Serializable Snapshot Isolation (SSI) level.

Reference: https://www.postgresql.org/docs/9.1/mvcc-intro.html

Another nice article about MVCC: https://vladmihalcea.com/how-does-mvcc-multi-version-concurrency-control-work/

So, I think it is a good idea to implement this approach but my question is: How can I implement this using JPA (Spring Data)?

@stIncMale
Copy link
Author

stIncMale commented Dec 12, 2019

@juliofalbo

The way that I found to do this was using the Isolation level (SERIALIZABLE). And I didn't like the solution, since I need to lock the whole table.

If you are using serializable transaction isolation level in PostgreSQL, then you do not need to explicitly lock anything at all. The specific DBMS implementation is important as not all of them implement serializability correctly; e.g. Oracle provides snapshot isolation (this is the same isolation level that PostgreSQL provides at repeatable read isolation) instead of serializable, but calls it serializable.

In fact, using serializable isolation level to correctly implement booking is a no-brainer in this system. And because this approach is the simplest one, I am suggesting to use it. Once it is successfully implemented and if you find any performance problems with it, only then I would suggest you to look at other approaches (like using relaxed isolation levels potentially with explicit locks in PostgreSQL either directly or via the API that your ORM provides).

Can you help me to implement?

Just do all booking actions (checking that a room is available during the requested period and marking it as unavailable) related to a single booking request in a serializable transaction. And if the transaction fails with sqlstate 40001 (serialization failure in PG), then retry it again (no changes needed, just do exactly the same actions in a new transaction again). Of course, the retry must be transparent to a user and thus should be done by the same application that started the failed transaction and not by a user that requested booking.

How can I implement this using JPA (Spring Data)?

Whether you are using JPA, or JDBC, or any other way of accessing PostgreSQL (see https://www.postgresql.org/docs/current/external-interfaces.html), the answer is the same as mentioned above.

@GabrielAmazonas

If I understood @stIncMale and @toktarev correctly both point that: Relying on the RDMBS Transaction and Concurrency default mechanism (MVCC) would solve 99% of the possible problems (Including the room booking example).

I was talking in general and about a different thing when I filed this issue, but the discussion turned out to be almost completely unrelated to what I was saying in the issue, and it is rather about the booking implementation in this project.

I am saying that in this specific system where all writes caused by booking go to a single PostgreSQL instance, there is no reason to not use the concurrency control provided by this instance and instead invent anything else. I am also saying (again, for this specific situation) that any other approach to handle concurrency control would be much more complicated and would involve additional network communications that could only result in higher latencies or lower throughput of processing booking requests.

Common pattern IMO would be to let it handle the concurrency itself and scale horizontally (adding read replica's) in the case of a read-heavy application and this would probably be already reliable and safe enough for the huge majority of the possible scenarios.

I agree completely provided that by "it" you mean the writable instance of PostgreSQL, and I proposed exactly this in the very beginning of this discussion (#1 (comment)) and have been advocating for this approach ever since. It is good that you emphasized that only reads can be horizontally scaled in this system. As a general note, serializable (and linearizable for that matter) read-write operations cannot be scaled horizontally in principle, and cannot be made CAP-available. This is the sole reason why databases with more relaxed consistency models exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants