-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tickets/dm 43385 #852
Tickets/dm 43385 #852
Conversation
7f3571b
to
8b7d91a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. See the comments and suggestions.
src/czar/CzarRegistry.h
Outdated
TIMEPOINT const updateTime; ///< "update-time-ms" entry. | ||
|
||
/// Return true if all members, aside from updateTime, are equal. | ||
bool same(WorkerContactInfo const& other) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not define the usual operator?
bool operator==(WorkerContactInfo const& other) const;
Ignoring the timestamp from the comparison should be fine here since the timestamp is AKA "metadata" for the payload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't include an updateTime comparison, intentionally. There's another comparison that will probably need to be made with updateTime, so the comparisons should be identifiable by function name. So, I'm renaming this sameConnectInfo for clarity.
|
||
std::shared_ptr<cconfig::CzarConfig> const _czarConfig; ///< Pointer to the CzarConfig. | ||
|
||
std::atomic<bool> _loop{true}; ///< Threads will continue to run until this is set false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed? CzarRegistry
gets created by Czar
, and they both would be gone when Czar
is gone. The only reason for Czar to be destroyed is when a process is about to die. Who would care about a couple of 2 extra threads when there could be hundreds of other query-processing threads? The threads won't prevent the process from dying, and there is nothing useful these threads could do before the end. The controlled termination of threads only makes sense if/when there is a plan for cleaning up resources or recreating CzarRegistry
while Czar
is still alive.
Nothing is bad about the implementation. Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like using detached threads as I've I have had real trouble with race condiftions in unit tests. The detached thread accesses something and you get segfault depending on what stopped first. Using _loop and join() results in repeatable termination conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you will design your class as:
class CzarRegistry : public std::enable_shared_from_this<CzarRegistry> {
And use the shared pointers instead of the low-level this
then you won't have any segfaults or race conditions.
@@ -52,6 +52,7 @@ | |||
#include "ccontrol/UserQueryType.h" | |||
#include "css/CssAccess.h" | |||
#include "css/KvInterfaceImplMem.h" | |||
#include "czar/Czar.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to need this to get access to the chunk map and to the interface class for the REgistry as per:
auto cz = czar::Czar::getCzar();
auto czarChunkMap = cz->getCzarChunkMap();
auto czarRegistry = cz->getCzarRegistry();
I'm not sure this is right from the dependency standpoint as Czar
is the front-end for handling incoming queries. Czar
is also responsible for creating a branch of resources and services needed for processing queries. These resources are hosted by UserQuerySharedResources
. Hence, I would argue that you could add both pointers (chunk map and the registry API) to that collection to make it available to the downstream classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll delete these lines for now as they aren't being used until the next ticket. I'll need to think about it.
You bring up a good point. Part of why I did made the CzarRegistry class is that it could be destoryed and recreated to stop and restart the communications with the registry if there were problems. I don't know if it useful at this time, but restarting that communications seems like a good option to have.
Having lots of copies of that shared pointer could make destroying the existing object tricky, while one pointer accessed only during Job creation, and dropped at the end of Job creation makes it easily managable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having lots of copies of that shared pointer could make destroying the existing object tricky...
Not at all. The standard pattern here is to introduce the "shutdown" (or "stop") method:
void CzarRegistry::shutdown() {
_loop = false;
}
In this case, you could repurpose your flag for telling threads when it's time to equity.
This would also help with the unit testing of the service.
A good example is the qhttp
server:
class Server : public std::enable_shared_from_this<Server> {
public:
using Ptr = std::shared_ptr<Server>;
static Ptr create(...);
...
//----- start() opens the server listening socket and installs the head of the asynchronous event
// handler chain onto the asio::io_service provided when the Server instance was constructed.
// Server execution may be halted either calling stop(), or by calling asio::io_service::stop()
// on the associated asio::io_service.
void start();
//----- stop() shuts down the server by closing all active sockets, including the server listening
// socket. No new connections will be accepted, and handlers in progress will err out the next
// time they try to read/write from/to their client sockets. A call to start() will be needed to
// resume server operation.
void stop();
This interface works really well for unit testing.
src/czar/CzarRegistry.cc
Outdated
// Begin periodically updating worker's status in the Replication System's registry | ||
// in the detached thread. This will continue before the application gets terminated. | ||
thread _registryUpdateThread(&CzarRegistry::_registryUpdateLoop, this); | ||
_czarHeartbeatThrd = move(_registryUpdateThread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of a local variable _registryUpdateThread
doesn't comply with the LSST C++ style guide. It should not start with _
. Besides, you don't need to introduce such a name as you can make the temporary and move in a single line:
_czarHeartbeatThrd = move(thread((&CzarRegistry::_registryUpdateLoop, this)));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a private member variable so _loop and join can be used and other options can be available later for deciding what to do if communications have problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was regarding this variable:
thread _registryUpdateThread
src/czar/CzarRegistry.cc
Outdated
_czarHeartbeatThrd = move(_registryUpdateThread); | ||
|
||
thread _registryWorkerUpdateThread(&CzarRegistry::_registryWorkerInfoLoop, this); | ||
_czarWorkerInfoThrd = move(_registryWorkerUpdateThread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for the name of the local variable. You may consider replacing two lines with one:
_czarWorkerInfoThrd = move(thread(&CzarRegistry::_registryWorkerInfoLoop, this));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second comment regarding these threads is that there is a safer design. Yould could redefine the class as:
class CzarRegistry : public std:: enable_shared_from_this <CzarRegistry> {
And then create threads in the factory method:
static Ptr create(std::shared_ptr<cconfig::CzarConfig> const& czarConfig) {
auto ptr = Ptr(new CzarRegistry(czarConfig));
ptr-> _czarHeartbeatThrd = std::move(std::thread([ptr]() {
ptr->_registryUpdateLoop();
});
ptr->_czarWorkerInfoThrd = std::move(std::thread([ptr]() {
ptr->_registryWorkerInfoLoop();
});
return ptr;
}
The main advantage of this approach is that you no longer need to care about this
from inside either method _registryUpdateLoop
or _registryWorkerInfoLoop
since ptr
will be captured by threads by copy. The copy of the shared pointer will guarantee the life expectancy of CzarRegistry
for your threads. This also eliminates any need in _loop
. See my other comments on this subject downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from being able to stop the loop and call join for the threads, I'm also not sure what the exact behavior should be if communications between the czar and workers becomes unstable. The original behavior was to abort()
which I thought was a bit drastic. However, failed registry contact attempts and failed connections to workers for a few may be the point to terminate the czar. I'm note sure where that line should be drawn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing unstable connections usually is not a good reason for terminating a process. It's not so uncommon to run into those during a startup of a complex distributed application (Qserv, etc.):
- during the initial start time of the application
- during rolling upgrades (in Kubernetes)
- due to intermittent network problems
- due to crashes and restarts of individual services
Moreover, the main intent here is to make each component as resilient as possible. A decision to abort the application should be left to the human operators.
Good reasons for terminating a process (to count just a few):
- misconfiguration
- problems with resources
- programming bugs triggering assertions, divisions by 0, logic errors, etc.
No description provided.