Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tickets/dm 43385 #852

Merged
merged 4 commits into from
Apr 15, 2024
Merged

Tickets/dm 43385 #852

merged 4 commits into from
Apr 15, 2024

Conversation

jgates108
Copy link
Contributor

No description provided.

Copy link
Contributor

@iagaponenko iagaponenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. See the comments and suggestions.

TIMEPOINT const updateTime; ///< "update-time-ms" entry.

/// Return true if all members, aside from updateTime, are equal.
bool same(WorkerContactInfo const& other) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not define the usual operator?

bool operator==(WorkerContactInfo const& other) const;

Ignoring the timestamp from the comparison should be fine here since the timestamp is AKA "metadata" for the payload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't include an updateTime comparison, intentionally. There's another comparison that will probably need to be made with updateTime, so the comparisons should be identifiable by function name. So, I'm renaming this sameConnectInfo for clarity.


std::shared_ptr<cconfig::CzarConfig> const _czarConfig; ///< Pointer to the CzarConfig.

std::atomic<bool> _loop{true}; ///< Threads will continue to run until this is set false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? CzarRegistry gets created by Czar, and they both would be gone when Czar is gone. The only reason for Czar to be destroyed is when a process is about to die. Who would care about a couple of 2 extra threads when there could be hundreds of other query-processing threads? The threads won't prevent the process from dying, and there is nothing useful these threads could do before the end. The controlled termination of threads only makes sense if/when there is a plan for cleaning up resources or recreating CzarRegistry while Czar is still alive.

Nothing is bad about the implementation. Just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like using detached threads as I've I have had real trouble with race condiftions in unit tests. The detached thread accesses something and you get segfault depending on what stopped first. Using _loop and join() results in repeatable termination conditions.

Copy link
Contributor

@iagaponenko iagaponenko Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you will design your class as:

class CzarRegistry : public std::enable_shared_from_this<CzarRegistry> {

And use the shared pointers instead of the low-level this then you won't have any segfaults or race conditions.

src/czar/Czar.h Outdated Show resolved Hide resolved
src/czar/Czar.h Outdated Show resolved Hide resolved
@@ -52,6 +52,7 @@
#include "ccontrol/UserQueryType.h"
#include "css/CssAccess.h"
#include "css/KvInterfaceImplMem.h"
#include "czar/Czar.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to need this to get access to the chunk map and to the interface class for the REgistry as per:

   auto cz = czar::Czar::getCzar();
    auto czarChunkMap = cz->getCzarChunkMap();
    auto czarRegistry = cz->getCzarRegistry();

I'm not sure this is right from the dependency standpoint as Czar is the front-end for handling incoming queries. Czar is also responsible for creating a branch of resources and services needed for processing queries. These resources are hosted by UserQuerySharedResources. Hence, I would argue that you could add both pointers (chunk map and the registry API) to that collection to make it available to the downstream classes.

Copy link
Contributor Author

@jgates108 jgates108 Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll delete these lines for now as they aren't being used until the next ticket. I'll need to think about it.

You bring up a good point. Part of why I did made the CzarRegistry class is that it could be destoryed and recreated to stop and restart the communications with the registry if there were problems. I don't know if it useful at this time, but restarting that communications seems like a good option to have.

Having lots of copies of that shared pointer could make destroying the existing object tricky, while one pointer accessed only during Job creation, and dropped at the end of Job creation makes it easily managable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having lots of copies of that shared pointer could make destroying the existing object tricky...

Not at all. The standard pattern here is to introduce the "shutdown" (or "stop") method:

void CzarRegistry::shutdown() {
    _loop = false;
}

In this case, you could repurpose your flag for telling threads when it's time to equity.
This would also help with the unit testing of the service.

A good example is the qhttp server:

class Server : public std::enable_shared_from_this<Server> {
public:
    using Ptr = std::shared_ptr<Server>;

    static Ptr create(...);

    ...

    //----- start() opens the server listening socket and installs the head of the asynchronous event
    //      handler chain onto the asio::io_service provided when the Server instance was constructed.
    //      Server execution may be halted either calling stop(), or by calling asio::io_service::stop()
    //      on the associated asio::io_service.

    void start();

    //----- stop() shuts down the server by closing all active sockets, including the server listening
    //      socket.  No new connections will be accepted, and handlers in progress will err out the next
    //      time they try to read/write from/to their client sockets.  A call to start() will be needed to
    //      resume server operation.

    void stop();

This interface works really well for unit testing.

// Begin periodically updating worker's status in the Replication System's registry
// in the detached thread. This will continue before the application gets terminated.
thread _registryUpdateThread(&CzarRegistry::_registryUpdateLoop, this);
_czarHeartbeatThrd = move(_registryUpdateThread);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of a local variable _registryUpdateThread doesn't comply with the LSST C++ style guide. It should not start with _. Besides, you don't need to introduce such a name as you can make the temporary and move in a single line:

 _czarHeartbeatThrd = move(thread((&CzarRegistry::_registryUpdateLoop, this)));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a private member variable so _loop and join can be used and other options can be available later for deciding what to do if communications have problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was regarding this variable:

thread _registryUpdateThread

_czarHeartbeatThrd = move(_registryUpdateThread);

thread _registryWorkerUpdateThread(&CzarRegistry::_registryWorkerInfoLoop, this);
_czarWorkerInfoThrd = move(_registryWorkerUpdateThread);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for the name of the local variable. You may consider replacing two lines with one:

_czarWorkerInfoThrd = move(thread(&CzarRegistry::_registryWorkerInfoLoop, this));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second comment regarding these threads is that there is a safer design. Yould could redefine the class as:

class CzarRegistry : public std:: enable_shared_from_this <CzarRegistry> {

And then create threads in the factory method:

    static Ptr create(std::shared_ptr<cconfig::CzarConfig> const& czarConfig) {
        auto ptr = Ptr(new CzarRegistry(czarConfig));
        ptr-> _czarHeartbeatThrd = std::move(std::thread([ptr]() {
            ptr->_registryUpdateLoop();
        });
        ptr->_czarWorkerInfoThrd = std::move(std::thread([ptr]() {
            ptr->_registryWorkerInfoLoop();
        });
        return ptr;
    }

The main advantage of this approach is that you no longer need to care about this from inside either method _registryUpdateLoop or _registryWorkerInfoLoop since ptr will be captured by threads by copy. The copy of the shared pointer will guarantee the life expectancy of CzarRegistry for your threads. This also eliminates any need in _loop. See my other comments on this subject downstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from being able to stop the loop and call join for the threads, I'm also not sure what the exact behavior should be if communications between the czar and workers becomes unstable. The original behavior was to abort() which I thought was a bit drastic. However, failed registry contact attempts and failed connections to workers for a few may be the point to terminate the czar. I'm note sure where that line should be drawn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing unstable connections usually is not a good reason for terminating a process. It's not so uncommon to run into those during a startup of a complex distributed application (Qserv, etc.):

  1. during the initial start time of the application
  2. during rolling upgrades (in Kubernetes)
  3. due to intermittent network problems
  4. due to crashes and restarts of individual services

Moreover, the main intent here is to make each component as resilient as possible. A decision to abort the application should be left to the human operators.

Good reasons for terminating a process (to count just a few):

  1. misconfiguration
  2. problems with resources
  3. programming bugs triggering assertions, divisions by 0, logic errors, etc.

@jgates108 jgates108 merged commit 4626bde into tickets/DM-43715 Apr 15, 2024
8 checks passed
@jgates108 jgates108 deleted the tickets/DM-43385 branch April 15, 2024 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants