-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the trpc thread exit procedure #68
Closed
jennifer-richards
wants to merge
66
commits into
milestone/monitoring
from
jennifer/peer_conn_cleanup
Closed
Fix the trpc thread exit procedure #68
jennifer-richards
wants to merge
66
commits into
milestone/monitoring
from
jennifer/peer_conn_cleanup
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Works, but not yet integrated with the build system.
This is not actually used for building the trust router!
* add response encoder * add partial test of response encoder * move tr_mon.h to include directory * move code common to req/resp from tr_mon_req.c to tr_mon.c * fix a couple warnings
This better matches other protocol submodule naming (tid_, trp_, gss_)
* Move tr_gss.[ch] to tr_gss_names.[ch], that is what the files contain * Add new tr_gss.[ch] containing generalized GSS request/response code * Refactor tids request handlers to use generalized code * First steps towards a monitoring interface handler, not functional * Rename listen_on_all_addrs() to tr_sock_listen_all() * Make better use of talloc in a few places * Clean up a few missing or unused #includes * Fix a few data types for the sake of pedantry
The trust router now builds, but the monitoring parser tests do not. * Eliminate extra layer of auth callback when using tr_gss.c, services using it now need only one auth callback * Document tr_gss.c's intended usage * Flesh out the MONS_INSTANCE structure * Fix a couple more pedantic data typing errors
Trust router now builds and opens monitoring port
* Actually encode the TID response! * Do not directly send responses from tids_req_handler(), set the properties in the response and return with an error code * Add hostname to MONS_INSTANCE * Update tids hostname after configuration change * Add a tid_resp_cpy() function to duplicate a TID_RESP into a struct that already exists
At this point, if you hack tr_mons_auth_handler() to always return 0 (success), then trmon can connect to the trust router's monitoring port and retrieve a test message. That counts as first contact, I guess. Actual functionality is still to come. * Create basic trmon utility based closely on tidc * Temporarily use void pointers for trps/tids handles in the MON_INSTANCE structure - there is a header file cycle that prevents compliation. Need to sort that out, but this works for the moment. * Fill in tr_msg handlers for monitoring message encoders/decoders * Revert to the monitoring msg decoder working from json, not a string, since that is what we need. This breaks the test programs for now.
* Implement minimal decoding of monitoring responses * Add tr_gss_client.[ch] to house GSS req/resp message exchange * Always use 'payload' as the key for MON_RESP payload, don't name it after the command that it is responding to * Use better reference count behavior for MON_RESP payload * Move typedefs out of mon_internal.h to mon.h to avoid cyclic header dependencies * Fix some minor integer type mismatches in option parser * Update various test programs to use extra argument to tr_msg_(en/de)code methods
I had assumed in a few places that TR_MSGs and the various message payload types were always allocated dynamically via talloc(). This is not a safe assumption - in a few places, we use stack-allocated TR_MSGs and these are all used outside our code via the libtr_tid library. We now use talloc when we can (i.e., when we have encoded or decoded a message and know we used talloc), but otherwise leave it to the calling code to properly manage memory.
Also some further cleanup of header files and data types.
* Keep a list of handlers as part of MONS_INSTANCE - each handles a command/opt_type pair - registered via mons_register_handler() * Scan the list of handlers when servicing a monitoring request * Add handlers for version and uptime, registered through tr_main.c (probably need to move these, but this works as a demo)
* Add a separate source file for TID-related monitoring handlers * Increment tids->req_count in the main process, otherwise it will always seem to be zero. This does mean any connection to the TID port is counted as a tid request, which is not perfect. *
* Track TID processes by pid * Add handlers for the TID req counts Still only check for terminated TID processes after the next one comes in, should either periodically sweep or check this after a child terminates and sends SIGCHLD
* Route forwarded request based on mapped APC, not the original COI * Refactor COI/APC mapping code out of tr_tids_req_handler(), which remains in desperate need of refactoring for clarity * Use accessors instead of direct reference to structure elements in a few places (still more to convert) * Don't assume TR_NAME buf is null-terminated (it always is AFAIK, but is not required by the data structure). Still more of these to fix * Rename tid_req_set_rp_orig_coi() to _set_orig_coi(). It's not exported as part of the public API and was not used in our code. I think this was originally a copy/paste error. This resolves https://bugs.launchpad.net/moonshot-tr/+bug/1765681
No functional changes
Some refactoring here and there, too.
* change show "serial" to "config_files" to reflect its function * suppress display of empty strings for unset / irrelevant values when returning routes / communities
# Conflicts: # tr/tr_tid.c # tr/tr_trp.c
The messaging between the main thread and the trpc (outgoing connection) threads allowed the trpc data to be cleaned up before the message queue was empty, causing incorrect mutex behavior and seg faults. This is (I hope!) solved adding an additional shutdown phase in which the main thread indicates that it has recognized that the trpc thread is done and that the trpc thread can safely exit. So far, I have not seen a failure of the system to handle a peer disconnecting. Prior to these changes, it failed every time with my current setup.
This was referenced Apr 27, 2018
The old iterator was completely broken, which was causing incomplete cleanup of realms that should have been expired. This may have been leaving the community membership table in an inconsistent state.
Replace tr_comm_memb_iter_all methods with ones that actually work
This should not be merged. It was superseded by later work. It will be part of the history via other pull requests. Closing. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The messaging between the main thread and the trpc (outgoing connection)
threads allowed the trpc data to be cleaned up before the message queue
was empty, causing incorrect mutex behavior and seg faults.
This is (I hope!) solved by adding an additional shutdown phase in which
the main thread indicates that it has recognized that the trpc thread
is done and that the trpc thread can safely exit.
So far, I have not seen a failure of the system to handle a peer
disconnecting. Prior to these changes, it failed every time with my
current setup.
-- I forgot to add this in the git comment, but I also found that I was mixing up the purpose of the
gssname
andpeer
in theTRP_CONNECTION
structure. These are bothTR_NAME
s. The former is the GSS service name of the local trust router (i.e., how we identified ourselves on this particular connection). The latter is the GSS name of the remote peer. During cleanup of incoming threads, the gssname rather than the peer name was used to decide which peer to mark as disconnected. This caused the search to fail, and peers were never marked as disconnected.This should not be merged until after jennifer/monitoring. Assigning to myself to resolve any conflicts once that happens.