Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the trpc thread exit procedure #68

Closed

Conversation

jennifer-richards
Copy link
Collaborator

@jennifer-richards jennifer-richards commented Apr 27, 2018

The messaging between the main thread and the trpc (outgoing connection)
threads allowed the trpc data to be cleaned up before the message queue
was empty, causing incorrect mutex behavior and seg faults.

This is (I hope!) solved by adding an additional shutdown phase in which
the main thread indicates that it has recognized that the trpc thread
is done and that the trpc thread can safely exit.

So far, I have not seen a failure of the system to handle a peer
disconnecting. Prior to these changes, it failed every time with my
current setup.

-- I forgot to add this in the git comment, but I also found that I was mixing up the purpose of the gssname and peer in the TRP_CONNECTION structure. These are both TR_NAMEs. The former is the GSS service name of the local trust router (i.e., how we identified ourselves on this particular connection). The latter is the GSS name of the remote peer. During cleanup of incoming threads, the gssname rather than the peer name was used to decide which peer to mark as disconnected. This caused the search to fail, and peers were never marked as disconnected.

This should not be merged until after jennifer/monitoring. Assigning to myself to resolve any conflicts once that happens.

Works, but not yet integrated with the build system.
This is not actually used for building the trust router!
  * add response encoder
  * add partial test of response encoder
  * move tr_mon.h to include directory
  * move code common to req/resp from tr_mon_req.c to tr_mon.c
  * fix a couple warnings
This better matches other protocol submodule naming (tid_, trp_, gss_)
  * Move tr_gss.[ch] to tr_gss_names.[ch], that is what the files contain
  * Add new tr_gss.[ch] containing generalized GSS request/response code
  * Refactor tids request handlers to use generalized code

  * First steps towards a monitoring interface handler, not functional
  * Rename listen_on_all_addrs() to tr_sock_listen_all()
  * Make better use of talloc in a few places
  * Clean up a few missing or unused #includes
  * Fix a few data types for the sake of pedantry
The trust router now builds, but the monitoring parser tests do not.

  * Eliminate extra layer of auth callback when using tr_gss.c, services
    using it now need only one auth callback
  * Document tr_gss.c's intended usage
  * Flesh out the MONS_INSTANCE structure
  * Fix a couple more pedantic data typing errors
Trust router now builds and opens monitoring port
  * Actually encode the TID response!
  * Do not directly send responses from tids_req_handler(), set the
    properties in the response and return with an error code
  * Add hostname to MONS_INSTANCE
  * Update tids hostname after configuration change
  * Add a tid_resp_cpy() function to duplicate a TID_RESP into a struct
    that already exists
At this point, if you hack tr_mons_auth_handler() to always return 0
(success), then trmon can connect to the trust router's monitoring port
and retrieve a test message. That counts as first contact, I guess.
Actual functionality is still to come.

  * Create basic trmon utility based closely on tidc
  * Temporarily use void pointers for trps/tids handles in the MON_INSTANCE
    structure - there is a header file cycle that prevents compliation.
    Need to sort that out, but this works for the moment.
  * Fill in tr_msg handlers for monitoring message encoders/decoders
  * Revert to the monitoring msg decoder working from json, not a string,
    since that is what we need. This breaks the test programs for now.
  * Implement minimal decoding of monitoring responses
  * Add tr_gss_client.[ch] to house GSS req/resp message exchange
  * Always use 'payload' as the key for MON_RESP payload, don't name it
    after the command that it is responding to
  * Use better reference count behavior for MON_RESP payload
  * Move typedefs out of mon_internal.h to mon.h to avoid cyclic header
    dependencies
  * Fix some minor integer type mismatches in option parser
  * Update various test programs to use extra argument to
    tr_msg_(en/de)code methods
I had assumed in a few places that TR_MSGs and the various message
payload types were always allocated dynamically via talloc(). This is
not a safe assumption - in a few places, we use stack-allocated TR_MSGs
and these are all used outside our code via the libtr_tid library.
We now use talloc when we can (i.e., when we have encoded or decoded
a message and know we used talloc), but otherwise leave it to the calling
code to properly manage memory.
Also some further cleanup of header files and data types.
  * Keep a list of handlers as part of MONS_INSTANCE
    - each handles a command/opt_type pair
    - registered via mons_register_handler()
  * Scan the list of handlers when servicing a monitoring request
  * Add handlers for version and uptime, registered through tr_main.c
    (probably need to move these, but this works as a demo)
  * Add a separate source file for TID-related monitoring handlers
  * Increment tids->req_count in the main process, otherwise it will
    always seem to be zero. This does mean any connection to the TID
    port is counted as a tid request, which is not perfect.
  *
  * Track TID processes by pid
  * Add handlers for the TID req counts

Still only check for terminated TID processes after the next one comes
in, should either periodically sweep or check this after a child
terminates and sends SIGCHLD
  * Route forwarded request based on mapped APC, not the original COI
  * Refactor COI/APC mapping code out of tr_tids_req_handler(), which
    remains in desperate need of refactoring for clarity
  * Use accessors instead of direct reference to structure elements in a
    few places (still more to convert)
  * Don't assume TR_NAME buf is null-terminated (it always is AFAIK, but
    is not required by the data structure). Still more of these to fix
  * Rename tid_req_set_rp_orig_coi() to _set_orig_coi(). It's not exported
    as part of the public API and was not used in our code. I think this
    was originally a copy/paste error.

This resolves https://bugs.launchpad.net/moonshot-tr/+bug/1765681
Some refactoring here and there, too.
  * change show "serial" to "config_files" to reflect its function
  * suppress display of empty strings for unset / irrelevant values when
    returning routes / communities
The messaging between the main thread and the trpc (outgoing connection)
threads allowed the trpc data to be cleaned up before the message queue
was empty, causing incorrect mutex behavior and seg faults.

This is (I hope!) solved adding an additional shutdown phase in which
the main thread indicates that it has recognized that the trpc thread
is done and that the trpc thread can safely exit.

So far, I have not seen a failure of the system to handle a peer
disconnecting. Prior to these changes, it failed every time with my
current setup.
The old iterator was completely broken, which was causing incomplete
cleanup of realms that should have been expired. This may have been
leaving the community membership table in an inconsistent state.
Replace tr_comm_memb_iter_all methods with ones that actually work
@jennifer-richards
Copy link
Collaborator Author

This should not be merged. It was superseded by later work. It will be part of the history via other pull requests. Closing.

@jennifer-richards jennifer-richards deleted the jennifer/peer_conn_cleanup branch May 7, 2018 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant