Fix the trpc thread exit procedure #68

jennifer-richards · 2018-04-27T02:38:58Z

The messaging between the main thread and the trpc (outgoing connection)
threads allowed the trpc data to be cleaned up before the message queue
was empty, causing incorrect mutex behavior and seg faults.

This is (I hope!) solved by adding an additional shutdown phase in which
the main thread indicates that it has recognized that the trpc thread
is done and that the trpc thread can safely exit.

So far, I have not seen a failure of the system to handle a peer
disconnecting. Prior to these changes, it failed every time with my
current setup.

-- I forgot to add this in the git comment, but I also found that I was mixing up the purpose of the gssname and peer in the TRP_CONNECTION structure. These are both TR_NAMEs. The former is the GSS service name of the local trust router (i.e., how we identified ourselves on this particular connection). The latter is the GSS name of the remote peer. During cleanup of incoming threads, the gssname rather than the peer name was used to decide which peer to mark as disconnected. This caused the search to fail, and peers were never marked as disconnected.

This should not be merged until after jennifer/monitoring. Assigning to myself to resolve any conflicts once that happens.

Works, but not yet integrated with the build system.

This is not actually used for building the trust router!

* add response encoder * add partial test of response encoder * move tr_mon.h to include directory * move code common to req/resp from tr_mon_req.c to tr_mon.c * fix a couple warnings

This better matches other protocol submodule naming (tid_, trp_, gss_)

* Move tr_gss.[ch] to tr_gss_names.[ch], that is what the files contain * Add new tr_gss.[ch] containing generalized GSS request/response code * Refactor tids request handlers to use generalized code * First steps towards a monitoring interface handler, not functional * Rename listen_on_all_addrs() to tr_sock_listen_all() * Make better use of talloc in a few places * Clean up a few missing or unused #includes * Fix a few data types for the sake of pedantry

The trust router now builds, but the monitoring parser tests do not. * Eliminate extra layer of auth callback when using tr_gss.c, services using it now need only one auth callback * Document tr_gss.c's intended usage * Flesh out the MONS_INSTANCE structure * Fix a couple more pedantic data typing errors

Trust router now builds and opens monitoring port

* Actually encode the TID response! * Do not directly send responses from tids_req_handler(), set the properties in the response and return with an error code * Add hostname to MONS_INSTANCE * Update tids hostname after configuration change * Add a tid_resp_cpy() function to duplicate a TID_RESP into a struct that already exists

At this point, if you hack tr_mons_auth_handler() to always return 0 (success), then trmon can connect to the trust router's monitoring port and retrieve a test message. That counts as first contact, I guess. Actual functionality is still to come. * Create basic trmon utility based closely on tidc * Temporarily use void pointers for trps/tids handles in the MON_INSTANCE structure - there is a header file cycle that prevents compliation. Need to sort that out, but this works for the moment. * Fill in tr_msg handlers for monitoring message encoders/decoders * Revert to the monitoring msg decoder working from json, not a string, since that is what we need. This breaks the test programs for now.

* Implement minimal decoding of monitoring responses * Add tr_gss_client.[ch] to house GSS req/resp message exchange * Always use 'payload' as the key for MON_RESP payload, don't name it after the command that it is responding to * Use better reference count behavior for MON_RESP payload * Move typedefs out of mon_internal.h to mon.h to avoid cyclic header dependencies * Fix some minor integer type mismatches in option parser * Update various test programs to use extra argument to tr_msg_(en/de)code methods

I had assumed in a few places that TR_MSGs and the various message payload types were always allocated dynamically via talloc(). This is not a safe assumption - in a few places, we use stack-allocated TR_MSGs and these are all used outside our code via the libtr_tid library. We now use talloc when we can (i.e., when we have encoded or decoded a message and know we used talloc), but otherwise leave it to the calling code to properly manage memory.

Also some further cleanup of header files and data types.

* Keep a list of handlers as part of MONS_INSTANCE - each handles a command/opt_type pair - registered via mons_register_handler() * Scan the list of handlers when servicing a monitoring request * Add handlers for version and uptime, registered through tr_main.c (probably need to move these, but this works as a demo)

* Add a separate source file for TID-related monitoring handlers * Increment tids->req_count in the main process, otherwise it will always seem to be zero. This does mean any connection to the TID port is counted as a tid request, which is not perfect. *

* Track TID processes by pid * Add handlers for the TID req counts Still only check for terminated TID processes after the next one comes in, should either periodically sweep or check this after a child terminates and sends SIGCHLD

* Route forwarded request based on mapped APC, not the original COI * Refactor COI/APC mapping code out of tr_tids_req_handler(), which remains in desperate need of refactoring for clarity * Use accessors instead of direct reference to structure elements in a few places (still more to convert) * Don't assume TR_NAME buf is null-terminated (it always is AFAIK, but is not required by the data structure). Still more of these to fix * Rename tid_req_set_rp_orig_coi() to _set_orig_coi(). It's not exported as part of the public API and was not used in our code. I think this was originally a copy/paste error. This resolves https://bugs.launchpad.net/moonshot-tr/+bug/1765681

No functional changes

Some refactoring here and there, too.

* change show "serial" to "config_files" to reflect its function * suppress display of empty strings for unset / irrelevant values when returning routes / communities

# Conflicts: # tr/tr_tid.c # tr/tr_trp.c

The messaging between the main thread and the trpc (outgoing connection) threads allowed the trpc data to be cleaned up before the message queue was empty, causing incorrect mutex behavior and seg faults. This is (I hope!) solved adding an additional shutdown phase in which the main thread indicates that it has recognized that the trpc thread is done and that the trpc thread can safely exit. So far, I have not seen a failure of the system to handle a peer disconnecting. Prior to these changes, it failed every time with my current setup.

The old iterator was completely broken, which was causing incomplete cleanup of realms that should have been expired. This may have been leaving the community membership table in an inconsistent state.

Replace tr_comm_memb_iter_all methods with ones that actually work

jennifer-richards · 2018-05-03T21:50:30Z

This should not be merged. It was superseded by later work. It will be part of the history via other pull requests. Closing.

jennifer-richards added 30 commits April 10, 2018 22:05

First pass at monitoring request encoder/decoder and tests

48283a0

Works, but not yet integrated with the build system.

Add CMakeLists.txt for CLion integration

1543f89

This is not actually used for building the trust router!

Add req encode/decode tests to make system, move from test/ to tests/

29bc5ee

Add encoder for monitoring responses

36a4712

* add response encoder * add partial test of response encoder * move tr_mon.h to include directory * move code common to req/resp from tr_mon_req.c to tr_mon.c * fix a couple warnings

Change tr_mon_ prefix to mon_, no functional changes

34a48da

This better matches other protocol submodule naming (tid_, trp_, gss_)

Factor out identical tids_listen/trps_listen functions into shared copy

94592a2

Rename tr_gss.[ch] to tr_gss_names.[ch]

3fcf109

Fix accidentally changed variable name in function prototype

d253bd9

Remove several unused parameters and clean up some lint warnings

cdb5040

Add stub of handler for monitoring requests

2664b9d

Trust router now builds and opens monitoring port

Move internal config parser to a separate file

5d6bc31

Refactor to eliminate repeated code in tr_cfg_parse_internal()

97e8a58

Parse monitoring port from internal configuration

44c8ebb

Enclose macro arguments in parentheses

b9cb3ff

Make better use of talloc for TR_MSG handling

d48e5dd

Fix makefile, full make now succeeds

4d39775

Use TR_MSG instead of encoded strings in GSS request handler interface

f55d550

Also some further cleanup of header files and data types.

First steps toward actually handling monitoring requests

cbe31f4

First functional monitoring server - can return the trust router version

3d17524

Get rid of CLion warnings about undefined PACKAGE_* macros

c75a2de

Collect return codes from monitoring handlers and indicate errors

58571ae

Add TID_REQ_COUNT handler

283699f

* Add a separate source file for TID-related monitoring handlers * Increment tids->req_count in the main process, otherwise it will always seem to be zero. This does mean any connection to the TID port is counted as a tid request, which is not perfect. *

jennifer-richards added 17 commits April 20, 2018 13:28

Support 'show serial' monitoring request

d664bc9

Break tr_config.c into smaller chunks

827e90d

No functional changes

Bump version number (but not shared library version yet). Now 3.3.1~1

06dbc40

Bump version number (but not shared library version yet). Now 3.3.1~1

52b8604

Read GSS credentials for monitoring service

d0cb62b

Some refactoring here and there, too.

Rename acceptor_realm/name to _hostname/service, add some debug output

bafa2aa

Clean up monitoring format/naming

86f808d

* change show "serial" to "config_files" to reflect its function * suppress display of empty strings for unset / irrelevant values when returning routes / communities

Check in changes that were accidentally omitted

e03b7d9

Fix lines that were swapped accidentally

f53a6f7

Merge remote-tracking branch 'origin/v3.3.0' into jennifer/monitoring

5214f20

# Conflicts: # tr/tr_tid.c # tr/tr_trp.c

Bump versions to 3.4.0~1 (did not update ABI version yet)

0bc07f7

Add missing %.*s so debug message includes GSS name

f1c739d

Correctly display RP realms in the 'show communities' response

bad2cb0

Don't display "last_connection_attempt" if there is not one

77b6d2f

Add some comments, a bit of code clean up

b8efc0d

jennifer-richards added this to the Monitoring Interface milestone Apr 27, 2018

jennifer-richards self-assigned this Apr 27, 2018

This was referenced Apr 27, 2018

Trust router getting stuck when a peer disconnects #67

Closed

Runaway trust_router processes #64

Closed

Segfault when preparing community update #63

Closed

jennifer-richards added 2 commits April 27, 2018 16:20

Replace tr_comm_memb_iter_all methods with ones that actually work

74efd26

The old iterator was completely broken, which was causing incomplete cleanup of realms that should have been expired. This may have been leaving the community membership table in an inconsistent state.

Merge pull request #70 from painless-security/jennifer/comm_sweep_fix

5dbba4e

Replace tr_comm_memb_iter_all methods with ones that actually work

jennifer-richards closed this May 3, 2018

jennifer-richards deleted the jennifer/peer_conn_cleanup branch May 7, 2018 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the trpc thread exit procedure #68

Fix the trpc thread exit procedure #68

jennifer-richards commented Apr 27, 2018 •

edited

Loading

jennifer-richards commented May 3, 2018

Fix the trpc thread exit procedure #68

Fix the trpc thread exit procedure #68

Conversation

jennifer-richards commented Apr 27, 2018 • edited Loading

jennifer-richards commented May 3, 2018

jennifer-richards commented Apr 27, 2018 •

edited

Loading