Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent Session Control #4138

Merged
merged 4 commits into from Sep 17, 2020
Merged

Conversation

fspmarshall
Copy link
Contributor

@fspmarshall fspmarshall commented Jul 28, 2020

Support for limiting concurrent sessions (#2938), implemented via new semaphore API.

EDIT: See updated PR description here.

Old PR Description

TODOs

  • Primary:

    • Integration tests to verify concurrent session limits are correctly applied.
    • Integration tests to verify termination of SSH connections when lease cannot complete before timeout.
    • Unit testing of the SemaphoreLock abstraction.
    • Audit event for users who attempt to exceed max_concurrent_sessions.
    • Make sessctl lease expiry a cluster-level config option.
  • Secondary:

    • Unique error variant to trigger early SemaphoreLock release on failed renewal (useful for backwards compatibility in the event we end up using the semaphore system as the basis for Session Termination in the future).
    • Track kube proxy connections in addition to ssh connections (likely requires updating github.com/gravitational/oxy/forward to support cancellation).
    • Improve UX around SSH connections that are closed due to loss of a previously held lock (potentially difficult).

Overview

The role resource has been updated to support a new max_concurrent_sessions field:

kind: role
metadata:
  name: some-role
  # ...
spec:
  options:
    max_concurrent_sessions: 3
  # ...
version: v3

When a node encounters a new inbound SSH connection from a user who has a max_concurrent_sessions limit defined in one or more if their roles, it calculates the minimum of the defined limits as the maximum for that user. The node will then attempt to acquire a lease for the corresponding semaphore (sessctl/<username>) by invoking the auth server's semaphore API. If the number of existent leases is already greater than or equal to the derived max_concurrent_sessions value, the ssh connection is rejected.

When using tsh, the rejection looks like this:

$ tsh ssh alice@example.com
error: ssh: rejected: administratively prohibited (cannot acquire semaphore sessctl/alice (max leases reached))

When using openssh, the rejection looks like this:

$ ssh alice@example.com
channel 0: open failed: administratively prohibited: cannot acquire semaphore sessctl/alice (max leases reached)
Connection to example.com closed.

If a node does successfully acquire a semaphore lease, then the ssh connection continues normally, and the lease is periodically renewed in the background.

Semaphore leases have an expiry, and the node must successfully renew the lease before this expiry. Nodes begin attempting to renew a lease when it is halfway to its expiry point. If a node fails to successfully renew before the expiry time, the node will terminate the ssh connection.

Existant leases can be viewed like so: (see this comment for updated syntax)

$ tctl sem ls
Kind    Name  LeaseID                              Holder                               Expires             
------- ----- ------------------------------------ ------------------------------------ ------------------- 
sessctl alice a0585fc3-9770-44b7-b836-4d9c01751ee5 dc43a481-3063-4476-a9c7-505d1fffc077 28 Jul 20 00:35 UTC 
sessctl alice 0f3e3359-af4f-4a6b-9a36-ebb906e062c6 dc43a481-3063-4476-a9c7-505d1fffc077 28 Jul 20 00:35 UTC

Or, more verbosely:

$ tctl sem ls --format=json
[
  {
    "kind": "semaphore",
    "sub_kind": "sessctl",
    "version": "v3",
    "metadata": {
      "name": "alice",
      "expires": "2020-07-28T00:35:26.828932214Z"
    },
    "spec": {
      "leases": [
        {
          "lease_id": "a0585fc3-9770-44b7-b836-4d9c01751ee5",
          "expires": "2020-07-28T00:35:26.828932214Z",
          "holder": "dc43a481-3063-4476-a9c7-505d1fffc077"
        },
        {
          "lease_id": "0f3e3359-af4f-4a6b-9a36-ebb906e062c6",
          "expires": "2020-07-28T00:35:22.230492119Z",
          "holder": "dc43a481-3063-4476-a9c7-505d1fffc077"
        }
      ]
    }
  }
]

The semaphore API allows lease_id and holder to be arbitrary strings (though lease_id must be unique). When a node acquires a semaphore lease for the purposes of limiting concurrent SSH connections, lease_id will be a random UUID, and holder will the UUID of the node in question.

A pair of additional hidden tctl commands, tctl sem cancel and tctl sem rm, allow removal of individual leases and entire semaphores respectively. These commands effectively constitute a form of asynchronous session termination, but do not prevent new sessions from being created. These two functions have been merged into a single tctl sessctl rm subcommand. It is still hidden to avoid confusion since removing individual sessions may be needed in some debugging scenarios, but is not a substitute for proper session termination.

Gotchas/Clarifications

  • Since Concurrent Session Control is applied at the authenticated SSH connection level, multiple individual "sessions" may appear in the audit log for each individual lease (in practice, this should only occur when connection reuse systems such as ControlMaster are in play).
  • Semaphores do not store a global lease maximum as part of their state. Instead, a lease acquisition attempt includes a parameter specifying the limit to be observed by the auth server that specific request. This mechanism is designed to avoid problems with global configuration sync and unexpected behavior due to distributed ordering issues. The result of this is that if state within a teleport cluster is out of sync, each node will individually refuse to create connections if the number of concurrent connections would exceed the max_concurrent_sessions value as seen by that node.
  • Semaphore leases are renewed only by ID, and a single lease can theoretically be concurrently renewed by multiple separate entities (so long as the second entity is willing to overlook the AlreadyExists error). This is an intentional design choice since some "session" concepts that we may want to track in the future won't exist on a unique machine (kubernetes API, AAP, etc...).

@fspmarshall fspmarshall force-pushed the fspmarshall/session-control-2 branch from 7c6c98c to 53e1ab3 Compare July 28, 2020 20:22
@benarent
Copy link
Contributor

benarent commented Jul 29, 2020

Thanks for the awesome PR Description.

Old PR Feedback

My initial gut reaction, is that we're leaking the implementation detail of semaphores to customers. It feels like a new concept, and I'm proposing for tctl we stick to sessions vs sems?

# lists active sessions
$ tctl sem ls
+ $ tctl sessions ls

# Cancel vs rm? Aren't these the same?
$ tctl sem cancel
+ $ tctl session deny

# Removes / kills session
$ tctl sem rm 
+ $ tctl session rm

Also for $ tctl sem ls Can you find which node that users is accessing?

@fspmarshall
Copy link
Contributor Author

fspmarshall commented Jul 29, 2020

My initial gut reaction, is that we're leaking the implementation detail of semaphores to customers. It feels like a new concept, and I'm proposing for tctl we stick to sessions vs sems?

You are absolutely right that we are leaking the semaphore concept... I'm hesitant to make a tctl command called sessions however, because the audit log has already given users an intuitive understanding of what a tctl sessions ls would return... and it doesn't quite map to the result of the current tctl sem ls. The Session Control feature is limiting the number of concurrent authenticated connections, whereas the UI and audit log treat individual ssh exec/shell operations as sessions.

Even if we wanted to, I don't see how we could make the session construct used by the audit log match up with the nist control because the audit log's session construct doesn't have a 1-to-1 mapping to an authenticated user (due to the "session joining" feature).

We could create a tctl sessctl subcommand to provide a more tailored experience for specifically managing the Session Control feature, rather than using the general semaphore concept. Perhaps something like this:

$ tctl sessctl ls
User  Sem ID                               Node        Node ID                                      
----- ------------------------------------ ----------- ------------------------- 
alice a0585fc3-9770-44b7-b836-4d9c01751ee5 main-db-345 dc43a481-3063-4476-a9c7-505d1fffc077 
alice 0f3e3359-af4f-4a6b-9a36-ebb906e062c6 main-db-345 dc43a481-3063-4476-a9c7-505d1fffc077 

Do you think that is a better UX? Definitely a little more self-explanatory.


# Cancel vs rm? Aren't these the same?

Each semaphore contains some number of leases; if alice is under Session Control and has two currently active connections, then a semaphore named sessctl/alice will exist, and it will contain two leases (one for each connection). The cancel operation removes a specific lease, whereas the rm operation removes the entire semaphore (which is equivalent to removing all leases).

Effectively, cancel will terminate a specific connection, whereas rm will terminate all sessions related to the user. As it exists now, the rm operation can also be used to kill all connections in the cluster which are under Concurrent Session Control, but it won't let you do that unless you supply -f/--force.

If we went with the tctl sessctl subcommand strategy discussed above, we could probably unify these two concepts into a single subcommand:

$ tctl sessctl rm [--force] [--user=<username>] [<id>]...

Also for $ tctl sem ls Can you find which node that users is accessing?

The Holder column contains the UUID of the node which holds the lease. Using that value, one can run tctl get nodes/<uuid> to find out all the details of the node that the user is accessing.

@benarent
Copy link
Contributor

I edited contents of @fspmarshall based on our call yesterday, just some tweaks to the output of tctl sessctl ls.

@fspmarshall fspmarshall force-pushed the fspmarshall/session-control-2 branch from e0319e9 to 652da5f Compare August 3, 2020 19:20
@fspmarshall fspmarshall force-pushed the fspmarshall/session-control-2 branch 3 times, most recently from 140a163 to ea85056 Compare August 11, 2020 18:39
@fspmarshall
Copy link
Contributor Author

Updated tctl subcommand:

Regular:

$ tctl sessctl ls
User  LeaseID                              Host 
----- ------------------------------------ ----------- 
alice 8c7ffdb8-6cf6-4688-943e-0694ac456302 example.com 
alice ac2fafef-2519-4e74-a6e4-6f7b44870fc3 example.com 

Verbose:

$ tctl sessctl ls -v
User  LeaseID                              NodeID                               Expires    
----- ------------------------------------ ------------------------------------ -------------------  
alice 8c7ffdb8-6cf6-4688-943e-0694ac456302 dc43a481-3063-4476-a9c7-505d1fffc077 11 Aug 20 18:46 UTC 
alice ac2fafef-2519-4e74-a6e4-6f7b44870fc3 dc43a481-3063-4476-a9c7-505d1fffc077 11 Aug 20 18:46 UTC 

Json:

$ tctl sessctl ls --format=json
[
  {
    "kind": "semaphore",
    "sub_kind": "sessctl",
    "version": "v3",
    "metadata": {
      "name": "alice",
      "expires": "2020-08-11T18:46:48.479388046Z"
    },
    "spec": {
      "leases": [
        {
          "lease_id": "8c7ffdb8-6cf6-4688-943e-0694ac456302",
          "expires": "2020-08-11T18:46:43.710496567Z",
          "holder": "dc43a481-3063-4476-a9c7-505d1fffc077"
        },
        {
          "lease_id": "ac2fafef-2519-4e74-a6e4-6f7b44870fc3",
          "expires": "2020-08-11T18:46:48.479388046Z",
          "holder": "dc43a481-3063-4476-a9c7-505d1fffc077"
        }
      ]
    }
  }
]

@fspmarshall fspmarshall marked this pull request as ready for review August 11, 2020 18:57
lib/config/fileconf.go Outdated Show resolved Hide resolved
lib/services/local/presence.go Outdated Show resolved Hide resolved
lib/services/local/presence.go Outdated Show resolved Hide resolved
lib/services/semaphore.go Outdated Show resolved Hide resolved
lib/services/semaphore.go Show resolved Hide resolved
lib/services/types.proto Outdated Show resolved Hide resolved
lib/srv/regular/sshserver.go Outdated Show resolved Hide resolved
tool/tctl/common/semaphore_command.go Outdated Show resolved Hide resolved
tool/tctl/common/semaphore_command.go Outdated Show resolved Hide resolved
lib/auth/proto/auth.proto Outdated Show resolved Hide resolved
@webvictim
Copy link
Contributor

As a relative newbie to Go, I like this implementation and think it seems pretty clean. I'll leave the more hardcore reviewing to backend engineers.

@fspmarshall fspmarshall force-pushed the fspmarshall/session-control-2 branch 2 times, most recently from 2e0f393 to 0745ea4 Compare August 12, 2020 20:37
integration/integration_test.go Show resolved Hide resolved
lib/auth/api.go Outdated Show resolved Hide resolved
lib/auth/proto/auth.proto Outdated Show resolved Hide resolved
lib/defaults/defaults.go Outdated Show resolved Hide resolved
lib/services/clusterconfig.go Outdated Show resolved Hide resolved
lib/services/suite/suite.go Outdated Show resolved Hide resolved
lib/srv/authhandlers.go Show resolved Hide resolved
lib/utils/retry.go Outdated Show resolved Hide resolved
tool/tctl/common/semaphore_command.go Outdated Show resolved Hide resolved
tool/tctl/common/semaphore_command.go Outdated Show resolved Hide resolved
tool/tctl/common/semaphore_command.go Outdated Show resolved Hide resolved
lib/sshutils/server.go Outdated Show resolved Hide resolved
}

// SemaphoreLeaseRef identifies an existent lease.
message SemaphoreLeaseRef {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because each SemaphoreSpecV3 can maintain multiple leases?


// SessionControlLimit fires when a user's attempt to create an authenticated
// session has been rejected due to exceeding `max_concurrent_sessions`.
SessionControlLimitEvent = "sessctl.limit"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be named something else. @benarent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this named to something else?

lib/services/semaphore.go Outdated Show resolved Hide resolved
lib/services/semaphore.go Show resolved Hide resolved
@fspmarshall fspmarshall force-pushed the fspmarshall/session-control-2 branch 7 times, most recently from 68a8293 to 0f17b60 Compare August 24, 2020 19:11
@benarent benarent added this to the 4.4 "Rome" milestone Aug 25, 2020
@benarent benarent added the feature-request Used for new features in Teleport, improvements to current should be #enhancements label Aug 25, 2020
@fspmarshall fspmarshall force-pushed the fspmarshall/session-control-2 branch 2 times, most recently from dbbd51c to 1d45431 Compare August 25, 2020 20:27
@fspmarshall
Copy link
Contributor Author

fspmarshall commented Aug 25, 2020

Updated PR Description

RBAC

A role's options block now supports two new optional configuration values:

kind: role
metadata:
  name: some-role
  # ...
spec:
  options:
    max_connections: 2
    max_sessions: 2
  # ...
version: v3

max_connections

This value limits the total number of concurrent SSH connections that a user may establish to nodes within a cluster. No distinction is made between connections to the same node or different node, and only connections to teleport nodes are covered (connections to openssh nodes are not counted).

In order to enforce this control, nodes acquire semaphore leases from the auth server corresponding to the teleport username that is attempting to connect. This means that users who have limits placed on concurrent connections cannot connect to nodes while the auth server is offline.

In tsh, attempting to exceed max_connections will result in an error that looks something like this:

$ tsh ssh node.example.com
error: ssh: rejected: administratively prohibited (too many concurrent ssh connections for user "alice" (max=2))

The same error in openssh will look like this:

ssh -J proxy.example.com node.example.com
channel 0: open failed: administratively prohibited: too many concurrent ssh connections for user "alice" (max=2)

max_sessions

This value limits the total number of session channels which can be established across a single SSH connection (typically used for interactive terminals or remote exec operations). It is essentially equivalent to the MaxSessions configuration value accepted by sshd. Under normal usage via tsh a user will never hit this limit as separate invocations of tsh produce separate connections. This limit will affect users or software that leverage connection-reuse systems such as the ControlMaster feature of openssh.

In openssh, attempting to exceed max_sessions will result in an error message something like this:

channel 5: open failed: administratively prohibited: too many session channels for user "alice" (max=2)

Cluster Configuration

A new session_control_timeout configuration value has been added to the auth_service configuration block of the teleport config file:

auth_service:
  session_control_timeout: 2m # default
# ...

This parameter controls the expiration time of the semaphore leases used to enforce max_connections. More generally, this value represents the upper limit of how long a node can go without being able to successfully contact the auth server before it terminates a controlled connection. Because semaphores are synchronized cluster-wide state, very short timeouts may be costly performance-wise. The default value is two minutes, but clusters with extremely high numbers of concurrent controlled connections may benefit from longer timeouts.


The Semaphore Resource

In order to facilitate basic debugging/introspection, support for the semaphore resource has been added to the tctl get and tctl rm commands:

$ tctl get semaphores
kind: semaphore
metadata:
  expires: "2020-08-25T21:05:31.846078263Z"
  name: alice
spec:
  leases:
  - expires: "2020-08-25T21:05:31.846078263Z"
    holder: dc43a481-3063-4476-a9c7-505d1fffc077
    lease_id: bf36059c-7986-4f44-937e-a43cbda2b1f5
  - expires: "2020-08-25T21:05:30.726874902Z"
    holder: dc43a481-3063-4476-a9c7-505d1fffc077
    lease_id: b488bcf3-d6fa-4e31-a363-08ea1488de5d
sub_kind: connection
version: v3

$ tctl rm semaphores/connection/alice
semaphore 'connection/alice' has been deleted

Audit Events

Attempts to exceed the max_connections and max_sessions restraints result in connection and session variants of the sessctl.limit audit event respectively:

{
    "code": "T1006W",
    "event": "sessctl.limit",
    "max": 2,
    "proto": "ssh",
    "server_id": "dc43a481-3063-4476-a9c7-505d1fffc077",
    "sessctl_kind": "connection",
    "time": "2020-08-25T20:55:28.548Z",
    "uid": "adfad7cd-e884-47dd-8663-4d83de30f7cc",
    "user": "alice"
}
{
    "code": "T1006W",
    "event": "sessctl.limit",
    "max": 2,
    "proto": "ssh",
    "server_id": "dc43a481-3063-4476-a9c7-505d1fffc077",
    "sessctl_kind": "session",
    "time": "2020-08-25T20:55:28.498Z",
    "uid": "9ad61a77-de3d-44c8-bafa-dc21e63a3150",
    "user": "alice"
}

@klizhentas
Copy link
Contributor

@fspmarshall does it have to be ssh specific? Can we have just max_connections and max_sessions that will limit total overall amount of connections and sessions across different protocol? Otherwise we would have to add a new option per protocol?

@fspmarshall
Copy link
Contributor Author

@fspmarshall does it have to be ssh specific? Can we have just max_connections and max_sessions that will limit total overall amount of connections and sessions across different protocol? Otherwise we would have to add a new option per protocol?

@klizhentas I started off using the more general terminology, but was concerned that it might cause confusion since not all connection/session concepts have the hierarchical relationship that ssh connections/sessions have (i.e. a session is scoped to a specific connection). For things like the kubernetes API and (I think) AAP, a single session can span multiple connections. What do you think? I suppose as long as we don't promise to preserve the current hierarchical relationship, it isn't a huge deal :/

@klizhentas
Copy link
Contributor

I would still remove ssh part, precisely because we don't promise to preserve the relationship, but we would actually limit the amount of connections and sessions across all ssh/k8s/app protocols - that's AC requirement anyways.

@fspmarshall
Copy link
Contributor Author

Updated description to reflect removal of ssh prefixing.

@fspmarshall fspmarshall force-pushed the fspmarshall/session-control-2 branch 2 times, most recently from b1da6e8 to f4791be Compare August 27, 2020 19:22
Copy link
Contributor

@awly awly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I didn't re-review lib/services/semaphore.go, hit my mental capacity after ~2hrs of reviewing.

Regarding max_connections vs max_sessions: it looks like these separate controls are required by FIPS, right?
I suspect this will be confusing for some customers, vs a single global max_sessions-like limit.
We should at least make the docs very clear about how these relate @benarent

integration/integration_test.go Outdated Show resolved Hide resolved
Comment on lines +1056 to +1057
option (gogoproto.goproto_stringer) = false;
option (gogoproto.stringer) = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these options set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These options are set for all top-level Resource types. We implement custom String() methods that display compact summaries of the important state of the resource.

lib/services/types.proto Outdated Show resolved Hide resolved
@@ -304,6 +304,32 @@ func (s *SrvSuite) TestAgentForwardPermission(c *C) {
c.Assert(strings.Contains(string(output), "SSH_AUTH_SOCK"), Equals, false)
}

// TestMaxSesssions makes sure that MaxSessions RBAC rules prevent
// too many concurrent sessions.
func (s *SrvSuite) TestMaxSessions(c *C) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also check that the limit doesn't affect new connections

lib/sshutils/ctx.go Outdated Show resolved Hide resolved
lib/sshutils/server.go Outdated Show resolved Hide resolved
tool/tctl/common/collection.go Show resolved Hide resolved
fspmarshall and others added 3 commits September 16, 2020 16:19
Adds support for Concurrent Session Control and a new
semaphore API.  Roles now support two new configuration
options, `max_ssh_connections` and `max_ssh_sessions`
which correspond to the total number of authenticated
ssh connections per cluster, and the number of ssh sessions
within a connection respectively.  Attempting to exceed
these limits generate variants of the `session.rejected`
audit event and cause the connection/session to be
rejected.
@fspmarshall
Copy link
Contributor Author

retest this please

@fspmarshall fspmarshall merged commit 1d20053 into master Sep 17, 2020
@fspmarshall fspmarshall deleted the fspmarshall/session-control-2 branch September 17, 2020 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Used for new features in Teleport, improvements to current should be #enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants