Initial commit

Part of #8318
projectnessie · Apr 12, 2024 · 57493b6 · 57493b6
commit 57493b6
Show file tree

Hide file tree

Showing 11 changed files with 1,672 additions and 0 deletions.
diff --git a/0.79.0/docs/authentication.md b/0.79.0/docs/authentication.md
@@ -0,0 +1,48 @@
+---
+title: "Authentication"
+---
+
+# Authentication
+
+By default, Nessie servers run with authentication disabled and all requests are processed under the "anonymous"
+user identity. In Nessie clients this authentication type is known as `NONE`.
+
+When a Nessie server runs as an AWS Lambda, access to its API is controlled by AWS authentication settings.
+In this case there is no need to configure any additional authentication in the Nessie server.
+In Nessie clients this authentication type is known as `AWS`.  
+
+When a Nessie API is exposed to clients without any external authentication layer, the server itself can be
+configured to authenticate clients using [OpenID tokens](https://openid.net/specs/openid-connect-core-1_0.html)
+as described in the section below. On the client side this authentication type is known as `BEARER` authentication.
+
+For client-side authentication settings refer to the following pages:
+
+* [Nessie CLI](cli.md)
+* [Java Client](../develop/java.md)
+* [Authentication in Tools](client_config.md)
+
+## OpenID Bearer Tokens
+
+Nessie supports bearer tokens and uses [OpenID Connect](https://openid.net/connect/) for validating them.
+
+To enable bearer authentication the following [configuration](configuration.md) properties need to be set 
+for the Nessie Server process:
+
+* `nessie.server.authentication.enabled=true`
+* `quarkus.oidc.auth-server-url=<OpenID Server URL>`
+* `quarkus.oidc.client-id=<Client ID>`
+
+When using Nessie [Docker](../guides/docker.md) images, the authentication options can be specified on
+the `docker` command line as environment variables, for example:
+
+```bash
+$ docker run -p 19120:19120 \
+  -e QUARKUS_OIDC_CLIENT_ID=$YOUR_OIDC_Client_ID \
+  -e QUARKUS_OIDC_AUTH_SERVER_URL=$YOUR_OpenID_Server_URL \
+  -e NESSIE_SERVER_AUTHENTICATION_ENABLED=true \
+  --network host \
+  ghcr.io/projectnessie/nessie
+```
+
+Note the use of the `host` Docker network. In this example, it is assumed that the Open ID Server
+is available on the host network. More advanced network setup is possible, of course.
diff --git a/0.79.0/docs/authorization.md b/0.79.0/docs/authorization.md
@@ -0,0 +1,196 @@
+---
+title: "Metadata Authorization"
+---
+
+# authorization
+
+## Authorization scope
+It is important to note that Nessie does not store data directly but only data location and other metadata. 
+
+As a consequence, the Nessie authorization layer can only really control access to **metadata**, but might not prevent data itself to be accessed directly without interacting with Nessie. 
+It is then expected that another system can control access to data itself to make sure unauthorized access isn't possible.
+
+The same is true for access to historical data, which is one of Nessie's main features.
+For example, while it might seem safe committing a change that removes undesired sensitive data and restricting access to only the latest version of the dataset,
+the truth is that the sensitive data may still exist on the data lake and be accessed by other means 
+(similar to how redacting a PDF by adding black boxes on top of sensitive information does not prevent people to read what is written beneath in most cases). 
+The only safe way to remove this data is to remove it from the table (e.g. via `DELETE` statements) and then run the [Garbage Collection](https://projectnessie.org/features/management/#garbage-collection) algorithm to ensure the data has been removed from Nessie history and deleted on the data lake.
+
+## Stories
+Here's a list of common authorization scenarios:
+
+* Alice attempts to execute a query against the table `Foo` on branch `prod`. As she has read access to the table on this branch, Nessie allows the execution engine to get the table details.
+* Bob attempts to execute a query against the table `Foo` on branch `prod`. However, Bob does not have read access to the table. Nessie returns an authorization error, and the execution engine refuses to execute the query.
+* Carol has access to the content on branch `prod`, but not to the table `Foo` on this branch. Carol creates a new reference named `carol-branch` with the same hash as `prod`, and attempts to change permissions on table `Foo`. However, request is denied and Carol cannot access the content of `Foo`.
+* Dave has access to the content on branch `prod`, and wants to update the content of the table `Foo`. He creates a new reference named `dave-experiment`, and executes several queries against this branch to modify table `Foo`. Each modification is a commit done against `dave-experiment` branch which is approved by the Nessie server. When all the desired modifications are done, Dave attempts to merge the changes back to the `prod` branch. However, Dave doesn't have the rights to modify the `prod` branch, causing Nessie to deny the request.
+
+## Access control model
+Any object in Nessie can be designated by a pair of coordinates (**reference**, **path**), therefore access control is also designed around those two concepts.
+
+### Access control against references
+References can be designated by their name (branches and tags) and there are several operations that can be exercised:
+
+* view/list available references
+* create a new named reference
+* assign a hash to a reference
+* delete a reference
+* list objects present in the tree
+* read objects content in the tree
+* commit a change against the reference
+
+Note that a user needs to be able to view a reference in order to list objects on that reference.
+
+### Access control against paths
+For a specific reference, an **entity** is designated by its path which is why a simple way of performing access control can be done by applying restrictions on path.
+
+Several operations can be exercised against an **entity**:
+
+ * create a new entity
+ * delete an entity
+ * update entity's content
+
+Note that those operations combine themselves with the reference operations. For example to actually be able to update the content of an entity, user needs both permission to do the update AND to commit the change against the reference where the change will be stored
+
+## Service Provider Interface
+The SPI is named [AccessChecker](https://github.com/projectnessie/nessie/blob/main/servers/services/src/main/java/org/projectnessie/services/authz/AccessChecker.java) and uses [AccessContext](https://github.com/projectnessie/nessie/blob/main/servers/services/src/main/java/org/projectnessie/services/authz/AccessContext.java), which carries information about the overall context of the operation.
+Implementers of `AccessChecker` are completely free to define their own way of creating/updating/checking authorization rules.
+
+### ContentId Usage
+Note that there is a `contentId` parameter in some methods of the [AccessChecker](https://github.com/projectnessie/nessie/blob/main/servers/services/src/main/java/org/projectnessie/services/authz/AccessChecker.java), which allows checking specific rules for a given entity at a given point in time.
+The `contentId` parameter refers to the ID of a `Content` object and its contract is defined [here](https://projectnessie.org/develop/spec/#content-id).
+
+One can think of this similar to how permissions are defined in Google Docs. There are some permissions that are specific to the parent folder and to the doc itself. When a Doc is moved from one folder to another, it inherits the permissions of the parent folder.
+However, the doc-specific permissions are carried over with the doc and still apply.
+The same is true in the context of entities. There are some rules that apply to an entity in a global fashion, and then there's the possibility to define rules specific to the `contentId` of an entity.
+
+## Reference implementation for Metadata Authorization
+
+The reference implementation allows defining authorization rules via [application.properties](https://github.com/projectnessie/nessie/blob/main/servers/quarkus-server/src/main/resources/application.properties) and is therefore dependent on [Quarkus](https://quarkus.io/).
+Nessie's metadata authorization can be enabled via `nessie.server.authorization.enabled=true`.
+
+### Authorization Rules
+Authorization rule definitions are using a **Common Expression Language (CEL)** expression (an intro to CEL can be found at https://github.com/google/cel-spec/blob/master/doc/intro.md).
+
+Rule definitions are of the form `nessie.server.authorization.rules.<ruleId>=<rule_expression>`, where `<ruleId>` is a unique identifier for the rule.
+
+`<rule_expression>` is basically a CEL expression string, which allows lots of flexibility on a given set of variables. 
+
+Available variables within the `<rule_expression>` are: **'op'** / **'role'** / **'ref'** / **'path'**.
+
+* The **'op'** variable in the `<rule_expression>` refers to the type of operation can be any of the following.
+  See [BatchAccessChecker](https://github.com/projectnessie/nessie/blob/main/servers/services/src/main/java/org/projectnessie/services/authz/BatchAccessChecker.java)
+  and [Check](https://github.com/projectnessie/nessie/blob/main/servers/services/src/main/java/org/projectnessie/services/authz/Check.java) types.
+  * Arguments to the CEL script for the following check-types:
+    * **'role'** refers to the user's role and can be any string.
+    * **'ref'** refers to a string representing a branch/tag name or `DETATCHED` for direct access to a commit id.
+    * **'path'** refers to the [content key](https://github.com/projectnessie/nessie/blob/main/model/src/main/java/org/projectnessie/model/ContentKey.java) for the contents of an object and can be any string
+  * `VIEW_REFERENCE`
+  * `CREATE_REFERENCE`
+  * `DELETE_REFERENCE`
+  * `READ_ENTRIES`
+  * `READ_CONTENT_KEY`
+  * `LIST_COMMIT_LOG`
+  * `COMMIT_CHANGE_AGAINST_REFERENCE`
+  * `ASSIGN_REFERENCE_TO_HASH`
+  * `CREATE_ENTITY`
+  * `UPDATE_ENTITY`
+  * `READ_ENTITY_VALUE`
+  * `DELETE_ENTITY`
+* The following values for the **'op'**
+  * `READ_REPOSITORY_CONFIG`
+  * `UPDATE_REPOSITORY_CONFIG`
+  * Arguments to the CEL script for the repository config check-types:
+    * **'type'** argument that refers to the repository config type to be retrieved or updated.
+
+Since all available authorization rule variables are strings, the relevant CEL-specific things that are worth mentioning are shown below:
+
+* [equality and inequality](https://github.com/google/cel-spec/blob/master/doc/langdef.md#equality-and-ordering)
+* [regular expressions](https://github.com/google/cel-spec/blob/master/doc/langdef.md#regular-expressions)
+* [operators & functions](https://github.com/google/cel-spec/blob/master/doc/langdef.md#list-of-standard-definitions)
+
+### Example authorization rules
+
+Below are some basic examples that show how to give a permission for a particular operation. In reality, one would want to keep the number of authorization rules for a single user/role low and grant permissions for all required operations through as few rules as possible.
+
+* allows viewing the branch/tag starting with the name `allowedBranch` for the role that starts with the name `test_`:
+```
+nessie.server.authorization.rules.allow_branch_listing=\
+  op=='VIEW_REFERENCE' && role.startsWith('test_') && ref.startsWith('allowedBranch')
+```
+
+* allows creating branches/tags that match the regex `.*allowedBranch.*` for the role `test_user`:
+```
+nessie.server.authorization.rules.allow_branch_creation=\
+  op=='CREATE_REFERENCE' && role=='test_user' && ref.matches('.*allowedBranch.*')
+```
+
+* allows deleting branches/tags that end with `allowedBranch` for the role named `test_user123`:
+```
+nessie.server.authorization.rules.allow_branch_deletion=\
+  op in ['VIEW_REFERENCE', 'DELETE_REFERENCE'] && role=='test_user123' && ref.endsWith('allowedBranch')
+```
+
+* allows listing the commit log for all branches/tags starting with `dev`:
+```
+nessie.server.authorization.rules.allow_listing_commitlog=\
+  op in ['VIEW_REFERENCE', 'LIST_COMMIT_LOG'] && ref.startsWith('dev')
+```
+
+* allows reading the entity value where teh `path` starts with `allowed.` for the role `test_user`:
+```
+nessie.server.authorization.rules.allow_reading_entity_value=\
+  op in ['VIEW_REFERENCE', 'READ_ENTITY_VALUE'] && role=='test_user' && path.startsWith('allowed.')
+```
+
+* allows deleting the entity where the `path` starts with `dev.` for all roles:
+```
+nessie.server.authorization.rules.allow_deleting_entity=\
+  op in ['VIEW_REFERENCE', 'DELETE_ENTITY'] && path.startsWith('dev.')
+```
+
+* allows listing reflog for the role `admin_user`:
+```
+nessie.server.authorization.rules.allow_listing_reflog=\
+  op=='VIEW_REFLOG' && role=='admin_user'
+```
+
+### Example authorization rules from Stories section
+
+As mentioned in the Stories section, a few common scenarios that are possible are:
+
+* Alice attempts to execute a query against the table `Foo` on branch `prod`. As she has read access to the table on this branch, Nessie allows the execution engine to get the table details.
+* Bob attempts to execute a query against the table `Foo` on branch `prod`. However, Bob does not have read access to the table. Nessie returns an authorization error, and the execution engine refuses to execute the query.
+* Carol has access to the content on branch `prod`, but not to the table `Foo` on this branch. Carol creates a new reference named `carol-branch` with the same hash as `prod`, and attempts to change permissions on table `Foo`. However, request is denied and Carol cannot access the content of `Foo`.
+* Dave has access to the content on branch `prod`, and wants to update the content of the table `Foo`. He creates a new reference named `dave-experiment`, and executes several queries against this branch to modify table `Foo`. Each modification is a commit done against `dave-experiment` branch which is approved by the Nessie server. When all the desired modifications are done, Dave attempts to merge the changes back to the `prod` branch. However, Dave doesn't have the rights to modify the `prod` branch, causing Nessie to deny the request.
+
+Below are the respective authorization rules for these scenarios:
+```
+# read access for all on the prod branch
+nessie.server.authorization.rules.prod=\
+  op in ['VIEW_REFERENCE'] && ref=='prod' && role in ['Alice', 'Bob', 'Carol', 'Dave']
+
+# alice & dave can read Foo  
+nessie.server.authorization.rules.reading_foo_on_prod=\
+  op in ['READ_ENTITY_VALUE'] && ref=='prod' && path=='Foo' && role in ['Alice', 'Dave']
+
+# specific rules for carol on her branch
+nessie.server.authorization.rules.carol-branch=\
+  op in ['VIEW_REFERENCE', 'CREATE_REFERENCE', 'DELETE_REFERENCE', 'COMMIT_CHANGE_AGAINST_REFERENCE'] && ref=='carol-branch' && role=='Carol'
+
+# specific rules for dave on his branch
+nessie.server.authorization.rules.dave-experiment=\
+  op in ['VIEW_REFERENCE', 'CREATE_REFERENCE', 'DELETE_REFERENCE', 'COMMIT_CHANGE_AGAINST_REFERENCE'] && ref=='dave-experiment' && role=='Dave'
+
+# bob can read/update/delete BobsBar only
+nessie.server.authorization.rules.bob=\
+  op in ['READ_ENTITY_VALUE', 'UPDATE_ENTITY', 'DELETE_ENTITY'] && path=='BobsBar` && role=='Bob')
+
+# carol can read/update/delete CarolsSecret
+nessie.server.authorization.rules.carol=\
+  op in ['READ_ENTITY_VALUE', 'UPDATE_ENTITY', 'DELETE_ENTITY'] && path=='CarolsSecret` && role=='Alice')
+
+# dave can read/update/delete DavesHiddenX
+nessie.server.authorization.rules.dave=\
+  op in ['READ_ENTITY_VALUE', 'UPDATE_ENTITY', 'DELETE_ENTITY'] && path=='DavesHiddenX` && role=='Dave')
+
+```
diff --git a/0.79.0/docs/cli.md b/0.79.0/docs/cli.md
@@ -0,0 +1,85 @@
+---
+title: "Nessie CLI"
+---
+
+# Nessie CLI
+
+The Nessie CLI is an easy way to get started with Nessie. It supports multiple branch 
+and tag management capabilities. This is installed as `pynessie` via `pip install pynessie`.
+Additional information about `pynessie` and release notes can be found at the [PyPI](https://pypi.org/project/pynessie/) site. 
+
+## Installation
+
+```
+# python 3 required
+pip install pynessie
+```
+
+## Usage 
+All the REST API calls are exposed via the command line interface. To see a list of what is available run:
+
+``` bash
+$ nessie --help
+``` 
+
+All docs of the CLI can be found [here](https://nessie.readthedocs.io/en/latest/cli.html).
+
+## Configuration
+
+You can configure the Nessie CLI by creating a configuration file as described below:
+
+* macOS: `~/.config/nessie` and `~/Library/Application Support/nessie`
+* Other Unix: `~/.config/nessie` and `/etc/nessie`
+* Windows: `%APPDATA%\nessie` where the `APPDATA` environment variable falls
+  back to `%HOME%\AppData\Roaming` if undefined
+* Via the environment variable `DREMIO_CLIENTDIR`
+
+The default config file is as follows:
+
+``` yaml
+auth:
+    # Authentication type can be: none, bearer or aws
+    type: none
+
+    # OpenID token for the "bearer" authentication type
+    # token: <OpenID token>
+
+    timeout: 10
+
+# Nessie endpoint
+endpoint: http://localhost/api/v1
+
+# whether to skip SSL cert verification
+verify: true 
+```
+
+Possible values for the `auth.type` property are:
+
+* `none` (default)
+* `bearer`
+* `aws`
+
+When configuring authentication type `bearer`, the `auth.token` parameter should be set to a valid
+[OpenID token](https://openid.net/specs/openid-connect-core-1_0.html). The token can be set in the Nessie
+configuration file, as an environment variable (details below), or by the `--auth-token <TOKEN>` command
+line option (for each command).
+
+When configuring authentication type `aws`, the client delegates to the [Boto](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) 
+library. You can configure credentials using any of the standard [Boto AWS methods](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#configuring-credentials).
+Additionally, the Nessie `auth.region` parameter should be set to the relevant AWS region.
+
+The command line interface can be configured with most of the above parameters via flags or by setting
+a config directory. The relevant configs can also be set via environment variables. These take precedence. The
+environment variable format is to append `NESSIE_` to a config parameter and nested configs are separated by a *_*. For
+example: `NESSIE_AUTH_TIMEOUT` maps to `auth.timeout` in the default configuration file above.
+
+
+## Working with JSON
+
+The Nessie CLI can return data in json format and can be used effectively with [`jq`](https://stedolan.github.io/jq/). For example:
+
+``` bash
+$ nessie --json branch -l | jq .
+```
+
+The Nessie CLI is built on the great Python [Click](https://click.palletsprojects.com) library. It requires Python 3.x.