Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/documentation acl #276

Merged
merged 3 commits into from
Nov 8, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 63 additions & 19 deletions DOCUMENTATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ Entities are returned as an array of JSON objects and can also contain a continu

## Setting public namespaces for a Dataset

By default, the context object in datahub responses lists all available namespace mappings in the datahub. When there is a large number of datasets with many namespaces in the datahub, this can be undesired.
By default, the context object in data hub responses lists all available namespace mappings in the data hub. When there is a large number of datasets with many namespaces in the data hub, this can be undesired.
Therefore, it is possible to configure a limited list of namespaces per dataset to be used in response contexts.

### Creating datasets with public namespaces
Expand Down Expand Up @@ -238,7 +238,7 @@ In this example, we add two namespaces as `publicNamespaces` to dataset `namespa
mim dataset store core.Dataset --filename=update.json
```

Now, when we retrieve entities from `namespaces.Test`, datahub will supply only `publicNamespaces` as context
Now, when we retrieve entities from `namespaces.Test`, data hub will supply only `publicNamespaces` as context

```
> mim dataset entities namespaces.Test
Expand Down Expand Up @@ -292,7 +292,7 @@ POST /query

To lookup a single entity:

- using `mim`, the datahub CLI
- using `mim`, the data hub CLI
```shell
> mim query --id="http://data.mimiro.io/people/homer"
```
Expand All @@ -307,7 +307,7 @@ To lookup a single entity:

To fetch related entities for a given entity:

- using `mim`, the datahub CLI
- using `mim`, the data hub CLI

```shell
> mim query --entity="http://data.mimiro.io/people/homer" \
Expand All @@ -324,7 +324,7 @@ To fetch related entities for a given entity:

and to get entities referencing a given entity, e.g. all entities of type person.

- using `mim`, the datahub CLI
- using `mim`, the data hub CLI
```shell
> mim query --entity="http://data.mimiro.io/schema/person" \
--via="http://www.w3.org/1999/02/22-rdf-syntax-ns#type" \
Expand Down Expand Up @@ -492,7 +492,7 @@ We are developing and making available data layers for common systems. They are

## Jobs

Jobs are used to fetch data from remote datalayers into the datahub, they are used to connect and transform data in the data hub and they are used to send data to remote datalayers.
Jobs are used to fetch data from remote datalayers into the data hub, they are used to connect and transform data in the data hub and they are used to send data to remote datalayers.

A Job is defined with three components: a source, an optional transform and a sink. Jobs that pull from or push to external datalayers execute on a schedule, jobs that move data between datasets can either be on a schedule or triggered as data arrives in the source dataset.

Expand Down Expand Up @@ -564,7 +564,7 @@ The default is that all changes of each entity are emitted, so that the whole da

#### Union Dataset Source

A union dataset source can be used to consume multiple datasets in the datahub.
A union dataset source can be used to consume multiple datasets in the data hub.
All configured datasets are read sequentially, as if their contents were concatenated.

```json
Expand Down Expand Up @@ -1741,7 +1741,7 @@ mim transform test test.people --file transform1.js

#### Testing a Transform on a given entity

There is also a possibility to test the transform on a known entity in the datahub by running a query and applying the transformation on the returned entity, the command runs the same transform as above but on the entity `http://data.mimiro.io/people/bob`. The data is fetched from the dataset, the script is executed locally, and the output displayed.
There is also a possibility to test the transform on a known entity in the data hub by running a query and applying the transformation on the returned entity, the command runs the same transform as above but on the entity `http://data.mimiro.io/people/bob`. The data is fetched from the dataset, the script is executed locally, and the output displayed.

```shell
mim query --id "http://data.mimiro.io/people/bob" --via="*" --json | mim transform test --file transform1.js
Expand Down Expand Up @@ -1809,11 +1809,11 @@ There are two main security models for the data hub.

1. No security / API gateway seured. All calls are allowed at the data hub API level. This mode can be used either when developing or when the data hub API is protected behind an API gateway that implements secure access.

2. Data Hub Security. This involves a datahub allowing for the registration of clients and a public key. The client (often in this model another datahub) retrieves a JWT access token by sending a request (signed with a private key) to authenticate.
2. Data Hub Security. This involves a data hub allowing for the registration of clients and a public key. The client (often in this model another data hub) retrieves a JWT access token by sending a request (signed with a private key) to authenticate.

In secured mode, it is also possile to configure an OPA endpoint. OPA is used to authorize requests. Authorization is then based on a union of datahub ACL rules and OPA policy rules for the current user.
In secured mode, it is also possile to configure an OPA endpoint. OPA is used to authorize requests. Authorization is then based on a union of data hub ACL rules and OPA policy rules for the current user.

Additionally, an external authentication provider can be configured to validate JWT tokens. This has to be an OAuth2 provider. The datahub will then validate JWT tokens against both the built-in provider and the external provider.
Additionally, an external authentication provider can be configured to validate JWT tokens. This has to be an OAuth2 provider. The data hub will then validate JWT tokens against both the built-in provider and the external provider.

The following environment variables can be set to configure the data hub security.

Expand All @@ -1831,7 +1831,7 @@ This is the password value for the admin user. Required in secured mode. It is h

`ADMIN_LOCAL_ONLY=false`

If set to true admin access is only available from the local machine / container where the datahub is running. (coming soon)
If set to true admin access is only available from the local machine / container where the data hub is running. (coming soon)

`AUTHORIZATION_MIDDLEWARE=noop`

Expand Down Expand Up @@ -1909,7 +1909,7 @@ with id and secret.

Assuming there are two data hubs and the goal is to have one data hub be able to run a job that accesses a dataset on another.

To register clients and ACLs it is first necessary to log into the datahub with the admin permissions.
To register clients and ACLs it is first necessary to log into the data hub with the admin permissions.

To login with admin credentials create a new login alias. Notice the type is 'admin'. The clientId and clientSecret should align with the data hub environment variables ADMIN_USERNAME and ADMIN_PASSWORD.

Expand All @@ -1918,11 +1918,14 @@ mim login add
--alias localadmin \
--type admin
--server "https://localhost:8080" \
--audience "https://localhost:8080" \
--authorizer "https://auth.localhost:8081" \
--clientId "ADMIN_USERNAME" \
--clientSecret "ADMIN_PASSWORD" \

```

Then get the client id and public key from the data hub that will be connecting to this datahub. The client-id is the NODE_ID of the data hub that will be a client. The public key can be found based on the SECURITY_STORAGE_LOCATION environment variable of the client data hub. Ensure you only share the public key.
Then get the client id and public key from the data hub that will be connecting to this data hub. The client-id is the NODE_ID of the data hub that will be a client. The public key can be found based on the SECURITY_STORAGE_LOCATION environment variable of the client data hub. Ensure you only share the public key.

Register the client data hub with the following command:

Expand All @@ -1947,28 +1950,50 @@ It will show something like:
}
}
```
To add ACLS a json-file can be created from this template:

```json
[
{
"Resource":"/datasets/your.Dataset*",
"Action":"read",
"Deny":false
}
]
```
After this file has been created it should be uploaded to the data hub.

```
mim acl add <client-id> -f acls.json
```

Then get, edit and update the ACL for the client:

```
mim acl get <client-id> client23-acl.json
```

The resource patterns are either exact matches or '*' matches. This will match any subpart of the URL and isnt restricted to path segments. e.g. '/datasets/core.*' can be used to secure all datasets starting with 'core.'. This also mean that access can be granted to only /changes or /entities respectively.
Something to have in mind if both endpoints are granted access to with '/datasets/core.Dataset*' there is also potential to grant access to '/datasets/core.DatasetWithAdditionalInfo'.

To grant full access to the client. Add to the ACL file so it looks like:

```json
[{ "Resource": "/*", "Action": "write", "Deny": false }]
```
Other options can look like this

The resource patterns are either exact matches or '_' matches. This will match any subpart of the URL and isnt restricted to path segments. e.g. ´/datasets/core._' can be used to secure all datasets starting with 'core.'.
```json
[{"Resource": "/datasets/core.*", "Action": "read", "Deny": false },{"Resource": "/datasets/test.Users*", "Action": "read", "Deny": false }, {"Resource": "/datasets/test.Places/changes", "Action": "read", "Deny": false }]
```

Then upload the config.

```
mim acl add <client-id> -f acls.json
```

On the client datahub it is necessary to upload a provider config that can be referenced from jobs that need to access the remote data hub.
On the client data hub it is necessary to upload a provider config that can be referenced from jobs that need to access the remote data hub.

This can be done with the following:

Expand All @@ -1981,18 +2006,37 @@ a POST to /provider/logins

```json
{
"name": "remote-datahub-name-provider",
"name": "remote-data-hub-name-provider",
"type": "nodebearer",
"endpoint": {
"type": "text",
"value": "URL-of-datahub/security/token"
"value": "URL-of-remote-data-hub/security/token"
},
"audience": {
"type": "text",
"value": "the name (NODE_ID) of the remote datahub you want to read from"
"value": "the name (NODE_ID) of the remote data hub you want to read from"
}
}
```
The name of this provider should be used in the job to specify which one the data hub should use for this job

#### Remote data hub as source config
```json
"source": {
"Type": "HttpDatasetSource",
"Url": "URL-of-remote-data-hub/datasets/some.Dataset/changes",
"TokenProvider": "remote-data-hub-name-provider"
},
```

#### Remote data hub as sink config
```json
"sink": {
"Type": "HttpDatasetSink",
"Url": "URL-of-remote-data-hub/datasets/some.Dataset/entities",
"TokenProvider": "remote-data-hub-name-provider"
},
```

### Working with security providers

Expand Down
Loading