Skip to content

Commit

Permalink
Add new documentation site for DataHub (#327)
Browse files Browse the repository at this point in the history
  • Loading branch information
czgu committed Nov 30, 2020
1 parent cd5a67a commit 4f052e2
Show file tree
Hide file tree
Showing 102 changed files with 8,252 additions and 3,518 deletions.
59 changes: 58 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1 +1,58 @@
**/yarn.lock -diff
*.lock text -diff
package-lock.json text -diff

# Graphics
*.ai binary
*.bmp binary
*.eps binary
*.gif binary
*.gifv binary
*.ico binary
*.jng binary
*.jp2 binary
*.jpg binary
*.jpeg binary
*.jpx binary
*.jxr binary
*.pdf binary
*.png binary
*.psb binary
*.psd binary
# SVG treated as an asset (binary) by default.
*.svg text
# If you want to treat it as binary,
# use the following line instead.
# *.svg binary
*.svgz binary
*.tif binary
*.tiff binary
*.wbmp binary
*.webp binary

# Audio
*.kar binary
*.m4a binary
*.mid binary
*.midi binary
*.mp3 binary
*.ogg binary
*.ra binary

# Video
*.3gpp binary
*.3gp binary
*.as binary
*.asf binary
*.asx binary
*.fla binary
*.flv binary
*.m4v binary
*.mng binary
*.mov binary
*.mp4 binary
*.mpeg binary
*.mpg binary
*.ogv binary
*.swc binary
*.swf binary
*.webm binary
19 changes: 17 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,12 +1,21 @@
.DS_Store
.idea/*
*.iml
.env.local
.env.development.local
.env.test.local
.env.production.local


package-lock.json
# jenkins job should not delete .env
.env/

yarn-error.log
# Frontend logs
npm-debug.log*
yarn-debug.log*
yarn-error.log*


# Build files
dist/
Expand All @@ -22,10 +31,16 @@ __pycache__/
celerybeat.pid

# ignore for testing/custom settings
node_modules/*
**/node_modules
.vscode/*
*.pyc
.vscode.env

# Jest coverage files
datahub/webapp/coverage/


# Documentation site
.docusaurus
.cache-loader
/build
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ repos:
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
args: ['--maxkb=10000']
- repo: https://github.com/psf/black
rev: 19.10b0
hooks:
Expand Down
80 changes: 2 additions & 78 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,82 +14,6 @@
limitations under the License.
-->

# Contributing
## Thanks for taking the time to contribute!

First off, thanks for taking the time to contribute! This guide will answer
some common questions about how this project works.

While this is a Pinterest open source project, we welcome contributions from
everyone.

## Code of Conduct

Please be sure to read and understand our [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md).
We work hard to ensure that our projects are welcoming and inclusive to as many
people as possible.

## Reporting Issues

If you have a bug report, please provide as much information as possible so that
we can help you out:

- Version of the project you're using.
- Code (or even better whole projects) which reproduce the issue.
- Steps which reproduce the issue.
- Screenshots, GIFs or videos (if relavent).
- Stack traces for crashes.
- Any logs produced.

## Fixing Bugs

We welcome and appreciate anyone for fixing bugs!
- You can fix bugs as per "Making Changes" and send for code review!

## Adding New Features

We welcome and appreciate anyone for adding new features for DataHub!
Following is the current process:
- Please create a GitHub issue proposing your new feature, including what and why. It can be brief, one or two paragraphs is ok.
- The project maintainers will then approve the new feature proposal.
- You can then briefly describe your intended technical design for the new feature.
- The project maintainers will then approve the technical design and/or request changes.
- Then you can implement the new feature as per "Making Changes" and send for code review!

## Making Changes

Please first check "Fixing Bugs" or "Adding New Features" as appropriate.

1. Create a new branch off of master. (We can't currently enable forking while the repo is in private beta)
2. Make your changes and verify that tests pass
3. Commit your work and push to origin your new branch
4. Submit a pull request to merge to master
5. Ensure your code passes both linter and unit tests
6. Participate in the code review process by responding to feedback

Once there is agreement that the code is in good shape, one of the project's
maintainers will merge your contribution.

To increase the chances that your pull request will be accepted:

- Follow the coding style
- Write tests for your changes
- Write a good commit message

## Help

Start by reading the developer starter guide [this guide](docs/developer_guide/developer_setup.md) to setup DataHub/
If you're having trouble using this project, please check the [developer guides](docs/developer_guide/)
and searching for solutions in the existing open and closed issues.

You can also reach out to us at datahub@pinterest.com or on our [Slack](https://join.slack.com/t/datahubchat/shared_invite/zt-dpr988af-9VwGkjcmPhqTmRoA2Tm3gg).

## Security

If you've found a security issue in one of our open source projects,
please report it at [Bugcrowd](https://bugcrowd.com/pinterest); you may even
make some money!

## License

By contributing to this project, you agree that your contributions will be
licensed under its [license](LICENSE).
Please checkout the [guide on contribution](docs/contributing/overview.md) before you start.
35 changes: 33 additions & 2 deletions datahub/server/lib/change_log.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,34 @@
__change_logs = None


def generate_change_log(raw_text: str) -> str:
# Since the markdown is used for the documentation site
# We need to preprocess it so that it can be presentable
# within DataHub

# TODO: either move the changelog completely to documentation site
# or come up with a solution that can be compatible with both

lines = raw_text.split("\n")
filtered_lines = []

# Remove --- blocks from markdown
inside_comment = False
for line in lines:
if line.startswith("---"):
inside_comment = not inside_comment
if not inside_comment:
filtered_lines.append(line)

# Add "static" prefix to image path
filtered_lines = [
line.replace("![](/changelog/", "![](/static/changelog/")
for line in filtered_lines
]

return markdown2.markdown("\n".join(filtered_lines))


def load_all_change_logs():
# Eventually there will be too many changelogs
# TODO: add a maximum number of change logs to load
Expand All @@ -16,10 +44,13 @@ def load_all_change_logs():
change_log_files = sorted(os.listdir(CHANGE_LOG_PATH), reverse=True)
for filename in change_log_files:
with open(os.path.join(CHANGE_LOG_PATH, "./{}".format(filename))) as f:
changelog_date = filename.split(".")[0]
__change_logs.append(
{
"date": filename.split(".")[0],
"content": markdown2.markdown(f.read()),
"date": changelog_date,
"content": generate_change_log(
f"{changelog_date}\n" + f.read()
),
}
)
return __change_logs
Expand Down
Binary file modified datahub/static/changelog/20200121/board1.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified datahub/static/changelog/20200121/board2.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified datahub/static/changelog/20200121/board3.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified datahub/static/changelog/20200121/board4.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified datahub/static/changelog/20200121/graph1.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified datahub/static/changelog/20200121/graph2.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/admin_guide/add_custom_jobs.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Please use the plugin repo to add custom jobs.

When adding a job, use the following format:

```
```py
'[name of the task]': {
'task': '[import path of the task]',
'schedule': '0 0 * * *', # cron schedule
Expand Down
6 changes: 3 additions & 3 deletions docs/admin_guide/deployment_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ sidebar_label: Deployment Guide
While there are many ways to deploy datahub to production, there are some general principles
that are recommended when setting up your own production deployment.

1. Please make sure the web server, the beat scheduler, and the celery worker are using the production Docker image. Please refer to `containers/docker-compose.prod.yml` to see how to launch these images for production.
1. Please make sure the web server, the celery beat scheduler, and the celery worker are using the production Docker image. Please refer to `containers/docker-compose.prod.yml` to see how to launch these images for production.
2. To get logs, make sure /var/log/datahub/ is mounted as a Docker volume so that the logs can be moved to the host machine.
3. Use the /ping/ endpoint for health checks.
1. During deployments, you can create a file that has the path `/tmp/datahub/deploying` to make the health check return 503 and remove it after completion.
4. Please make sure celery worker is ran with concurrent mode as it is the only mode that can have a memory limit.
1. During deployments, you can create a file that has the path `/tmp/datahub/deploying` to make the health check endpoint /ping/ return 503 and remove it after completion.
4. Please make sure celery worker is ran with concurrent mode as it is the only mode that can have a memory limit per worker.
5. During worker deployments, you can run the following first to make the celery worker stop receving new tasks and exit once all current tasks are finished: `celery multi stopwait datahub_worker@%h -A tasks.all_tasks --pidfile=/opt/celery_%n.pid`. This will make deployment time take much longer but users' running queries won't be killed.
6 changes: 6 additions & 0 deletions docs/admin_guide/general_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ The first user that installs DataHub gets the "Admin" role. Admins can add other

In the next section we will go over different things that can be configured in the Admin tools.

:::info
Checkout [Sharing & Security](./sharing_and_security.md) to learn how to configure access permission for these entities.
:::

#### Environment

Environment ensures users on DataHub are only allowed to access to information/query they have permission to. All DataDocs, Query Engines are attached to some environments.
Expand All @@ -23,6 +27,8 @@ Here are all the fields needed when creating an environment:
- Archived: Once archived, environments cannot be accessed on the website.
- Environment description: A short description of the environment which will appear as a tooltip on the environment picker.
- Logo Url: By default an environment's icon is shown as an square button with the first letter of the environment's name in it. You can also supply a custom image.
- Hidden: if a user does not have access to the environment, hide the environment.
- Shareable: If turned off, DataDocs will be private by default and query executions can only be viewed by the owner.

Once an environment is created, you can use `Add/Remove User` to add/remove user access to an environment.

Expand Down
4 changes: 3 additions & 1 deletion docs/admin_guide/infra_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ sidebar_label: Infra Config

## Overview

<b>THIS GUIDE IS ONLY FOR INFRASTRUCTURE SETUP, PLEASE READ [GENERAL CONFIG](../admin_guide/general_config.md) FOR GENERIC CONFIGS</b>
:::caution
THIS GUIDE IS ONLY FOR INFRASTRUCTURE SETUP, PLEASE READ THE [GENERAL CONFIG](../admin_guide/general_config.md) TO CONFIGURE ENTITIES SUCH AS QUERY ENGINE & ACCESS PERMISSION.
:::

Eventhrough DataHub can be launched without any configuration, it is absolutely required for more powerful infrastructure/flexible customization. In this guide we will walkthrough different kind of environment settings you can set for DataHub. You can see all possible options and default values in this repo by checking out `datahub/config/datahub_default_config.yaml`.

Expand Down
3 changes: 2 additions & 1 deletion docs/admin_guide/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@
id: quick_start
title: Quick Start
sidebar_label: Quick Start
slug: /
---

After cloning the repo, run the following

```
```sh
make
```

Expand Down
36 changes: 36 additions & 0 deletions docs/admin_guide/sharing_and_security.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
id: sharing_and_security
title: Sharing & Security
sidebar_label: Sharing & Security
---

To understand how security & access restrictions work, let's first go over the core entities of DataHub. They are:

- Environments
- DataDocs
- Tables & Schemas
- Metastore
- Query Engines
- Query Executions

All of these entities can be connected to each other as a tree where the environment is the root node. All DataDocs are required to be inside a single environment whereas query engines have many to many relationships with environments. Each query engine can belong to a single metastore, and every metastore is associated with 0 or more tables/schemas. Last, each query execution must belong to a query engine.

When checking if a user has access to a certain entity in DataHub, DataHub would walk up the tree all the way to environments. Since there are many to many relationships, an entity may be related to multiple environments. If the user can access any one of them, then they can access the entity.

The granularity of access can be configured further with environment configs. Currently, here are all the options to configure an Environment in DataHub:

- Public
- Hidden
- Shareable

A public environment means anyone who has access to the DataHub tool can access this environment. To only allow certain users to access the environment, you need to change the environment to private and add users one by one to the environment ACL. This can be done either in the Admin UI manually or through a dynamic script that runs automatically via the jobs plugin.

A hidden environment means that the user would not see the environment if they do not have access to it. Sometimes, it is useful to turn that option off to let the user know an environment exists, but they do not have access to it.

The shareable option is the most complex environment configuration for DataHub. By default, all DataDocs created in an environment is a public DataDoc, so all users who can access that environment can view the DataDoc. Similarly, all users in that environment can access all query executions associated with that environment. The shareable option is on by default as it simplifies the number of operations required to share a DataDoc or a query execution with someone else. If the shareable option is turned off, then all DataDocs created within an environment would be private by default, and query executions can only be viewed by the user who executed it or anyone who has access to the DataDoc that contains the execution. The owner can still invite others to view by either sharing the DataDoc/execution manually.

:::note
For a public DataDoc, users cannot edit it unless they are invited with edit permission. Furthermore, DataDocs in a shareable environment can still be converted to private so they are not accessible to the public.
:::

As a footnote, searching for DataDocs' access permission is verified at the environment level, searching for Tables is verified at the metastore level, and searching for users is available to all users on DataHub. Both public/private DataDocs can be searched, but the user would only see search results that they have access to.
2 changes: 1 addition & 1 deletion docs/admin_guide/troubleshoot.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Please run migrations when makeing changes to sqlalchemy schema definitions
If datadocs are not showing up in search results and
if when you run `make` you get the following message in the bash console:

```
```sh
elasticsearch | [2019-03-27T20:35:00,273][INFO ][o.e.x.m.p.NativeController] [kcqBkjB] Native controller process has stopped - no new native processes can be started
```

Expand Down
Loading

0 comments on commit 4f052e2

Please sign in to comment.