Add new documentation site for DataHub (#327)

#327
pinterest · Nov 30, 2020 · 4f052e2 · 4f052e2
1 parent cd5a67a
commit 4f052e2
Show file tree

Hide file tree

Showing 102 changed files with 8,252 additions and 3,518 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -1 +1,58 @@
-**/yarn.lock -diff
+*.lock            text -diff
+package-lock.json text -diff
+
+# Graphics
+*.ai              binary
+*.bmp             binary
+*.eps             binary
+*.gif             binary
+*.gifv            binary
+*.ico             binary
+*.jng             binary
+*.jp2             binary
+*.jpg             binary
+*.jpeg            binary
+*.jpx             binary
+*.jxr             binary
+*.pdf             binary
+*.png             binary
+*.psb             binary
+*.psd             binary
+# SVG treated as an asset (binary) by default.
+*.svg             text
+# If you want to treat it as binary,
+# use the following line instead.
+# *.svg           binary
+*.svgz            binary
+*.tif             binary
+*.tiff            binary
+*.wbmp            binary
+*.webp            binary
+
+# Audio
+*.kar             binary
+*.m4a             binary
+*.mid             binary
+*.midi            binary
+*.mp3             binary
+*.ogg             binary
+*.ra              binary
+
+# Video
+*.3gpp            binary
+*.3gp             binary
+*.as              binary
+*.asf             binary
+*.asx             binary
+*.fla             binary
+*.flv             binary
+*.m4v             binary
+*.mng             binary
+*.mov             binary
+*.mp4             binary
+*.mpeg            binary
+*.mpg             binary
+*.ogv             binary
+*.swc             binary
+*.swf             binary
+*.webm            binary
diff --git a/.gitignore b/.gitignore
@@ -1,12 +1,21 @@
 .DS_Store
 .idea/*
 *.iml
+.env.local
+.env.development.local
+.env.test.local
+.env.production.local
+
 
 package-lock.json
 # jenkins job should not delete .env
 .env/
 
-yarn-error.log
+# Frontend logs
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+
 
 # Build files
 dist/
@@ -22,10 +31,16 @@ __pycache__/
 celerybeat.pid
 
 # ignore for testing/custom settings
-node_modules/*
+**/node_modules
 .vscode/*
 *.pyc
 .vscode.env
 
 # Jest coverage files
 datahub/webapp/coverage/
+
+
+# Documentation site
+.docusaurus
+.cache-loader
+/build
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -7,6 +7,7 @@ repos:
           - id: end-of-file-fixer
           - id: check-yaml
           - id: check-added-large-files
+            args: ['--maxkb=10000']
     - repo: https://github.com/psf/black
       rev: 19.10b0
       hooks:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -14,82 +14,6 @@
  limitations under the License.
 -->
 
-# Contributing
+## Thanks for taking the time to contribute!
 
-First off, thanks for taking the time to contribute! This guide will answer
-some common questions about how this project works.
-
-While this is a Pinterest open source project, we welcome contributions from
-everyone.
-
-## Code of Conduct
-
-Please be sure to read and understand our [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md).
-We work hard to ensure that our projects are welcoming and inclusive to as many
-people as possible.
-
-## Reporting Issues
-
-If you have a bug report, please provide as much information as possible so that
-we can help you out:
-
--   Version of the project you're using.
--   Code (or even better whole projects) which reproduce the issue.
--   Steps which reproduce the issue.
--   Screenshots, GIFs or videos (if relavent).
--   Stack traces for crashes.
--   Any logs produced.
-
-## Fixing Bugs
-
-We welcome and appreciate anyone for fixing bugs!
-- You can fix bugs as per "Making Changes" and send for code review!
-
-## Adding New Features
-
-We welcome and appreciate anyone for adding new features for DataHub!
-Following is the current process:
-- Please create a GitHub issue proposing your new feature, including what and why. It can be brief, one or two paragraphs is ok.
-- The project maintainers will then approve the new feature proposal.
-- You can then briefly describe your intended technical design for the new feature.
-- The project maintainers will then approve the technical design and/or request changes.
-- Then you can implement the new feature as per "Making Changes" and send for code review!
-
-## Making Changes
-
-Please first check "Fixing Bugs" or "Adding New Features" as appropriate.
-
-1. Create a new branch off of master. (We can't currently enable forking while the repo is in private beta)
-2. Make your changes and verify that tests pass
-3. Commit your work and push to origin your new branch
-4. Submit a pull request to merge to master
-5. Ensure your code passes both linter and unit tests
-6. Participate in the code review process by responding to feedback
-
-Once there is agreement that the code is in good shape, one of the project's
-maintainers will merge your contribution.
-
-To increase the chances that your pull request will be accepted:
-
--   Follow the coding style
--   Write tests for your changes
--   Write a good commit message
-
-## Help
-
-Start by reading the developer starter guide [this guide](docs/developer_guide/developer_setup.md) to setup DataHub/
-If you're having trouble using this project, please check the [developer guides](docs/developer_guide/)
-and searching for solutions in the existing open and closed issues.
-
-You can also reach out to us at datahub@pinterest.com or on our [Slack](https://join.slack.com/t/datahubchat/shared_invite/zt-dpr988af-9VwGkjcmPhqTmRoA2Tm3gg).
-
-## Security
-
-If you've found a security issue in one of our open source projects,
-please report it at [Bugcrowd](https://bugcrowd.com/pinterest); you may even
-make some money!
-
-## License
-
-By contributing to this project, you agree that your contributions will be
-licensed under its [license](LICENSE).
+Please checkout the [guide on contribution](docs/contributing/overview.md) before you start.
diff --git a/datahub/server/lib/change_log.py b/datahub/server/lib/change_log.py
@@ -7,6 +7,34 @@
 __change_logs = None
 
 
+def generate_change_log(raw_text: str) -> str:
+    # Since the markdown is used for the documentation site
+    # We need to preprocess it so that it can be presentable
+    # within DataHub
+
+    # TODO: either move the changelog completely to documentation site
+    #       or come up with a solution that can be compatible with both
+
+    lines = raw_text.split("\n")
+    filtered_lines = []
+
+    # Remove --- blocks from markdown
+    inside_comment = False
+    for line in lines:
+        if line.startswith("---"):
+            inside_comment = not inside_comment
+        if not inside_comment:
+            filtered_lines.append(line)
+
+    # Add "static" prefix to image path
+    filtered_lines = [
+        line.replace("![](/changelog/", "![](/static/changelog/")
+        for line in filtered_lines
+    ]
+
+    return markdown2.markdown("\n".join(filtered_lines))
+
+
 def load_all_change_logs():
     # Eventually there will be too many changelogs
     # TODO: add a maximum number of change logs to load
@@ -16,10 +44,13 @@ def load_all_change_logs():
         change_log_files = sorted(os.listdir(CHANGE_LOG_PATH), reverse=True)
         for filename in change_log_files:
             with open(os.path.join(CHANGE_LOG_PATH, "./{}".format(filename))) as f:
+                changelog_date = filename.split(".")[0]
                 __change_logs.append(
                     {
-                        "date": filename.split(".")[0],
-                        "content": markdown2.markdown(f.read()),
+                        "date": changelog_date,
+                        "content": generate_change_log(
+                            f"{changelog_date}\n" + f.read()
+                        ),
                     }
                 )
     return __change_logs

diff --git a/datahub/static/changelog/20200121/board1.gif b/datahub/static/changelog/20200121/board1.gif
diff --git a/datahub/static/changelog/20200121/board2.gif b/datahub/static/changelog/20200121/board2.gif
diff --git a/datahub/static/changelog/20200121/board3.gif b/datahub/static/changelog/20200121/board3.gif
diff --git a/datahub/static/changelog/20200121/board4.gif b/datahub/static/changelog/20200121/board4.gif
diff --git a/datahub/static/changelog/20200121/graph1.gif b/datahub/static/changelog/20200121/graph1.gif
diff --git a/datahub/static/changelog/20200121/graph2.gif b/datahub/static/changelog/20200121/graph2.gif
diff --git a/docs/admin_guide/add_custom_jobs.md b/docs/admin_guide/add_custom_jobs.md
@@ -9,7 +9,7 @@ Please use the plugin repo to add custom jobs.
 
 When adding a job, use the following format:
 
-```
+```py
     '[name of the task]': {
         'task': '[import path of the task]',
         'schedule': '0 0 * * *',  # cron schedule

diff --git a/docs/admin_guide/deployment_guide.md b/docs/admin_guide/deployment_guide.md
@@ -7,9 +7,9 @@ sidebar_label: Deployment Guide
 While there are many ways to deploy datahub to production, there are some general principles
 that are recommended when setting up your own production deployment.
 
-1. Please make sure the web server, the beat scheduler, and the celery worker are using the production Docker image. Please refer to `containers/docker-compose.prod.yml` to see how to launch these images for production.
+1. Please make sure the web server, the celery beat scheduler, and the celery worker are using the production Docker image. Please refer to `containers/docker-compose.prod.yml` to see how to launch these images for production.
 2. To get logs, make sure /var/log/datahub/ is mounted as a Docker volume so that the logs can be moved to the host machine.
 3. Use the /ping/ endpoint for health checks.
-    1. During deployments, you can create a file that has the path `/tmp/datahub/deploying` to make the health check return 503 and remove it after completion.
-4. Please make sure celery worker is ran with concurrent mode as it is the only mode that can have a memory limit.
+    1. During deployments, you can create a file that has the path `/tmp/datahub/deploying` to make the health check endpoint /ping/ return 503 and remove it after completion.
+4. Please make sure celery worker is ran with concurrent mode as it is the only mode that can have a memory limit per worker.
 5. During worker deployments, you can run the following first to make the celery worker stop receving new tasks and exit once all current tasks are finished: `celery multi stopwait datahub_worker@%h -A tasks.all_tasks --pidfile=/opt/celery_%n.pid`. This will make deployment time take much longer but users' running queries won't be killed.
diff --git a/docs/admin_guide/general_config.md b/docs/admin_guide/general_config.md
@@ -12,6 +12,10 @@ The first user that installs DataHub gets the "Admin" role. Admins can add other
 
 In the next section we will go over different things that can be configured in the Admin tools.
 
+:::info
+Checkout [Sharing & Security](./sharing_and_security.md) to learn how to configure access permission for these entities.
+:::
+
 #### Environment
 
 Environment ensures users on DataHub are only allowed to access to information/query they have permission to. All DataDocs, Query Engines are attached to some environments.
@@ -23,6 +27,8 @@ Here are all the fields needed when creating an environment:
 -   Archived: Once archived, environments cannot be accessed on the website.
 -   Environment description: A short description of the environment which will appear as a tooltip on the environment picker.
 -   Logo Url: By default an environment's icon is shown as an square button with the first letter of the environment's name in it. You can also supply a custom image.
+-   Hidden: if a user does not have access to the environment, hide the environment.
+-   Shareable: If turned off, DataDocs will be private by default and query executions can only be viewed by the owner.
 
 Once an environment is created, you can use `Add/Remove User` to add/remove user access to an environment.
 

diff --git a/docs/admin_guide/infra_config.md b/docs/admin_guide/infra_config.md
@@ -6,7 +6,9 @@ sidebar_label: Infra Config
 
 ## Overview
 
-<b>THIS GUIDE IS ONLY FOR INFRASTRUCTURE SETUP, PLEASE READ [GENERAL CONFIG](../admin_guide/general_config.md) FOR GENERIC CONFIGS</b>
+:::caution
+THIS GUIDE IS ONLY FOR INFRASTRUCTURE SETUP, PLEASE READ THE [GENERAL CONFIG](../admin_guide/general_config.md) TO CONFIGURE ENTITIES SUCH AS QUERY ENGINE & ACCESS PERMISSION.
+:::
 
 Eventhrough DataHub can be launched without any configuration, it is absolutely required for more powerful infrastructure/flexible customization. In this guide we will walkthrough different kind of environment settings you can set for DataHub. You can see all possible options and default values in this repo by checking out `datahub/config/datahub_default_config.yaml`.
 

diff --git a/docs/admin_guide/quick_start.md b/docs/admin_guide/quick_start.md
@@ -2,11 +2,12 @@
 id: quick_start
 title: Quick Start
 sidebar_label: Quick Start
+slug: /
 ---
 
 After cloning the repo, run the following
 
-```
+```sh
 make
 ```
 

diff --git a/docs/admin_guide/sharing_and_security.md b/docs/admin_guide/sharing_and_security.md
@@ -0,0 +1,36 @@
+---
+id: sharing_and_security
+title: Sharing & Security
+sidebar_label: Sharing & Security
+---
+
+To understand how security & access restrictions work, let's first go over the core entities of DataHub. They are:
+
+-   Environments
+-   DataDocs
+-   Tables & Schemas
+-   Metastore
+-   Query Engines
+-   Query Executions
+
+All of these entities can be connected to each other as a tree where the environment is the root node. All DataDocs are required to be inside a single environment whereas query engines have many to many relationships with environments. Each query engine can belong to a single metastore, and every metastore is associated with 0 or more tables/schemas. Last, each query execution must belong to a query engine.
+
+When checking if a user has access to a certain entity in DataHub, DataHub would walk up the tree all the way to environments. Since there are many to many relationships, an entity may be related to multiple environments. If the user can access any one of them, then they can access the entity.
+
+The granularity of access can be configured further with environment configs. Currently, here are all the options to configure an Environment in DataHub:
+
+-   Public
+-   Hidden
+-   Shareable
+
+A public environment means anyone who has access to the DataHub tool can access this environment. To only allow certain users to access the environment, you need to change the environment to private and add users one by one to the environment ACL. This can be done either in the Admin UI manually or through a dynamic script that runs automatically via the jobs plugin.
+
+A hidden environment means that the user would not see the environment if they do not have access to it. Sometimes, it is useful to turn that option off to let the user know an environment exists, but they do not have access to it.
+
+The shareable option is the most complex environment configuration for DataHub. By default, all DataDocs created in an environment is a public DataDoc, so all users who can access that environment can view the DataDoc. Similarly, all users in that environment can access all query executions associated with that environment. The shareable option is on by default as it simplifies the number of operations required to share a DataDoc or a query execution with someone else. If the shareable option is turned off, then all DataDocs created within an environment would be private by default, and query executions can only be viewed by the user who executed it or anyone who has access to the DataDoc that contains the execution. The owner can still invite others to view by either sharing the DataDoc/execution manually.
+
+:::note
+For a public DataDoc, users cannot edit it unless they are invited with edit permission. Furthermore, DataDocs in a shareable environment can still be converted to private so they are not accessible to the public.
+:::
+
+As a footnote, searching for DataDocs' access permission is verified at the environment level, searching for Tables is verified at the metastore level, and searching for users is available to all users on DataHub. Both public/private DataDocs can be searched, but the user would only see search results that they have access to.
diff --git a/docs/admin_guide/troubleshoot.md b/docs/admin_guide/troubleshoot.md
@@ -11,7 +11,7 @@ Please run migrations when makeing changes to sqlalchemy schema definitions
 If datadocs are not showing up in search results and
 if when you run `make` you get the following message in the bash console:
 
-```
+```sh
 elasticsearch    | [2019-03-27T20:35:00,273][INFO ][o.e.x.m.p.NativeController] [kcqBkjB] Native controller process has stopped - no new native processes can be started
 ```