Add table lineage model #126

feng-tao · 2019-08-06T18:28:21Z

Summary of Changes

Currently we don't have table lineage extractor. Provide a data model is the first step towards this goal.

User could build a generic / specific . lineage extractor with this model.

Tests

Yes

Documentation

What documentation did you add or modify and why? Add any relevant links then remove this line

CheckList

Make sure you have checked all steps below to ensure a timely review.

PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.
PR includes a summary of changes.
PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
PR passes make test
I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message"

jinhyukchang

Awesome! Left one comment.

jinhyukchang · 2019-08-06T22:16:27Z

databuilder/models/table_lineage.py

+        for downstream_tab in self.downstream_deps:
+            # every deps should follow '{db}://{cluster}.{schema}/{table}'
+            # todo: if we change the table uri, we should change here.
+            m = re.match('(\w+)://(\w+)\.(\w+)\/(\w+)', downstream_tab)


I know REGEX would work, but our key is uri after all. Do you think we can use urlparsing?
https://docs.python.org/3/library/urllib.parse.html#url-parsing

hey @jinhyukchang , played around with the lib:

In [2]: a='hive://gold.test_s/test_t' In [3]: o=urlparse(a) In [4]: o Out[4]: ParseResult(scheme='hive', netloc='gold.test_s', path='/test_t', params='', query='', fragment='') In [5]: o=urlparse('hive.test_s.test_t') In [6]: o Out[6]: ParseResult(scheme='', netloc='', path='hive.test_s.test_t', params='', query='', fragment='')

Try to understand the value of using urllib as I see two issues:

we will still parse the path to get cluster name and schema name even with urlparse.

urlparse's interface is different for py2 and py3.

So I would like to understand more with this lib. Is it due to the string could be encoded in other env?

hey @jinhyukchang , played around with the lib:

In [2]: a='hive://gold.test_s/test_t' In [3]: o=urlparse(a) In [4]: o Out[4]: ParseResult(scheme='hive', netloc='gold.test_s', path='/test_t', params='', query='', fragment='') In [5]: o=urlparse('hive.test_s.test_t') In [6]: o Out[6]: ParseResult(scheme='', netloc='', path='hive.test_s.test_t', params='', query='', fragment='')

Try to understand the value of using urllib as I see two issues:

we will still parse the path to get cluster name and schema name even with urlparse.

urlparse's interface is different for py2 and py3.

So I would like to understand more with this lib. Is it due to the string could be encoded in other env?

Yep, it still needs some operation after urlparse. It looks cleaner to me but I am fine with REGEX as well. Will leave it to you!

codecov-io · 2019-08-06T23:08:59Z

Codecov Report

Merging #126 into master will decrease coverage by 0.11%.
The diff coverage is 75%.

@@            Coverage Diff             @@
##           master     #126      +/-   ##
==========================================
- Coverage   83.16%   83.05%   -0.12%     
==========================================
  Files          55       56       +1     
  Lines        2745     2785      +40     
  Branches      283      285       +2     
==========================================
+ Hits         2283     2313      +30     
- Misses        372      381       +9     
- Partials       90       91       +1

Impacted Files	Coverage Δ
databuilder/models/table_lineage.py	`75% <75%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a73a9a...a9dffdf. Read the comment docs.

…#126) Bumps [amundsenfrontendlibrary](https://github.com/lyft/amundsenfrontendlibrary) from `675a29d` to `5a90e8a`. - [Release notes](https://github.com/lyft/amundsenfrontendlibrary/releases) - [Commits](amundsen-io/amundsenfrontendlibrary@675a29d...5a90e8a) Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

feng-tao requested a review from jinhyukchang August 6, 2019 18:28

jinhyukchang reviewed Aug 6, 2019

View reviewed changes

jinhyukchang approved these changes Aug 6, 2019

View reviewed changes

Add table lineage model

a9dffdf

feng-tao force-pushed the tfeng_add_lineage_model branch from e99889f to a9dffdf Compare August 6, 2019 23:06

feng-tao merged commit fdb53bf into master Aug 6, 2019

feng-tao deleted the tfeng_add_lineage_model branch August 6, 2019 23:13

jornh mentioned this pull request Aug 12, 2019

Add Dashboard and Metrics #120

Merged

6 tasks

jornh mentioned this pull request Aug 21, 2019

Support for showing lineage of table across ETL's in a data warehouse amundsen-io/amundsen#69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add table lineage model #126

Add table lineage model #126

feng-tao commented Aug 6, 2019 •

edited

jinhyukchang left a comment

jinhyukchang Aug 6, 2019

feng-tao Aug 6, 2019 •

edited

jinhyukchang Aug 6, 2019

codecov-io commented Aug 6, 2019 •

edited

Add table lineage model #126

Add table lineage model #126

Conversation

feng-tao commented Aug 6, 2019 • edited

Summary of Changes

Tests

Documentation

CheckList

jinhyukchang left a comment

Choose a reason for hiding this comment

jinhyukchang Aug 6, 2019

Choose a reason for hiding this comment

feng-tao Aug 6, 2019 • edited

Choose a reason for hiding this comment

jinhyukchang Aug 6, 2019

Choose a reason for hiding this comment

codecov-io commented Aug 6, 2019 • edited

Codecov Report

feng-tao commented Aug 6, 2019 •

edited

feng-tao Aug 6, 2019 •

edited

codecov-io commented Aug 6, 2019 •

edited