Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add table lineage model #126

Merged
merged 1 commit into from Aug 6, 2019
Merged

Add table lineage model #126

merged 1 commit into from Aug 6, 2019

Conversation

feng-tao
Copy link
Member

@feng-tao feng-tao commented Aug 6, 2019

Summary of Changes

Currently we don't have table lineage extractor. Provide a data model is the first step towards this goal.

User could build a generic / specific . lineage extractor with this model.

Tests

Yes

Documentation

What documentation did you add or modify and why? Add any relevant links then remove this line

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • PR includes a summary of changes.
  • PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
  • PR passes make test
  • I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message"

Copy link
Contributor

@jinhyukchang jinhyukchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Left one comment.

for downstream_tab in self.downstream_deps:
# every deps should follow '{db}://{cluster}.{schema}/{table}'
# todo: if we change the table uri, we should change here.
m = re.match('(\w+)://(\w+)\.(\w+)\/(\w+)', downstream_tab)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know REGEX would work, but our key is uri after all. Do you think we can use urlparsing?
https://docs.python.org/3/library/urllib.parse.html#url-parsing

Copy link
Member Author

@feng-tao feng-tao Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @jinhyukchang , played around with the lib:


In [2]: a='hive://gold.test_s/test_t'

In [3]: o=urlparse(a)

In [4]: o
Out[4]: ParseResult(scheme='hive', netloc='gold.test_s', path='/test_t', params='', query='', fragment='')

In [5]: o=urlparse('hive.test_s.test_t')

In [6]: o
Out[6]: ParseResult(scheme='', netloc='', path='hive.test_s.test_t', params='', query='', fragment='')

Try to understand the value of using urllib as I see two issues:

  1. we will still parse the path to get cluster name and schema name even with urlparse.
  2. urlparse's interface is different for py2 and py3.

So I would like to understand more with this lib. Is it due to the string could be encoded in other env?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @jinhyukchang , played around with the lib:


In [2]: a='hive://gold.test_s/test_t'

In [3]: o=urlparse(a)

In [4]: o
Out[4]: ParseResult(scheme='hive', netloc='gold.test_s', path='/test_t', params='', query='', fragment='')

In [5]: o=urlparse('hive.test_s.test_t')

In [6]: o
Out[6]: ParseResult(scheme='', netloc='', path='hive.test_s.test_t', params='', query='', fragment='')

Try to understand the value of using urllib as I see two issues:

  1. we will still parse the path to get cluster name and schema name even with urlparse.
  2. urlparse's interface is different for py2 and py3.

So I would like to understand more with this lib. Is it due to the string could be encoded in other env?

Yep, it still needs some operation after urlparse. It looks cleaner to me but I am fine with REGEX as well. Will leave it to you!

@codecov-io
Copy link

codecov-io commented Aug 6, 2019

Codecov Report

Merging #126 into master will decrease coverage by 0.11%.
The diff coverage is 75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #126      +/-   ##
==========================================
- Coverage   83.16%   83.05%   -0.12%     
==========================================
  Files          55       56       +1     
  Lines        2745     2785      +40     
  Branches      283      285       +2     
==========================================
+ Hits         2283     2313      +30     
- Misses        372      381       +9     
- Partials       90       91       +1
Impacted Files Coverage Δ
databuilder/models/table_lineage.py 75% <75%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a73a9a...a9dffdf. Read the comment docs.

@feng-tao feng-tao merged commit fdb53bf into master Aug 6, 2019
@feng-tao feng-tao deleted the tfeng_add_lineage_model branch August 6, 2019 23:13
@jornh jornh mentioned this pull request Aug 12, 2019
6 tasks
PabTorre pushed a commit to PabTorre/amundsendatabuilder that referenced this pull request May 29, 2020
…#126)

Bumps [amundsenfrontendlibrary](https://github.com/lyft/amundsenfrontendlibrary) from `675a29d` to `5a90e8a`.
- [Release notes](https://github.com/lyft/amundsenfrontendlibrary/releases)
- [Commits](amundsen-io/amundsenfrontendlibrary@675a29d...5a90e8a)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants