New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add table lineage model #126
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Left one comment.
for downstream_tab in self.downstream_deps: | ||
# every deps should follow '{db}://{cluster}.{schema}/{table}' | ||
# todo: if we change the table uri, we should change here. | ||
m = re.match('(\w+)://(\w+)\.(\w+)\/(\w+)', downstream_tab) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know REGEX would work, but our key is uri after all. Do you think we can use urlparsing
?
https://docs.python.org/3/library/urllib.parse.html#url-parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @jinhyukchang , played around with the lib:
In [2]: a='hive://gold.test_s/test_t'
In [3]: o=urlparse(a)
In [4]: o
Out[4]: ParseResult(scheme='hive', netloc='gold.test_s', path='/test_t', params='', query='', fragment='')
In [5]: o=urlparse('hive.test_s.test_t')
In [6]: o
Out[6]: ParseResult(scheme='', netloc='', path='hive.test_s.test_t', params='', query='', fragment='')
Try to understand the value of using urllib as I see two issues:
- we will still parse the path to get cluster name and schema name even with urlparse.
- urlparse's interface is different for py2 and py3.
So I would like to understand more with this lib. Is it due to the string could be encoded in other env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @jinhyukchang , played around with the lib:
In [2]: a='hive://gold.test_s/test_t' In [3]: o=urlparse(a) In [4]: o Out[4]: ParseResult(scheme='hive', netloc='gold.test_s', path='/test_t', params='', query='', fragment='') In [5]: o=urlparse('hive.test_s.test_t') In [6]: o Out[6]: ParseResult(scheme='', netloc='', path='hive.test_s.test_t', params='', query='', fragment='')
Try to understand the value of using urllib as I see two issues:
- we will still parse the path to get cluster name and schema name even with urlparse.
- urlparse's interface is different for py2 and py3.
So I would like to understand more with this lib. Is it due to the string could be encoded in other env?
Yep, it still needs some operation after urlparse. It looks cleaner to me but I am fine with REGEX as well. Will leave it to you!
e99889f
to
a9dffdf
Compare
Codecov Report
@@ Coverage Diff @@
## master #126 +/- ##
==========================================
- Coverage 83.16% 83.05% -0.12%
==========================================
Files 55 56 +1
Lines 2745 2785 +40
Branches 283 285 +2
==========================================
+ Hits 2283 2313 +30
- Misses 372 381 +9
- Partials 90 91 +1
Continue to review full report at Codecov.
|
…#126) Bumps [amundsenfrontendlibrary](https://github.com/lyft/amundsenfrontendlibrary) from `675a29d` to `5a90e8a`. - [Release notes](https://github.com/lyft/amundsenfrontendlibrary/releases) - [Commits](amundsen-io/amundsenfrontendlibrary@675a29d...5a90e8a) Signed-off-by: dependabot-preview[bot] <support@dependabot.com>
Summary of Changes
Currently we don't have table lineage extractor. Provide a data model is the first step towards this goal.
User could build a generic / specific . lineage extractor with this model.
Tests
Yes
Documentation
What documentation did you add or modify and why? Add any relevant links then remove this line
CheckList
Make sure you have checked all steps below to ensure a timely review.
make test