Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic Input/Output of DataFrames #475

Merged
merged 12 commits into from
Aug 13, 2022
Merged

Generic Input/Output of DataFrames #475

merged 12 commits into from
Aug 13, 2022

Conversation

windoze
Copy link
Member

@windoze windoze commented Jul 15, 2022

This PR includes:

  1. Extended former InputLocation classes to support both read and write functions, also renamed it to DataLocation to reflect this change.
  2. Added a GenericLocation which supports all Spark confs, modes, and options, so it can be used to operate virtually any connectors supported by Spark
  3. In GenericLocation, I added a format-specific patching mechanism to workaround quirks come with different connectors, e.g. CosmosDb requires rows to have an id column with unique values.
  4. Update FeathrGenJob and FeathrJoinJob, enabling them to use a JSON-encoded string of DataLocation instead of plain path as the input and output target.

Theoretically, Feathr core can support all Spark connectors with this patch, but we still run a series of compatibility tests to confirm the final list.

NOTE: This PR only involves Feathr core, the corresponding Feathr Client changes will be in upcoming PRs.

build.sbt Show resolved Hide resolved
@xiaoyongzhu
Copy link
Member

This PR looks good to me and is a good way of extending to other sources in the future. Thanks @windoze for the work!

build.sbt Show resolved Hide resolved
@windoze windoze added the safe to test Tag to execute build pipeline for a PR from forked repo label Jul 15, 2022
@windoze windoze merged commit 671bae3 into main Aug 13, 2022
@xiaoyongzhu xiaoyongzhu deleted the windoze/generic-io branch August 22, 2022 17:15
ahlag pushed a commit to ahlag/feathr that referenced this pull request Aug 26, 2022
* GenericLocation for DataFrame read/write

* WIP

* Generate id column

* Fix unit test

* Parse string into DataLocation

* Id column must be string

* Fix auth logic

* Fix unit test

* Fix id column generation

* CosmosDb Sink
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test Tag to execute build pipeline for a PR from forked repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants