Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will Great Expectations support Dask for parallel data integrity validation? #478

Open
vnnw opened this issue May 27, 2019 · 2 comments

Comments

@vnnw
Copy link

commented May 27, 2019

As Pandas is now using as one of the underlying workhorses for data validation, I think it is quite possible to use Dask as an parallel alternative in case when Pyspark is inconvenient for deployment.

@abegong

This comment has been minimized.

Copy link
Contributor

commented May 27, 2019

Dask would be a great execution engine!

I'm not deeply familiar with the Dask internals. From a first glance, it looks like we could use the same pattern that we used with python. The main strategy would be to create a DaskDataSet class that inherits both from dask.DataFrame and great_expectations.DataSet

Does that sounds sensible?

Refs:

  • DataFrame docs here
  • DataFrame class here
  • from_pandas method here
@vnnw

This comment has been minimized.

Copy link
Author

commented Jun 5, 2019

Dask would be a great execution engine!

I'm not deeply familiar with the Dask internals. From a first glance, it looks like we could use the same pattern that we used with python. The main strategy would be to create a DaskDataSet class that inherits both from dask.DataFrame and great_expectations.DataSet

Does that sounds sensible?

Refs:

  • DataFrame docs here
  • DataFrame class here
  • from_pandas method here

datatable may be a valuable option too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.