Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOPSWORKS-1982] Deequ statistics for Feature Groups/Training Datasets #96

Merged
merged 10 commits into from
Sep 25, 2020

Conversation

moritzmeister
Copy link
Contributor

Should be rebased either this or the python PR should be merged first and then the other one needs to be rebased.

Comment on lines +168 to +170
if (statisticsEnabled) {
statisticsEngine.computeStatistics(this, featureData);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this call the computeStatistics() method? Otherwise you might end up computing feature for the online feature store. which is not bad per se in this case, as you are not query NDB, but might confuse users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the confusion part, I just wanted to reuse the dataframe as we already have it, instead of rereading it. I am not sure spark is smart enough to recognize that it's already there.

On the other hand this way it would always allow the user to have the statistics from the very first creation of the featuregroup even if it is purely online.

*/
@JsonIgnore
public Statistics getStatistics() throws FeatureStoreException, IOException {
return statisticsEngine.getLast(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validates my point in the python api. Here we return an object containing commit_time, content. Which I think is good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the object in python contains content and commit time as only accessible members

Copy link
Contributor

@SirOibaf SirOibaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just change the deequ group id

@moritzmeister
Copy link
Contributor Author

Changed the groupId, we should merge this Java PR first, and then I need to rebase #84, so it also has the right groupId.

@SirOibaf SirOibaf merged commit 43c2e1c into logicalclocks:master Sep 25, 2020
@moritzmeister moritzmeister deleted the deequ-java branch September 28, 2020 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants