feat: add filter and timestamp splits #549

sirtorry · 2021-07-16T03:34:10Z

Changes to training splits:

Modifies default behaviour of fraction split
Raises error when multiple split types passed
Adds filter splits (b/172365904) to image, text and video
Adds timestamp split (b/172368070) to tabular

Future:

next PR will change forecasting training jobs to extend tabular; therefore, adding timestamp split to forecasting

succeeds #210

google/cloud/aiplatform/training_jobs.py

ivanmkc · 2021-07-16T22:06:19Z

From the docstring, it's unclear to me what happens when None is passed for training_fraction_split.
Followup question, can we default to None?

Also, can we remove the defaults from private methods and only include on public methods? Seems cleaner.

sirtorry · 2021-07-16T22:44:02Z

From the docstring, it's unclear to me what happens when None is passed for training_fraction_split.
Followup question, can we default to None?

Also, can we remove the defaults from private methods and only include on public methods? Seems cleaner.

I have no objections to either. @sasha-gitg what are your thoughts?

google/cloud/aiplatform/training_jobs.py

sasha-gitg · 2021-07-20T19:30:02Z

google/cloud/aiplatform/training_jobs.py

+                    key=timestamp_split_column_name,
+                )
+
+            fraction_split = None


I agree with the comment that *_fraction_splits need to default to None throughout because we will not be able to differentiate when the user explicitly set those values and when they have been defaulted which is important as we need to determine if we received an invalid set of arguments.

If we agree on that then we need to also update the documentation that if no fraction_spit, timestamp_split, filter_split, or predfined_split is set then we default to a 0.8, 0.1, 0.1 fraction_split.

There needs to be additional logic at the end here when if all splits are None then fallback to a 0.8, 0.1, 0.1 fraction split.

I don't think this is a breaking change as we moved to setting the argument as Optional.

everything is now none and falls back to 0.8, 0.1, 0.1. with the exception of video, where 0.8, none, 0.2 has been used. we'll have to update the docs to reflect all of this.

google/cloud/aiplatform/training_jobs.py

sasha-gitg · 2021-07-30T16:19:38Z

google/cloud/aiplatform/training_jobs.py

+                        test_fraction_split is None,
+                    ]
+                ):
+                    raise ValueError(


This doesn't seem to be the case: https://github.com/googleapis/python-aiplatform/blob/master/google/cloud/aiplatform_v1/types/training_pipeline.py#L319

If seems like this entire fraction split block can be simplified to:

if any([training_fraction_split, validation_fraction_split, test_fraction_split]): fraction_split = gca_training_pipeline.FractionSplit( training_fraction=training_fraction_split, validation_fraction=validation_fraction_split, test_fraction=test_fraction_split, )

I think it's also cleaner to create the default fraction split at the very end instead of sharing the state between fraction_split creation and timeseries_split creation . ie::

if split_configs_count > 1: raise ValueError( """Can only specify one of: 1. training_filter_split, validation_filter_split, test_filter_split OR 2. predefined_split_column_name OR 3. timestamp_split_column_name, training_fraction_split, validation_fraction_split, test_fraction_split OR 4. training_fraction_split, validation_fraction_split, test_fraction_split""" ) # default if split_configs_count == 0: fraction_split = gca_training_pipeline.FractionSplit( training_fraction=0.8, validation_fraction=0.1, test_fraction=0.1, )

sasha-gitg · 2021-07-30T17:13:50Z

google/cloud/aiplatform/training_jobs.py

@@ -1653,14 +1829,35 @@ def run(
            accelerator_count (int):
                The number of accelerators to attach to a worker replica.
            training_fraction_split (float):
-                The fraction of the input data that is to be
-                used to train the Model. This is ignored if Dataset is not provided.
+                Optional. The fraction of the input data that is to be used to train


There should be additional documentation at the top of all public run method docstrings describing the different splits. Right now, It' just the fraction split that has that treatment.

sasha-gitg · 2021-07-30T17:15:07Z

google/cloud/aiplatform/training_jobs.py

+                Optional. The fraction of the input data that is to be used to evaluate
+                the Model. This is ignored if Dataset is not provided.
+            training_filter_split (str):
+                Optional. A filter on DataItems of the Dataset. DataItems that match


It would be worth going back through the protos and ensuring any additional information that is not argument specific but split specific is surfaced like the minus sign usage here. This can only be done on public run methods.: https://github.com/googleapis/python-aiplatform/blob/master/google/cloud/aiplatform_v1/types/training_pipeline.py#L348

Good catch. You can put it under where the Data fraction splits info is in the run method docstring.

tests/unit/aiplatform/test_automl_video_training_jobs.py

sirtorry · 2021-08-07T12:03:13Z

I'll make the final changes sometime over the next few days.

sasha-gitg · 2021-08-18T16:39:20Z

Closing in favor of #627

sirtorry added 2 commits July 15, 2021 21:14

initial commit

5a72c86

add tests

0a5971e

sirtorry added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jul 16, 2021

sirtorry self-assigned this Jul 16, 2021

product-auto-label bot added the api: aiplatform Issues related to the AI Platform API. label Jul 16, 2021

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Jul 16, 2021

sirtorry marked this pull request as draft July 16, 2021 03:34

sirtorry changed the title ~~feat: add filter and timestamp splits for custom training~~ feat: add filter and timestamp splits Jul 16, 2021

fix arrays

4875cd1

sirtorry requested review from ivanmkc, sasha-gitg, morgandu and vinnysenthil July 16, 2021 20:47

sirtorry commented Jul 16, 2021

View reviewed changes

google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved

fix tests

9315fea

fix test

b04dca6

sirtorry removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jul 17, 2021

sirtorry marked this pull request as ready for review July 17, 2021 00:13

sasha-gitg requested changes Jul 20, 2021

View reviewed changes

sirtorry added 9 commits July 22, 2021 21:52

change default split behaviour

5b78ee6

modify tests

c0b38f8

fix tests

99aa367

tests for multi-split

2ef9cbd

split counter bug fix

5f8a7b1

fixes multi split issue

d43ef98

bug fixes

a45d837

bug fix

3577e9e

fix non-sync tests

864d0e2

sirtorry added 2 commits July 28, 2021 23:19

change yield to return

e31531f

fix tabular test

20ea593

sirtorry requested a review from sasha-gitg July 29, 2021 21:03

sasha-gitg requested changes Jul 30, 2021

View reviewed changes

ivanmkc reviewed Aug 2, 2021

View reviewed changes

tests/unit/aiplatform/test_automl_video_training_jobs.py Show resolved Hide resolved

ivanmkc reviewed Aug 2, 2021

View reviewed changes

tests/unit/aiplatform/test_automl_video_training_jobs.py Show resolved Hide resolved

ivanmkc mentioned this pull request Aug 13, 2021

feat: add filter and timestamp splits #627

Merged

sasha-gitg closed this Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add filter and timestamp splits #549

feat: add filter and timestamp splits #549

sirtorry commented Jul 16, 2021 •

edited

ivanmkc commented Jul 16, 2021

sirtorry commented Jul 16, 2021

sasha-gitg Jul 20, 2021

sirtorry Jul 23, 2021 •

edited

sasha-gitg Jul 30, 2021

sasha-gitg Jul 30, 2021

sasha-gitg Jul 30, 2021

ivanmkc Aug 2, 2021

sirtorry commented Aug 7, 2021

sasha-gitg commented Aug 18, 2021

feat: add filter and timestamp splits #549

feat: add filter and timestamp splits #549

Conversation

sirtorry commented Jul 16, 2021 • edited

ivanmkc commented Jul 16, 2021

sirtorry commented Jul 16, 2021

sasha-gitg Jul 20, 2021

Choose a reason for hiding this comment

sirtorry Jul 23, 2021 • edited

Choose a reason for hiding this comment

sasha-gitg Jul 30, 2021

Choose a reason for hiding this comment

sasha-gitg Jul 30, 2021

Choose a reason for hiding this comment

sasha-gitg Jul 30, 2021

Choose a reason for hiding this comment

ivanmkc Aug 2, 2021

Choose a reason for hiding this comment

sirtorry commented Aug 7, 2021

sasha-gitg commented Aug 18, 2021

sirtorry commented Jul 16, 2021 •

edited

sirtorry Jul 23, 2021 •

edited