Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Splitting data assets into batches using datetime columns in pandas #4982

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
d66f95c
WIP starting to pull together tests
anthonyburdi Apr 25, 2022
c12676d
Merge branch 'develop' into FEATURE/GREAT-727/GREAT-733/splitting_dat…
anthonyburdi Apr 25, 2022
5bedcec
Remove integration test case that is handled in unit tests.
anthonyburdi Apr 26, 2022
36a724b
Combine removed test into existing test case.
anthonyburdi Apr 26, 2022
0e43d82
Partial stub of end to end test.
anthonyburdi Apr 26, 2022
2d5de06
First pass spark split_on_date_parts, new methods to be DRY between s…
anthonyburdi Apr 26, 2022
3d128d3
Use SparkDataSplitter in SparkDFExecutionEngine, other cleanup.
anthonyburdi Apr 26, 2022
02bcd2c
WIP Spark specific integration tests
anthonyburdi Apr 26, 2022
d31002c
Move all splitting methods to SparkDataSplitter
anthonyburdi Apr 27, 2022
849a8e6
Move shared test cases to separate file, move splitter test to separa…
anthonyburdi Apr 27, 2022
7054d04
Code consolidation into parameterized test since it will be reused fo…
anthonyburdi Apr 27, 2022
1cbd142
Adding unit tests for spark splitters
anthonyburdi Apr 27, 2022
7419a09
Adding unit tests for spark splitters, single and multiple date parts…
anthonyburdi Apr 27, 2022
abeddc7
Unit tests for unsupported batch identifiers in splitter_kwargs (not …
anthonyburdi Apr 27, 2022
51891a0
Use pre-built fixture to speed up creation of 10 trips from each mont…
anthonyburdi Apr 27, 2022
8c977ad
Add missing docstring.
anthonyburdi Apr 27, 2022
c2044fc
Cleanup and add test cases for SparkDataSplitter.date_part access
anthonyburdi Apr 27, 2022
6c31869
Remove spark integration test from test_script_runner.py
anthonyburdi Apr 27, 2022
0d47bf0
Merge branch 'develop' into FEATURE/GREAT-727/GREAT-733/splitting_dat…
anthonyburdi Apr 27, 2022
2b7d182
Merge branch 'develop' into FEATURE/GREAT-727/GREAT-733/splitting_dat…
anthonyburdi Apr 27, 2022
d1253fa
Clarify comment from PR review
anthonyburdi Apr 27, 2022
02785c9
Fix items from PR Review
anthonyburdi Apr 27, 2022
e1a7c15
Add annotations to new dunder methods
anthonyburdi Apr 27, 2022
7713f04
Merge branch 'develop' into FEATURE/GREAT-727/GREAT-733/splitting_dat…
anthonyburdi Apr 27, 2022
457a547
Add docstrings to moved methods.
anthonyburdi Apr 27, 2022
b3d2645
Remove index from fixture
anthonyburdi Apr 27, 2022
0441538
Merge branch 'develop' into FEATURE/GREAT-727/GREAT-733/splitting_dat…
anthonyburdi Apr 27, 2022
3722ca2
Merge branch 'develop' into FEATURE/GREAT-727/GREAT-733/splitting_dat…
anthonyburdi Apr 27, 2022
8be75e5
Typo in docstring
anthonyburdi Apr 27, 2022
526ff57
Use imported DataFrame in test
anthonyburdi Apr 27, 2022
eb50727
Merge branch 'develop' into FEATURE/GREAT-727/GREAT-733/splitting_dat…
anthonyburdi Apr 28, 2022
bd118c4
Initial implementation and some tests.
anthonyburdi Apr 28, 2022
03545c6
Unit and integration tests.
anthonyburdi Apr 28, 2022
dd7d081
Cleanup
anthonyburdi Apr 28, 2022
3e0b899
Update docs
anthonyburdi Apr 28, 2022
8c06815
Merge branch 'develop' into FEATURE/GREAT-727/GREAT-734/splitting_dat…
anthonyburdi Apr 28, 2022
ea44295
Update docstrings
anthonyburdi Apr 28, 2022
5a7e9ba
Merge branch 'develop' into FEATURE/GREAT-727/GREAT-734/splitting_dat…
anthonyburdi Apr 28, 2022
d3f6a91
Add type hints
anthonyburdi Apr 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -103,38 +103,32 @@ Finally, confirm the expected number of batches was retrieved and the reduced si

Available `Splitting` methods and their configuration parameters:

+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| **Method** | **Parameters** | **Returned Batch Data** |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_whole_table | N/A | identical to original |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_column_value | column_name='col', batch_identifiers={ 'col': value } | rows where value of column_name are equal to value specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_converted_datetime | column_name='col', date_format_string=<'%Y-%m-%d'>, batch_identifiers={ 'col': matching_string } | rows where value of column_name converted to datetime using the given date_format_string are equal to matching string provided for the column_name specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_divided_integer | column_name='col', divisor=<int>, batch_identifiers={ 'col': matching_divisor } | rows where value of column_name divided (using integral division) by the given divisor are equal to matching_divisor provided for the column_name specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_mod_integer | column_name='col', mod=<int>, batch_identifiers={ 'col': matching_mod_value } | rows where value of column_name divided (using modular division) by the given mod are equal to matching_mod_value provided for the column_name specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_multi_column_values | column_names='<list[col]>', batch_identifiers={ 'col_0': value_0, 'col_1': value_1, 'col_2': value_2, ... } | rows where values of column_names are equal to values corresponding to each column name as specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_hashed_column | column_name='col', hash_digits=<int>, hash_function_name=<'md5'> batch_identifiers={ 'hash_value': value } | rows where value of column_name hashed (using specified has_function_name) and retaining the stated number of hash_digits are equal to hash_value provided for the column_name specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
Note: Splitter methods can be specified with or without a preceding underscore.

| Method | Parameters | Returned Batch Data |
|---------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| split_on_whole_table | N/A | Identical to original |
| split_on_column_value | `column_name='col', batch_identifiers={ 'col': value }` | Rows where value of column_name is equal to the value specified |
| split_on_year | `column_name='col'` | Rows where the year of a datetime column are equal to the specified value |
| split_on_year_and_month | `column_name='col'` | Rows where the year and month of a datetime column are equal to the specified value |
| split_on_year_and_month_and_day | `column_name='col'` | Rows where the year, month and day of a datetime column are equal to the specified value |
| split_on_date_parts | `column_name='col', date_parts='<list[DatePart]>'` | Rows where the date parts of a datetime column are equal to the specified value. Date parts can be specified as DatePart objects or as their string equivalent e.g. "year", "month", "week", "day", "hour", "minute", or "second" |
| split_on_divided_integer | `column_name='col', divisor=<int>, batch_identifiers={ 'col': matching_divisor }` | Rows where value of column_name divided (using integral division) by the given divisor are equal to matching_divisor provided for the column_name specified |
| split_on_mod_integer | `column_name='col', mod=<int>, batch_identifiers={ 'col': matching_mod_value }` | Rows where value of column_name divided (using modular division) by the given mod are equal to matching_mod_value provided for the column_name specified |
| split_on_multi_column_values | `column_names='<list[col]>', batch_identifiers={ 'col_0': value_0, 'col_1': value_1, 'col_2': value_2, ... }` | Rows where values of column_names are equal to values corresponding to each column name as specified |
| split_on_converted_datetime | `column_name='col', date_format_string=<'%Y-%m-%d'>, batch_identifiers={ 'col': matching_string }` | Rows where value of column_name converted to datetime using the given date_format_string are equal to matching string provided for the column_name specified |
| split_on_hashed_column | `column_name='col', hash_digits=<int>, hash_function_name=<'md5'> batch_identifiers={ 'hash_value': value }` | Rows where value of column_name hashed (using specified has_function_name) and retaining the stated number of hash_digits are equal to hash_value provided for the column_name specified |


Available `Sampling` methods and their configuration parameters:

+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| **Method** | **Parameters** | **Returned Batch Data** |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _sample_using_random | p=fraction | rows selected at random, whose number amounts to selected fraction of total number of rows in batch |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _sample_using_mod | column_name='col', mod=<int> | take the mod of named column, and only keep rows that match the given value |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _sample_using_a_list | column_name='col', value_list=<list[val]> | match the values in the named column against value_list, and only keep the matches |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _sample_using_hash | column_name='col', hash_digits=<int>, hash_value=<str>, hash_function_name=<'md5'> | hash the values in the named column (using specified has_function_name), and only keep rows that match the given hash_value |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| Method | Parameters | Returned Batch Data |
|----------------------|--------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
| _sample_using_random | `p=fraction` | Rows selected at random, whose number amounts to selected fraction of total number of rows in batch |
| _sample_using_mod | `column_name='col', mod=<int>` | Take the mod of named column, and only keep rows that match the given value |
| _sample_using_a_list | `column_name='col', value_list=<list[val]>` | Match the values in the named column against value_list, and only keep the matches |
| _sample_using_hash | `column_name='col', hash_digits=<int>, hash_value=<str>, hash_function_name=<'md5'>` | Hash the values in the named column (using specified has_function_name), and only keep rows that match the given hash_value |



To view the full script used in this page, see it on GitHub:
Expand Down