Feature interval identfier #2

mansenfranzen · 2019-03-15T12:16:00Z

Add interface and pandas implementations for IntervalIdentifier wrangler:

An interval is defined as a range of values beginning with an opening
marker and ending with a closing marker (e.g. the interval daylight may be
defined as all events/values occurring between sunrise and sunset).

The interval identification wrangler assigns ids to values such that values
belonging to the same interval share the same interval id. For example, all
values of the first daylight interval are assigned with id 1. All values of
the second daylight interval will be assigned with id 2 and so on.

TODO:

Update changelog
Add cumsum based implementation
Unify start/end marker notation across tests and implementations
Remove flake8 exclude for test folders
Open issue to address possible performance increases with numba.jit and pure numpy

`__init__` method.

wranglers.

testing purposes.

always report F401 and F811 due to how pytest fixtures work. Putting all tests in `conftest.py` is not possible because all wrangler tests would finally be located there which would soon run into namespace issues.

parametrization to run all test cases against all available algorithms. In addition, tests include different marker types (string, int, float) and shuffling.

for distributed comutation engines. Clarfify shortest valid interval more precisely.

markers for `IntervalIdentifier`. Rearrange and label test input for better readability.

bool to comply with pandas and spark convention.

caused wrong result when begin marker was left open and was not closed. Reorder sequence of test parameters for `test_pandas_interval_identifier` to have wrangler first and test case second which increases test output readability.

was not closed.

pandas.

do not have wheels.

and implementations for interval identifier.

not have wheels on pypi because building from source is time consuming and would even require additional TravisCI configuration. As a trade off, combinations of old pandas versions with newer python versions are dropped in favor of maintainability and speed with Travis CI.

dropping import of pytest fixtures.

convenient functions taking care of optional parameters. Add `PandasSingleNoFit` subclass which can be used as a default for wranglers without fitting required and only a single dataframe as input and output. Update types to be used from `util.types`.

`BaseIntervalIdentifier`. Add input and output validation.

`order_columns` and `groupby_columns` parameters.

for `_transform`.

mansenfranzen added 30 commits March 4, 2019 10:36

Minor rst adjustment.

4f75af2

Add util.sanitizer module including ensure_tuples helper function.

d32dde8

Add BaseIntervalIdentifier.Allow parameter parsing/conversion in

12ea727

`__init__` method.

Handle None correctly.

f7de85b

Move IntervalIdentifier into separate interfaces module.

f320858

Move tests for IntervalIdentifier into separate test module.

38cbcf9

Add PandasWrangler providing methods common to all pandas based

37a6826

wranglers.

Add test data for interval identifier wrangler.

dcee237

Add NaiveIterator for interval identification for pandas engine for

f65cf3f

testing purposes.

Exclude test modules with imported pytest fixtures because flake8 will

4373963

always report F401 and F811 due to how pytest fixtures work. Putting all tests in `conftest.py` is not possible because all wrangler tests would finally be located there which would soon run into namespace issues.

Refactor tests. Add single interval and spanning interval tests.

8027309

Add target_column_name to IntervalIdentifier interface.

dd29360

Add actual implementation for ǸaiveIterator.

8fdf640

Refactor tests to use pytest.mark.parametrize instead of fixture

5827364

parametrization to run all test cases against all available algorithms. In addition, tests include different marker types (string, int, float) and shuffling.

Fix incorrect type annotation.

5c11582

Add note for groupby columns that they should reference partition keys

78c500a

for distributed comutation engines. Clarfify shortest valid interval more precisely.

Add type hints.

fb0cd30

Refine is_valid_begin and is_invalid_begin doc strings.

6f5e592

Add groupby tests for IntervalIdentifier.

7113aaa

Add tests for multiple order/groupby columns, invalid start and end

67c6935

markers for `IntervalIdentifier`. Rearrange and label test input for better readability.

Refine doc string for NaiveIterator.

2319933

Rename sort_order to ascending and change string parameter type to

7d8aa28

bool to comply with pandas and spark convention.

Fix bug which caused wrong result when begin marker was left open and

54e93eb

was not closed.

Add VectorizedCumSum implementation for interval identification in

b91e93b

pandas.

Activate all pandas versions.

dc932a9

Add pandas versions to tox dependencies.

6427276

Add pandas versions to tox dependencies.

183bfea

Test excluding specific environments in TravisCI to remove builds which

86ca01b

do not have wheels.

Simplify doc strings. Unify naming of start and end markers across tests

a16de5f

and implementations for interval identifier.

mansenfranzen added the new feature Add a new feature label Mar 15, 2019

mansenfranzen self-assigned this Mar 15, 2019

mansenfranzen added 10 commits March 15, 2019 13:22

Remove exlude in flake8 config which is no longer necessary due to

9655afe

dropping import of pytest fixtures.

Update changelog in regard to additions of IntervalIdentifier.

f12ff63

Test against all versions.

263aa84

Exclude python 3.7 and pandas 0.22.0 due to missing wheels.

acf4f07

Move commonly used types into separate module.

8e0800b

Remove redundant type Any and use types from util.types.

28d3bdc

Refactor via removing duplicated code while adding

1ac20e7

`BaseIntervalIdentifier`. Add input and output validation.

Streamline naming convention for wrangler and add tests for unused

ccc65e1

`order_columns` and `groupby_columns` parameters.

Mark _BaseIntervalIdentifier as private and add NotImplementedError

fc94238

for `_transform`.

mansenfranzen merged commit e4aad01 into master Mar 16, 2019

mansenfranzen deleted the feature_interval_identfier branch March 16, 2019 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature interval identfier #2

Feature interval identfier #2

mansenfranzen commented Mar 15, 2019 •

edited

Feature interval identfier #2

Feature interval identfier #2

Conversation

mansenfranzen commented Mar 15, 2019 • edited

mansenfranzen commented Mar 15, 2019 •

edited