Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature interval identfier #2

Merged
merged 41 commits into from Mar 16, 2019
Merged

Conversation

mansenfranzen
Copy link
Owner

@mansenfranzen mansenfranzen commented Mar 15, 2019

Add interface and pandas implementations for IntervalIdentifier wrangler:

An interval is defined as a range of values beginning with an opening
marker and ending with a closing marker (e.g. the interval daylight may be
defined as all events/values occurring between sunrise and sunset).

The interval identification wrangler assigns ids to values such that values
belonging to the same interval share the same interval id. For example, all
values of the first daylight interval are assigned with id 1. All values of
the second daylight interval will be assigned with id 2 and so on.

TODO:

  • Update changelog
  • Add cumsum based implementation
  • Unify start/end marker notation across tests and implementations
  • Remove flake8 exclude for test folders
  • Open issue to address possible performance increases with numba.jit and pure numpy

always report F401 and F811 due to how pytest fixtures work. Putting all
tests in `conftest.py` is not possible because all wrangler tests would
finally be located there which would soon run into namespace issues.
parametrization to run all test cases against all available algorithms.
In addition, tests include different marker types (string, int, float)
and shuffling.
for distributed comutation engines. Clarfify shortest valid interval
more precisely.
markers for `IntervalIdentifier`. Rearrange and label test input for
better readability.
bool to comply with pandas and spark convention.
caused wrong result when begin marker was left open and was not closed.
Reorder sequence of test parameters for
`test_pandas_interval_identifier` to have wrangler first and test case
second which increases test output readability.
not have wheels on pypi because building from source is time consuming
and would even require additional TravisCI configuration. As a trade
off, combinations of old pandas versions with newer python versions are
dropped in favor of maintainability and speed with Travis CI.
@mansenfranzen mansenfranzen added the new feature Add a new feature label Mar 15, 2019
@mansenfranzen mansenfranzen self-assigned this Mar 15, 2019
convenient functions taking care of optional parameters. Add
`PandasSingleNoFit` subclass which can be used as a default for
wranglers without fitting required and only a single dataframe as input
and output. Update types to be used from `util.types`.
`BaseIntervalIdentifier`. Add input and output validation.
`order_columns` and `groupby_columns` parameters.
@mansenfranzen mansenfranzen merged commit e4aad01 into master Mar 16, 2019
@mansenfranzen mansenfranzen deleted the feature_interval_identfier branch March 16, 2019 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature Add a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant