Add log names and types #397

david26694 · 2020-07-28T11:52:36Z

I feel that by copying the log_step function many times and slightly changing the logging section I'm repeating a lot of code. Do you have any suggestion to avoid this?

This is WIP, some TODOs:

rename log_step -> log_shape.
tests, docs.
log_dtypes

pim-hoeven · 2020-07-28T14:27:34Z

One solution might be to make log_names an argument of log_step:

Something like

def log_step(func=None, *, log_names=False, level=logging.INFO):
    ...
    names_str = f"columns = {X.columns.tolist()} " if log_names else ""
    logger.log(
            level,
            f"[{func.__name__}(df{func_args_str})] "
            names_str  # Will be an empty string if log_names is False
            f"time={time_taken}",
    )

That could easily be extended to include log_time, log_dtypes, log_shape, etc.

pim-hoeven · 2020-07-28T14:32:55Z

We could even extend this more such that any arbitrary function that takes a df and produces a string could be used. For example if I want to count the number of cats in the column animal:

def cat_counter(df):
    return str(df[df["animal"]=="cat"].sum())

def log_step(func=None, *, extra_log_func=None, level=logging.INFO):
    ...
    extra_string = extra_log_func(result) if extra_log_func else ""
    logger.log(
            level,
            f"[{func.__name__}(df{func_args_str})] "
            extra_string# Will be an empty string if extra_log_func is None
            f"time={time_taken}",
    )

@log_step(extra_log_func=cat_counter)
def do_something(df):
    ...

koaning · 2020-07-29T08:05:19Z

One thing to be mindful of: pandas just had a new release and I think this is causing some of the old tests to fail. Will have a look today but it's good to be aware.

koaning · 2020-07-29T08:38:25Z

It turns out there's one failing test due to numpy, it shouldn't influence this PR.

into pandas-pipes

david26694 · 2020-07-29T10:02:43Z

I was working towards the direction indicated by @pim-hoeven, I didn't find a trivial way to create parametrize decorators. Finally went for the first answer of this question on stackoverflow. I've renamed the log_step to log_step_factory, I am not very convinced about the change though.

Also, tests are failing but they are also failing on other PRs, but I don't understand the reason.

koaning · 2020-07-29T10:27:58Z

The tests were failing because of a breaking pytest update. I've changed the settings in CI and we're back to the old version. I just merged the changes back into this branch.

pim-hoeven · 2020-07-30T10:45:27Z

I don't know how @koaning feels about this, so please don't implement this before his approval, but I think it would be really nice to have some standard options that we can switch on or off as arguments and an extra option where the user can supply their own logging function. This would keep usage simple and clean (we only need a single decorator per function).

def log_step(func=None, *, time_taken=True, shape=True, names=False, dtypes=False, extra=None, log_dtypes=False, level=logging.INFO):
    ...
    names_str = f"columns = {X.columns.tolist()} " if log_names else ""
    logger.log(
            level,
            f"[{func.__name__}(df{func_args_str})] "
            names_str  # Will be an empty string if log_names is False
            f"time={time_taken}",
    )

Where `log_extra` are user defined log functions. Would be nice if we could have multiple there as well.

koaning · 2020-07-31T08:23:47Z

I'd like to get @MBrouns in on this one but when I think about one of the big life lessons of api design:

Keep the complicated things complicated if that means the simple things remain simple.

Then I might propose a slight edit (again, would appreciate @MBrouns' opinion here).

Maybe we can generalize into two functions.

def log_step(func, *, time_taken=True, shape=True, names=False, dtypes=False, level=logging.INFO):
    ...

def log_step_custom(func, extra, level=logging.INFO):
    ...

I'm assuming that most people won't need the custom part. And if they need it, it most likely
won't be for each individual step. We could therefore keep the log_step simpler by moving the
responsibility of the custom stuff to another decorator.

Another part of the thinking here is that log_step should maybe not be the most "general" in this
scenario rather the most "common" scenario.

koaning · 2020-08-08T14:32:54Z

@MBrouns (who I think is now back from holiday) got an opinion on this? After mulling it over I think this might be what is best;

def log_step(func, *, time_taken=True, shape=True, names=False, dtypes=False, level=logging.INFO):
    ...

def log_step_custom(func, level=logging.INFO, **kwargs):
    ...

The idea with **kwargs here is that you might be able to do something like.

@log_step_custom(n_user=lambda d: d['uid'].nunique(), n_sess=lambda d: d['session'].nunique())
def remove_outliers(dataf, max_days_per_user=100, max_rows_per_session=100):
    ...

This way you also have the name (the key) as well as what to log (the lambda function). You can then have a logline like;

INFO remove_outliers n_user=12314 n_sess=1245322

koaning · 2020-09-07T15:30:10Z

It's been a while (I got distracted) but I just want to check @david26694 are you still interested in working on this feature? I have a bit more bandwith now and would like to get this PR in :)

david26694 · 2020-09-12T10:09:15Z

Hi, sorry for the delay, I'm coming back from holidays. I'm a bit busier now, so if you want to get it in soon it might be better for somebody else to work on it.

pim-hoeven · 2020-09-14T14:53:10Z

@koaning I would like to work on this coming Wednesday or Thursday

koaning · 2020-09-14T18:57:33Z

@pim-hoeven cool! Is it possible for you to start a new branch from the branch from this branch made by @david26694. There's been some appreciated effort on his part and that way his name also makes it to the contributor list.

david26694 · 2020-09-19T10:47:41Z

aw, thanks for that!

koaning · 2020-09-24T17:12:34Z

This got fixed in another PR.

Add log names

08df0a1

david26694 added 4 commits July 29, 2020 09:26

Split logging functions

1bb38d9

type -> dtype

57b53e9

Original lags formatting

fa373cf

Change testing function

2571482

Merge branch 'master' into pandas-pipes

d21c81b

david26694 added 2 commits July 29, 2020 11:55

Parametrize log step, add tests

2eb4fa0

Merge branch 'pandas-pipes' of https://github.com/david26694/scikit-lego

03eea24

into pandas-pipes

david26694 changed the title ~~[WIP] Add log names and types~~ Add log names and types Jul 29, 2020

Merge branch 'master' into pandas-pipes

0c34bbe

Correct format

4082edf

Merge branch 'master' into pandas-pipes

418a503

pim-hoeven mentioned this pull request Aug 13, 2020

[FEATURE] Log shape delta #407

Closed

3 tasks

Merge branch 'master' into pandas-pipes

a69ba76

pim-hoeven mentioned this pull request Sep 17, 2020

Feature/improve logging #411

Merged

koaning closed this Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add log names and types #397

Add log names and types #397

david26694 commented Jul 28, 2020

pim-hoeven commented Jul 28, 2020 •

edited

pim-hoeven commented Jul 28, 2020

koaning commented Jul 29, 2020

koaning commented Jul 29, 2020

david26694 commented Jul 29, 2020

koaning commented Jul 29, 2020

pim-hoeven commented Jul 30, 2020

koaning commented Jul 31, 2020

koaning commented Aug 8, 2020

koaning commented Sep 7, 2020

david26694 commented Sep 12, 2020

pim-hoeven commented Sep 14, 2020

koaning commented Sep 14, 2020 •

edited

david26694 commented Sep 19, 2020

koaning commented Sep 24, 2020

Add log names and types #397

Add log names and types #397

Conversation

david26694 commented Jul 28, 2020

pim-hoeven commented Jul 28, 2020 • edited

pim-hoeven commented Jul 28, 2020

koaning commented Jul 29, 2020

koaning commented Jul 29, 2020

david26694 commented Jul 29, 2020

koaning commented Jul 29, 2020

pim-hoeven commented Jul 30, 2020

koaning commented Jul 31, 2020

koaning commented Aug 8, 2020

koaning commented Sep 7, 2020

david26694 commented Sep 12, 2020

pim-hoeven commented Sep 14, 2020

koaning commented Sep 14, 2020 • edited

david26694 commented Sep 19, 2020

koaning commented Sep 24, 2020

pim-hoeven commented Jul 28, 2020 •

edited

koaning commented Sep 14, 2020 •

edited