Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add log names and types #397

Closed
wants to merge 12 commits into from
Closed

Add log names and types #397

wants to merge 12 commits into from

Conversation

david26694
Copy link
Contributor

I feel that by copying the log_step function many times and slightly changing the logging section I'm repeating a lot of code. Do you have any suggestion to avoid this?

This is WIP, some TODOs:

  • rename log_step -> log_shape.
  • tests, docs.
  • log_dtypes

@pim-hoeven
Copy link
Contributor

pim-hoeven commented Jul 28, 2020

One solution might be to make log_names an argument of log_step:

Something like

def log_step(func=None, *, log_names=False, level=logging.INFO):
    ...
    names_str = f"columns = {X.columns.tolist()} " if log_names else ""
    logger.log(
            level,
            f"[{func.__name__}(df{func_args_str})] "
            names_str  # Will be an empty string if log_names is False
            f"time={time_taken}",
    )

That could easily be extended to include log_time, log_dtypes, log_shape, etc.

@pim-hoeven
Copy link
Contributor

We could even extend this more such that any arbitrary function that takes a df and produces a string could be used. For example if I want to count the number of cats in the column animal:

def cat_counter(df):
    return str(df[df["animal"]=="cat"].sum())

def log_step(func=None, *, extra_log_func=None, level=logging.INFO):
    ...
    extra_string = extra_log_func(result) if extra_log_func else ""
    logger.log(
            level,
            f"[{func.__name__}(df{func_args_str})] "
            extra_string# Will be an empty string if extra_log_func is None
            f"time={time_taken}",
    )

@log_step(extra_log_func=cat_counter)
def do_something(df):
    ...

@koaning
Copy link
Owner

koaning commented Jul 29, 2020

One thing to be mindful of: pandas just had a new release and I think this is causing some of the old tests to fail. Will have a look today but it's good to be aware.

@koaning
Copy link
Owner

koaning commented Jul 29, 2020

It turns out there's one failing test due to numpy, it shouldn't influence this PR.

@david26694
Copy link
Contributor Author

I was working towards the direction indicated by @pim-hoeven, I didn't find a trivial way to create parametrize decorators. Finally went for the first answer of this question on stackoverflow. I've renamed the log_step to log_step_factory, I am not very convinced about the change though.

Also, tests are failing but they are also failing on other PRs, but I don't understand the reason.

@david26694 david26694 changed the title [WIP] Add log names and types Add log names and types Jul 29, 2020
@koaning
Copy link
Owner

koaning commented Jul 29, 2020

The tests were failing because of a breaking pytest update. I've changed the settings in CI and we're back to the old version. I just merged the changes back into this branch.

@pim-hoeven
Copy link
Contributor

I don't know how @koaning feels about this, so please don't implement this before his approval, but I think it would be really nice to have some standard options that we can switch on or off as arguments and an extra option where the user can supply their own logging function. This would keep usage simple and clean (we only need a single decorator per function).

def log_step(func=None, *, time_taken=True, shape=True, names=False, dtypes=False, extra=None, log_dtypes=False, level=logging.INFO):
    ...
    names_str = f"columns = {X.columns.tolist()} " if log_names else ""
    logger.log(
            level,
            f"[{func.__name__}(df{func_args_str})] "
            names_str  # Will be an empty string if log_names is False
            f"time={time_taken}",
    )

Where `log_extra` are user defined log functions. Would be nice if we could have multiple there as well.
    

@koaning
Copy link
Owner

koaning commented Jul 31, 2020

I'd like to get @MBrouns in on this one but when I think about one of the big life lessons of api design:

Keep the complicated things complicated if that means the simple things remain simple.

Then I might propose a slight edit (again, would appreciate @MBrouns' opinion here).

Maybe we can generalize into two functions.

def log_step(func, *, time_taken=True, shape=True, names=False, dtypes=False, level=logging.INFO):
    ...

def log_step_custom(func, extra, level=logging.INFO):
    ...

I'm assuming that most people won't need the custom part. And if they need it, it most likely
won't be for each individual step. We could therefore keep the log_step simpler by moving the
responsibility of the custom stuff to another decorator.

Another part of the thinking here is that log_step should maybe not be the most "general" in this
scenario rather the most "common" scenario.

@koaning
Copy link
Owner

koaning commented Aug 8, 2020

@MBrouns (who I think is now back from holiday) got an opinion on this? After mulling it over I think this might be what is best;

def log_step(func, *, time_taken=True, shape=True, names=False, dtypes=False, level=logging.INFO):
    ...

def log_step_custom(func, level=logging.INFO, **kwargs):
    ...

The idea with **kwargs here is that you might be able to do something like.

@log_step_custom(n_user=lambda d: d['uid'].nunique(), n_sess=lambda d: d['session'].nunique())
def remove_outliers(dataf, max_days_per_user=100, max_rows_per_session=100):
    ...

This way you also have the name (the key) as well as what to log (the lambda function). You can then have a logline like;

INFO remove_outliers n_user=12314 n_sess=1245322

@pim-hoeven pim-hoeven mentioned this pull request Aug 13, 2020
3 tasks
@koaning
Copy link
Owner

koaning commented Sep 7, 2020

It's been a while (I got distracted) but I just want to check @david26694 are you still interested in working on this feature? I have a bit more bandwith now and would like to get this PR in :)

@david26694
Copy link
Contributor Author

Hi, sorry for the delay, I'm coming back from holidays. I'm a bit busier now, so if you want to get it in soon it might be better for somebody else to work on it.

@pim-hoeven
Copy link
Contributor

@koaning I would like to work on this coming Wednesday or Thursday

@koaning
Copy link
Owner

koaning commented Sep 14, 2020

@pim-hoeven cool! Is it possible for you to start a new branch from the branch from this branch made by @david26694. There's been some appreciated effort on his part and that way his name also makes it to the contributor list.

@david26694
Copy link
Contributor Author

aw, thanks for that!

@koaning
Copy link
Owner

koaning commented Sep 24, 2020

This got fixed in another PR.

@koaning koaning closed this Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants