Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] allow for all kwargs when using @log_step #552

Open
sephib opened this issue Dec 18, 2022 · 7 comments
Open

[FEATURE] allow for all kwargs when using @log_step #552

sephib opened this issue Dec 18, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@sephib
Copy link

sephib commented Dec 18, 2022

Hi,

When using @log_step in debugging a Pandas Pipeline, the current function must accept a single argument of df:pd.Dataframe.

However if the user sends all the parameters as kwargs there is an error .

It would be useful if the @log_step will check the first kwargs and if it is a pd.Dataframe then it will convert it into an arg - possible implementation before running the def wrapper() as follows

    _kwargs = {**kwargs}
    first_arg= next(iter(_kwargs))
    if isinstance(_kwargs[first_arg],pd.DataFrame) and len(args)==0:
        args=args+(_kwargs.pop[first_arg],)
@sephib sephib added the enhancement New feature or request label Dec 18, 2022
@MBrouns
Copy link
Collaborator

MBrouns commented Dec 18, 2022

I took a brief look, but the only thing I could imagine failing is the shape_delta when its set to true. Is that indeed what's happening?

@sephib
Copy link
Author

sephib commented Dec 18, 2022

yes

@sephib
Copy link
Author

sephib commented Dec 18, 2022

this currently is what i've implements

        if shape_delta:
            if len(args)==0 and len(kwargs)>0:
                _kwargs = {**kwargs}
                first_arg= next(iter(_kwargs))
                if isinstance(_kwargs[first_arg],pd.DataFrame):
                    args=args+(_kwargs.pop[first_arg],)
            old_shape = args[0].shape
        tic = dt.datetime.now()

@sephib
Copy link
Author

sephib commented Dec 19, 2022

If you think this is a good direction I'll be happy to submit a PR

@MBrouns
Copy link
Collaborator

MBrouns commented Dec 19, 2022

I'm not sure if changing args directly is a great way to solve this because we use it later in result = func(*args, **kwargs)

Maybe something like this?

if shape_delta:
    if len(args) > 0:
        old_shape = args[0].shape
    else: 
        for v in kwargs.values:
              if isinstance(v, pd.DataFrame):
                  old_shape = v.shape
                  break
         else:
               raise ValueError("shape_delta was set to true, but no DataFrame instance was found as either the first non-keyword argument or in the keyword arguments")

@sephib
Copy link
Author

sephib commented Dec 25, 2022

@MBrouns should I provide a PR?

@FBruzzesi
Copy link
Collaborator

FBruzzesi commented Sep 25, 2023

It may be an overkill, but a way to do it is to allow to specify the index or name of the argument that is the dataframe to track and then use inspect.signature.

Example assuming the new argument is called arg_to_track:

if shape_delta:
    func_args = (
        inspect.signature(func).bind(*args, **kwargs).arguments  # type: ignore
    )
    if isinstance(arg_to_track, int) and arg_to_track >= 0:
        _, input_frame = tuple(func_args.items())[arg_to_track]
    elif isinstance(arg_to_track, str):
        input_frame = func_args[arg_to_track]
    else:
        raise ValueError("arg_to_track should be a string or a positive integer")
    
    old_shape = input_frame .shape

...
result = func(*args, **kwargs)
...

Remark that both getters could result in errors if

  • int is out of range
  • string is not an argument name

Edit: commenting and expanding on this approach
From one side this implementation would be as flexible as possible without any misinterpretation (*); on the other hand, it moves responsibility to the user.

(*) The snippet

for v in kwargs.values:
    if isinstance(v, pd.DataFrame):
        old_shape = v.shape
        break

is assuming the the first dataframe passed is the one we want to track, but it could be terribly wrong if a function takes more than one dataframe as input (e.g. if one wants to do a merge inside it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants