Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Make DataFrame.insert more flexible #34365

Open
topper-123 opened this issue May 25, 2020 · 4 comments
Open

ENH: Make DataFrame.insert more flexible #34365

topper-123 opened this issue May 25, 2020 · 4 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@topper-123
Copy link
Contributor

topper-123 commented May 25, 2020

DataFrame.insert is very inflexible today and can not really be used in pipes. I'd like to do several changes to the method:

  • Add a inplace parameter, so the result from the op can be returned.
  • allow a callable for value, so the new values can be computed from existing column values.
  • allow a dict for value, so the several columns can be inserted simultaneously,
  • add insert_after and insert_before parameters to allow label-based insertion location.
  • move loc to the after value and let it have a default value of None, i.e. insert the new column at the end of the frame if location is not specified.
  • deprecate DataFrame.assign. It's functionality would be covered by the changed DataFrame.insert method.

The above change would allow us to be quite flexible when creating new columns in pipes. For example we could do

df.insert("formal_name", lambda: "Mr. " + x["last_name"], insert_after="last_name")
.pipe(...)

The above is obviously quite a large change. It could be discussed if it would be better make it a new method instead of changing an existing one...

@topper-123 topper-123 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 25, 2020
@Gabriel-ROBIN
Copy link

If I understand, this new insert would be exactly like assign ( in assign, you can pass a callable or dict of callables), the only this that assign miss is the location of the columns ( and the inplace option but not sure it is a good idea)
It would make more sense, imo, to add the location argument in assign no ?

@jorisvandenbossche jorisvandenbossche added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 25, 2020
@jreback
Copy link
Contributor

jreback commented May 25, 2020

DataFrame.insert is very inflexible today and can not really be used in pipes. I'd like to do several changes to the method:

  • Add a inplace parameter, so the result from the op can be returned.
    yes this could have an inplace parameter as its like update, but we haven't changed that either so changing both would be ok for consistency (though we should simply deprecate inplace anyhow), so -0 on making this change
  • allow a callable for value, so the new values can be computed from existing column values.
    +0
  • allow a dict for value, so the several columns can be inserted simultaneously,
  • add insert_after and insert_before parameters to allow label-based insertion location.
    -1, not sure more methods make sense
  • move loc to the after value and let it have a default value of None, i.e. insert the new column at the end of the frame if location is not specified.
    -0 not sure this adds much value
  • deprecate DataFrame.assign. It's functionality would be covered by the changed DataFrame.insert method.
    -100, we already have a method to do this, don't think adding .insert here is appropriate, rather I would deprecate .insert

The above change would allow us to be quite flexible when creating new columns in pipes. For example we could do

df.insert("formal_name", lambda: "Mr. " + x["last_name"], insert_after="last_name")
.pipe(...)

The above is obviously quite a large change. It could be discussed if it would be better make it a new method instead of changing an existing one...

so you can see I am bascially -1 on any additional functionaility for .insert and actually would rather see us simply deprecate .insert and .update

@jorisvandenbossche
Copy link
Member

@topper-123 I think some of the functionalities you mention would be interesting, but I also wondering why we would make insert the "go-to" method for adding columns, rather than using asign for this?
Apart from the insertion location, what is it that assign is missing? (we could also try to improve assign then)

@topper-123
Copy link
Contributor Author

topper-123 commented May 25, 2020

@jreback , yeah I don't like the inplace param in general either, though in cases where it makes sense to keep I'm positive to keep it (e.g. for performance reasons).

I do think piping is a great concept when transforming data and should be encouraged. I would like insert to have inplace=False as default, but that would break backwards compat, so not possible in insert, sadly.

add insert_after and insert_before parameters to allow label-based insertion location.

The point of this would be to make insertion based on label location possible. Currently only insertion by integer location is possible. insertion by label location is more robust and easier to understand at a glance IMO. The proposed parameter names could maybe be improved, and I would idelly just have one parameter, but that couldn't be combined with insertion points before/after a label, I think.

move loc to the after value and let it have a default value of None, i.e. insert the new column at the end of the frame if location is not specified.

This is so we can do simpler and excellent df.insert(ser) and it would be inserted at the end column if no loc is specified and the label would be taken from the inserted series. With the loc at the beginning of the method, we have to always supply it, and do df.insert(1, "name", ser) i.e. pick a integer location and a column name, which isn't so nice. The changed method could always be extended, so df.insert(lambda x: x["a"] + x["b"], name="new_col", insert_after="old_col", inplace=False).pipe(...).pipe(...) etc.

The above functionality could do things that the current insert and assign can't do without being more complex.

deprecate DataFrame.assign. It's functionality would be covered by the changed DataFrame.insert method.

There's a large overlap between insert and assign. The differences between the methods are technical rather than in substance, i.e. one works inplace and the other doesn't, one takes dicts and the other doesn't, one lets the user select an insertion point and the other doesn't and one allows callables and the other doesn't. IMO it would be better to have one unified method.

wrt assign:

The assign signature is just **kwargs, so can't be extended to have new keyword parameters without breakage, AFAIKS. I think that's a bad design and we should use a dict or array-like as a first argument in a insertion method rather than **kwargs. That would make it more powerful tha the current impl., and also make it more similar to DataFrame instantiation, i.e.:

>>> DataFrame({"a": range(10), ...}, extra_params)
>>> df.insert({"a": range(10), ...}, extra_params)  # would add to existing df, but better control than df.assign

Adding the new columns at the end of the frame as in assign is a fine default, but we should allow them to be set elsewhere as in insert and be able to use of labels to find the insertion point, not only use integer location as in insert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants