ENH: Make DataFrame.insert more flexible #34365

topper-123 · 2020-05-25T14:09:51Z

DataFrame.insert is very inflexible today and can not really be used in pipes. I'd like to do several changes to the method:

Add a inplace parameter, so the result from the op can be returned.
allow a callable for value, so the new values can be computed from existing column values.
allow a dict for value, so the several columns can be inserted simultaneously,
add insert_after and insert_before parameters to allow label-based insertion location.
move loc to the after value and let it have a default value of None, i.e. insert the new column at the end of the frame if location is not specified.
deprecate DataFrame.assign. It's functionality would be covered by the changed DataFrame.insert method.

The above change would allow us to be quite flexible when creating new columns in pipes. For example we could do

df.insert("formal_name", lambda: "Mr. " + x["last_name"], insert_after="last_name")
.pipe(...)

The above is obviously quite a large change. It could be discussed if it would be better make it a new method instead of changing an existing one...

The text was updated successfully, but these errors were encountered:

Gabriel-ROBIN · 2020-05-25T21:21:10Z

If I understand, this new insert would be exactly like assign ( in assign, you can pass a callable or dict of callables), the only this that assign miss is the location of the columns ( and the inplace option but not sure it is a good idea)
It would make more sense, imo, to add the location argument in assign no ?

jreback · 2020-05-25T22:21:21Z

DataFrame.insert is very inflexible today and can not really be used in pipes. I'd like to do several changes to the method:

Add a inplace parameter, so the result from the op can be returned.
yes this could have an inplace parameter as its like update, but we haven't changed that either so changing both would be ok for consistency (though we should simply deprecate inplace anyhow), so -0 on making this change

allow a callable for value, so the new values can be computed from existing column values.
+0

allow a dict for value, so the several columns can be inserted simultaneously,

add insert_after and insert_before parameters to allow label-based insertion location.
-1, not sure more methods make sense

move loc to the after value and let it have a default value of None, i.e. insert the new column at the end of the frame if location is not specified.
-0 not sure this adds much value

deprecate DataFrame.assign. It's functionality would be covered by the changed DataFrame.insert method.
-100, we already have a method to do this, don't think adding .insert here is appropriate, rather I would deprecate .insert

The above change would allow us to be quite flexible when creating new columns in pipes. For example we could do
df.insert("formal_name", lambda: "Mr. " + x["last_name"], insert_after="last_name")
.pipe(...)
The above is obviously quite a large change. It could be discussed if it would be better make it a new method instead of changing an existing one...

so you can see I am bascially -1 on any additional functionaility for .insert and actually would rather see us simply deprecate .insert and .update

jorisvandenbossche · 2020-05-25T22:24:29Z

@topper-123 I think some of the functionalities you mention would be interesting, but I also wondering why we would make insert the "go-to" method for adding columns, rather than using asign for this?
Apart from the insertion location, what is it that assign is missing? (we could also try to improve assign then)

topper-123 · 2020-05-25T23:34:02Z

@jreback , yeah I don't like the inplace param in general either, though in cases where it makes sense to keep I'm positive to keep it (e.g. for performance reasons).

I do think piping is a great concept when transforming data and should be encouraged. I would like insert to have inplace=False as default, but that would break backwards compat, so not possible in insert, sadly.

add insert_after and insert_before parameters to allow label-based insertion location.

The point of this would be to make insertion based on label location possible. Currently only insertion by integer location is possible. insertion by label location is more robust and easier to understand at a glance IMO. The proposed parameter names could maybe be improved, and I would idelly just have one parameter, but that couldn't be combined with insertion points before/after a label, I think.

move loc to the after value and let it have a default value of None, i.e. insert the new column at the end of the frame if location is not specified.

This is so we can do simpler and excellent df.insert(ser) and it would be inserted at the end column if no loc is specified and the label would be taken from the inserted series. With the loc at the beginning of the method, we have to always supply it, and do df.insert(1, "name", ser) i.e. pick a integer location and a column name, which isn't so nice. The changed method could always be extended, so df.insert(lambda x: x["a"] + x["b"], name="new_col", insert_after="old_col", inplace=False).pipe(...).pipe(...) etc.

The above functionality could do things that the current insert and assign can't do without being more complex.

deprecate DataFrame.assign. It's functionality would be covered by the changed DataFrame.insert method.

There's a large overlap between insert and assign. The differences between the methods are technical rather than in substance, i.e. one works inplace and the other doesn't, one takes dicts and the other doesn't, one lets the user select an insertion point and the other doesn't and one allows callables and the other doesn't. IMO it would be better to have one unified method.

wrt `assign`:

The assign signature is just **kwargs, so can't be extended to have new keyword parameters without breakage, AFAIKS. I think that's a bad design and we should use a dict or array-like as a first argument in a insertion method rather than **kwargs. That would make it more powerful tha the current impl., and also make it more similar to DataFrame instantiation, i.e.:

>>> DataFrame({"a": range(10), ...}, extra_params)
>>> df.insert({"a": range(10), ...}, extra_params)  # would add to existing df, but better control than df.assign

Adding the new columns at the end of the frame as in assign is a fine default, but we should allow them to be set elsewhere as in insert and be able to use of labels to find the insertion point, not only use integer location as in insert.

topper-123 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 25, 2020

jorisvandenbossche added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Make DataFrame.insert more flexible #34365

ENH: Make DataFrame.insert more flexible #34365

topper-123 commented May 25, 2020 •

edited

Gabriel-ROBIN commented May 25, 2020

jreback commented May 25, 2020

jorisvandenbossche commented May 25, 2020

topper-123 commented May 25, 2020 •

edited

ENH: Make DataFrame.insert more flexible #34365

ENH: Make DataFrame.insert more flexible #34365

Comments

topper-123 commented May 25, 2020 • edited

Gabriel-ROBIN commented May 25, 2020

jreback commented May 25, 2020

jorisvandenbossche commented May 25, 2020

topper-123 commented May 25, 2020 • edited

wrt assign:

topper-123 commented May 25, 2020 •

edited

topper-123 commented May 25, 2020 •

edited

wrt `assign`: