Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set_axis with callable #29145

Open
markxwang opened this issue Oct 22, 2019 · 7 comments
Open

set_axis with callable #29145

markxwang opened this issue Oct 22, 2019 · 7 comments
Labels
DataFrame DataFrame data structure Enhancement MultiIndex Needs Discussion Requires discussion from core team before further action

Comments

@markxwang
Copy link

Hi,
I've been using method chaining to write most of my data wrangling processes and one of the things that bother me a bit is to modify column names as part of the chain.

With a static list of column names, df.set_axis can do the work. But in the following cases, I have to use df.columns=xxx to modify the names.

  1. Header comes from a row
  2. Combine multi-indexed header into a single-index (e.g. concatenate levels)
  3. Expand single-indexed header to multi-indexed (e.g. split via some delimiter)

I wonder if these could be achieved by allowing df.set_axis to take callables, something similar to df.assign.

@WillAyd
Copy link
Member

WillAyd commented Oct 22, 2019

Can you provide code samples for the usage cases with expected outputs for this?

@WillAyd WillAyd added the DataFrame DataFrame data structure label Oct 22, 2019
@markxwang
Copy link
Author

Usage example,

  1. Set first row as column names,
df.set_axis(lambda x: x.iloc[0], axis=1)
  1. Merge multiindex column into single index, say ('a','1'), ('b','2') into 'a_1', 'b_2'
df.set_axis(lambda x: map('_'.join, x.columns), axis=1)
  1. Split single index to multiindex, say 'a_1', 'b_2' to ('a','1'), ('b','2')
df.set_axis(lambda x: x.columns.str.split('_',expand=True), axis=1)

@WillAyd
Copy link
Member

WillAyd commented Oct 22, 2019

Couple things to note:

Number 1 is already possible, though requires you to drop the row after setting. So maybe we need a keyword argument for that (it's called drop in the set_index world)

>>> df = pd.DataFrame(np.arange(10, 16).reshape((-1, 2)))
>>> df.set_axis(df.iloc[0], axis=1).drop(0)
0  10  11
1  12  13
2  14  15

For number two there is already a multi index to_flat_index command you can use

>>> df = pd.DataFrame(np.arange(10, 16).reshape((-1, 2)), columns=pd.MultiIndex.from_product((("a",), ("b", "c"))))
>>> df.set_axis(df.columns.to_flat_index(), axis=1)
   (a, b)  (a, c)
0      10      11
1      12      13
2      14      15

For number three you can also use pd.MultiIndex.from_tuples

>>> df = pd.DataFrame(np.arange(10, 16).reshape((-1, 2)), columns=[("a", "b"), ("a", "c")])
>>> df.set_axis(pd.MultiIndex.from_tuples(df.columns), axis=1)
    a
    b   c
0  10  11
1  12  13
2  14  15

So out of these I think the most actionable thing may be to add a drop argument to set_axis akin to set_index, but let's see what others think

Note this is something tied into the conversations of #24046 but I think this is an actionable item

@markxwang
Copy link
Author

markxwang commented Oct 22, 2019

I understand all are possible, but difficult to work with method chaining. You have to create df first. I would like to do something like,

df = (pd.read_file(file)
        .set_axis(lambda x:x...)
        .do_a()
        .do_b())

@hwalinga
Copy link
Contributor

If you don't know yet, have a look at .pipe.

Pandas isn't designed for method chaining. (Although I think it is a good idea to have more methods like this that do tie in more with the method chaining pattern, but for now you can ease the pain with .pipe.)

@markxwang
Copy link
Author

@hwaling I do use pipe as the last resort and wonder if it is possible to bypass pipe when it comes to header assignment.

@giuliobeseghi
Copy link

giuliobeseghi commented May 19, 2020

@hwaling I do use pipe as the last resort and wonder if it is possible to bypass pipe when it comes to header assignment.

Also, there are many options where you can both use pipe and callable as arg. For example:

df = do_stuff().pipe(lambda x: x.assign(new_col=x.old_col + 1))
# or
df = do_stuff().assign(new_col=lambda x: x.old_col + 1)

Same with .loc[lambda s: ...].

I wonder what the preferred method is when both options are available.

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed API Design labels Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Enhancement MultiIndex Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants