API: Add DataFrame.assign method #9239

Merged
merged 1 commit into from Mar 1, 2015

Conversation

Projects
None yet
7 participants
@TomAugspurger
Contributor

TomAugspurger commented Jan 13, 2015

Closes #9229

signature: DataFrame.transform(**kwargs)

  • the keyword is the name of the new column (existing columns are overwritten if there's a name conflict, as in dplyr)
  • the value is either
    • called on self if it's callable. The callable should be a function of 1 argument, the DataFrame being called on.
    • inserted otherwise
In [7]: df.head()
Out[7]: 
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

In [8]: (df.query('species == "virginica"')
           .transform(sepal_ratio=lambda x: x.sepal_length / x.sepal_width)
           .head())
Out[8]: 
     sepal_length  sepal_width  petal_length  petal_width    species  \
100           6.3          3.3           6.0          2.5  virginica   
101           5.8          2.7           5.1          1.9  virginica   
102           7.1          3.0           5.9          2.1  virginica   
103           6.3          2.9           5.6          1.8  virginica   
104           6.5          3.0           5.8          2.2  virginica   

     sepal_ratio  
100     1.909091  
101     2.148148  
102     2.366667  
103     2.172414  
104     2.166667  

My question now is

  • How strict should we be on the shape of the transformed DataFrame? Should we do any kind of checking on the index or columns?
@shoyer

View changes

pandas/core/frame.py
+ """
+
+ """
+ data = self.copy()

This comment has been minimized.

@shoyer

shoyer Jan 13, 2015

Member

This should do a shallow copy if possible.

@shoyer

shoyer Jan 13, 2015

Member

This should do a shallow copy if possible.

This comment has been minimized.

@jreback

jreback Jan 13, 2015

Contributor

we never shallow copy

much more trouble than it's worth
only indexes are shallow

@jreback

jreback Jan 13, 2015

Contributor

we never shallow copy

much more trouble than it's worth
only indexes are shallow

This comment has been minimized.

@TomAugspurger

TomAugspurger Jan 13, 2015

Contributor

I like the unintentional formatting on that :)

@TomAugspurger

TomAugspurger Jan 13, 2015

Contributor

I like the unintentional formatting on that :)

This comment has been minimized.

@jreback

jreback Jan 13, 2015

Contributor

hahha

the old iphone

@jreback

jreback Jan 13, 2015

Contributor

hahha

the old iphone

This comment has been minimized.

@shoyer

shoyer Jan 13, 2015

Member

I misunderstand pandas's datamodel, presuming that you could do a shallow copy and then update a column without modifying the original (like a dict). I see now that I was mistaken :(.

@shoyer

shoyer Jan 13, 2015

Member

I misunderstand pandas's datamodel, presuming that you could do a shallow copy and then update a column without modifying the original (like a dict). I see now that I was mistaken :(.

@shoyer

View changes

pandas/core/frame.py
+ """
+ data = self.copy()
+
+ if not len(kwargs) == 1:

This comment has been minimized.

@shoyer

shoyer Jan 13, 2015

Member

is this necessary?

@shoyer

shoyer Jan 13, 2015

Member

is this necessary?

This comment has been minimized.

@TomAugspurger

TomAugspurger Jan 13, 2015

Contributor

I'm planning to add a helpful error message with how to call .transform here if something else is passed in. I wanted to disallow passing in multiple kwargs to avoid the issue with computations relying on each other.

@TomAugspurger

TomAugspurger Jan 13, 2015

Contributor

I'm planning to add a helpful error message with how to call .transform here if something else is passed in. I wanted to disallow passing in multiple kwargs to avoid the issue with computations relying on each other.

This comment has been minimized.

@shoyer

shoyer Jan 13, 2015

Member

Hmm. I don't think this is that much of a trap and it is certainly very convenient to be able to use a single call to transform. It's enough, I think, to simply ensure that all functions are evaluated on the original untransformed dataframe.

@shoyer

shoyer Jan 13, 2015

Member

Hmm. I don't think this is that much of a trap and it is certainly very convenient to be able to use a single call to transform. It's enough, I think, to simply ensure that all functions are evaluated on the original untransformed dataframe.

@jreback jreback added the API Design label Jan 13, 2015

@jreback jreback added this to the 0.16.0 milestone Jan 13, 2015

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 13, 2015

Contributor

I like transform as the name; no-one in favor of mutate? any other possibilities?

only minus against transform is the use in groupby is somewhat different.

Contributor

jreback commented Jan 13, 2015

I like transform as the name; no-one in favor of mutate? any other possibilities?

only minus against transform is the use in groupby is somewhat different.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jan 13, 2015

Member

I like mutate better because:

  1. we already have transform on groupby
  2. we might want mutate as grouped operation -- dplyr has it
Member

shoyer commented Jan 13, 2015

I like mutate better because:

  1. we already have transform on groupby
  2. we might want mutate as grouped operation -- dplyr has it
@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Jan 14, 2015

Contributor

If you intend to keep the function pure (as it is currently) then the term mutate might be misleading.

Neither transform nor mutate are very descriptive. It may make sense to seek out a better term, even if it means breaking from tradition.

Contributor

mrocklin commented Jan 14, 2015

If you intend to keep the function pure (as it is currently) then the term mutate might be misleading.

Neither transform nor mutate are very descriptive. It may make sense to seek out a better term, even if it means breaking from tradition.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jan 14, 2015

Member

FWIW, dlpyr's mutate is also pure, despite the misleading name.

Member

shoyer commented Jan 14, 2015

FWIW, dlpyr's mutate is also pure, despite the misleading name.

@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Jan 14, 2015

Contributor

le sigh

Contributor

mrocklin commented Jan 14, 2015

le sigh

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jan 14, 2015

Contributor

Augment? Enhance? I'll keep thinking.

On Jan 13, 2015, at 18:32, Matthew Rocklin notifications@github.com wrote:

le sigh


Reply to this email directly or view it on GitHub.

Contributor

TomAugspurger commented Jan 14, 2015

Augment? Enhance? I'll keep thinking.

On Jan 13, 2015, at 18:32, Matthew Rocklin notifications@github.com wrote:

le sigh


Reply to this email directly or view it on GitHub.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 14, 2015

Contributor

I think we should go a slightly different route here.

I would change the signature of df.update(*args, **kwargs). Then you can detect if its the 'new' mode or the 'original' mode (you can simply look at the args/kwargs and figure this out unambiguously).

I would vote to effectively deprecate the original .update mode (of course ATM this would just show a FutureWarning. This function is really not necessary except in very rare circumstances, and is not well implemented internally.

df.update(A=df.B/df.C) looks like a winner to me. (same idea in that it IS pure, but has a more suggestive name than transform/mutate. Further we can add the same method to groupby).

Contributor

jreback commented Jan 14, 2015

I think we should go a slightly different route here.

I would change the signature of df.update(*args, **kwargs). Then you can detect if its the 'new' mode or the 'original' mode (you can simply look at the args/kwargs and figure this out unambiguously).

I would vote to effectively deprecate the original .update mode (of course ATM this would just show a FutureWarning. This function is really not necessary except in very rare circumstances, and is not well implemented internally.

df.update(A=df.B/df.C) looks like a winner to me. (same idea in that it IS pure, but has a more suggestive name than transform/mutate. Further we can add the same method to groupby).

@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Jan 14, 2015

Contributor

That sounds pretty clean.

Contributor

mrocklin commented Jan 14, 2015

That sounds pretty clean.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jan 14, 2015

Member

I wanted to say that df.add_column() is descriptive, but ugly name .. but I like update more!
Only, the combination with the existing use seems a bit confusing to users I think (but have to look in more detail)

Member

jorisvandenbossche commented Jan 14, 2015

I wanted to say that df.add_column() is descriptive, but ugly name .. but I like update more!
Only, the combination with the existing use seems a bit confusing to users I think (but have to look in more detail)

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jan 14, 2015

Member

+1 for update.

I still vote for allowing multiple variables at once :).

Member

shoyer commented Jan 14, 2015

+1 for update.

I still vote for allowing multiple variables at once :).

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jan 14, 2015

Member

OK, another idea: df.assign?

Update is not very useful in pandas, but it is part of the standard mapping API, where it's known as a method that does an in-place operation.

Member

shoyer commented Jan 14, 2015

OK, another idea: df.assign?

Update is not very useful in pandas, but it is part of the standard mapping API, where it's known as a method that does an in-place operation.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jan 16, 2015

Contributor

Thanks for the input. My favorite is df.assign, it conveys the meaning well. I could live with update though. I only worry about the cognitive clash with dict.update(), which is inplace.

I'll also allow multiple variables I think. I'm going to do all of the calculations before the assignment so that we don't run into issues with one calculation depending on another, and having the success or failure of the call dependent upon the dict ordering.

More tests and docs coming soon.

Contributor

TomAugspurger commented Jan 16, 2015

Thanks for the input. My favorite is df.assign, it conveys the meaning well. I could live with update though. I only worry about the cognitive clash with dict.update(), which is inplace.

I'll also allow multiple variables I think. I'm going to do all of the calculations before the assignment so that we don't run into issues with one calculation depending on another, and having the success or failure of the call dependent upon the dict ordering.

More tests and docs coming soon.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jan 17, 2015

Contributor

Updated with some docs and handling multiple assigns. I wasn't sure about the best place for the docs so I threw it in basics.rst. I still need to build them to make sure everything looks good.

I am worried about people hitting subtle bugs with assigning multiple columns in one assign since the order won't be preserved. I've tried to document that.

I've got a few more things to clean up and then I'll ping for review.

Contributor

TomAugspurger commented Jan 17, 2015

Updated with some docs and handling multiple assigns. I wasn't sure about the best place for the docs so I threw it in basics.rst. I still need to build them to make sure everything looks good.

I am worried about people hitting subtle bugs with assigning multiple columns in one assign since the order won't be preserved. I've tried to document that.

I've got a few more things to clean up and then I'll ping for review.

@TomAugspurger TomAugspurger changed the title from API: Add DataFrame.transform method to API: Add DataFrame.assign method Jan 18, 2015

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jan 18, 2015

Contributor

Ok, ready for feedback.

Just a summary,

  • I went with assign, but update could work
  • keyword arguments only (potentially multiple)
  • if the value is callable, it's called on self
  • if the value is not callable, it's inserted
Contributor

TomAugspurger commented Jan 18, 2015

Ok, ready for feedback.

Just a summary,

  • I went with assign, but update could work
  • keyword arguments only (potentially multiple)
  • if the value is callable, it's called on self
  • if the value is not callable, it's inserted
@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks Jan 18, 2015

Member

Nice feature, and some points to be considered:

  • inplace option like other functions (but it results we cannot create column named inplace though)
  • Should Series also have assign to make a DataFrame for consistency?
  • Better to care partial string slicing if DataFrame has DatetimeIndex and PeriodIndex, which outputs unexpected results.
Member

sinhrks commented Jan 18, 2015

Nice feature, and some points to be considered:

  • inplace option like other functions (but it results we cannot create column named inplace though)
  • Should Series also have assign to make a DataFrame for consistency?
  • Better to care partial string slicing if DataFrame has DatetimeIndex and PeriodIndex, which outputs unexpected results.
@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jan 18, 2015

Contributor
  • I'm -1 on an inplace option. Overall, I think we're discouraging users from using inplace these days. It also kills what I think is the main use of assign: inside a chain of operations.
  • Series could have an assign. I didn't include it yet since that would necessarily involve transforming a Series to a DataFrame, which we have the to_frame method for.
  • For now I'm taking the approach that people need to be very careful when using assign. I'm not doing any checking of the results to ensure that you're computation hasn't caused a reindexing that creates a bunch of NaNs.
Contributor

TomAugspurger commented Jan 18, 2015

  • I'm -1 on an inplace option. Overall, I think we're discouraging users from using inplace these days. It also kills what I think is the main use of assign: inside a chain of operations.
  • Series could have an assign. I didn't include it yet since that would necessarily involve transforming a Series to a DataFrame, which we have the to_frame method for.
  • For now I'm taking the approach that people need to be very careful when using assign. I'm not doing any checking of the results to ensure that you're computation hasn't caused a reindexing that creates a bunch of NaNs.
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 18, 2015

Contributor

I really think this should be called .update. Adding another function is just confusing.

Contributor

jreback commented Jan 18, 2015

I really think this should be called .update. Adding another function is just confusing.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jan 18, 2015

Member

@jreback I'm all for deprecating .update, but I think it is clearer to give them a distinct name, given that the behavior is different and also different from the update method on dicts.

Another name possibly worth considering is .set.

Member

shoyer commented Jan 18, 2015

@jreback I'm all for deprecating .update, but I think it is clearer to give them a distinct name, given that the behavior is different and also different from the update method on dicts.

Another name possibly worth considering is .set.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 18, 2015

Contributor

@shoyer we already have way too many update/set/filter etc methods. I think some consolidation is in order. I don't think adding another new method is useful at all.

And you dont' actually need to deprecate the current functionaility of .update. (e.g. other is a dict-like).

Contributor

jreback commented Jan 18, 2015

@shoyer we already have way too many update/set/filter etc methods. I think some consolidation is in order. I don't think adding another new method is useful at all.

And you dont' actually need to deprecate the current functionaility of .update. (e.g. other is a dict-like).

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jan 18, 2015

Member

@jreback I agree on the problem but not the solution (in this specific case). Taking the deprecation and eventual removal of the current meaning of update as a given, I would rather have an assign method than an update method, because the later will always have the confusing association with duct.update (the issue is similar to the name mutate but worse). But is mostly bike shedding.

Member

shoyer commented Jan 18, 2015

@jreback I agree on the problem but not the solution (in this specific case). Taking the deprecation and eventual removal of the current meaning of update as a given, I would rather have an assign method than an update method, because the later will always have the confusing association with duct.update (the issue is similar to the name mutate but worse). But is mostly bike shedding.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jan 18, 2015

Member

@jreback To follow up on your edit, I am pretty opposed to functions that do very different things depending on how you call them, except as a transitional step. That is poor API design :).

Member

shoyer commented Jan 18, 2015

@jreback To follow up on your edit, I am pretty opposed to functions that do very different things depending on how you call them, except as a transitional step. That is poor API design :).

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 18, 2015

Contributor

@shoyer fair enough, and I DO like .set :)

Contributor

jreback commented Jan 18, 2015

@shoyer fair enough, and I DO like .set :)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 18, 2015

Contributor

@shoyer update to my update (pun intended). I am pushing this because .update functionaility is currently completely encompassed by .loc (maybe with a couple of edge cases), and is not handled correctly internally (e.g.its a giant hack right now).

So I would view this as a net positive to remove it and simplify the API.

Contributor

jreback commented Jan 18, 2015

@shoyer update to my update (pun intended). I am pushing this because .update functionaility is currently completely encompassed by .loc (maybe with a couple of edge cases), and is not handled correctly internally (e.g.its a giant hack right now).

So I would view this as a net positive to remove it and simplify the API.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jan 19, 2015

Member

.update functionaility is currently completely encompassed by .loc (maybe with a couple of edge cases), and is not handled correctly internally (e.g.its a giant hack right now).

How would you exactly do such an update with loc? Say an df1.update(df2) is equivalent to df1.loc[df2.index, df2.columns] = df2. I don't know if that is so obvious for users to do?

Member

jorisvandenbossche commented Jan 19, 2015

.update functionaility is currently completely encompassed by .loc (maybe with a couple of edge cases), and is not handled correctly internally (e.g.its a giant hack right now).

How would you exactly do such an update with loc? Say an df1.update(df2) is equivalent to df1.loc[df2.index, df2.columns] = df2. I don't know if that is so obvious for users to do?

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jan 20, 2015

Contributor

This could be an argument from authority, but Wes does breifly use DataFrame.update in The Book (page 338), as an alternative to .combine_first. I'm think we're trying to avoid breaking examples from that.

Contributor

TomAugspurger commented Jan 20, 2015

This could be an argument from authority, but Wes does breifly use DataFrame.update in The Book (page 338), as an alternative to .combine_first. I'm think we're trying to avoid breaking examples from that.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jan 24, 2015

Contributor

Picking this up again today.

I'm even mored down on update than before, since the couple releases where there's overlap will be very muddled. With df.update(other_df) being inplace since other_df is positional, but df.update(a=other_thing) being the new behavior since it's a keyword. Just in my testing and putting together I've accidentally forgotten the keyword argument multiple times. Having that unexpectedly update my data, potentially without me realizing, would suck.

I was thinking maybe it'd be OK since DataFrames aren't dict-like, so DataFrame.update shouldn't have to much cognitive dissonance with dict.update, but we also have a Seris.update, and Series are dict-like.

So I guess what I'm saying is I want to go with .assign. (Or maybe enhance (NSFWish I guess, it's the Super Troopers reference) for the comedic value of chaining a bunch of enhances together, 😆 ).

Contributor

TomAugspurger commented Jan 24, 2015

Picking this up again today.

I'm even mored down on update than before, since the couple releases where there's overlap will be very muddled. With df.update(other_df) being inplace since other_df is positional, but df.update(a=other_thing) being the new behavior since it's a keyword. Just in my testing and putting together I've accidentally forgotten the keyword argument multiple times. Having that unexpectedly update my data, potentially without me realizing, would suck.

I was thinking maybe it'd be OK since DataFrames aren't dict-like, so DataFrame.update shouldn't have to much cognitive dissonance with dict.update, but we also have a Seris.update, and Series are dict-like.

So I guess what I'm saying is I want to go with .assign. (Or maybe enhance (NSFWish I guess, it's the Super Troopers reference) for the comedic value of chaining a bunch of enhances together, 😆 ).

@shoyer

View changes

doc/source/basics.rst
+
+.. warning::
+
+ Since the function signature of ``assign`` is ``**kwargs**``, a dictionary,

This comment has been minimized.

@shoyer

shoyer Jan 25, 2015

Member

I think you want **kwargs here (no trailing **)?

@shoyer

shoyer Jan 25, 2015

Member

I think you want **kwargs here (no trailing **)?

@shoyer

View changes

doc/source/basics.rst
+.. warning::
+
+ Since the function signature of ``assign`` is ``**kwargs**``, a dictionary,
+ the order of the columns in the result DataFrame cannot be guarunteed.

This comment has been minimized.

@shoyer

shoyer Jan 25, 2015

Member

spelling: guarunteed -> guaranteed

also: I would qualify "the order of any new columns"

@shoyer

shoyer Jan 25, 2015

Member

spelling: guarunteed -> guaranteed

also: I would qualify "the order of any new columns"

@shoyer

View changes

doc/source/basics.rst
+
+ .. code-block:: python
+
+ >>># Don't do this

This comment has been minimized.

@shoyer

shoyer Jan 25, 2015

Member

I think we usually use IPython style In [1] prompts rather than >>>.

Also, there's a nice IPython verbatim sphinx directive that will number lines properly even if they aren't evaluated, e.g., write something like:

.. ipython::
    :verbatim:

    In [1]: df['not_found']
    Out[1]: KeyError

Probably best to try building the docs to make sure it looks right.

@shoyer

shoyer Jan 25, 2015

Member

I think we usually use IPython style In [1] prompts rather than >>>.

Also, there's a nice IPython verbatim sphinx directive that will number lines properly even if they aren't evaluated, e.g., write something like:

.. ipython::
    :verbatim:

    In [1]: df['not_found']
    Out[1]: KeyError

Probably best to try building the docs to make sure it looks right.

@shoyer

View changes

doc/source/basics.rst
+
+.. ipython:: python
+
+ iris.assign(sepal_ratio = lambda x: x['SepalWidth'] /

This comment has been minimized.

@shoyer

shoyer Jan 25, 2015

Member

It would be best to pick a style here for spacing around = and stick to it for the docs. Personally, I like the extra spaces (even though it goes against the usual PEP8) because it's used for assignment, and it looks very unbalanced otherwise with the lambda.

Also: for these lines, I would consider using parentheses like this to group the lambda statement:

lambda x: (x['SepalWidth'] /
           x['SepalLength])

(just looks better to me)

@shoyer

shoyer Jan 25, 2015

Member

It would be best to pick a style here for spacing around = and stick to it for the docs. Personally, I like the extra spaces (even though it goes against the usual PEP8) because it's used for assignment, and it looks very unbalanced otherwise with the lambda.

Also: for these lines, I would consider using parentheses like this to group the lambda statement:

lambda x: (x['SepalWidth'] /
           x['SepalLength])

(just looks better to me)

@shoyer

View changes

pandas/core/frame.py
+ Examples
+ ========
+
+ df = DataFrame({'A': range(1, 10), 'B': np.random.randn(10)})

This comment has been minimized.

@shoyer

shoyer Jan 25, 2015

Member

I think sphinx formats these better if you preface by >>>

@shoyer

shoyer Jan 25, 2015

Member

I think sphinx formats these better if you preface by >>>

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jan 25, 2015

Member

yes, indeed, blocks that start with >>> are automatically converted to code blocks

@jorisvandenbossche

jorisvandenbossche Jan 25, 2015

Member

yes, indeed, blocks that start with >>> are automatically converted to code blocks

@jorisvandenbossche

View changes

pandas/core/frame.py
+ Assign new columns to a DataFrame.
+
+ Parameters
+ ==========

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jan 25, 2015

Member

can you use a single line ---- here? (for consistency with our other docstrings)

also we generally use no blankline after a heading

(see https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt for an overview)

@jorisvandenbossche

jorisvandenbossche Jan 25, 2015

Member

can you use a single line ---- here? (for consistency with our other docstrings)

also we generally use no blankline after a heading

(see https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt for an overview)

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jan 26, 2015

Contributor

Thanks. I'm building the docs now.

@jreback thoughts on keeping it at .assign vs. changing to .update?

Contributor

TomAugspurger commented Jan 26, 2015

Thanks. I'm building the docs now.

@jreback thoughts on keeping it at .assign vs. changing to .update?

+ 8 9 0.549296 2.197225
+ 9 10 -0.758542 2.302585
+ """
+ data = self.copy()

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jan 28, 2015

Member

I think there has been an previous comment about this, but two things:

  • Is this actually necessary? (But I probably also do not yet fully understand pandas' datamodel) Eg does df['a'] = .. always copy?
  • In the probable case of misunderstanding (so this is my actual comment :-), I would maybe add a note about this in the docstring? DataFrame.append has this in the sense it says returning a new object
@jorisvandenbossche

jorisvandenbossche Jan 28, 2015

Member

I think there has been an previous comment about this, but two things:

  • Is this actually necessary? (But I probably also do not yet fully understand pandas' datamodel) Eg does df['a'] = .. always copy?
  • In the probable case of misunderstanding (so this is my actual comment :-), I would maybe add a note about this in the docstring? DataFrame.append has this in the sense it says returning a new object

This comment has been minimized.

@jreback

jreback Jan 28, 2015

Contributor

This would violate pandas data model. The assign method would then have side effects (without it being obvious that it does), and further intuition on chaining would be very difficult to reason.

e.g. If you allowed inplace chaining

df.assign(C=df.A/df.C)

would then add C to the ORIGINAL frame. (I have some commentary at on this later)

@jreback

jreback Jan 28, 2015

Contributor

This would violate pandas data model. The assign method would then have side effects (without it being obvious that it does), and further intuition on chaining would be very difficult to reason.

e.g. If you allowed inplace chaining

df.assign(C=df.A/df.C)

would then add C to the ORIGINAL frame. (I have some commentary at on this later)

This comment has been minimized.

@jreback

jreback Jan 28, 2015

Contributor

@jorisvandenbossche
df['a'] = ... NEVER copies. That is the point its an inplace assignment.

@jreback

jreback Jan 28, 2015

Contributor

@jorisvandenbossche
df['a'] = ... NEVER copies. That is the point its an inplace assignment.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jan 28, 2015

Member

@jreback Thanks for explaining! (this does me thinking we should really have some better docs about the internals .. but of course, someone has to write (and keep up to date) them)

So assigning with df['a'] = .. adds a new block and does not consolidate it with another block if one exists of that type? Why not having the same approach here? What are the side effects you are talking about with df['a'] = .. ?

@jorisvandenbossche

jorisvandenbossche Jan 28, 2015

Member

@jreback Thanks for explaining! (this does me thinking we should really have some better docs about the internals .. but of course, someone has to write (and keep up to date) them)

So assigning with df['a'] = .. adds a new block and does not consolidate it with another block if one exists of that type? Why not having the same approach here? What are the side effects you are talking about with df['a'] = .. ?

This comment has been minimized.

@jreback

jreback Jan 28, 2015

Contributor

no, I was saying side-effects meaning that df IS modified, as opposed to df.assign(...) which returns a NEW object. df['a'] = .. is just like is says, its an assignment INPLACE.

whether this creates a new block and/or consolidates is an implementation detail (it actually creates a new block if its a new dtype, then consolidates)

@jreback

jreback Jan 28, 2015

Contributor

no, I was saying side-effects meaning that df IS modified, as opposed to df.assign(...) which returns a NEW object. df['a'] = .. is just like is says, its an assignment INPLACE.

whether this creates a new block and/or consolidates is an implementation detail (it actually creates a new block if its a new dtype, then consolidates)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jan 28, 2015

Member

yep, thinking more about it now, it is indeed logical, if you are chaining, that it returns a new object. It should just be clear from the docs, as @TomAugspurger adapted them now.

@jorisvandenbossche

jorisvandenbossche Jan 28, 2015

Member

yep, thinking more about it now, it is indeed logical, if you are chaining, that it returns a new object. It should just be clear from the docs, as @TomAugspurger adapted them now.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jan 28, 2015

Member

@TomAugspurger So I did my round of nitpicks :-)

Member

jorisvandenbossche commented Jan 28, 2015

@TomAugspurger So I did my round of nitpicks :-)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 28, 2015

Contributor

So its clear that .assign MUST be pure and not have side effects.

That said, I think it might be ok for the following to be equivalent:

df.assign(C=df.A/df.B, inplace=True)
df['C'] = df.A/df.C

But I DON'T think we should actually offer the inplace kw. It will be confusing, and more room for abuse (e.g. chained-assignment that non-explicity modifies the original data).

@TomAugspurger I would also move the doc section to AFTER 'regular' assignment. And maybe call it chained-assignment.

Contributor

jreback commented Jan 28, 2015

So its clear that .assign MUST be pure and not have side effects.

That said, I think it might be ok for the following to be equivalent:

df.assign(C=df.A/df.B, inplace=True)
df['C'] = df.A/df.C

But I DON'T think we should actually offer the inplace kw. It will be confusing, and more room for abuse (e.g. chained-assignment that non-explicity modifies the original data).

@TomAugspurger I would also move the doc section to AFTER 'regular' assignment. And maybe call it chained-assignment.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jan 28, 2015

Contributor

Thanks again all.

@jreback good call, I think that's a better location for the docs. I named the section Assigning New Columns in Method Chains (a bit wordy, meh). I really don't want to confuse newish users who have heard about the evils of chained assignment. Is method-chaining distinct enough?

Contributor

TomAugspurger commented Jan 28, 2015

Thanks again all.

@jreback good call, I think that's a better location for the docs. I named the section Assigning New Columns in Method Chains (a bit wordy, meh). I really don't want to confuse newish users who have heard about the evils of chained assignment. Is method-chaining distinct enough?

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Jan 28, 2015

Contributor

Maybe give it a new doc and only talk about chained operations there? In that context, mutate even makes sense, because it mutates a dataframe "in the chain".

Contributor

jankatins commented Jan 28, 2015

Maybe give it a new doc and only talk about chained operations there? In that context, mutate even makes sense, because it mutates a dataframe "in the chain".

+
+ # ... and then assign
+ for k, v in results.items():
+ data[k] = v

This comment has been minimized.

@sinhrks

sinhrks Feb 2, 2015

Member

Better to use .loc here, __setitem__ can behave unexpectedly depending on input.

@sinhrks

sinhrks Feb 2, 2015

Member

Better to use .loc here, __setitem__ can behave unexpectedly depending on input.

This comment has been minimized.

@jreback

jreback Feb 2, 2015

Contributor

no this is correct; this is by definition a string setting of a column
maybe just assert that the keys are steings (I think that the function call would raise before hand if they were not in any event)

@jreback

jreback Feb 2, 2015

Contributor

no this is correct; this is by definition a string setting of a column
maybe just assert that the keys are steings (I think that the function call would raise before hand if they were not in any event)

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 28, 2015

Contributor

The keys in **kwargs are required to be strings by Python. No need to check.

@TomAugspurger

TomAugspurger Feb 28, 2015

Contributor

The keys in **kwargs are required to be strings by Python. No need to check.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 24, 2015

Contributor

@TomAugspurger

I think implementing this for groupby (e.g. #9545) should be straightforward.

Contributor

jreback commented Feb 24, 2015

@TomAugspurger

I think implementing this for groupby (e.g. #9545) should be straightforward.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Feb 24, 2015

Contributor

Cool. I'm going to clean this up and merge on Saturday.

On Tue, Feb 24, 2015 at 4:05 PM, jreback notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger

I think implementing this for groupby (e.g. #9545
#9545) should be straightforward.


Reply to this email directly or view it on GitHub
#9239 (comment).

Contributor

TomAugspurger commented Feb 24, 2015

Cool. I'm going to clean this up and merge on Saturday.

On Tue, Feb 24, 2015 at 4:05 PM, jreback notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger

I think implementing this for groupby (e.g. #9545
#9545) should be straightforward.


Reply to this email directly or view it on GitHub
#9239 (comment).

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Feb 28, 2015

Contributor

Travis is running. Will merge in a few hours, assuming no objections.

Contributor

TomAugspurger commented Feb 28, 2015

Travis is running. Will merge in a few hours, assuming no objections.

+ iris = read_csv('data/iris.data')
+ iris.head()
+
+ (iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])

This comment has been minimized.

@jreback

jreback Feb 28, 2015

Contributor

you don't need the parens here

@jreback

jreback Feb 28, 2015

Contributor

you don't need the parens here

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 28, 2015

Contributor

I call .head on the next line.

@TomAugspurger

TomAugspurger Feb 28, 2015

Contributor

I call .head on the next line.

+ iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
+ x['SepalLength'])).head()
+
+``assign`` **always** returns a copy of the data, leaving the original

This comment has been minimized.

@jreback

jreback Feb 28, 2015

Contributor

maybe explain why you would use a callable (as opposed to straight assignment)

@jreback

jreback Feb 28, 2015

Contributor

maybe explain why you would use a callable (as opposed to straight assignment)

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 28, 2015

Contributor

I do that down on line 498, but agreed. I'll put a sentence here.

@TomAugspurger

TomAugspurger Feb 28, 2015

Contributor

I do that down on line 498, but agreed. I'll put a sentence here.

+.. ipython:: python
+
+ @savefig basics_assign.png
+ (iris.query('SepalLength > 5')

This comment has been minimized.

@jreback

jreback Feb 28, 2015

Contributor

I don't believe you need the parens here either

@jreback

jreback Feb 28, 2015

Contributor

I don't believe you need the parens here either

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 28, 2015

Contributor

I'm using parens instead of \ to do line-continuation.

@TomAugspurger

TomAugspurger Feb 28, 2015

Contributor

I'm using parens instead of \ to do line-continuation.

+ .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
+
+Since a function is passed in, the function is computed on the DataFrame
+being assigned to. Importantly, this is the DataFrame that's been filtered

This comment has been minimized.

@jreback

jreback Feb 28, 2015

Contributor

make it clear that this is a deferred operation (so that the purpose is to have the filtering happen first)

@jreback

jreback Feb 28, 2015

Contributor

make it clear that this is a deferred operation (so that the purpose is to have the filtering happen first)

@jreback

View changes

doc/source/dsintro.rst
+to be inserted (for example, a ``Series`` or NumPy array), or a function
+of one argument to be called on the ``DataFrame``. The new values are inserted,
+and the entire DataFrame (with all original and new columns) is returned.
+

This comment has been minimized.

@jreback

jreback Feb 28, 2015

Contributor

make it clear that its a copy

@jreback

jreback Feb 28, 2015

Contributor

make it clear that its a copy

+ iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head()
+
+Above was an example of inserting a precomputed value. We can also pass in
+a function to be evalutated.

This comment has been minimized.

@jreback

jreback Feb 28, 2015

Contributor

say that the pupose of the callable is to have a defered operation

@jreback

jreback Feb 28, 2015

Contributor

say that the pupose of the callable is to have a defered operation

@jreback

View changes

pandas/core/frame.py
+ kwargs : keyword, value pairs
+ keywords are the column names, and values are either
+ inserted into the DataFrame, or called on the DataFrame
+ and inserted if the value is callable.

This comment has been minimized.

@jreback

jreback Feb 28, 2015

Contributor

the results of the callable are assigned

@jreback

jreback Feb 28, 2015

Contributor

the results of the callable are assigned

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 28, 2015

Contributor

@TomAugspurger lgtm

  • some minor doc comments
  • pls open another issue (for 0.16.0 hopefully!) to add examples / docs for use of .assign with .groupby() , e.g. df.groupby('A').assign(B = lambda x: X.C+1).max(). I think you need the .assign to interact (and make a copy of the internal self.obj of the grouper, but might be a bit trickier).
Contributor

jreback commented Feb 28, 2015

@TomAugspurger lgtm

  • some minor doc comments
  • pls open another issue (for 0.16.0 hopefully!) to add examples / docs for use of .assign with .groupby() , e.g. df.groupby('A').assign(B = lambda x: X.C+1).max(). I think you need the .assign to interact (and make a copy of the internal self.obj of the grouper, but might be a bit trickier).
@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Feb 28, 2015

Contributor

@jreback I'll need to think about how this interacts with groupby. E.G. say we wanted the group-wise mean of the ratio:

In [7]: gr.apply(lambda x: x.assign(r=x.sepal_width / x.sepal_length).mean())
Out[7]: 
            sepal_length  sepal_width  petal_length  petal_width         r
species                                                                   
setosa             5.006        3.428         1.462        0.246  0.684248
versicolor         5.936        2.770         4.260        1.326  0.467680
virginica          6.588        2.974         5.552        2.026  0.453396

Obviously not really what we want; assign returns the entire frame. But assigning and then grouping works fine:

In [10]: df.assign(r=lambda x: x.sepal_length / x.sepal_width).groupby('species').r.mean()
Out[10]: 
species
setosa        1.470188
versicolor    2.160402
virginica     2.230453
Name: r, dtype: float64

I suppose it'd be needed for operations where the assign depends on some group-wise computation. Like df.groupby('species').apply(r_ = lambda x: (x.sepal_length / x.sepal_width * len(x)))?

Contributor

TomAugspurger commented Feb 28, 2015

@jreback I'll need to think about how this interacts with groupby. E.G. say we wanted the group-wise mean of the ratio:

In [7]: gr.apply(lambda x: x.assign(r=x.sepal_width / x.sepal_length).mean())
Out[7]: 
            sepal_length  sepal_width  petal_length  petal_width         r
species                                                                   
setosa             5.006        3.428         1.462        0.246  0.684248
versicolor         5.936        2.770         4.260        1.326  0.467680
virginica          6.588        2.974         5.552        2.026  0.453396

Obviously not really what we want; assign returns the entire frame. But assigning and then grouping works fine:

In [10]: df.assign(r=lambda x: x.sepal_length / x.sepal_width).groupby('species').r.mean()
Out[10]: 
species
setosa        1.470188
versicolor    2.160402
virginica     2.230453
Name: r, dtype: float64

I suppose it'd be needed for operations where the assign depends on some group-wise computation. Like df.groupby('species').apply(r_ = lambda x: (x.sepal_length / x.sepal_width * len(x)))?

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Feb 28, 2015

Contributor

💣s away?

Contributor

TomAugspurger commented Feb 28, 2015

💣s away?

@jorisvandenbossche

View changes

doc/source/basics.rst
+ matplotlib.style.use('ggplot')
+ except AttributeError:
+ options.display.mpl_style = 'default'
+

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Mar 1, 2015

Member

Are these import still needed? As you don't seem to use a plotting example anymore now?

@jorisvandenbossche

jorisvandenbossche Mar 1, 2015

Member

Are these import still needed? As you don't seem to use a plotting example anymore now?

This comment has been minimized.

@TomAugspurger

TomAugspurger Mar 1, 2015

Contributor

Not needed now. Forgot to remove them.

@TomAugspurger

TomAugspurger Mar 1, 2015

Contributor

Not needed now. Forgot to remove them.

+ """
+ Assign new columns to a DataFrame, returning a new object
+ (a copy) with all the original columns in addition to the new ones.
+

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Mar 1, 2015

Member

Can you add here a versionadded as well?

@jorisvandenbossche

jorisvandenbossche Mar 1, 2015

Member

Can you add here a versionadded as well?

This comment has been minimized.

@TomAugspurger

TomAugspurger Mar 1, 2015

Contributor

Good idea. Done.

@TomAugspurger

TomAugspurger Mar 1, 2015

Contributor

Good idea. Done.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Mar 1, 2015

Member

I added two small doc remarks, for the rest, bombs away!

Member

jorisvandenbossche commented Mar 1, 2015

I added two small doc remarks, for the rest, bombs away!

ENH: Add assign method to DataFrame
Creates a new method for DataFrame, based off dplyr's mutate.
Closes #9229

TomAugspurger added a commit that referenced this pull request Mar 1, 2015

@TomAugspurger TomAugspurger merged commit c88b0ba into pandas-dev:master Mar 1, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Mar 1, 2015

Contributor

Ok, thanks everyone. We can do follow-ups as needed.

Contributor

TomAugspurger commented Mar 1, 2015

Ok, thanks everyone. We can do follow-ups as needed.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Mar 1, 2015

Member

Woohoo! Well done 👍

Member

shoyer commented Mar 1, 2015

Woohoo! Well done 👍

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 1, 2015

Contributor

very nice @TomAugspurger

small issue: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html

the plot seems 'too big' size-wise. is this controlled somewhere?

Contributor

jreback commented Mar 1, 2015

very nice @TomAugspurger

small issue: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html

the plot seems 'too big' size-wise. is this controlled somewhere?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Mar 2, 2015

Member

You can control this size in the image directive of rst (see http://docutils.sourceforge.net/docs/ref/rst/directives.html#image), the other option is just to include a smaller figure in the sources and include it as it is in the docs (this is what is done for the other images included in _static in the docs I think).

Member

jorisvandenbossche commented Mar 2, 2015

You can control this size in the image directive of rst (see http://docutils.sourceforge.net/docs/ref/rst/directives.html#image), the other option is just to include a smaller figure in the sources and include it as it is in the docs (this is what is done for the other images included in _static in the docs I think).

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Mar 2, 2015

Contributor

Thanks. I'll follow up tonight.

On Mar 2, 2015, at 03:02, Joris Van den Bossche notifications@github.com wrote:

You can control this size in the image directive of rst (see http://docutils.sourceforge.net/docs/ref/rst/directives.html#image), the other option is just to include a smaller figure in the sources and include it as it is in the docs (this is what is done for the other images included in _static in the docs I think).


Reply to this email directly or view it on GitHub.

Contributor

TomAugspurger commented Mar 2, 2015

Thanks. I'll follow up tonight.

On Mar 2, 2015, at 03:02, Joris Van den Bossche notifications@github.com wrote:

You can control this size in the image directive of rst (see http://docutils.sourceforge.net/docs/ref/rst/directives.html#image), the other option is just to include a smaller figure in the sources and include it as it is in the docs (this is what is done for the other images included in _static in the docs I think).


Reply to this email directly or view it on GitHub.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 9, 2015

Contributor

@TomAugspurger I remember you did a PR for the size of the plot in the whatsnew: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html. Did it get merged?

Contributor

jreback commented Mar 9, 2015

@TomAugspurger I remember you did a PR for the size of the plot in the whatsnew: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html. Did it get merged?

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Mar 9, 2015

Contributor

Yeah: #9575

Still having problems? I can change the actual image size of that pull didn't work.

On Mar 9, 2015, at 05:08, jreback notifications@github.com wrote:

@TomAugspurger I remember you did a PR for the size of the plot in the whatsnew: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html. Did it get merged?


Reply to this email directly or view it on GitHub.

Contributor

TomAugspurger commented Mar 9, 2015

Yeah: #9575

Still having problems? I can change the actual image size of that pull didn't work.

On Mar 9, 2015, at 05:08, jreback notifications@github.com wrote:

@TomAugspurger I remember you did a PR for the size of the plot in the whatsnew: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html. Did it get merged?


Reply to this email directly or view it on GitHub.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 9, 2015

Contributor

hmm still seems much larger than the rest of the page to me

On Mar 9, 2015, at 9:00 AM, Tom Augspurger notifications@github.com wrote:

Yeah: #9575

Still having problems? I can change the actual image size of that pull didn't work.

On Mar 9, 2015, at 05:08, jreback notifications@github.com wrote:

@TomAugspurger I remember you did a PR for the size of the plot in the whatsnew: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html. Did it get merged?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub.

Contributor

jreback commented Mar 9, 2015

hmm still seems much larger than the rest of the page to me

On Mar 9, 2015, at 9:00 AM, Tom Augspurger notifications@github.com wrote:

Yeah: #9575

Still having problems? I can change the actual image size of that pull didn't work.

On Mar 9, 2015, at 05:08, jreback notifications@github.com wrote:

@TomAugspurger I remember you did a PR for the size of the plot in the whatsnew: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html. Did it get merged?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Mar 9, 2015

Member

Yes, that didn't seem to work for some reason.

Member

shoyer commented Mar 9, 2015

Yes, that didn't seem to work for some reason.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Mar 9, 2015

Contributor

I opened up #9619 to fix this.

Contributor

TomAugspurger commented Mar 9, 2015

I opened up #9619 to fix this.

@TomAugspurger TomAugspurger deleted the TomAugspurger:dfTransform branch Apr 5, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment