Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Reindexing behaviour of dataframe column-assignment missing #39845

Open
fish-face opened this issue Feb 16, 2021 · 8 comments
Open

DOC: Reindexing behaviour of dataframe column-assignment missing #39845

fish-face opened this issue Feb 16, 2021 · 8 comments
Assignees
Labels
Docs good first issue Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@fish-face
Copy link

Location of the documentation

pandas.core.indexing.IndexingMixin.loc
pandas.DataFrame.__setitem__

Documentation problem

When assigning a Series through df[...] = ... or df.loc[...] = , the Series' index is expanded to conform to the DataFrame's, and then values are added according to the index:

In [1]: df
Out[1]: 
   a
0  1
1  2
2  3

In [2]: df['b'] = pd.Series({1: 'b'})
In [3]: df
Out[3]: 
   a    b
0  1  NaN
1  2    b
2  3  NaN

In [4]: se = pd.Series({2: 'zero', 1: 'one', 0: 'two'})      
# NOTE: Order is preserved and reflected in the Series' order
In [5]: se
Out[5]: 
2    zero
1     one
0     two
dtype: object
In [6]: df['d'] = se
# NOTE: values have been reordered according to the df's index
In [7]: df
Out[8]: 
   a    b    d
0  1  NaN  two
1  2    b  one
2  3  NaN zero

(But in contrast:

In [4]: df['c'] = {1: 'c'} 
<traceback omitted>
ValueError: Length of values (1) does not match length of index (3)

)

As far as I can tell, this is not really documented. In the case of __setitem__ there is no API documentation at all, and one is left only with the "Selecting and Indexing Data" guide's examples. In the case of .loc there is mention that if using a Series as input, "The index of the key will be aligned before masking," but this is not what we're doing here. Neither set of examples indicates the behaviour when adding a new column or part of column: the only hints I could find in the guide about setting with enlargement added a series whose index was the same as the existing index. This means that it is not clear what order the data will end up in the dataframe and where NaNs will be added.

In the case of .loc in general the API documentation, although it does exist, is fairly scant. There is a link to the user guide, but personally I think this is pretty important behaviour to document in the reference.

Suggested fix for documentation

  • Add documentation for __setitem__ and in particular the behaviour of reindexing Series.
  • Update documentation for .loc to more completely describe the behaviour obtained when assigning to .loc[], and include at least one example of assigning to a partial column. Alternatively add this to the user guide. Perhaps something like the following, plus an example like those above:

When assigning a Series to a DataFrame, either via .loc or via the [] operator, values of the Series will be added to the dataframe according to their index. Values in the Series whose label does not appear in the DataFrame will not be added, and labels missing from the Series' index will be NaN. This also means that the order that data appears in the resulting DataFrame could be different from the order in the Series.

@fish-face fish-face added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 16, 2021
@0xpranjal
Copy link
Contributor

@fish-face Are you working on this issue?

@fish-face
Copy link
Author

@Bhard27 I'm not confident of finding the best place in the documentation to add this, so was hoping to leave my contribution at the example and suggested wording above. But if someone can provide feedback on that, and on the wording, I might be able to produce more...

@mzeitlin11
Copy link
Member

Thanks for writing this up @fish-face! I agree that an example in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html would be valuable since there is no current example (and this seems like an issue that might be commonly encountered). Your example seems perfect for this purpose. The same example could be added to the indexing section of the user guide if nothing like it exists.

(Not sure about the __setitem__ portion of your question, assuming there is a reason it's not documented like loc, but not sure about the history there).

@mzeitlin11 mzeitlin11 added good first issue Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 20, 2021
@mzeitlin11 mzeitlin11 added this to the Contributions Welcome milestone Mar 20, 2021
@mmarconi
Copy link

Take, I am with a group of student developers from Allegheny College and we are looking to contribute to this issue.

@mzeitlin11
Copy link
Member

@mmarconi saw a question about this on gitter. A good starting point would be adding an example for loc and to the indexing guide as well if there is nothing like it. __setitem__ could be handled as a followup (it's nice to keep prs small and targeted)

@salomondush
Copy link
Contributor

Hello, I would like to work on this issue if it's not entirely finished! I noticed that it's still open.

@marco-georgaklis
Copy link

take

@jreback jreback modified the milestones: Contributions Welcome, 1.5 Feb 26, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 15, 2022
@rsm-23
Copy link
Contributor

rsm-23 commented Sep 11, 2023

@mroeschke is this still open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs good first issue Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants