Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Document that DataFrame.from_records()'s columns argument also acts as "include" #59670

Closed
1 task done
cjerdonek opened this issue Aug 30, 2024 · 5 comments · Fixed by #59723
Closed
1 task done
Assignees
Labels
Docs good first issue IO Data IO issues that don't fit into a more specific label

Comments

@cjerdonek
Copy link

cjerdonek commented Aug 30, 2024

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_records.html

Documentation problem

Currently, it's not clear from the DataFrame.from_records() docs that the columns argument to from_records() also has the effect of what an include argument (or usecols) would do. Indeed, the current wording once led someone to file a feature request asking for an include argument to be added: #15319
However, the request was later closed when the maintainers realized the columns argument already does this (but it's not documented, hence this issue).

Suggested fix for documentation

Add a sentence or phrase to the documentation of the columns argument that the argument also has the effect of limiting the DataFrame to including only the columns specified. This isn't implied by the current wording, or it's at least a little ambiguous.

@cjerdonek cjerdonek added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 30, 2024
@rhshadrach rhshadrach added IO Data IO issues that don't fit into a more specific label good first issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2024
@rhshadrach
Copy link
Member

Thanks for the report! PRs to improve the docs here are welcome!

@ammar-qazi
Copy link
Contributor

take

@ammar-qazi
Copy link
Contributor

@rhshadrach
Unfortunately, the columns parameter behaves differently than expected with various input types. While I encountered the intended behavior as described in the other thread with dictionaries, the results were different when I used Numpy's structured array or a list of tuples.

  • Numpy's structred array: The Columns parameter doesn't appear to have any filtering effect on a structured array. All columns are included in the original order, regardless of what's specified in columns.

image

  • List of tuples: The Columns parameter only assigns names to the columns. It doesn't filter or reorder the data. Additionally, it raises an error if fewer column names are provided than the number of columns in the input data.

image
image

For documentation purposes, I'm considering treating the dictionary-based input as the ideal case and instructing users to convert their data into dictionaries if they want to filter or reorder columns using the columns parameter.

However, as this is my first time contributing to the documentation, I'd greatly appreciate your guidance on the best approach to document these nuances.

@cjerdonek
Copy link
Author

cjerdonek commented Sep 5, 2024

Numpy's structred array: The Columns parameter doesn't appear to have any filtering effect on a structured array. All columns are included in the original order, regardless of what's specified in columns.

It seems like this could be a bug because it seems to contradict the current documentation (that the argument reorders the columns if column names are provided). If this case is fixed for reordering, the behavior for filtering could be fixed at the same time.

List of tuples: The Columns parameter only assigns names to the columns. It doesn't filter or reorder the data. Additionally, it raises an error if fewer column names are provided than the number of columns in the input data.

This seems to be consistent with the current documentation because the documentation only says that reordering occurs if names are provided (and this limitation can be the same for filtering). As for raising an error when fewer names are provided, the documentation doesn't explicitly say an error will happen in that case, but it's not surprising it would, IMO.

@ammar-qazi
Copy link
Contributor

Yeah, that's understandable. If the bug is fixed, we can say that the columns parameter reorders data in dict and Numpy array. And for a list of tuples, it behaves as a names parameter, which should be clear because a list of tuples doesn't have existing column names.

Thanks for the insights, @cjerdonek.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs good first issue IO Data IO issues that don't fit into a more specific label
Projects
None yet
3 participants