DOC: Document that DataFrame.from_records()'s columns argument also acts as "include" #59670

cjerdonek · 2024-08-30T19:34:56Z

Pandas version checks

I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_records.html

Documentation problem

Currently, it's not clear from the DataFrame.from_records() docs that the columns argument to from_records() also has the effect of what an include argument (or usecols) would do. Indeed, the current wording once led someone to file a feature request asking for an include argument to be added: #15319
However, the request was later closed when the maintainers realized the columns argument already does this (but it's not documented, hence this issue).

Suggested fix for documentation

Add a sentence or phrase to the documentation of the columns argument that the argument also has the effect of limiting the DataFrame to including only the columns specified. This isn't implied by the current wording, or it's at least a little ambiguous.

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-09-04T20:56:21Z

Thanks for the report! PRs to improve the docs here are welcome!

ammar-qazi · 2024-09-04T21:51:15Z

take

ammar-qazi · 2024-09-04T23:06:17Z

@rhshadrach
Unfortunately, the columns parameter behaves differently than expected with various input types. While I encountered the intended behavior as described in the other thread with dictionaries, the results were different when I used Numpy's structured array or a list of tuples.

Numpy's structred array: The Columns parameter doesn't appear to have any filtering effect on a structured array. All columns are included in the original order, regardless of what's specified in columns.

List of tuples: The Columns parameter only assigns names to the columns. It doesn't filter or reorder the data. Additionally, it raises an error if fewer column names are provided than the number of columns in the input data.

For documentation purposes, I'm considering treating the dictionary-based input as the ideal case and instructing users to convert their data into dictionaries if they want to filter or reorder columns using the columns parameter.

However, as this is my first time contributing to the documentation, I'd greatly appreciate your guidance on the best approach to document these nuances.

cjerdonek · 2024-09-05T03:59:38Z

Numpy's structred array: The Columns parameter doesn't appear to have any filtering effect on a structured array. All columns are included in the original order, regardless of what's specified in columns.

It seems like this could be a bug because it seems to contradict the current documentation (that the argument reorders the columns if column names are provided). If this case is fixed for reordering, the behavior for filtering could be fixed at the same time.

List of tuples: The Columns parameter only assigns names to the columns. It doesn't filter or reorder the data. Additionally, it raises an error if fewer column names are provided than the number of columns in the input data.

This seems to be consistent with the current documentation because the documentation only says that reordering occurs if names are provided (and this limitation can be the same for filtering). As for raising an error when fewer names are provided, the documentation doesn't explicitly say an error will happen in that case, but it's not surprising it would, IMO.

ammar-qazi · 2024-09-05T04:28:43Z

Yeah, that's understandable. If the bug is fixed, we can say that the columns parameter reorders data in dict and Numpy array. And for a list of tuples, it behaves as a names parameter, which should be clear because a list of tuples doesn't have existing column names.

Thanks for the insights, @cjerdonek.

cjerdonek added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 30, 2024

rhshadrach added IO Data IO issues that don't fit into a more specific label good first issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2024

github-actions bot assigned ammar-qazi Sep 4, 2024

StaticAccess mentioned this issue Sep 5, 2024

fixed issue#59670. DOC #59714

Closed

1 task

This was referenced Sep 5, 2024

BUG: DataFrame.from_records()'s columns argument doesn't work on Numpy's structured array #59717

Closed

Resolves #59670 by documenting that DataFrame.from_records()'s columns filters (includes) data. #59723

Merged

mroeschke closed this as completed in #59723 Sep 6, 2024

mroeschke closed this as completed in 5a07ed5 Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Document that DataFrame.from_records()'s columns argument also acts as "include" #59670

DOC: Document that DataFrame.from_records()'s columns argument also acts as "include" #59670

cjerdonek commented Aug 30, 2024 •

edited

Loading

rhshadrach commented Sep 4, 2024

ammar-qazi commented Sep 4, 2024

ammar-qazi commented Sep 4, 2024

cjerdonek commented Sep 5, 2024 •

edited

Loading

ammar-qazi commented Sep 5, 2024

DOC: Document that DataFrame.from_records()'s columns argument also acts as "include" #59670

DOC: Document that DataFrame.from_records()'s columns argument also acts as "include" #59670

Comments

cjerdonek commented Aug 30, 2024 • edited Loading

Pandas version checks

Location of the documentation

Documentation problem

Suggested fix for documentation

rhshadrach commented Sep 4, 2024

ammar-qazi commented Sep 4, 2024

ammar-qazi commented Sep 4, 2024

cjerdonek commented Sep 5, 2024 • edited Loading

ammar-qazi commented Sep 5, 2024

cjerdonek commented Aug 30, 2024 •

edited

Loading

cjerdonek commented Sep 5, 2024 •

edited

Loading