Skip to content

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Nov 17, 2022

If read_stata was used with parameter index=None, an index based on np.arange was supplied to the constructed DataFrame, i.e. (pre pandas 2.0) an Int64Index.

np.arange has dtype np.int_, i.e. like np.intp, except is always 32bit on windows, which makes it annoying to use with tests when indexes can take all numpy numeric dtypes (like after #49560), so I'm looking into how arange is used in #49560. One case I found it was used is in read_stata and in that case it's better to use a range, so we get a RangeIndex instead of an Index[int_] when using read_stata(index_col=None).

This is a slight change in API, so I separate it out into its own PR here, so #49560, which is a large Pr, can be as focused as possible.

@mroeschke mroeschke added IO Stata read_stata, to_stata Index Related to the Index class or subclasses labels Nov 17, 2022
@@ -340,6 +340,7 @@ Other API changes
- Passing strings that cannot be parsed as datetimes to :class:`Series` or :class:`DataFrame` with ``dtype="datetime64[ns]"`` will raise instead of silently ignoring the keyword and returning ``object`` dtype (:issue:`24435`)
- Passing a sequence containing a type that cannot be converted to :class:`Timedelta` to :func:`to_timedelta` or to the :class:`Series` or :class:`DataFrame` constructor with ``dtype="timedelta64[ns]"`` or to :class:`TimedeltaIndex` now raises ``TypeError`` instead of ``ValueError`` (:issue:`49525`)
- Changed behavior of :class:`Index` constructor with sequence containing at least one ``NaT`` and everything else either ``None`` or ``NaN`` to infer ``datetime64[ns]`` dtype instead of ``object``, matching :class:`Series` behavior (:issue:`49340`)
- If no parameter ``index_col`` is given to :func:`read_stata`, the index will be a :class:`RangeIndex` Previously the index would have been a :class:`Int64Index` (:issue:`49745`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Performance improvement note can be used instead of this one

@@ -594,6 +595,7 @@ Performance improvements
- Memory improvement in :meth:`RangeIndex.sort_values` (:issue:`48801`)
- Performance improvement in :class:`DataFrameGroupBy` and :class:`SeriesGroupBy` when ``by`` is a categorical type and ``sort=False`` (:issue:`48976`)
- Performance improvement in :class:`DataFrameGroupBy` and :class:`SeriesGroupBy` when ``by`` is a categorical type and ``observed=False`` (:issue:`49596`)
- Performance improvement in :func:`read_stata` with parameter ``index_col`` set to ``None``(the default). Now the index will be a :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`49745`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docbuild is complaining about this line

@topper-123
Copy link
Contributor Author

Updated.

@mroeschke mroeschke added this to the 2.0 milestone Nov 18, 2022
@mroeschke mroeschke merged commit c37dfc1 into pandas-dev:main Nov 18, 2022
@mroeschke
Copy link
Member

Thanks @topper-123

@topper-123 topper-123 deleted the read_stata_index_col branch November 18, 2022 18:39
mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022
…dev#49745)

* API: read_stata with index_col=None return RangeIndex

* fix comments

* fix comments II

Co-authored-by: Terji Petersen <terjipetersen@Terjis-MacBook-Air.local>
Co-authored-by: Terji Petersen <terjipetersen@Terjis-Air.fritz.box>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants