BUG: Index dtype may not be applied properly #11017

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
2 participants
Member

sinhrks commented Sep 7, 2015

Fixed 2 problems:

  • Specified dtype is not applied to other iterables.
import numpy as np
import pandas as pd

pd.Index([1, 2, 3], dtype=int)
# Index([1, 2, 3], dtype='object')
  • Specifying category to ndarray-like results in TypeError
pd.Index(np.array([1, 2, 3]), dtype='category')
# TypeError: data type "category" not understood

@sinhrks sinhrks added Bug Indexing Dtypes and removed Indexing labels Sep 7, 2015

sinhrks added this to the 0.17.0 milestone Sep 7, 2015

Contributor

jreback commented Sep 7, 2015

pls run perf check on this -

Contributor

jreback commented Sep 9, 2015

@sinhrks can you see if this affects perf?

@jreback jreback commented on an outdated diff Sep 9, 2015

pandas/core/index.py
@@ -114,17 +114,21 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None, fastpath=False,
if fastpath:
return cls._simple_new(data, name)
+ if is_categorical_dtype(data) or is_categorical_dtype(dtype):
+ return CategoricalIndex(data, copy=copy, name=name, **kwargs)
+
+ from pandas.tseries.index import DatetimeIndex
@jreback

jreback Sep 9, 2015

Contributor

I would leave all of the imports where they were, no reason to actually import them until they are used (see below)

@jreback

jreback Sep 9, 2015

Contributor

actually on 2nd thought this is fine

Member

sinhrks commented Sep 10, 2015

Following is a asv result. Will consider a better path.

    before     after       ratio
  [789e07d9] [ce89f9d1]
    33.94μs    44.57μs      1.31  ctors.index_from_series_ctor.time_index_from_series_ctor
   121.86μs   173.21μs      1.42  frame_methods.frame_get_dtype_counts.time_frame_get_dtype_co
    16.45μs    27.07μs      1.65  index_object.index_float64_construct.time_index_float64_construct
Contributor

jreback commented Sep 10, 2015

this is prob just from the imports (e.g. a Float64Index) doesn't care about Datetimeindex, so the import adds the extra time (or the check for the import anyhow)

Member

sinhrks commented Sep 11, 2015

@jreback Thanks, suggested changes improve perf a little. I assume other slowness is caused by categorical condition moved to the top.

All benchmarks:

    before     after       ratio
  [4d4a2e33] [c3eaeb07]
    33.85μs    41.24μs      1.22  ctors.index_from_series_ctor.time_index_from_series_ctor
   129.48μs   160.86μs      1.24  frame_methods.frame_get_dtype_counts.time_frame_get_dtype_counts
    16.36μs    22.88μs      1.40  index_object.index_float64_construct.time_index_float64_construct
Contributor

jreback commented Sep 11, 2015

master (this is index_object.index_float64_construct.time_index_float64_construct) benchmark

In [1]: arr = np.arange(1000000.0)
In [2]: %timeit Index(arr)
The slowest run took 12.23 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 15.7 µs per loop

this branch

In [2]: %timeit Index(arr)
The slowest run took 7.49 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 19.8 µs per loop

fix jreback/pandas@29c0325

In [2]: %timeit Index(arr)
The slowest run took 11.74 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 13.8 µs per loop

will merge in a bit, thanks @sinhrks

Member

sinhrks commented Sep 11, 2015

Wow, great. Thanks!

Contributor

jreback commented Sep 11, 2015

merged via ead3ca8 (my change in another commit)

thanks!

I don't believe we had an issue assosicated, correct?

jreback closed this Sep 11, 2015

Member

sinhrks commented Sep 11, 2015

Though this was based on gitter chat, #5196 refers to the same issue. Closed.

sinhrks deleted the sinhrks:index_dtype branch Sep 11, 2015

Contributor

jreback commented Sep 11, 2015

@sinhrks awesome thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment