Categorical(vals, cats) bad performance with NaNs #12077

Closed
Winand opened this Issue Jan 18, 2016 · 4 comments

Comments

Projects
None yet
2 participants
Contributor

Winand commented Jan 18, 2016

NaNs in datetime64 data values GREATLY reduce performance of Categorical(values, cats):

import pandas as pd
tmp=pd.Series(pd.DatetimeIndex(pd.np.datetime64('1995-01-01 00:00')+i for i in range(1000000)))
to = tmp.astype('category')
cats = to.cat.categorical._categories.values

%timeit pd.Categorical(tmp.values, cats)
1 loops, best of 3: 250 ms per loop

tmp[500000] = pd.NaT
%timeit pd.Categorical(tmp.values, cats)
1 loops, best of 3: 10.1 s per loop

tmp[tmp.isnull()] = pd.np.datetime64('0')
%timeit pd.Categorical(tmp.values, cats)
1 loops, best of 3: 251 ms per loop

Small issue with printing Categorical datetime64:

ds = pd.Categorical([pd.np.datetime64("2014-01-01"), pd.NaT])
>>> ds
[2014-01-01, 2014-01-01] <--- NO, -1 is NOT a category:-)
Categories (1, datetime64[ns]): [2014-01-01]
>>> ds.astype('datetime64')
array(['2014-01-01T03:00:00.000000000+0300', 'NaT'], dtype='datetime64[ns]')

Versions:

commit: None
python: 3.4.3.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en

pandas: 0.17.1
nose: 1.3.7
pip: 7.1.2
setuptools: 18.7.1
Cython: 0.23.4
numpy: 1.9.3
scipy: 0.16.1
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.3.3
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2015.7
blosc: 1.2.8
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.0
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.7
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
Jinja2: 2.8
Contributor

jreback commented Jan 18, 2016

pls show a copy pastable example and
pd.show_versions()

Contributor

jreback commented Jan 18, 2016

can u show using
%%timeit in ipython instead
it's much easier to read

Contributor

Winand commented Jan 18, 2016

At first i've tried to initialize like this:

to=pd.Series(pd.DatetimeIndex(range(1000000))).astype('category')
cats = to.cat.categorical._categories.values
tmp=pd.Series(pd.DatetimeIndex(range(1000000)))

but it gives wrong results in the 1st case (a bug?):

>>>c1
[1970-01-01, 1970-01-01, 1970-01-01, 1970-01-01, 1970-01-01, ..., 1970-01-01 00:00:00.000999, 1970-01-01 00:00:00.000999, 1970-01-01 00:00:00.000999, 1970-01-01 00:00:00.000999, 1970-01-01 00:00:00.000999]
Length: 1000000
Categories (1000000, datetime64[ns]): [1970-01-01 00:00:00.000000000, 1970-01-01 00:00:00.000000001,
                                       1970-01-01 00:00:00.000000002, 1970-01-01 00:00:00.000000003, ...,
                                       1970-01-01 00:00:00.000999996, 1970-01-01 00:00:00.000999997,
                                       1970-01-01 00:00:00.000999998, 1970-01-01 00:00:00.000999999]

>>>c2
[1970-01-01 00:00:00.000000000, 1970-01-01 00:00:00.000000001, 1970-01-01 00:00:00.000000002, 1970-01-01 00:00:00.000000003, 1970-01-01 00:00:00.000000004, ..., 1970-01-01 00:00:00.000999995, 1970-01-01 00:00:00.000999996, 1970-01-01 00:00:00.000999997, 1970-01-01 00:00:00.000999998, 1970-01-01 00:00:00.000999999]
Length: 1000000
Categories (1000000, datetime64[ns]): [1970-01-01 00:00:00.000000000, 1970-01-01 00:00:00.000000001,
                                       1970-01-01 00:00:00.000000002, 1970-01-01 00:00:00.000000003, ...,
                                       1970-01-01 00:00:00.000999996, 1970-01-01 00:00:00.000999997,
                                       1970-01-01 00:00:00.000999998, 1970-01-01 00:00:00.000999999]
equal? False

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 25, 2016

@jreback jreback PERF: add support for NaT in hashtable factorizers, improving Categor…
…ical construction

      with NaT, #12077
e1385d8
Contributor

jreback commented Jan 25, 2016

@Winand

#12128 should fix a multitude of categorical with NaT issues/perf.

was converting them to object dtype under the hood (bad) and not treating NaT like nan

jreback closed this in 81bb972 Jan 25, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment