New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent dtype of category in empty Series between dict and list input #18515

Closed
toobaz opened this Issue Nov 27, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@toobaz
Member

toobaz commented Nov 27, 2017

Code Sample, a copy-pastable example if possible

In [2]: pd.Series([], dtype='category')
Out[2]: 
Series([], dtype: category
Categories (0, object): [])

In [3]: pd.Series({}, dtype='category')
Out[3]: 
Series([], dtype: category
Categories (0, float64): [])

In [4]: pd.Series(dtype='category')
Out[4]: 
Series([], dtype: category
Categories (0, float64): [])

Problem description

The difference is unjustified.

Expected Output

The same. Probably Out[4]:, which is also (implicitly) tested.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.22.0.dev0+241.gf745e52e1
pytest: 3.0.6
pip: 9.0.1
setuptools: 33.1.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.2.2
sphinx: None
patsy: 0.4.1+dev
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml: 3.7.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 27, 2017

dupe of #17261

@jreback jreback closed this Nov 27, 2017

@jreback jreback added this to the No action milestone Nov 27, 2017

@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Nov 30, 2017

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Nov 30, 2017

This takes a separate fix, so let's keep this open?
I suppose this is due to this inconsistency in the Categorical constructor:

In [13]: pd.Categorical(np.array([]))
Out[13]: [], Categories (0, float64): []

In [14]: pd.Categorical([])
Out[14]: [], Categories (0, object): []
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Nov 30, 2017

Actually, the above is maybe not that wrong? As in the first case you actually pass it a float array, so it's OK that the categories are float.
So then it does need to be fixed in the Series init, which I think @toobaz is doing in #18496

@toobaz

This comment has been minimized.

Member

toobaz commented Nov 30, 2017

I suppose this is due to this inconsistency in the Categorical constructor:

The inconsistency is precisely the one described in my opening example. Which is what I'm fixing in #18496 (and has nothing to do with passing an array - which would rightly keep its dtype).

But you are right that this can be considered separate from #17261, which doesn't involve categories. @jreback probably just viewed this as included in that (which was fine to me - but the fix is distinct).

So OK with reopening, I will just mention in #18496 that it closes this.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Nov 30, 2017

The inconsistency is precisely the one described in my opening example.

No, you are describing it with Series, which is something different as with Categorical :-)

Anyhow, yes consider this as a separate issue, and mention it in #18496 (+ adding new tests, whatsnew note)

@toobaz

This comment has been minimized.

Member

toobaz commented Nov 30, 2017

No, you are describing it with Series, which is something different as with Categorical :-)

Indeed. This issue has nothing to do with (non-Series) Categorical (the inconsistency you are describing is unrelated to mine, and to this bug).

Anyhow, yes consider this as a separate issue, and mention it in #18496 (+ adding new tests, whatsnew note)

OK

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Nov 30, 2017

the inconsistency you are describing is unrelated to mine

No, the inconsistency I describe is the underlying reason for the bug you reported in this issue. Of course, the actual cause is in the implementation detail how Series handles no data or empty dict, previously it was passed as np.array([]), now you changed that to be passed as []. Which you are fixing, so perfect!

toobaz added a commit to toobaz/pandas that referenced this issue Nov 30, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Dec 1, 2017

@jreback jreback modified the milestones: Next Major Release, 0.22.0 Dec 1, 2017

jreback added a commit that referenced this issue Dec 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment