Dataframe column dtype changed from int8 to int64 when setting complete column #11638

Closed
cpaulik opened this Issue Nov 18, 2015 · 12 comments

Comments

Projects
None yet
3 participants
Contributor

cpaulik commented Nov 18, 2015

The following example should explain:

Python 2.7.10 |Continuum Analytics, Inc.| (default, Oct 19 2015, 18:04:42) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: None
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: None
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

In [4]: df = pd.DataFrame({'one': np.full(10, 0, dtype=np.int8)})

In [5]: df
Out[5]: 
   one
0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0

In [6]: df.dtypes
Out[6]: 
one    int8
dtype: object

In [7]: df.loc[1, 'one'] = 6

In [8]: df
Out[8]: 
   one
0    0
1    6
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0

In [9]: df.dtypes
Out[9]: 
one    int8
dtype: object

In [10]: df.one = np.int8(7)

In [11]: df.dtypes
Out[11]: 
one    int64
dtype: object

In [12]: df
Out[12]: 
   one
0    7
1    7
2    7
3    7
4    7
5    7
6    7
7    7
8    7
9    7

So it is cast to the correct dtype if a slice of the column is changed but setting the whole column changes the dtype even when explicitly set to np.int8

Contributor

jreback commented Nov 18, 2015

hmm, I recall seeing almost the same issue, but can't locate ATM. yep, looks buggy. pull-requests to fix are welcome.

jreback added this to the Next Major Release milestone Nov 18, 2015

Contributor

varun-kr commented Nov 19, 2015

@jreback In comman.py
it is upcasting int8 to int64.

 # provide implicity upcast on scalars
    elif is_integer(val):
        dtype = np.int64

    elif is_float(val):
        dtype = np.float64

if there is no specific requirement for upcasting, then I can do a PR .
My proposed solution is using np.issubdtype inside is_integer and is_float function to provide support for all kind of int and float types.
Please suggest .

Contributor

jreback commented Nov 19, 2015

this is actually tricky. do lib.isscalar on the val first, if its a scalar, then use the defaults, else you should 'assume' its a zero-dim scalar and do val.dtype

alternatively you can do

if isinstance(val, ndarray):
    dtype = val.dtype
else:
    dtype = np.int64

etc (this might be better)

Contributor

varun-kr commented Nov 19, 2015

We are calling _infer_dtype_from_scalar(val) which is already doing this job. How does it solve the problem ? Am I missing something ?

Contributor

jreback commented Nov 19, 2015

right......so must be someplace else then (because _infer_dtype_from_scalar is doing the correct job)

Contributor

varun-kr commented Nov 19, 2015

As pointed earlier , how about modifying _infer_dtype_from_scalar(val) and use np.issubdtype inside is_integer(val) and is_float(val) function to provide support for all kind of int and float types ? I am assuming that this will generate the same error for float types as well.

Contributor

jreback commented Nov 19, 2015

how would that help?
these are already caught above

Contributor

varun-kr commented Nov 19, 2015

If we modify integer and float conditions in _infer_dtype_from_scalar(val) like this

elif is_integer(val):
        if isinstance(val, int):
            dtype = np.int64
        else:
            dtype = type(val)

elif is_float(val):
        if isinstance(val, float):
            dtype = np.float64
        else:
            dtype = type(val)

It will resolve the discrepancy without breaking anything. Please suggest .

Contributor

jreback commented Nov 19, 2015

ahh, the problem is this:

In [1]: x = np.int8(7)

In [2]: isinstance(x, np.ndarray)
Out[2]: False

In [9]: pd.core.common.is_integer(x)
Out[9]: 1

you can try that change that you are suggesting above and see what breaks (and of course add a test for this behavior).

It looks like it should work.

@jreback jreback modified the milestone: 0.17.1, Next Major Release Nov 19, 2015

@varun-kr varun-kr added a commit to varun-kr/pandas that referenced this issue Nov 19, 2015

@varun-kr varun-kr BUG #11638 return correct dtype for int and float
Added test case TestInferDtype
c61020e

@jreback jreback added a commit to jreback/pandas that referenced this issue Nov 20, 2015

@varun-kr @jreback varun-kr + jreback BUG: #11638 return correct dtype for int and float 24cdf45
Contributor

jreback commented Nov 20, 2015

closed by #11644

jreback closed this Nov 20, 2015

@jreback jreback added a commit to jreback/pandas that referenced this issue Nov 20, 2015

@jreback jreback COMPAT: compat of scalars on all platforms, xref #11638 b52b7fc
Contributor

cpaulik commented Nov 20, 2015

Wow, thank you for the quick fix. I was going to try but got lost in the pandas internals. Maybe next time 😄

@jreback jreback added a commit that referenced this issue Nov 20, 2015

@jreback jreback Merge pull request #11662 from jreback/scalar
COMPAT: compat of scalars on all platforms, xref #11638
a3fd834
Contributor

jreback commented Nov 20, 2015

thank @varun-kr !

@yarikoptic yarikoptic added a commit to neurodebian/pandas that referenced this issue Dec 3, 2015

@yarikoptic yarikoptic Merge tag 'v0.17.1' into debian
Version 0.17.1

* tag 'v0.17.1': (168 commits)
  add nbviewer link
  Revert "DOC: fix sponsor notice"
  DOC: a few touchups
  DOC: fix sponsor notice
  DOC: warnings and remove HTML
  COMPAT: compat of scalars on all platforms, xref #11638
  DOC: fix build errors/warnings
  DOC: whatsnew edits
  DOC: fix link syntax
  DOC: update release.rst / whatsnew edits
  BUG: fix col iteration in DataFrame.round, #11611
  DOC: Clarify foramtting
  BUG: #11638 return correct dtype for int and float
  BUG: #11637 fix to_csv incorrect output.
  DOC: sponsor notice
  BUG: indexing with a range , #11652
  Fix link to numexpr
  ENH: fixup tilde expansion, xref #11438
  ENH: tilde expansion for write output formatting functions, #11438
  DOC: fix up doc-string creations in generic.py
  ...
9b2e35f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment