Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

pandas converts int32 to int64 #622

Closed
gdementen opened this Issue Jan 13, 2012 · 14 comments

Comments

Projects
None yet
7 participants
Contributor

gdementen commented Jan 13, 2012

Is this intended? I had hoped no copying at all would happen in that case.

In [65]: a = np.random.randint(10, size=1e6)

In [66]: a.dtype
Out[66]: dtype('int32')

In [67]: b = np.random.randint(2, size=1e6)

In [68]: df = pandas.DataFrame({'a': a, 'b': b})

In [69]: df.dtypes
Out[69]:
a int64
b int64

Contributor

adamklein commented Jan 13, 2012

Yes, pandas has only four dtypes right now: int64, float64, bool, and object. This is in the interest of making it user-friendly, but at the expense of memory conservation obviously. In the future it might make sense to add more as long as it doesn't complicate the user-facing API.

Contributor

jseabold commented Oct 8, 2012

Just got bit by this, upcasting from float32, int8 and int16.

Member

cpcloud commented Dec 11, 2012

I actually like the fact that the dtypes are simpler when using pandas. Also, If you don't use a dict, then the dtype is preserved.

In practice is this a big deal? Maybe I'm a bit green, but I've never run into a situation using pandas where it really mattered whether I used int32 vs int64.

It matters for things like reading raw bytes from binary files, but if you're creating arrays large enough that the distinction between 32 and 64-bit width numbers matters, you'd be better off just getting more RAM.

For example, even if you had 4GB of RAM on your machine and you had a 2GB array of 32-bit integers, you're still going to need another 2GB if you want to do any non destructive arithmetic on that array thus maxing out your system's RAM.

Point is, doesn't seem like this is a bug. Just my two cents.

Owner

wesm commented Dec 11, 2012

I agree that the simplicity is good-- you don't have to have to write down the dtype of a DataFrame like you do with a structured array. I think the design should be: have simple defaults, but when a data type is already set (e.g. int32), it's OK to "live and let live".

adamsd5 commented Jan 17, 2013

I am new to Pandas, but would like to put in my vote for supporting all ndarray types. From my testing, Series already will support other types, but DataFrame will not. I have two arguments... memory and speed. cpcloud suggested that you can always buy more memory, which is a reasonable suggestion. However, systems do have memory limits, and there are computation tasks that will use all of it (yes, even 256GB or more). Being able to fit twice as many samples on the system, regardless of how much memory you have, is a good thing.

On the speed front, I want to load binary files quickly into memory and process them with pandas. I wrote a C++ module for this purpose. I don't want to copy the memory after reading it from disk. For the processing we are doing, this would double the number of memory operations, which slows down the processing by almost 1/2. Unfortunately, after reading the binary into memory, I need to iterate over it and copy the int32 array into an int64 array. It is even worse than just a large memory copy because it also must up-cast each value to int64.

I like wesm's suggestion.

Member

cpcloud commented Jan 17, 2013

@adamsd5 You might try numpy's memmap ndarray subclass. It allows you to treat a file like an in-memory array. Of course, if your file is not just an array then this might be tricky. You could then pass the memmap to the pandas dataframe constructor and the dtype should be preserved. I agree with you that in the long run dtype preservation is desirable. Just out of curiosity, what kind of data are you working with?

Contributor

jreback commented Jan 17, 2013

@adamsd5 sounds like what you really want is out-of-core computation (similar to what @cpcloud suggested).
that is, your data is represented on disk, then slices are put in memory as needed and computed. I know @wesm has this as a goal as well. this will allow you to not even worry about the memory issue at all

HDFStore supports this now, though in a somewhat non-transparent manner.

Here's what you could do

  1. store your data on-disk using HDFStore in a table format (could be a series of append operations from say csv files, or wherever you have now)
  2. iterate over either a) a series of queries, or b) the indicies of the 'mapped frame'
  3. compute and repeat

so imagine this pseudo code (this is the 2 b) part):

store = HDFStore('a_big_file.h5')

# pretend we have a store of the frame as a table 'df'

nrows = store.get_storer('df').nrows
chunk_size = 100000

for i in xrange(int(nrows / chunk_size) + 1):
    start_i = i * chunk_size
    stop_i = min((i + 1) * chunk_size, nrows)

    data_for_this_chunk = store.select('df', start = start_i, stop = stop_i)
    store.append('df_result', process_chunk(data_for_this_chunk))

would essentially give you a transformation operation, similar to process_chunk(df),
but processed in chunks. should be quite memory and speed insensitive, and could be easily parrallellizable
reduction operations are even simpler (as they can be accumulated in memory)

not that hard to create a wrapper around this....

adamsd5 commented Jan 18, 2013

cpcloud, does pandas.DataFrame treat such memmap ndarrays differently? You've presented a technique that I might use, but I think the DataFrame will still convert all int32 into int64.

I'm not actually trying to process things out of memory. I'm happy loading the entire Data Frame into memory. However, I would like to minimize the memory operations. Once the bytes are loaded from disk (and alas, I have no control over the format they are written), I do not want to copy them around at all (and I don't want pandas to make a copy for me either). From what I can tell, Pandas will always up-convert Int32 to Int64, which is a slow operation.

Member

cpcloud commented Jan 18, 2013

@adamsd5 A cursory glance at frame.py suggests that the dtype is preserved with instances of ndarray (isinstance tests for subclasses) that are not record arrays. You can also pass the dtype in the constructor.

Member

cpcloud commented Jan 20, 2013

@adamsd5 I was wrong. It seems that floating point types are preserved in the DataFrame constructor, but integer types are not. E.g.,

df-dtypes

The issue still stands. I poked around in pandas/core/internals.py and saw that the function make_block converts any integer subtypes to int64, but preserves other types. Is there any reason to suspect that getting of the call to values.astype('i8') would break anything? Either way I'll try it and report back.

Contributor

jreback commented Jan 20, 2013

@cpcloud see PR #2705. this is a bit more complicated than it first appears, this change will appear on 0.10.2. the existing implementation will upcast for most operations, eg even though u can create a float32 (or int32) frame most operations will not preserve it

Contributor

jreback commented Jan 22, 2013

what dtypes should pandas fully support - this means all types of pad,fill,take,diff operations - their are specific cython functions created for each of the dtypes - the following are currently supported
float64,int64,int32,datetime64[ns],bool,object

float32 should be added clearly
what about float16,int16,int8,uint64,uint32,uint16,uint8?

you can always store these other dtypes, but certain operations will raise (or can auto upcast them)
eg say we don't support int16, can upcast to int32 and perform the ops

downside of adding more fully supported dtypes is additional compile time on installation and testing
and after a certain point prob should move to code generation (rather than copy/paste of the functions)
comments?

adamsd5 commented Jan 22, 2013

For my purposes, int32 and float32 would suffice. I see value in the smaller types for some people. If operations mean an upcast during the operation, the value is diminished. A use case would be a huge time series DataFrame on disk that has many int8 columns (perhaps factors), and the user wants to load, then filter based on time stamp and save a sub range of time. None of the int8 columns should be up converted. Just my ideas, hope it is helpful.

Darryl

On Jan 21, 2013, at 11:43 PM, jreback notifications@github.com wrote:

what dtypes should pandas fully support - this means all types of pad,fill,take,diff operations - their are specific cython functions created for each of the dtypes - the following are currently supported
float64,int64,int32,datetime64[ns],bool,object

float32 should be added clearly
what about float16,int16,int8,uint64,uint32,uint16,uint8?

you can always store these other dtypes, but certain operations will raise (or can auto upcast them)
eg say we don't support int16, can upcast to int32 and perform the ops

downside of adding more fully supported dtypes is additional compile time on installation and testing
and after a certain point prob should move to code generation (rather than copy/paste of the functions)
comments?


Reply to this email directly or view it on GitHub.

@jreback jreback added a commit to jreback/pandas that referenced this issue Feb 8, 2013

@jreback jreback ENH: allow propgation and coexistance of numeric dtypes (closes GH #622)
     construction of multi numeric dtypes with other types in a dict
     validated get_numeric_data returns correct dtypes
     added blocks attribute (and as_blocks()) method that returns a dict of dtype -> homogeneous Frame to DataFrame
     added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns
     fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger)
     changed implementation of get_dtype_counts() to use .blocks
     revised DataFrame.convert_objects to use blocks to be more efficient
     added Dtype printing to show on default with a Series
     added convert_dates='coerce' option to convert_objects, to force conversions to datetime64[ns]
     where can upcast integer to float as needed (on inplace ops #2793)
     added fully cythonized support for int8/int16
     no support for float16 (it can exist, but no cython methods for it)

TST: fixed test in test_from_records_sequencelike (dict orders can be different on different arch!)
       NOTE: using tuples will remove dtype info from the input stream (using a record array is ok though!)
     test updates for merging (multi-dtypes)
     added tests for replace (but skipped for now, algos not set for float32/16)
     tests for astype and convert in internals
     fixes for test_excel on 32-bit
     fixed test_resample_median_bug_1688 I belive
     separated out test_from_records_dictlike
     testing of panel constructors (GH #797)
     where ops now have a full test suite
     allow slightly less sensitive decimal tests for less precise dtypes

BUG: fixed GH #2778, fillna on empty frame causes seg fault
     fixed bug in groupby where types were not being casted to original dtype
     respect the dtype of non-natural numeric (Decimal)
     don't upcast ints/bools to floats (if you say were agging on len, you can get an int)
DOC: added astype conversion examples to whatsnew and docs (dsintro)
     updated RELEASE notes
     whatsnew for 0.10.2
     added upcasting gotchas docs

CLN: updated convert_objects to be more consistent across frame/series
     moved most groupby functions out of algos.pyx to generated.pyx
     fully support cython functions for pad/bfill/take/diff/groupby for float32
     moved more block-like conversion loops from frame.py to internals.py (created apply method)
       (e.g. diff,fillna,where,shift,replace,interpolate,combining), to top-level methods in BlockManager
166a80d

@wesm wesm added a commit that referenced this issue Feb 10, 2013

@wesm wesm Merge remote branch 'jreback/dtypes'
* jreback/dtypes:
  ENH: allow propgation and coexistance of numeric dtypes (closes GH #622)      construction of multi numeric dtypes with other types in a dict      validated get_numeric_data returns correct dtypes      added blocks attribute (and as_blocks()) method that returns a dict of dtype -> homogeneous Frame to DataFrame      added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns      fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger)      changed implementation of get_dtype_counts() to use .blocks      revised DataFrame.convert_objects to use blocks to be more efficient      added Dtype printing to show on default with a Series      added convert_dates='coerce' option to convert_objects, to force conversions to datetime64[ns]      where can upcast integer to float as needed (on inplace ops #2793)      added fully cythonized support for int8/int16      no support for float16 (it can exist, but no cython methods for it)
8ad9598

@jreback jreback was assigned Feb 10, 2013

Owner

wesm commented Feb 10, 2013

Boom. resolved by #2708, merged to master today

@wesm wesm closed this Feb 10, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment