Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
MemoryError with more than 1E9 rows #8252
Comments
|
You can try separately creating Series (with each of the columns first), then putting them into a dict and creating the frame. However you might be having a problem finding contiguous memory. |
This was referenced Sep 12, 2014
jreback
added Performance Reshaping
labels
Sep 20, 2014
jreback
added this to the
0.15.0
milestone
Sep 20, 2014
|
finally had time to look at this. I think their was an extra copy going on in certain cases. so try this out using master (once I merge this change). This seems to scale much better. and the following slightly modified code:
|
jreback
closed this
in #8331
Sep 20, 2014
|
@mattdowle I updated the example to give a pretty simplied version, that give pretty good memory performance (e.g is just a bit over 1X final data size) by not trying to create everything at once. |
mattdowle
referenced
this issue
in Rdatatable/data.table
Sep 21, 2014
Open
Rerun pandas 2E9 benchmark from dev #823
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
mattdowle commentedSep 12, 2014
I have 240GB of RAM. Nothing else running on the machine. I'm trying to create 1.5E9 rows, which I think should create a data frame of around 100GB, but getting this MemoryError. This works fine with 1E9 but not 1.5E9. I could understand a limit at about 2^31 (2E9) or 2^32 (4E9) but all 240GB seems exhausted (according to htop) at somewhere between 1E9 and 1.5E9 rows. Any ideas? Thanks.
$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Stepping: 4 CPU MHz: 2494.070 BogoMIPS: 5054.21 Hypervisor vendor: Xen Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 $ free -h total used free shared buffers cached Mem: 240G 2.3G 237G 364K 66M 632M -/+ buffers/cache: 1.6G 238G Swap: 0B 0B 0B $An earlier question on S.O. is here : http://stackoverflow.com/questions/25631076/is-this-the-fastest-way-to-group-in-pandas