CLN: revisit & simplify core data structures #6744

immerrr · 2014-03-30T17:52:30Z

This is an attempt to simplify/streamline internal API that has been brewing inside my head for quite a while. It does mean a significant overhaul and may take time, but it may prove worth the while. I'm putting it here for discussion ahead of time to make sure the effort isn't wasted for going in the wrong direction.

Idea

The idea is simple: "augmented take" operation — with -1 taking from nowhere and creating a new column — is enough to express any reindexing/merging/joining that may happen at Index level. So, lower levels of API that do the heavy lifting may be relieved from the burden of operating on labels and keeping them in sync. This will make them more self-contained with the following benefits:

simplify implementation
regularize data-handling operations making them more dependable, no more "oh no, I have a duplicate/timestamp/period/multiindex/etc. label in the index, all slicing operations are now 10x slower", this happened to me more times than I'm proud of.
simplify contributions from new developers, no more tracking down all the code paths up to the public API to fix one small check.
make code more test- and benchmark-friendly, no more exponential growth in number of tests/benchmarks for each new feature
declaring an API will simplify developing and maintaining "unconventional" storages (sparse, categorical, compressed, etc.)

There's a three-year-old ticket ticket that mentions a similar (if not the same) idea. As mentioned there, this may break pickles and other deserialization and thus it will require a separate legacy deserialization compatibility layer.

Another ticket mentions moving Block & BlockManager internals to cython level and dropping axes dependency will definitely facilitate that.

Goals

The end goal is to have internals layered as follows:

Block: a proper homogenous ndarray
- think numpy.ndarray with all necessary fixes/workarounds
- datatype inference
- support for custom pandas datatypes
- typical (slice, concatenate) and pandas-specific (take-with-insert) operations
RemappableBlock: homogeneous ndarray that supports "remapping" one of its axes
- think Block + ref_locs
- ref_locs should be Int64Index (platform-int-index?, also RangeIndex will help)
- Block instances may be shared between RemappableBlocks
- optimizations for no-remapping mode of work (think SingleBlockManager)
- FIXME: better name?
BlockManager: a proper heterogeneous ndarray
- external interface is similar to that of Block
- can share RemappableBlocks
NDFrame: labeled heterogeneous ndarray
- more or less equivalent to current NDFrames
- merging/joining/reindexing only appears at this level

Deliverables

Stage 1 DONE
- make ref_locs primary source of information (leaving items/ref_items in place to back it up and avoid breakage)
- port merging/joining internals to loc-based implementation (there's quite a number of hacks a.t.m. that make this non-trivial)
- drop Block items & ref_items fields (ensure io backward compatibility!)
- fix performance issues & integrate with mainline
Stage 2: TBA

The text was updated successfully, but these errors were encountered:

jtratner · 2014-03-30T22:55:39Z

interesting idea, seems like a nice and clean separation

immerrr · 2014-03-31T13:35:14Z

A point to consider which was brought up in patch discussion: don't forget to put unnecessary parts of API into separate functions to keep basic API to bare minimum.

jbrockmendel · 2020-04-15T02:47:08Z

Closeable? It looks like this has gone as far as its gonna go

mroeschke · 2020-04-15T04:49:18Z

Sure. We can revisit this if it becomes active again

This was referenced Mar 30, 2014

Wrong repo... immerrr/pandas#1

Closed

CLN: revisit & simplify Block/BlockManager, remove axes #6745

Merged

jreback added Indexing labels Mar 30, 2014

jreback added this to the 0.15.0 milestone Apr 25, 2014

immerrr mentioned this issue Jul 2, 2014

Index API proposal: unified axis label lookup #7651

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jbrockmendel removed the Indexing Related to indexing on series/frames, not to indexes themselves label Apr 14, 2020

mroeschke closed this as completed Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: revisit & simplify core data structures #6744

CLN: revisit & simplify core data structures #6744

immerrr commented Mar 30, 2014

jtratner commented Mar 30, 2014

immerrr commented Mar 31, 2014

jbrockmendel commented Apr 15, 2020

mroeschke commented Apr 15, 2020

CLN: revisit & simplify core data structures #6744

CLN: revisit & simplify core data structures #6744

Comments

immerrr commented Mar 30, 2014

Idea

Goals

Deliverables

jtratner commented Mar 30, 2014

immerrr commented Mar 31, 2014

jbrockmendel commented Apr 15, 2020

mroeschke commented Apr 15, 2020