Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: revisit & simplify core data structures #6744

Closed
immerrr opened this issue Mar 30, 2014 · 4 comments
Closed

CLN: revisit & simplify core data structures #6744

immerrr opened this issue Mar 30, 2014 · 4 comments
Labels
Internals Related to non-user accessible pandas implementation

Comments

@immerrr
Copy link
Contributor

immerrr commented Mar 30, 2014

This is an attempt to simplify/streamline internal API that has been brewing inside my head for quite a while. It does mean a significant overhaul and may take time, but it may prove worth the while. I'm putting it here for discussion ahead of time to make sure the effort isn't wasted for going in the wrong direction.

Idea

The idea is simple: "augmented take" operation — with -1 taking from nowhere and creating a new column — is enough to express any reindexing/merging/joining that may happen at Index level. So, lower levels of API that do the heavy lifting may be relieved from the burden of operating on labels and keeping them in sync. This will make them more self-contained with the following benefits:

  • simplify implementation
  • regularize data-handling operations making them more dependable, no more "oh no, I have a duplicate/timestamp/period/multiindex/etc. label in the index, all slicing operations are now 10x slower", this happened to me more times than I'm proud of.
  • simplify contributions from new developers, no more tracking down all the code paths up to the public API to fix one small check.
  • make code more test- and benchmark-friendly, no more exponential growth in number of tests/benchmarks for each new feature
  • declaring an API will simplify developing and maintaining "unconventional" storages (sparse, categorical, compressed, etc.)

There's a three-year-old ticket ticket that mentions a similar (if not the same) idea. As mentioned there, this may break pickles and other deserialization and thus it will require a separate legacy deserialization compatibility layer.

Another ticket mentions moving Block & BlockManager internals to cython level and dropping axes dependency will definitely facilitate that.

Goals

The end goal is to have internals layered as follows:

  • Block: a proper homogenous ndarray
    • think numpy.ndarray with all necessary fixes/workarounds
    • datatype inference
    • support for custom pandas datatypes
    • typical (slice, concatenate) and pandas-specific (take-with-insert) operations
  • RemappableBlock: homogeneous ndarray that supports "remapping" one of its axes
    • think Block + ref_locs
    • ref_locs should be Int64Index (platform-int-index?, also RangeIndex will help)
    • Block instances may be shared between RemappableBlocks
    • optimizations for no-remapping mode of work (think SingleBlockManager)
    • FIXME: better name?
  • BlockManager: a proper heterogeneous ndarray
    • external interface is similar to that of Block
    • can share RemappableBlocks
  • NDFrame: labeled heterogeneous ndarray
    • more or less equivalent to current NDFrames
    • merging/joining/reindexing only appears at this level

Deliverables

  • Stage 1 DONE
    • make ref_locs primary source of information (leaving items/ref_items in place to back it up and avoid breakage)
    • port merging/joining internals to loc-based implementation (there's quite a number of hacks a.t.m. that make this non-trivial)
    • drop Block items & ref_items fields (ensure io backward compatibility!)
    • fix performance issues & integrate with mainline
  • Stage 2: TBA
@jtratner
Copy link
Contributor

interesting idea, seems like a nice and clean separation

@immerrr
Copy link
Contributor Author

immerrr commented Mar 31, 2014

A point to consider which was brought up in patch discussion: don't forget to put unnecessary parts of API into separate functions to keep basic API to bare minimum.

@jreback jreback added this to the 0.15.0 milestone Apr 25, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jbrockmendel jbrockmendel removed the Indexing Related to indexing on series/frames, not to indexes themselves label Apr 14, 2020
@jbrockmendel
Copy link
Member

Closeable? It looks like this has gone as far as its gonna go

@mroeschke
Copy link
Member

Sure. We can revisit this if it becomes active again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

No branches or pull requests

5 participants