Experiments to mirror Python `pandas` using Haskell `conduit`.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
app
src
test
.gitignore
ChangeLog.md
LICENSE
README.md
package.yaml
stack.yaml

README.md

conduit-pandas

Experiments to (try) replicate/mirror Python's pandas library using our Glorious Haskell language and conduit's stream processing.

DISCLAIMER: THIS IS NOT A BINDING FOR PYTHON PANDAS!

Motivation

Pandas vs. type-safety

Pandas is widely adopted and used in the data processing community due to its expansive library capabilities and low barrier to entry thanks to Python's popularity. However there are limitations:

  • Absolutely dynamic without any type safety make scripts/applications a runtime exception minefield
  • Several operations (like joins) mutate record labels and create new ones with a magic suffix.
  • Several functionalities require doing things the "pandas way" to exploit high performance. (Ex: map, filter)
  • Datasets are loaded into memory. This is an expensive endeavor for "big data".

Haskell + streaming > Pandas pandas ?

No idea. This repo intends to try and port common use cases that I come across to haskell land. Improvements over Pandas:

  • Static typing can ensure data transformations are safe and errors are explicitly handled (ex: Missing values, parsing failures etc).
  • Join results can be expressed more richly with tuples or These to know exactly what the results of the operation were.
  • The core of the functionality and types can be vanilla Haskell and still play nicely with streaming/processing libraries.
  • Streaming also means only a fraction of the dataset is ever stored in memory at a time.

How to build+run

  • Get stack

  • Compile

    $ stack build --pedantic --ghc-options '-O2'
  • Run

    $ time stack exec conduit-pandas-exe

TODO

  • Explore more use-cases and examples
  • Add benchmarks with pandas and sqlite

Results

Hardware

stat value
CPU 2.3 GHz Intel Core i5
Memory 16 GB 2133 MHz LPDDR3

Findings

  • Inner join for 1000x1000 records runs in 28 seconds, rougly 36K rows/second.

Related reading