Experiments to (try) replicate/mirror Python's
pandas library using our Glorious Haskell language and
conduit's stream processing.
DISCLAIMER: THIS IS NOT A BINDING FOR PYTHON PANDAS!
Pandas vs. type-safety
Pandas is widely adopted and used in the data processing community due to its expansive library capabilities and low barrier to entry thanks to Python's popularity. However there are limitations:
- Absolutely dynamic without any type safety make scripts/applications a runtime exception minefield
- Several operations (like joins) mutate record labels and create new ones with a magic suffix.
- Several functionalities require doing things the "pandas way" to exploit high performance. (Ex: map, filter)
- Datasets are loaded into memory. This is an expensive endeavor for "big data".
Haskell + streaming > Pandas pandas ?
No idea. This repo intends to try and port common use cases that I come across to haskell land. Improvements over Pandas:
- Static typing can ensure data transformations are safe and errors are explicitly handled (ex: Missing values, parsing failures etc).
- Join results can be expressed more richly with tuples or
Theseto know exactly what the results of the operation were.
- The core of the functionality and types can be vanilla Haskell and still play nicely with streaming/processing libraries.
- Streaming also means only a fraction of the dataset is ever stored in memory at a time.
How to build+run
$ stack build --pedantic --ghc-options '-O2'
$ time stack exec conduit-pandas-exe
- Explore more use-cases and examples
- Add benchmarks with pandas and sqlite
|CPU||2.3 GHz Intel Core i5|
|Memory||16 GB 2133 MHz LPDDR3|
- Inner join for 1000x1000 records runs in 28 seconds, rougly 36K rows/second.