- Author: Ben Du
- Date: 2020-11-26 14:24:00
- Title: DataFrame Implementations in Python
- Slug: scaling-pandas
- Category: Computer Science
- Tags: Computer Science, programming, Python, DataFrame, pandas, PySpark, Vaex, Modin, Dask, RAPIDS, cudf, cylon, big data
- Modified: 2021-04-26 14:24:00


 ** Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement! **  

## Tips and Traps

There are multiple ways to handle big data in Python,
among which 
[vaex](https://github.com/vaexio/vaex)
and PySpark are the most popular ones.
Dask is NOT a good option compared to vaex and PySpark.

1. If you have relative large memory, 
    say more than 20G, 
    on your (single) machine, 
    you can handle (filtering, sorting, merging, etc.) 
    millions (magnitude of 1E6) of rows in pandas DataFrame without any pressure. 
    When you have more than 10 millions rows 
    or the memory on your (single) machine is restricted,
    you should consider using big data tools such as 
    [vaex](https://github.com/vaexio/vaex)
    and PySpark.

4. Do NOT use the Jupyter/Lab plugin `jupyterlab-lsp` 
    if you work on big data in Jupyter/Lab.
    The plugin `jupyterlab-lsp` has issues with large DataFrames 
    (both with pandas and PySpark DataFrames)
    and can easily crash your Jupyter/Lab server 
    even if you have enough memory.


[Benchmarking Python Distributed AI Backends with Wordbatch](https://towardsdatascience.com/benchmarking-python-distributed-ai-backends-with-wordbatch-9872457b785c)
has a detailed comparison among Dask, Ray and PySpark.
Dask is no good. 
Both Ray and PySpark scale well 
with Ray has slight performance advantge over PySpark.
Also, Ray is easy to configure to Spark.
Notice that [modin](https://github.com/modin-project/modin)
is a project aiming at scaling pandas workflows by changing one line of code
and it is based on Apache Ray.
It will probably provide better performance than Dask if you work with data frames.



## [pandas DataFrame](https://github.com/pandas-dev/pandas)

Pandas DataFrame is the most popular DataFrame implementation that people use Python.

## [vaex](https://github.com/vaexio/vaex)

**Vaex is currently the best alternative DataFrame implementation to pandas DataFrame.**


Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), 
to visualize and explore big tabular datasets. 
It calculates statistics such as mean, sum, count, standard deviation etc, 
on an N-dimensional grid for more than a billion (10^9) samples/rows per second. 
Visualization is done using histograms, density plots and 3d volume rendering, 
allowing interactive exploration of big data. 
Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).



## [cylon](https://github.com/cylondata/cylon)

Cylon use similar technologies (C++, Apache Arrow, etc.) as vaex.
However,
it doesn't seems to be as mature as vaex. 
A few advantages of Cylon compared to vaex are
- cylon supports different langauge APIs (C++, Python, Java, etc)
- cylon is distributed while vaex is single machine only
 
Cylon is a fast, scalable distributed memory data parallel library for processing structured data. 
Cylon implements a set of relational operators to process data. 
While "Core Cylon" is implemented using system level C/C++, 
multiple language interfaces (Python and Java ) are provided to seamlessly integrate with existing applications, 
enabling both data and AI/ML engineers to invoke data processing operators in a familiar programming language. 
By default it works with MPI for distributing the applications.
Internally Cylon uses Apache Arrow to represent the data in a column format.

## [polars](https://github.com/pola-rs/polars)

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow as memory model.

## PySpark DataFrame

PySpark DataFrame is another good option (besides vaex) if you have to work on relatively large data on a single machine,
especially if you have some Spark knowledge. 

## [cudf](https://github.com/rapidsai/cudf)

cudf (developed by RAPIDS) is built based on the Apache Arrow columnar memory format, 
cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, 
so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

## modin

Modin, with Ray as a backend. By installing these, you might see significant benefit by changing just a single line (`import pandas as pd` to `import modin.pandas as pd`). Unlike the other tools, Modin aims to reach full compatibility with Pandas.

Modin: a drop-in replacement for Pandas, powered by either Dask or Ray.



## [dask.DataFrame](https://github.com/dask/dask)

`dask.DataFrame` is not as good as other DataFrame implementations presented here. 
I'd suggest you try other alternatives (vaex or PySpark DataFrame).

Dask is a low-level scheduler and a high-level partial Pandas replacement, 
geared toward running code on compute clusters.
Dask provides `dask.dataframe`,
a higher-level, Pandas-like library that can help you deal with out-of-core datasets.


## TODO

1. compare PySpark DataFrame vs Vaex on a single machine ...

## References

- [Beyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to head](https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13)

- [7 reasons why I love Vaex for data science](https://towardsdatascience.com/7-reasons-why-i-love-vaex-for-data-science-99008bc8044b)

- [Vaex: Out of Core Dataframes for Python and Fast Visualization](https://towardsdatascience.com/vaex-out-of-core-dataframes-for-python-and-fast-visualization-12c102db044a)

- [ML impossible: Train 1 billion samples in 5 minutes on your laptop using Vaex and Scikit-Learn](https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20-minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f3850

- [Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS](https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray#:~:text=Vaex,-Dask%20(Dataframe)%20is&text=Ultimately%2C%20Dask%20is%20more%20focused,on%20data%20processing%20and%20wrangling.)

- [RIP Pandas 2.0: Time For DASK After VAEX !!!](https://towardsdatascience.com/dask-vs-vaex-for-big-data-38cb66728747)

- [High performance Computing in Python](http://www.legendu.net/misc/blog/high-performance-computing-in-python)

- [Hands on the Python Module dask](http://www.legendu.net/misc/blog/hands-on-the-python-module-dask/)

- http://www.legendu.net/misc/blog/tips-on-pyspark/

- http://www.legendu.net/misc/blog/pyspark-optimus-data-profiling/

- https://www.dataquest.io/blog/pandas-big-data/
