<h2>Pandas vs Polars Speed</h2>

https://pypi.org/project/polars-lts-cpu/

* `Polars` is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model.

* Run an inline SQL query
```bash
> polars -c "SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/assets/data/iris.csv') GROUP BY species;"
```

* `Polars` can handle larger-than-RAM data
* If you have data that does not fit into memory, Polars' query engine is able to process your query (or parts of your query) in a streaming fashion. This drastically reduces memory requirements, so you might be able to process your 250GB dataset on your laptop. Collect with collect(streaming=True) to run the query streaming. (This might be a little slower, but it is still very fast!)

* Benchmark testing vs. pandas, duckDB, PySpark, Dask:
  * https://pola.rs/posts/benchmarks/
---
<h3>Why is Polars better than Pandas?</h3>

Polars is able to work with datasets larger than RAM through a combination of design choices and optimizations that differ significantly from how pandas operates. Here are some key factors:

1. Lazy Evaluation & Query Optimization:
Polars supports lazy evaluation, meaning that it builds an execution plan rather than executing operations immediately. This allows it to optimize the computation (e.g., by fusing operations) and process data in chunks, which is much more memory-efficient.

2. Columnar Memory Layout:
Polars uses a columnar data format similar to Apache Arrow. This layout allows for better CPU cache utilization and efficient vectorized operations, minimizing memory overhead.

3. Memory Mapping and Out-of-Core Processing:
Polars can use memory mapping to work with data on disk as if it were in memory, which is beneficial for handling large datasets. This approach allows you to work with files larger than the available RAM by loading only necessary parts into memory as needed.

4. Parallel Processing:
Being built in Rust, Polars leverages modern hardware capabilities more effectively, allowing for parallel processing of data. This means that multiple CPU cores can work on different parts of the data concurrently, improving both speed and memory utilization.

5. Efficient Data Structures:
While pandas uses Python objects under the hood (with many operations relying on interpreted Python code), Polars is written in Rust and is designed to avoid unnecessary memory copies, reducing the overall memory footprint.

In contrast, pandas is primarily designed for in-memory analysis. It loads entire datasets into RAM, which makes it straightforward for small to moderately sized datasets but problematic when the dataset size exceeds the available memory.

Overall, Polars' design is oriented towards modern data processing requirements, including handling larger-than-RAM datasets efficiently through lazy evaluation, memory mapping, and highly optimized, parallelized computations.

---
<h3>Why is Pandas better than Polars?</h3>

* Pandas still has some advantages over polars.

1. Mature Ecosystem & Extensive Community Support
* Pandas has been around for over a decade and is widely used in data science, finance, and academia.
* There are more tutorials, Stack Overflow answers, and third-party integrations compared to Polars.
* Many data science libraries (e.g., scikit-learn, seaborn, statsmodels) work seamlessly with pandas.

2. More Built-in Functionality

* Pandas has more statistical functions and built-in support for functions like:
`df.corr(), df.cov(), df.rank(), df.interpolate(), df.rolling()`

* Polars has fewer built-in statistical methods (though you can use NumPy/SciPy with it).

3. More Flexible Data Structures

* Pandas supports heterogeneous data types within a single column (e.g., a mix of strings, numbers, and NaNs).
* Polars enforces a strict columnar format, so every column must have a single data type.
* Pandas also supports multi-indexing, which can be useful for hierarchical data.

4. Easier for Small Datasets & Quick Exploratory Analysis

* If you’re working with a small dataset (< 1M rows), pandas is often "fast enough" and has a more intuitive API.
* The interactive experience with Jupyter notebooks is more natural in pandas.

5. Better Support for Row-wise Operations

* Since pandas is row-based (whereas Polars is columnar), row-wise operations (e.g., `apply()`) are more straightforward in pandas.
* Polars discourages row-wise operations and instead pushes for vectorized, columnar computations.

** <b>TL;DR: Pandas is still better for small-scale data science, flexible data structures, and statistics-heavy tasks. But for performance, scalability, and large datasets, Polars is the clear winner</b>

---
* Note that `polars` does not support Mac OS M1 chip, directly installing the `polars` library leads to issues when importing.
* https://github.com/pola-rs/polars/issues/11650
* Use the `polars-lts-cpu` 



```python
import sys

!{sys.executable} -m pip install polars-lts-cpu
```

In [1]:
import pandas as pd
import polars as pl
import numpy as np
import time

# Generate a large dataset (1 billion rows)
N = 1_000_000_000
data = {"col1": np.random.randint(0, 100, size=N)}

# Benchmark Pandas
pandas_start = time.time()
df_pandas = pd.DataFrame(data)
sum_pandas = df_pandas["col1"].sum()
pandas_time = time.time() - pandas_start

# Benchmark Polars
polars_start = time.time()
df_polars = pl.DataFrame(data)
sum_polars = df_polars["col1"].sum()
polars_time = time.time() - polars_start

# Print results
print(f"Pandas sum: {sum_pandas} (Time: {pandas_time:.5f} sec)")
print(f"Polars sum: {sum_polars} (Time: {polars_time:.5f} sec)")

Pandas sum: 49499968634 (Time: 9.29749 sec)
Polars sum: 49499968634 (Time: 3.34252 sec)


* With the 1B rows, polars' runtime is consistently ~3x faster than pandas.