In [None]:
!pip install polars

In [1]:
import polars as pl

### Optimizations and Performance in Polars

Polars is built for high-performance data manipulation, leveraging parallelism and lazy execution. In this section, we'll cover:

- LazyFrame vs. DataFrame: When to Use Each?
- Parallel Execution and Automatic Optimizations
- Working with Large Datasets (>1GB)

### 1. LazyFrame vs. DataFrame: When to Use Each?
Polars offers two main data structures:

1. `pl.DataFrame` (Eager Execution)
- Similar to Pandas.
- Executes operations immediately.
- Best for small to medium datasets when quick results are needed.

2. `pl.LazyFrame` (Lazy Execution)
- Similar to SQL query planners.
- Operations are deferred and optimized before execution.
- Best for large datasets where performance matters.

In [24]:
df = pl.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'value': [10, 20, 30, 40, 50]
})

# Applying a filter (Eager execution - runs immediately)
eager_result = df.filter(pl.col('value') > 20)

print('Eager Execution:\n', eager_result)

# ----------

# Sample LazyFrame (Lazy Execution)
lf = df.lazy()

# Applying a filter (Lazy execution - does not run immediately)
lazy_result = lf.filter(pl.col('value') > 20)

# Must call collect() to execute LazyFrame
final_result = lazy_result.collect()

print('Lazy Execution (after collect()):\n', final_result)

Eager Execution:
 shape: (3, 2)
┌─────┬───────┐
│ id  ┆ value │
│ --- ┆ ---   │
│ i64 ┆ i64   │
╞═════╪═══════╡
│ 3   ┆ 30    │
│ 4   ┆ 40    │
│ 5   ┆ 50    │
└─────┴───────┘
Lazy Execution (after collect()):
 shape: (3, 2)
┌─────┬───────┐
│ id  ┆ value │
│ --- ┆ ---   │
│ i64 ┆ i64   │
╞═════╪═══════╡
│ 3   ┆ 30    │
│ 4   ┆ 40    │
│ 5   ┆ 50    │
└─────┴───────┘


### 2. Parallel Execution and Automatic Optimizations

Polars is designed for multi-threading and automatically optimizes queries.

**Polars' Optimization Techniques**
- `Predicate Pushdown`: Filters are applied early to reduce computation.
- `Projection Pushdown`: Only necessary columns are selected.
- `Parallel Execution`: Uses multiple CPU cores for processing.

In [22]:
# Large dataset simulation
df = pl.DataFrame({
    'id': range(1, 1000001),
    'value': range(1000000, 0, -1)
})

# Convert to LazyFrame
lf = df.lazy()

# Apply filtering and selection
optimized_query = (
    lf
    .filter(pl.col('value') > 500000)  # Predicate pushdown
    .select(['id', 'value'])           # Projection pushdown
)

# Execute query
result = optimized_query.collect()

print(result)

shape: (500_000, 2)
┌────────┬─────────┐
│ id     ┆ value   │
│ ---    ┆ ---     │
│ i64    ┆ i64     │
╞════════╪═════════╡
│ 1      ┆ 1000000 │
│ 2      ┆ 999999  │
│ 3      ┆ 999998  │
│ 4      ┆ 999997  │
│ 5      ┆ 999996  │
│ 6      ┆ 999995  │
│ 7      ┆ 999994  │
│ 8      ┆ 999993  │
│ 9      ┆ 999992  │
│ 10     ┆ 999991  │
│ 11     ┆ 999990  │
│ 12     ┆ 999989  │
│ 13     ┆ 999988  │
│ 14     ┆ 999987  │
│ 15     ┆ 999986  │
│ 16     ┆ 999985  │
│ 17     ┆ 999984  │
│ 18     ┆ 999983  │
│ 19     ┆ 999982  │
│ 20     ┆ 999981  │
│ …      ┆ …       │
│ 499981 ┆ 500020  │
│ 499982 ┆ 500019  │
│ 499983 ┆ 500018  │
│ 499984 ┆ 500017  │
│ 499985 ┆ 500016  │
│ 499986 ┆ 500015  │
│ 499987 ┆ 500014  │
│ 499988 ┆ 500013  │
│ 499989 ┆ 500012  │
│ 499990 ┆ 500011  │
│ 499991 ┆ 500010  │
│ 499992 ┆ 500009  │
│ 499993 ┆ 500008  │
│ 499994 ┆ 500007  │
│ 499995 ┆ 500006  │
│ 499996 ┆ 500005  │
│ 499997 ┆ 500004  │
│ 499998 ┆ 500003  │
│ 499999 ┆ 500002  │
│ 500000 ┆ 500001  │
└────────┴────

### 3. Working with Large Datasets (>1GB)

**When handling large datasets, you can:**
- Use LazyFrame to avoid memory overload.
- Read data in chunks to process it efficiently.
- Use Parquet instead of CSV for better performance.

In [21]:
# Deakling with a CSV of more than 1gb

# Read CSV file lazily to avoid high memory usage
lf = pl.scan_csv('sales.csv')

# Perform filtering and aggregation lazily
query = (
    lf.filter(pl.col('sale_price') > 900)
    .group_by('product')
    .agg(pl.col('sale_price').sum().alias('total'))
)

result = query.collect()

result = result.with_columns(pl.col('total').round(2).cast(pl.Utf8))

# .cast(pl.Utf8) to convert o string

print(result)

shape: (50, 2)
┌───────────┬─────────────┐
│ product   ┆ total       │
│ ---       ┆ ---         │
│ str       ┆ str         │
╞═══════════╪═════════════╡
│ enter     ┆ 19220181.75 │
│ happy     ┆ 19349138.12 │
│ apply     ┆ 19041936.07 │
│ would     ┆ 19214605.68 │
│ cold      ┆ 19350622.0  │
│ floor     ┆ 19146144.96 │
│ important ┆ 19165248.78 │
│ charge    ┆ 18806296.98 │
│ church    ┆ 19508403.31 │
│ certain   ┆ 18976284.17 │
│ whom      ┆ 19321754.75 │
│ speak     ┆ 18944193.08 │
│ least     ┆ 19525448.62 │
│ reason    ┆ 19389997.12 │
│ but       ┆ 19201797.88 │
│ half      ┆ 19123176.29 │
│ what      ┆ 19118490.49 │
│ long      ┆ 19102148.33 │
│ budget    ┆ 19158962.61 │
│ place     ┆ 19102064.75 │
│ …         ┆ …           │
│ service   ┆ 19263025.68 │
│ water     ┆ 19332875.84 │
│ bit       ┆ 19039997.64 │
│ figure    ┆ 19199389.04 │
│ that      ┆ 19110237.85 │
│ employee  ┆ 19254682.0  │
│ fly       ┆ 18975999.31 │
│ then      ┆ 19357373.76 │
│ health    ┆ 19100247.42 │
│ sav

### 📌 Summary
| Feature |	Recommendation |
| ------- | -------------- |
| Small Datasets (<1GB) | Use pl.DataFrame |
| Large Datasets (>1GB) | Use pl.LazyFrame with scan_csv or read_parquet |
| Performance Optimization | Use Lazy Execution to enable optimizations |
| File Formats | Prefer Parquet over CSV for large datasets |

Polars' **parallel execution and lazy optimizations** make it ideal for handling large-scale data efficiently.