# Polars — Comprehensive Tutorial (Beginner → Advanced)
This notebook contains explanations, notes, internal workings, keywords, and runnable examples.
Each code cell includes short comments explaining what it does and notes on internals where relevant.

## Setup: install and import
**What this cell does:** installs polars and imports libraries.
**Internal notes:** `polars` core is written in Rust and exposes bindings — installation pulls the wheel.
**Keywords:** eager, lazy, expressions, scan_csv

In [None]:
# Install polars if not present (uncomment if needed)
# !pip install polars

import polars as pl
import pandas as pd
import matplotlib.pyplot as plt

print('polars version:', pl.__version__)
# Create a tiny DF to check
print(pl.DataFrame({'a':[1,2,3]}))

## Creating DataFrames (Eager)
**What this cell does:** shows different ways to create DataFrames.
**Internal notes:** Eager `pl.DataFrame` materializes immediately in memory (Arrow memory layout).

In [None]:
# From Python dict (eager)
df = pl.DataFrame({'name': ['Alice','Bob'], 'age':[25,30]})
print(df)

# From pandas DataFrame
pdf = pd.DataFrame({'x':[1,2,3], 'y':[4,5,6]})
df2 = pl.from_pandas(pdf)
print(df2)

# From CSV (eager read)
df_csv = pl.read_csv('/mnt/data/polars_samples/sales.csv')
print('read_csv rows:', df_csv.height)

## Expressions & Selecting
**What this cell does:** demonstrates `pl.col`, `pl.expr` and using expressions.
**Internal notes:** Expressions are lazily built computation graphs — when used in eager context are evaluated immediately; in lazy context they become part of a query plan.
**Keywords:** `pl.col`, `pl.lit`, `.select()`, `.with_columns()`

In [None]:
# Use expressions to transform columns
print(df_csv.select([
    pl.col('customer'),
    (pl.col('amount') * 1.1).alias('amount_with_tax')
]))

# Add derived column with with_columns
print(df_csv.with_columns((pl.col('amount')/pl.col('amount').max()).alias('amount_norm')).head())

## Filter, GroupBy & Aggregation
**What this cell does:** basic filtering and aggregation.
**Internals:** GroupBy triggers parallel aggregation using multiple threads and Arrow buffers.
**Keywords:** `.filter()`, `.groupby()`, `.agg()`

In [None]:
print(df_csv.filter(pl.col('amount') > 50).head())
print(df_csv.groupby('customer').agg([pl.col('amount').sum().alias('total')]))

## Joins
**What this cell does:** demonstrates join types and notes on performance.
**Internals:** Polars uses hash joins by default for equality joins; joins can be expensive if keys are large.
**Keywords:** `.join()`, how='inner|left|outer|cross'

In [None]:
left = pl.DataFrame({'id':[1,2,3], 'v':[10,20,30]})
right = pl.DataFrame({'id':[2,3,4], 'w':[200,300,400]})
print(left.join(right, on='id', how='inner'))

## Lazy Mode & scan_* vs read_*()
**What this cell does:** shows lazy API using `scan_csv` and explains difference.
**Internals:** `scan_csv` builds a lazy plan without loading data; `read_csv` loads immediately. Lazy plans allow predicate pushdown, projection pushdown, and query optimization.
**Keywords:** `scan_csv`, `.lazy()`, `.collect()`, predicate pushdown

In [None]:
lf = pl.scan_csv('/mnt/data/polars_samples/sales.csv')  # lazy scan, no IO executed yet
print(type(lf))

# Build a lazy pipeline: filter then aggregate — executed on collect()
plan = lf.filter(pl.col('amount') > 50).groupby('customer').agg(pl.col('amount').sum())
print('lazy plan repr:')
print(plan)
print('collect() executes plan and returns eager DataFrame:')
print(plan.collect())

## Predicate Pushdown, Projection Pushdown & Query Optimization
**What this cell does:** demonstrates how placing filters before projections can reduce IO in lazy mode.
**Internals:** Optimizer rewrites query DAG, moves filters earlier (predicate pushdown), and removes unused columns (projection pushdown).
**Keywords:** optimizer, dag, pushdown, physical plan

In [None]:
# Demonstration: compare applying filter early vs late in lazy pipeline
lf = pl.scan_csv('/mnt/data/polars_samples/sales.csv')
# Filter early (good)
plan_early = lf.filter(pl.col('amount') > 50).select(['customer','amount']).collect()
# Filter late (suboptimal but optimizer often rewrites)
plan_late = lf.select(['customer','amount']).filter(pl.col('amount') > 50).collect()
print(plan_early)
print(plan_late)

## Fold (Expression-level Reduce)
**What this cell does:** demonstrates `fold` for iterative expression aggregation.
**Internals:** `fold` reduces a list of expressions into a single expression (useful for variable-length columns or many columns).
**Keywords:** `pl.fold`, `acc`, `function`

In [None]:
# Example: sum many columns via fold
df_multi = pl.DataFrame({'a':[1,2], 'b':[3,4], 'c':[5,6]})
exprs = [pl.col(c) for c in df_multi.columns]
sum_expr = pl.fold(acc=pl.lit(0), function=lambda a, b: a + b, exprs=exprs).alias('total')
print(df_multi.select(sum_expr))

## HStack & VStack (Concatenation)
**What this cell does:** stack DataFrames horizontally and vertically.
**Internals:** hstack reuses memory where possible for speed; vstack concatenates row-wise and may reallocate.
**Keywords:** `.hstack()`, `.vstack()`, `pl.concat()`

In [None]:
df_a = pl.DataFrame({'a':[1,2]})
df_b = pl.DataFrame({'b':[3,4]})
print(df_a.hstack(df_b))
print(pl.concat([df_a, df_a]))

## Struct & List columns (Nested types)
**What this cell does:** shows creating and manipulating nested columns.
**Internals:** Nested Arrow types store offsets for lists and struct fields; operations on them are optimized but can be more costly to serialize.
**Keywords:** `pl.Struct`, `pl.List`, `.explode()`, `.arr`

In [None]:
# Create struct and list columns
df_nested = pl.DataFrame({'id':[1,2], 'lst':[[1,2],[3]], 'meta':[{'a':1},{'b':2}]})
print(df_nested)
# Explode list column
print(df_nested.explode('lst'))
# Access struct-like keys by dot access if it's a struct
struct_df = df_nested.select([pl.struct(['id','lst']).alias('s')])
print(struct_df.select(pl.col('s').arr.eval(pl.element().sum()).alias('struct_demo'), pl.col('s')))


## UDFs & Performance
**What this cell does:** explains using `apply` vs vectorized expressions and performance implications.
**Internals:** `.apply()` runs Python-level function per-row and is slower; prefer built-in expressions which are executed in Rust and parallelized.
**Keywords:** `apply`, `map`, `expr`, vectorized

In [None]:
df_perf = pl.DataFrame({'x': list(range(1000))})
# Slow: python apply
import time
start = time.time()
res = df_perf.with_columns(pl.col('x').apply(lambda v: v*2).alias('x2'))
print('apply time (approx):', time.time()-start)
# Fast: vectorized expression (Rust)
start = time.time()
res2 = df_perf.with_columns((pl.col('x')*2).alias('x2'))
print('expr time (approx):', time.time()-start)
print(res2.head())

## Streaming & Scanning Many Files
**What this cell does:** shows `scan_csv` and `streaming=True` for memory constrained pipelines.
**Internals:** Streaming mode processes data in chunks and avoids fully materializing intermediate results.
**Keywords:** `streaming=True`, `scan_csv`, memory footprint

In [None]:
# For demonstration we use scan_csv; streaming is most useful when collecting to file-level sinks.
lf = pl.scan_csv('" + sales_path + "')
res = lf.filter(pl.col('amount') > 50).select(['date','amount']).collect(streaming=True)
print(res)

## Performance Tips (Practical)
- Prefer expressions over `apply`.
- Use lazy + scan for large files.
- Use predicate & projection pushdown.
- Repartition if one partition is too large (not often needed).
- Use arrow IPC or parquet for faster IO.

**Keywords:** probe, predicate pushdown, mutex, threads, rayon.

In [None]:
# Quick tip: read parquet (if available) is faster than CSV
# pl.read_parquet('file.parquet')

## Visual Diagrams
Below are two example diagrams included in the package: sales timeseries and Polars architecture.

In [None]:
from IPython.display import Image, display
display(Image(filename='/mnt/data/polars_samples/sales_timeseries.png'))
display(Image(filename='/mnt/data/polars_samples/polars_architecture.png'))

## Reading Newline-delimited JSON (example: nested types)
**What this cell does:** reads the sample nested JSON created earlier.
**Keywords:** `read_ndjson`, `read_json`, nested

In [None]:
print(pl.read_ndjson('/mnt/data/polars_samples/nested.csv'))

## Writing Files (CSV, Parquet)
**What this cell does:** demonstrates writing results.
**Internals:** parquet preserves schema and nested types whereas CSV flattens and may lose types.
**Keywords:** `.write_parquet()`, `.write_csv()`

In [None]:
df_csv = pl.read_csv('/mnt/data/polars_samples/sales.csv')
df_csv.write_csv('/mnt/data/polars_samples/out_sales.csv')
# For parquet (example)
# df_csv.write_parquet('/mnt/data/polars_samples/out_sales.parquet')
print('wrote out_sales.csv')

## Closing Notes & Further Reading
- Polars is ideal for heavy-data tasks when you need speed.
- Use the official docs and API reference for the latest specialized APIs.

**End of notebook.**