# PySpark mini-curriculum
## Goals
- Create a SparkSession and read CSV/Parquet
- DataFrame operations: select/withColumn/filter/groupBy
- UDFs vs built-ins; when to avoid UDFs
- Joins, window functions, and aggregations
- Spark MLlib basics: feature assemblers, estimators/transformers, pipelines
- Performance tips: partitions, caching, explain plans

## Step-by-step path
1) Setup & mindset
   - Install Spark locally (or use Databricks/EMR); set `SPARK_HOME`, test `SparkSession.builder.getOrCreate()`.
   - Favor built-in functions over Python UDFs; keep schemas explicit to avoid surprises.
2) SparkSession + IO
   - Create a session with sensible defaults (e.g., `master="local[*]"`, `appName`).
   - Read CSV/Parquet with `spark.read.option(...).csv/parquet`; check schema with `printSchema()`; set `inferSchema` cautiously.
3) DataFrame basics
   - Column ops: `select`, `withColumn`, `filter`/`where`, `drop`, `distinct`.
   - Expressions: use `pyspark.sql.functions` (e.g., `col`, `lit`, `when`, string/date functions) instead of UDFs when possible.
4) Aggregations
   - Grouped stats with `groupBy().agg(...)`; handle nulls with `na.fill`/`drop`.
   - Use `approx_count_distinct`, `percentile_approx` for scalable metrics.
5) Joins
   - Inner/left/anti/semi joins; beware duplicated keys changing row counts.
   - Control join strategy hints (`broadcast`, `merge`) and confirm with `explain()`.
6) Window functions
   - Define `Window.partitionBy(...).orderBy(...)`; apply `row_number`, `lag/lead`, running sums.
   - Use frame specs (`rowsBetween`) for rolling windows; confirm ordering determinism.
7) Data cleaning & types
   - Cast columns explicitly; parse timestamps with formats/timezones.
   - Handle skewed categories and outliers; consider bucketing or clipping before heavy joins.
8) Spark MLlib pipelines
   - Assemble features with `VectorAssembler`; handle categoricals with `StringIndexer` + `OneHotEncoder`.
   - Build `Pipeline` with transformers + estimators (e.g., `LogisticRegression`, `RandomForestClassifier`).
9) Evaluation & tuning
   - Split with `randomSplit`; use `BinaryClassificationEvaluator`/`RegressionEvaluator`.
   - Tune with `CrossValidator` or `TrainValidationSplit`; log `avgMetrics` and best params.
10) Performance & stability
    - Inspect physical plans with `explain(mode="formatted")`; avoid wide shuffles.
    - Repartition/coalesce intentionally, cache only when reused; monitor via Spark UI; write Parquet with partitioning and `mode="overwrite"`.

## Suggested exercises
- Start a local Spark session, read a CSV, and compute grouped stats + a filtered subset.
- Add a window function (e.g., `row_number` over partition) and validate ordering and counts.
- Build a small ML pipeline (StringIndexer -> OneHotEncoder -> VectorAssembler -> LogisticRegression), tune 2-3 hyperparameters, and evaluate on a holdout.
