# 🧠 PySpark Essentials — Categorized Imports & Functions Cheat Sheet

This guide is tailored for ETL and data engineering workflows. It’s modular, memorable, and reusable—perfect for onboarding, cheat sheets, or DataGym documentation.

---

## 🔧 1. Core Setup

```python
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("MyApp").getOrCreate()
```

---

## 📦 2. DataFrame Creation & Schema

```python
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, DoubleType, BooleanType, DateType, TimestampType
```

- `StructType`, `StructField`: Define custom schemas
- `Row`: Create row objects manually

---

## 📊 3. DataFrame Operations

```python
from pyspark.sql.functions import col, lit, expr, when, isnan, isnull, desc, asc
```

- `col("column")`: Column reference
- `lit(value)`: Literal value
- `expr("SQL expression")`: Inline SQL
- `when(condition, value)`: Conditional logic
- `isnan`, `isnull`: Null checks
- `asc`, `desc`: Sorting

---

## 🔍 4. Aggregations & Grouping

```python
from pyspark.sql.functions import count, sum, avg, min, max, mean, approx_count_distinct
```

- `groupBy(...).agg(...)`: Aggregation
- `approx_count_distinct`: Fast cardinality estimate

---

## 🧮 5. Window Functions

```python
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank, lead, lag
```

- `Window.partitionBy(...).orderBy(...)`: Define window
- `row_number`, `rank`, `dense_rank`: Ranking
- `lead`, `lag`: Lookahead/behind

---

## 🧼 6. Null Handling & Cleaning

```python
from pyspark.sql.functions import coalesce, fillna, dropna, na
```

- `coalesce`: First non-null
- `fillna`, `dropna`: Null management
- `df.na`: Null operations

---

## 🧪 7. Date & Time Functions

```python
from pyspark.sql.functions import current_date, current_timestamp, datediff, date_add, date_sub, year, month, dayofmonth, hour, minute
```

- `datediff`, `date_add`, `date_sub`: Date math
- `year`, `month`, `dayofmonth`: Extract parts

---

## 🔡 8. String Functions

```python
from pyspark.sql.functions import lower, upper, trim, lpad, rpad, substring, regexp_replace, regexp_extract, concat_ws
```

- `lower`, `upper`, `trim`: Case and cleanup
- `regexp_replace`, `regexp_extract`: Regex magic
- `concat_ws`: Join strings with delimiter

---

## 🧠 9. Advanced & Miscellaneous

```python
from pyspark.sql.functions import explode, split, array, struct, to_json, from_json, udf
```

- `explode`, `split`: Array and string ops
- `to_json`, `from_json`: JSON handling
- `udf`: User-defined functions

---

## 🧵 10. Unified Import Block (Optional)

If you want a single import block for most use cases:

```python
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import Window
```

> ⚠️ **Note**: `from pyspark.sql.functions import *` is convenient but can clutter namespace. Use selectively in production.

---

📘 *Want this turned into a visual cheat sheet or interactive onboarding module for DataGym? Let’s co-design it!*
