# ETL Developer Assessment

Build a PySpark ETL pipeline that processes agriculture crop yield and abandonment data.

See `README.md` for full requirements.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

spark = SparkSession.builder \
    .appName("AgricultureETL") \
    .master("local[*]") \
    .getOrCreate()

print(f"Spark version: {spark.version}")

## Step 1: Data Ingestion

Read the partitioned Parquet datasets:
- `data/crop_yield/`
- `data/county_crop_abandonment/`

In [None]:
# Read crop yield data


In [None]:
# Read abandonment data


## Step 2: Data Validation & Quality

Check for and handle:
- Missing/null values
- Invalid data ranges
- Duplicate records
- Referential integrity

*Optional: Great Expectations is available if you prefer using it for validation.*

In [None]:
# Data quality checks and cleaning


## Step 3: Field-Level Calculations

Join datasets and compute:
- `abandoned_area = planted_area * (abandonment_percent / 100)`
- `crop_production = (planted_area - abandoned_area) * yield`

In [None]:
# Join and calculate field-level metrics


## Step 4: County-Level Rollup

Aggregate by `harvest_year`, `fips_cd`, `crop_name`:
- `total_planted_area` (sum)
- `total_abandoned_area` (sum)
- `total_production` (sum)
- `county_yield = total_production / (total_planted_area - total_abandoned_area)`

In [None]:
# County-level aggregation


## Step 5: Output

Save results as Parquet:
- `output/field_level_production/`
- `output/county_rollup/`

In [None]:
# Write outputs


In [None]:
spark.stop()