# **NOAA GHCN Project Using Spark Medallion Architecture**

## **Project Overview**

* Collect daily climate observations from the **NOAA GHCN-Daily dataset (2010–2025)**
* Dataset source: [Global Historical Climatology Network – Daily (GHCN-D)](https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily)
* Process the data through the **Bronze → Silver → Gold** Medallion Architecture using Spark
* Perform exploratory analysis to study **historical climate trends** in temperature and precipitation
* Use machine-learning methods to identify and predict:

  * Temperature anomalies
  * Rainfall patterns
  * Regional climate clusters
  * Potential extreme events (e.g., heatwaves)

---

# **Bronze Layer**

In the Bronze layer we ingest the raw GHCN-Daily text file and convert it into a structured, append-only dataset using Spark.

**Key steps:**
- Download selected years from the dataset (One Time Only)
- Read the combined raw file as a text DataFrame (one line per record)
- Parse each line safely into a structured schema
- Use a safe parser that:
  - labels well-formed records as `_status = "valid"`
  - routes malformed lines to `_status = "parse_error"` and keeps the original raw line in `_raw_data`
- Write the Bronze dataset as Parquet, partitioned by `year`, to the configured `BRONZE_OUT` path.
- Inspect the Bronze output by printing the schema, showing sample rows, and listing distinct `element` values present in the raw data.

## 01. Download GHCN Data and Metadata

In [8]:
# Download GHCN files to a local folder and combines them into a text file using the helper script.
import os, subprocess

years = list(range(2010, 2026))  # Select years (2010-2025) to download
target_dir = os.environ.get("NOAA_DIR", "/home/ubuntu/spark-notebooks/project/data/raw")
os.makedirs(target_dir, exist_ok=True)

env = os.environ.copy()
env["DATA_DIR"] = target_dir

try:
   # Use the helper script located at project root
   cmd = ["bash", "/home/ubuntu/spark-notebooks/project/scripts/ghcn_download.sh", *[str(y) for y in years]]
   print("Running:", cmd)
   subprocess.run(cmd, check=True, env=env)
except Exception as e:
   print("Error downloading GHCN data:", e)

Running: ['bash', '/home/ubuntu/spark-notebooks/project/scripts/ghcn_download.sh', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023', '2024', '2025']
Downloading to: /home/ubuntu/spark-notebooks/project/data/raw
[wget] 2010


--2025-11-26 16:46:49--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2010.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.171, 205.167.25.167, 205.167.25.178, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.171|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 189374410 (181M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2010.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  216K 14m17s
    50K .......... .......... .......... .......... ..........  0%  433K 10m42s
   100K .......... .......... .......... .......... ..........  0% 36.3M 7m10s
   150K .......... .......... .......... .......... ..........  0% 4.28M 5m33s
   200K .......... .......... .......... .......... ..........  0%  486K 5m42s
   250K .......... .......... .......... .......... ..........  0% 72.5M 4m45s
   300K .......... .......... .......... .......... ..........  0% 71.1M 4m5s
   

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2010.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2010.txt
[wget] 2011


--2025-11-26 16:48:05--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2011.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.177, 205.167.25.178, 205.167.25.171, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.177|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 180028624 (172M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2011.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  219K 13m23s
    50K .......... .......... .......... .......... ..........  0%  436K 10m3s
   100K .......... .......... .......... .......... ..........  0% 70.6M 6m43s
   150K .......... .......... .......... .......... ..........  0% 3.76M 5m13s
   200K .......... .......... .......... .......... ..........  0%  496K 5m21s
   250K .......... .......... .......... .......... ..........  0% 89.5M 4m28s
   300K .......... .......... .......... .......... ..........  0%  108M 3m50s
   

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2011.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2011.txt
[wget] 2012


--2025-11-26 16:49:22--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2012.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.168, 205.167.25.178, 205.167.25.177, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.168|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 177151573 (169M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2012.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  220K 13m7s
    50K .......... .......... .......... .......... ..........  0%  439K 9m50s
   100K .......... .......... .......... .......... ..........  0%  116M 6m34s
   150K .......... .......... .......... .......... ..........  0%  148M 4m56s
   200K .......... .......... .......... .......... ..........  0%  443K 5m14s
   250K .......... .......... .......... .......... ..........  0% 64.4M 4m22s
   300K .......... .......... .......... .......... ..........  0% 99.5M 3m45s
   3

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2012.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2012.txt
[wget] 2013


--2025-11-26 16:49:55--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2013.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.168, 205.167.25.171, 205.167.25.172, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.168|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 171983250 (164M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2013.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  220K 12m44s
    50K .......... .......... .......... .......... ..........  0%  440K 9m33s
   100K .......... .......... .......... .......... ..........  0% 38.4M 6m23s
   150K .......... .......... .......... .......... ..........  0% 1.21M 5m21s
   200K .......... .......... .......... .......... ..........  0%  686K 5m6s
   250K .......... .......... .......... .......... ..........  0% 39.1M 4m15s
   300K .......... .......... .......... .......... ..........  0% 66.4M 3m39s
   3

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2013.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2013.txt
[wget] 2014


--2025-11-26 16:50:25--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2014.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.168, 205.167.25.172, 205.167.25.178, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.168|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170399448 (163M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2014.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  220K 12m36s
    50K .......... .......... .......... .......... ..........  0%  440K 9m27s
   100K .......... .......... .......... .......... ..........  0% 82.2M 6m18s
   150K .......... .......... .......... .......... ..........  0% 59.6M 4m44s
   200K .......... .......... .......... .......... ..........  0%  443K 5m2s
   250K .......... .......... .......... .......... ..........  0% 70.2M 4m12s
   300K .......... .......... .......... .......... ..........  0% 84.8M 3m36s
   3

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2014.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2014.txt
[wget] 2015


--2025-11-26 16:51:03--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2015.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.171, 205.167.25.167, 205.167.25.177, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.171|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 172916783 (165M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2015.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  220K 12m48s
    50K .......... .......... .......... .......... ..........  0%  439K 9m36s
   100K .......... .......... .......... .......... ..........  0% 73.0M 6m25s
   150K .......... .......... .......... .......... ..........  0%  102M 4m49s
   200K .......... .......... .......... .......... ..........  0%  442K 5m7s
   250K .......... .......... .......... .......... ..........  0% 63.6M 4m16s
   300K .......... .......... .......... .......... ..........  0% 80.4M 3m40s
   3

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2015.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2015.txt
[wget] 2016


--2025-11-26 16:51:34--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2016.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.167, 205.167.25.172, 205.167.25.171, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.167|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174383727 (166M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2016.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  219K 12m57s
    50K .......... .......... .......... .......... ..........  0%  442K 9m41s
   100K .......... .......... .......... .......... ..........  0% 47.4M 6m28s
   150K .......... .......... .......... .......... ..........  0%  444K 6m27s
   200K .......... .......... .......... .......... ..........  0%  100M 5m10s
   250K .......... .......... .......... .......... ..........  0% 61.9M 4m18s
   300K .......... .......... .......... .......... ..........  0% 98.5M 3m42s
   

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2016.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2016.txt
[wget] 2017


--2025-11-26 16:52:05--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2017.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.171, 205.167.25.172, 205.167.25.177, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.171|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 173999236 (166M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2017.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  220K 12m51s
    50K .......... .......... .......... .......... ..........  0%  440K 9m39s
   100K .......... .......... .......... .......... ..........  0%  132M 6m26s
   150K .......... .......... .......... .......... ..........  0%  441K 6m26s
   200K .......... .......... .......... .......... ..........  0% 66.6M 5m9s
   250K .......... .......... .......... .......... ..........  0% 76.9M 4m18s
   300K .......... .......... .......... .......... ..........  0% 90.3M 3m41s
   3

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2017.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2017.txt
[wget] 2018


--2025-11-26 16:52:43--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2018.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.168, 205.167.25.172, 205.167.25.177, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.168|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174062423 (166M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2018.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  220K 12m53s
    50K .......... .......... .......... .......... ..........  0%  439K 9m40s
   100K .......... .......... .......... .......... ..........  0% 56.6M 6m28s
   150K .......... .......... .......... .......... ..........  0%  442K 6m27s
   200K .......... .......... .......... .......... ..........  0% 58.6M 5m10s
   250K .......... .......... .......... .......... ..........  0% 50.4M 4m19s
   300K .......... .......... .......... .......... ..........  0% 40.2M 3m42s
   

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2018.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2018.txt
[wget] 2019


--2025-11-26 16:53:12--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2019.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.177, 205.167.25.167, 205.167.25.171, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.177|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 172913241 (165M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2019.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  220K 12m49s
    50K .......... .......... .......... .......... ..........  0%  442K 9m35s
   100K .......... .......... .......... .......... ..........  0%  102M 6m24s
   150K .......... .......... .......... .......... ..........  0%  439K 6m24s
   200K .......... .......... .......... .......... ..........  0% 69.6M 5m8s
   250K .......... .......... .......... .......... ..........  0% 60.5M 4m17s
   300K .......... .......... .......... .......... ..........  0% 73.9M 3m40s
   3

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2019.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2019.txt
[wget] 2020


--2025-11-26 16:53:45--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2020.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.178, 205.167.25.168, 205.167.25.171, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.178|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 173966561 (166M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2020.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  220K 12m54s
    50K .......... .......... .......... .......... ..........  0%  439K 9m40s
   100K .......... .......... .......... .......... ..........  0% 36.6M 6m28s
   150K .......... .......... .......... .......... ..........  0% 2.89M 5m5s
   200K .......... .......... .......... .......... ..........  0%  501K 5m12s
   250K .......... .......... .......... .......... ..........  0%  184M 4m20s
   300K .......... .......... .......... .......... ..........  0%  208M 3m43s
   3

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2020.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2020.txt
[wget] 2021


--2025-11-26 16:54:23--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2021.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.171, 205.167.25.177, 205.167.25.167, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.171|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 176958020 (169M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2021.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  220K 13m4s
    50K .......... .......... .......... .......... ..........  0%  437K 9m50s
   100K .......... .......... .......... .......... ..........  0% 74.1M 6m34s
   150K .......... .......... .......... .......... ..........  0%  442K 6m33s
   200K .......... .......... .......... .......... ..........  0% 46.3M 5m15s
   250K .......... .......... .......... .......... ..........  0% 68.9M 4m23s
   300K .......... .......... .......... .......... ..........  0% 72.0M 3m45s
   3

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2021.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2021.txt
[wget] 2022


--2025-11-26 16:54:54--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2022.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.172, 205.167.25.178, 205.167.25.177, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.172|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 177110662 (169M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2022.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  219K 13m10s
    50K .......... .......... .......... .......... ..........  0%  438K 9m52s
   100K .......... .......... .......... .......... ..........  0% 46.3M 6m36s
   150K .......... .......... .......... .......... ..........  0%  106M 4m57s
   200K .......... .......... .......... .......... ..........  0%  444K 5m16s
   250K .......... .......... .......... .......... ..........  0% 52.6M 4m23s
   300K .......... .......... .......... .......... ..........  0% 79.5M 3m46s
   

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2022.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2022.txt
[wget] 2023


--2025-11-26 16:55:57--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2023.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.172, 205.167.25.167, 205.167.25.178, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.172|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 177272595 (169M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2023.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  219K 13m9s
    50K .......... .......... .......... .......... ..........  0%  440K 9m51s
   100K .......... .......... .......... .......... ..........  0% 81.2M 6m34s
   150K .......... .......... .......... .......... ..........  0% 53.9M 4m57s
   200K .......... .......... .......... .......... ..........  0%  442K 5m15s
   250K .......... .......... .......... .......... ..........  0% 71.7M 4m23s
   300K .......... .......... .......... .......... ..........  0% 99.8M 3m46s
   3

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2023.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2023.txt
[wget] 2024


--2025-11-26 16:56:50--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2024.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.171, 205.167.25.178, 205.167.25.177, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.171|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 172391471 (164M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2024.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  220K 12m45s
    50K .......... .......... .......... .......... ..........  0%  440K 9m34s
   100K .......... .......... .......... .......... ..........  0% 58.9M 6m23s
   150K .......... .......... .......... .......... ..........  0%  101M 4m48s
   200K .......... .......... .......... .......... ..........  0%  443K 5m6s
   250K .......... .......... .......... .......... ..........  0% 62.6M 4m15s
   300K .......... .......... .......... .......... ..........  0% 79.3M 3m39s
   3

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2024.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2024.txt
[wget] 2025


--2025-11-26 16:57:43--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/2025.csv.gz
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.168, 205.167.25.178, 205.167.25.172, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.168|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127202688 (121M) [application/gzip]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/raw/2025.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  220K 9m25s
    50K .......... .......... .......... .......... ..........  0%  439K 7m4s
   100K .......... .......... .......... .......... ..........  0% 56.8M 4m43s
   150K .......... .......... .......... .......... ..........  0% 1.88M 3m48s
   200K .......... .......... .......... .......... ..........  0%  574K 3m46s
   250K .......... .......... .......... .......... ..........  0% 48.3M 3m8s
   300K .......... .......... .......... .......... ..........  0%  102M 2m42s
   350

[decompress] /home/ubuntu/spark-notebooks/project/data/raw/2025.csv.gz -> /home/ubuntu/spark-notebooks/project/data/raw/2025.txt
20G	/home/ubuntu/spark-notebooks/project/data/raw

Combining per-year TXT files into one file...
[combine] adding 2010.txt
[combine] adding 2011.txt
[combine] adding 2012.txt
[combine] adding 2013.txt
[combine] adding 2014.txt
[combine] adding 2015.txt
[combine] adding 2016.txt
[combine] adding 2017.txt
[combine] adding 2018.txt
[combine] adding 2019.txt
[combine] adding 2020.txt
[combine] adding 2021.txt
[combine] adding 2022.txt
[combine] adding 2023.txt
[combine] adding 2024.txt
[combine] adding 2025.txt
[done] combined 16 files -> /home/ubuntu/spark-notebooks/project/data/raw/ghcn_all_years.txt

Done. Summary:
40G	/home/ubuntu/spark-notebooks/project/data/raw
total 40G
-rw-rw-r-- 1 ubuntu ubuntu 1.4G Nov 26 16:48 2010.txt
-rw-rw-r-- 1 ubuntu ubuntu 1.3G Nov 26 16:49 2011.txt
-rw-rw-r-- 1 ubuntu ubuntu 1.3G Nov 26 16:49 2012.txt
-rw-rw-r-- 1 ubuntu ubuntu 

In [None]:
# Download GHCN metadata files (stations & inventory)
import os, subprocess

# Data Path
meta_dir = "/home/ubuntu/spark-notebooks/project/data/meta"
os.makedirs(meta_dir, exist_ok=True)

# Metadata URLs
urls = {
    "ghcnd-stations.txt": "https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
    "ghcnd-inventory.txt": "https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-inventory.txt"
}

for filename, url in urls.items():
    dest_path = os.path.join(meta_dir, filename)
    if os.path.exists(dest_path) and os.path.getsize(dest_path) > 0:
        print(f"[skip] {filename} already exists at {dest_path}")
        continue
    try:
        print(f"[download] {filename} from {url}")
        subprocess.run(["wget", "-O", dest_path, url], check=True)
        print(f"[done] Saved to {dest_path}")
    except Exception as e:
        print(f"[error] Could not download {filename}: {e}")

# Show downloaded files
print("\nMetadata files:")
!ls -lh /home/ubuntu/spark-notebooks/project/data/meta

[download] ghcnd-stations.txt from https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt


--2025-11-26 17:01:38--  https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 205.167.25.177, 205.167.25.168, 205.167.25.167, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|205.167.25.177|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11150588 (11M) [text/plain]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/meta/ghcnd-stations.txt’

     0K .......... .......... .......... .......... ..........  0%  220K 49s
    50K .......... .......... .......... .......... ..........  0%  439K 37s
   100K .......... .......... .......... .......... ..........  1%  107M 24s
   150K .......... .......... .......... .......... ..........  1% 1.71M 20s
   200K .......... .......... .......... .......... ..........  2%  588K 19s
   250K .......... .......... .......... .......... ..........  2%  117M 16s
   300K .......... .......... .......... .......... ..........  3%  105M 14s
   350K .......... 

[done] Saved to /home/ubuntu/spark-notebooks/project/data/meta/ghcnd-stations.txt
[download] ghcnd-inventory.txt from https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-inventory.txt


HTTP request sent, awaiting response... 200 OK
Length: 35288486 (34M) [text/plain]
Saving to: ‘/home/ubuntu/spark-notebooks/project/data/meta/ghcnd-inventory.txt’

     0K .......... .......... .......... .......... ..........  0%  219K 2m37s
    50K .......... .......... .......... .......... ..........  0%  440K 1m57s
   100K .......... .......... .......... .......... ..........  0% 68.6M 78s
   150K .......... .......... .......... .......... ..........  0% 2.55M 62s
   200K .......... .......... .......... .......... ..........  0%  525K 62s
   250K .......... .......... .......... .......... ..........  0% 42.9M 52s
   300K .......... .......... .......... .......... ..........  1% 57.0M 45s
   350K .......... .......... .......... .......... ..........  1% 84.1M 39s
   400K .......... .......... .......... .......... ..........  1% 2.76M 36s
   450K .......... .......... .......... .......... ..........  1%  534K 39s
   500K .......... .......... .......... .......... ..........

[done] Saved to /home/ubuntu/spark-notebooks/project/data/meta/ghcnd-inventory.txt

Metadata files:
total 45M
-rw-rw-r-- 1 ubuntu ubuntu 34M Nov 12 09:47 ghcnd-inventory.txt
-rw-rw-r-- 1 ubuntu ubuntu 11M Nov 12 09:47 ghcnd-stations.txt


## 02. Spark Config

In [10]:
# Create Spark session
import os
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .master("local[*]")
    .appName("NOAA-GHCN-Bronze")
    .config("spark.sql.session.timeZone", "UTC")
    .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")
print("Spark version:", spark.version)
spark

Spark version: 3.5.0


## 03. Data Path Config

In [11]:
# Define input/output paths
import os

# Local folder with raw .txt files
RAW_DIR = os.environ.get("NOAA_DIR", "/home/ubuntu/spark-notebooks/project/data/raw")

# Bronze Parquet outputs
BRONZE_OUT = os.environ.get("BRONZE_OUT", "/home/ubuntu/spark-notebooks/project/data/bronze")

print("Input dir:", RAW_DIR)
print("Bronze out:", BRONZE_OUT)

Input dir: /home/ubuntu/spark-notebooks/project/data/raw
Bronze out: /home/ubuntu/spark-notebooks/project/data/bronze


## 04. Read raw .txt file as a Spark text DataFrame

In [13]:
from pyspark.sql import functions as F
from pyspark.sql.functions import col
import os

# Strictly use the combined TXT file
input_path = os.path.join(RAW_DIR, "ghcn_all_years.txt")
if not os.path.isfile(input_path):
    raise FileNotFoundError(f"Expected combined TXT file at {input_path}. Run the download cell to generate it.")

df_text = spark.read.text(input_path)
print("Reading from:", input_path)
print("Raw line count:", df_text.count())
df_text.show(3, truncate=False)

Reading from: /home/ubuntu/spark-notebooks/project/data/raw/ghcn_all_years.txt




Raw line count: 587797263
+----------------------------------+
|value                             |
+----------------------------------+
|ASN00010568,20100101,TMAX,320,,,a,|
|ASN00010568,20100101,TMIN,120,,,a,|
|ASN00010568,20100101,PRCP,0,,,a,  |
+----------------------------------+
only showing top 3 rows



                                                                                

## 05. Safe Parsing of Raw Text Data

In [14]:
# Bronze-safe parsing: read raw lines and convert them into structured records

import time
from pyspark.sql import Row
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

# Schema for the Bronze layer: raw fields + ingestion metadata
bronze_schema = StructType([
    StructField("station", StringType(), True),
    StructField("date_str", StringType(), True),
    StructField("element", StringType(), True),
    StructField("raw_value", StringType(), True),
    StructField("mflag", StringType(), True),
    StructField("qflag", StringType(), True),
    StructField("sflag", StringType(), True),
    StructField("obstime", StringType(), True),
    StructField("year", StringType(), True),                    # extracted year for partitioning
    StructField("_ingestion_timestamp", DoubleType(), False),
    StructField("_source", StringType(), False),
    StructField("_status", StringType(), False),
    StructField("_raw_data", StringType(), True),
])

SOURCE_TAG = "ghcn_txt"

# Safe parser that captures all fields or records a parse error
def parse_line_safe(line: str):
    ts = time.time()
    try:
        parts = line.split(",")
        if len(parts) < 8:
            raise ValueError("not enough columns")
        station = parts[0] or None
        date_str = parts[1] or None
        element = parts[2] or None
        raw_value = parts[3] or None
        mflag = parts[4] or None
        qflag = parts[5] or None
        sflag = parts[6] or None
        obstime = parts[7] or None
        year = (date_str[:4] if date_str and len(date_str) >= 4 else None)
        return Row(
            station=station,
            date_str=date_str,
            element=element,
            raw_value=raw_value,
            mflag=mflag,
            qflag=qflag,
            sflag=sflag,
            obstime=obstime,
            year=year,
            _ingestion_timestamp=ts,
            _source=SOURCE_TAG,
            _status="valid",
            _raw_data=None,
        )
    except Exception:
        # Preserve the full raw line when parsing fails
        return Row(
            station=None,
            date_str=None,
            element=None,
            raw_value=None,
            mflag=None,
            qflag=None,
            sflag=None,
            obstime=None,
            year=None,
            _ingestion_timestamp=ts,
            _source=SOURCE_TAG,
            _status="parse_error",
            _raw_data=line,
        )

# Convert text input into RDD and apply safe parser
rdd = df_text.rdd.map(lambda r: parse_line_safe(r["value"]))

# Build the Bronze DataFrame from parsed rows
df_bronze = spark.createDataFrame(rdd, schema=bronze_schema)

print("Bronze schema:")
df_bronze.printSchema()

print("Bronze sample:")
df_bronze.show(5, truncate=False)

print("Bronze rows:", df_bronze.count())

Bronze schema:
root
 |-- station: string (nullable = true)
 |-- date_str: string (nullable = true)
 |-- element: string (nullable = true)
 |-- raw_value: string (nullable = true)
 |-- mflag: string (nullable = true)
 |-- qflag: string (nullable = true)
 |-- sflag: string (nullable = true)
 |-- obstime: string (nullable = true)
 |-- year: string (nullable = true)
 |-- _ingestion_timestamp: double (nullable = false)
 |-- _source: string (nullable = false)
 |-- _status: string (nullable = false)
 |-- _raw_data: string (nullable = true)

Bronze sample:
+-----------+--------+-------+---------+-----+-----+-----+-------+----+--------------------+--------+-------+---------+
|station    |date_str|element|raw_value|mflag|qflag|sflag|obstime|year|_ingestion_timestamp|_source |_status|_raw_data|
+-----------+--------+-------+---------+-----+-----+-----+-------+----+--------------------+--------+-------+---------+
|ASN00010568|20100101|TMAX   |320      |NULL |NULL |a    |NULL   |2010|1.764176707462



Bronze rows: 587797263


                                                                                

## 06. Write Bronze as Parquet

In [15]:
# Write Bronze as Parquet, partitioned by year
df_bronze.write.mode("overwrite").partitionBy("year").parquet(BRONZE_OUT)
print("Bronze Parquet written to:", BRONZE_OUT)

25/11/26 17:21:30 WARN MemoryManager: Total allocation exceeds 95.00% (925,918,811 bytes) of heap memory
Scaling row group sizes to 98.55% for 7 writers
25/11/26 17:21:30 WARN MemoryManager: Total allocation exceeds 95.00% (925,918,811 bytes) of heap memory
Scaling row group sizes to 86.23% for 8 writers
25/11/26 17:22:11 WARN MemoryManager: Total allocation exceeds 95.00% (925,918,811 bytes) of heap memory
Scaling row group sizes to 98.55% for 7 writers
25/11/26 17:23:05 WARN MemoryManager: Total allocation exceeds 95.00% (925,918,811 bytes) of heap memory
Scaling row group sizes to 98.55% for 7 writers
25/11/26 17:23:05 WARN MemoryManager: Total allocation exceeds 95.00% (925,918,811 bytes) of heap memory
Scaling row group sizes to 86.23% for 8 writers
25/11/26 17:23:08 WARN MemoryManager: Total allocation exceeds 95.00% (925,918,811 bytes) of heap memory
Scaling row group sizes to 98.55% for 7 writers
25/11/26 17:23:08 WARN MemoryManager: Total allocation exceeds 95.00% (925,918,811

Bronze Parquet written to: /home/ubuntu/spark-notebooks/project/data/bronze


                                                                                

## 07. Inspect elements present

In [16]:
elements = df_bronze.select("element").distinct().orderBy("element")
print("Distinct element count:", elements.count())
elements.show(50, truncate=False)

                                                                                

Distinct element count: 113




+-------+
|element|
+-------+
|ADPT   |
|ASLP   |
|ASTP   |
|AWBT   |
|AWDR   |
|AWND   |
|DAEV   |
|DAPR   |
|DASF   |
|DATN   |
|DATX   |
|DAWM   |
|DWPR   |
|EVAP   |
|FMTM   |
|MDEV   |
|MDPR   |
|MDSF   |
|MDTN   |
|MDTX   |
|MDWM   |
|MNPN   |
|MXPN   |
|PGTM   |
|PRCP   |
|PSUN   |
|RHAV   |
|RHMN   |
|RHMX   |
|SN02   |
|SN03   |
|SN11   |
|SN12   |
|SN13   |
|SN14   |
|SN21   |
|SN22   |
|SN23   |
|SN31   |
|SN32   |
|SN33   |
|SN34   |
|SN35   |
|SN36   |
|SN51   |
|SN52   |
|SN53   |
|SN54   |
|SN55   |
|SN56   |
+-------+
only showing top 50 rows



                                                                                