# **NOAA GHCN Project Using Spark Medallion Architecture**

## **Project Overview**

* Collect daily climate observations from the **NOAA GHCN-Daily dataset (2010–2025)**
* Dataset source: [Global Historical Climatology Network – Daily (GHCN-D)](https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily)
* Process the data through the **Bronze → Silver → Gold** Medallion Architecture using Spark
* Perform exploratory analysis to study **historical climate trends** in temperature and precipitation
* Use machine-learning methods to identify and predict:

  * Temperature anomalies
  * Rainfall patterns
  * Regional climate clusters
  * Potential extreme events (e.g., heatwaves)

---

# **Bronze Layer**

In the Bronze layer we ingest the raw GHCN-Daily text file and convert it into a structured, append-only dataset using Spark.

**Key steps:**
- Download selected years from the dataset (One Time Only)
- Read the combined raw file as a text DataFrame (one line per record)
- Parse each line safely into a structured schema
- Use a safe parser that:
  - labels well-formed records as `_status = "valid"`
  - routes malformed lines to `_status = "parse_error"` and keeps the original raw line in `_raw_data`
- Write the Bronze dataset as Parquet, partitioned by `year`, to the configured `BRONZE_OUT` path.
- Inspect the Bronze output by printing the schema, showing sample rows, and listing distinct `element` values present in the raw data.

## 01. Download GHCN Data and Metadata

In [None]:
# Download GHCN files to a local folder and combines them into a text file using the helper script.
#import os, subprocess
#
#years = list(range(2010, 2026))  # e.g., 2010..2025
#target_dir = os.environ.get("NOAA_DIR", "/home/ubuntu/spark-notebooks/project/data/raw")
#os.makedirs(target_dir, exist_ok=True)
#
#env = os.environ.copy()
#env["DATA_DIR"] = target_dir
#
#try:
#    # Use the helper script located at project root
#    cmd = ["bash", "/home/ubuntu/spark-notebooks/project/scripts/ghcn_download.sh", *[str(y) for y in years]]
#    print("Running:", cmd)
#    subprocess.run(cmd, check=True, env=env)
#except Exception as e:
#    print("Error downloading GHCN data:", e)

In [None]:
# # Download GHCN metadata files (stations & inventory)
# import os, subprocess

# # Define where metadata will be stored
# meta_dir = "/home/ubuntu/spark-notebooks/project/data/meta"
# os.makedirs(meta_dir, exist_ok=True)

# # URLs for metadata files
# urls = {
#     "ghcnd-stations.txt": "https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
#     "ghcnd-inventory.txt": "https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-inventory.txt"
# }

# for filename, url in urls.items():
#     dest_path = os.path.join(meta_dir, filename)
#     if os.path.exists(dest_path) and os.path.getsize(dest_path) > 0:
#         print(f"[skip] {filename} already exists at {dest_path}")
#         continue
#     try:
#         print(f"[download] {filename} from {url}")
#         subprocess.run(["wget", "-O", dest_path, url], check=True)
#         print(f"[done] Saved to {dest_path}")
#     except Exception as e:
#         print(f"[error] Could not download {filename}: {e}")

# # Show downloaded files
# print("\nMetadata files:")
# !ls -lh /home/ubuntu/spark-notebooks/project/data/meta

## 02. Spark Config

In [3]:
# Create Spark session
import os
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .master("local[*]")
    .appName("NOAA-GHCN-Bronze")
    .config("spark.sql.session.timeZone", "UTC")
    .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")
print("Spark version:", spark.version)
spark

Spark version: 3.5.0


## 03. Data Path Config

In [4]:
# Define input/output paths
import os

# Local folder with raw .txt files
RAW_DIR = os.environ.get("NOAA_DIR", "/home/ubuntu/spark-notebooks/project/data/raw")

# Bronze Parquet outputs
BRONZE_OUT = os.environ.get("BRONZE_OUT", "/home/ubuntu/spark-notebooks/project/data/bronze")

print("Input dir:", RAW_DIR)
print("Bronze out:", BRONZE_OUT)

Input dir: /home/ubuntu/spark-notebooks/project/data/raw
Bronze out: /home/ubuntu/spark-notebooks/project/data/bronze


## 04. Read raw .txt file as a Spark text DataFrame

In [5]:
from pyspark.sql import functions as F
from pyspark.sql.functions import col
import os

# Strictly use the combined TXT file
input_path = os.path.join(RAW_DIR, "ghcn_all_years.txt")
if not os.path.isfile(input_path):
    raise FileNotFoundError(f"Expected combined TXT file at {input_path}. Run the download cell to generate it.")

df_text = spark.read.text(input_path)
print("Reading from:", input_path)
print("Raw line count:", df_text.count())
df_text.show(3, truncate=False)

Reading from: /home/ubuntu/spark-notebooks/project/data/raw/ghcn_all_years.txt


                                                                                

Raw line count: 585758447
+--------------------------------+
|value                           |
+--------------------------------+
|ASN00010195,20100101,PRCP,0,,,a,|
|ASN00010160,20100101,PRCP,0,,,a,|
|ASN00010163,20100101,PRCP,0,,,a,|
+--------------------------------+
only showing top 3 rows



## 05. Safe Parsing of Raw Text Data

In [6]:
# Bronze-safe parsing: read raw lines and convert them into structured records

import time
from pyspark.sql import Row
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

# Schema for the Bronze layer: raw fields + ingestion metadata
bronze_schema = StructType([
    StructField("station", StringType(), True),
    StructField("date_str", StringType(), True),
    StructField("element", StringType(), True),
    StructField("raw_value", StringType(), True),
    StructField("mflag", StringType(), True),
    StructField("qflag", StringType(), True),
    StructField("sflag", StringType(), True),
    StructField("obstime", StringType(), True),
    StructField("year", StringType(), True),                    # extracted year for partitioning
    StructField("_ingestion_timestamp", DoubleType(), False),
    StructField("_source", StringType(), False),
    StructField("_status", StringType(), False),
    StructField("_raw_data", StringType(), True),
])

SOURCE_TAG = "ghcn_txt"

# Safe parser that captures all fields or records a parse error
def parse_line_safe(line: str):
    ts = time.time()
    try:
        parts = line.split(",")
        if len(parts) < 8:
            raise ValueError("not enough columns")
        station = parts[0] or None
        date_str = parts[1] or None
        element = parts[2] or None
        raw_value = parts[3] or None
        mflag = parts[4] or None
        qflag = parts[5] or None
        sflag = parts[6] or None
        obstime = parts[7] or None
        year = (date_str[:4] if date_str and len(date_str) >= 4 else None)
        return Row(
            station=station,
            date_str=date_str,
            element=element,
            raw_value=raw_value,
            mflag=mflag,
            qflag=qflag,
            sflag=sflag,
            obstime=obstime,
            year=year,
            _ingestion_timestamp=ts,
            _source=SOURCE_TAG,
            _status="valid",
            _raw_data=None,
        )
    except Exception:
        # Preserve the full raw line when parsing fails
        return Row(
            station=None,
            date_str=None,
            element=None,
            raw_value=None,
            mflag=None,
            qflag=None,
            sflag=None,
            obstime=None,
            year=None,
            _ingestion_timestamp=ts,
            _source=SOURCE_TAG,
            _status="parse_error",
            _raw_data=line,
        )

# Convert text input into RDD and apply safe parser
rdd = df_text.rdd.map(lambda r: parse_line_safe(r["value"]))

# Build the Bronze DataFrame from parsed rows
df_bronze = spark.createDataFrame(rdd, schema=bronze_schema)

print("Bronze schema:")
df_bronze.printSchema()

print("Bronze sample:")
df_bronze.show(5, truncate=False)

print("Bronze rows:", df_bronze.count())

Bronze schema:
root
 |-- station: string (nullable = true)
 |-- date_str: string (nullable = true)
 |-- element: string (nullable = true)
 |-- raw_value: string (nullable = true)
 |-- mflag: string (nullable = true)
 |-- qflag: string (nullable = true)
 |-- sflag: string (nullable = true)
 |-- obstime: string (nullable = true)
 |-- year: string (nullable = true)
 |-- _ingestion_timestamp: double (nullable = false)
 |-- _source: string (nullable = false)
 |-- _status: string (nullable = false)
 |-- _raw_data: string (nullable = true)

Bronze sample:


                                                                                

+-----------+--------+-------+---------+-----+-----+-----+-------+----+--------------------+--------+-------+---------+
|station    |date_str|element|raw_value|mflag|qflag|sflag|obstime|year|_ingestion_timestamp|_source |_status|_raw_data|
+-----------+--------+-------+---------+-----+-----+-----+-------+----+--------------------+--------+-------+---------+
|ASN00010195|20100101|PRCP   |0        |NULL |NULL |a    |NULL   |2010|1.7640004042068837E9|ghcn_txt|valid  |NULL     |
|ASN00010160|20100101|PRCP   |0        |NULL |NULL |a    |NULL   |2010|1.764000404206972E9 |ghcn_txt|valid  |NULL     |
|ASN00010163|20100101|PRCP   |0        |NULL |NULL |a    |NULL   |2010|1.764000404206985E9 |ghcn_txt|valid  |NULL     |
|ASN00010192|20100101|PRCP   |0        |NULL |NULL |a    |NULL   |2010|1.7640004042070355E9|ghcn_txt|valid  |NULL     |
|ASN00010111|20100101|TMAX   |330      |NULL |NULL |a    |NULL   |2010|1.7640004042070467E9|ghcn_txt|valid  |NULL     |
+-----------+--------+-------+---------+



Bronze rows: 585758447


                                                                                

## 06. Write Bronze as Parquet

In [5]:
# Write Bronze as Parquet, partitioned by year
df_bronze.write.mode("overwrite").partitionBy("year").parquet(BRONZE_OUT)
print("Bronze Parquet written to:", BRONZE_OUT)

25/11/11 23:07:07 WARN MemoryManager: Total allocation exceeds 95.00% (964,270,478 bytes) of heap memory
Scaling row group sizes to 89.80% for 8 writers
25/11/11 23:08:33 WARN MemoryManager: Total allocation exceeds 95.00% (964,270,478 bytes) of heap memory
Scaling row group sizes to 89.80% for 8 writers
25/11/11 23:08:45 WARN MemoryManager: Total allocation exceeds 95.00% (964,270,478 bytes) of heap memory
Scaling row group sizes to 89.80% for 8 writers
25/11/11 23:09:40 WARN MemoryManager: Total allocation exceeds 95.00% (964,270,478 bytes) of heap memory
Scaling row group sizes to 89.80% for 8 writers

Bronze Parquet written to: /home/ubuntu/spark-notebooks/data/bronze


                                                                                

## 07. Inspect elements present

In [None]:
elements = df_bronze.select("element").distinct().orderBy("element")
print("Distinct element count:", elements.count())
elements.show(50, truncate=False)

                                                                                

Distinct element count: 113




+-------+
|element|
+-------+
|ADPT   |
|ASLP   |
|ASTP   |
|AWBT   |
|AWDR   |
|AWND   |
|DAEV   |
|DAPR   |
|DASF   |
|DATN   |
|DATX   |
|DAWM   |
|DWPR   |
|EVAP   |
|FMTM   |
|MDEV   |
|MDPR   |
|MDSF   |
|MDTN   |
|MDTX   |
|MDWM   |
|MNPN   |
|MXPN   |
|PGTM   |
|PRCP   |
|PSUN   |
|RHAV   |
|RHMN   |
|RHMX   |
|SN02   |
|SN03   |
|SN11   |
|SN12   |
|SN13   |
|SN14   |
|SN21   |
|SN22   |
|SN23   |
|SN31   |
|SN32   |
|SN33   |
|SN34   |
|SN35   |
|SN36   |
|SN51   |
|SN52   |
|SN53   |
|SN54   |
|SN55   |
|SN56   |
+-------+
only showing top 50 rows



                                                                                