# Silver → Gold: NASA Engine Summary

## Purpose
Aggregate turbofan sensor data by engine unit for predictive maintenance analytics

## Source
- Silver: `nasa_turbofan_train`

## Transformations
- Aggregate by engine unit (unit_id + dataset_name)
- Calculate total cycles (engine lifetime)
- Compute sensor statistics (min, max, avg, stddev)
- Identify last cycle readings (pre-failure state)

## Output
- Gold: `fact_nasa_engines`
- Grain: One row per engine unit
- Use Case: RUL prediction, failure analysis

**Author:** Kevin  
**Date:** Feb 9, 2026


In [0]:
from pyspark.sql.functions import (
    col, count, max, min, avg, stddev, last, first, 
    current_timestamp, round as spark_round
)

storage_account_name = "stgolistmigration"
account_key = ""

spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
    account_key
)

def get_silver_path(table):
    return f"abfss://silver@{storage_account_name}.dfs.core.windows.net/{table}/"

def get_gold_path(table):
    return f"abfss://gold@{storage_account_name}.dfs.core.windows.net/{table}/"

print("✅ Config loaded")


✅ Config loaded


In [0]:
print("📖 Loading NASA Silver data...")

nasa_path = get_silver_path("nasa_turbofan_train")
df_nasa = spark.read.format("delta").load(nasa_path)

print(f"✅ Loaded: {df_nasa.count():,} sensor readings")
print(f"   Unique engines: {df_nasa.select('unit_id', 'dataset_name').distinct().count()}")


📖 Loading NASA Silver data...
✅ Loaded: 160,359 sensor readings
   Unique engines: 709


In [0]:
print("🔄 Aggregating by engine unit...")

# Create engine key
df_nasa_keyed = df_nasa.withColumn(
    "engine_key", 
    col("unit_id").cast("string") + "_" + col("dataset_name")
)

# Aggregate by engine
df_gold_engines = df_nasa_keyed.groupBy("engine_key", "dataset_name", "unit_id").agg(
    count("*").alias("total_cycles"),
    max("time_cycle").alias("max_cycle"),
    
    # Sensor 1 stats (Fan inlet temperature)
    avg("sensor_1").alias("sensor_1_avg"),
    stddev("sensor_1").alias("sensor_1_stddev"),
    
    # Sensor 2 stats (LPC outlet temperature)
    avg("sensor_2").alias("sensor_2_avg"),
    stddev("sensor_2").alias("sensor_2_stddev"),
    
    # Sensor 3 stats (HPC outlet temperature)
    avg("sensor_3").alias("sensor_3_avg"),
    stddev("sensor_3").alias("sensor_3_stddev"),
    
    # Sensor 4 stats (LPT outlet temperature)
    avg("sensor_4").alias("sensor_4_avg"),
    stddev("sensor_4").alias("sensor_4_stddev"),
    
    # Sensor 7 stats (HPC outlet pressure)
    avg("sensor_7").alias("sensor_7_avg"),
    stddev("sensor_7").alias("sensor_7_stddev"),
    
    # Sensor 11 stats (HPC outlet static pressure)
    avg("sensor_11").alias("sensor_11_avg"),
    stddev("sensor_11").alias("sensor_11_stddev"),
    
    # Sensor 12 stats (Fuel flow ratio)
    avg("sensor_12").alias("sensor_12_avg"),
    stddev("sensor_12").alias("sensor_12_stddev")
)

# Round values
for col_name in df_gold_engines.columns:
    if "avg" in col_name or "stddev" in col_name:
        df_gold_engines = df_gold_engines.withColumn(
            col_name, spark_round(col(col_name), 2)
        )

df_gold_engines = df_gold_engines.withColumn("gold_ingestion_timestamp", current_timestamp())

print(f"✅ Aggregated to {df_gold_engines.count():,} engines")

# Show sample
df_gold_engines.limit(2).show(truncate=False, vertical=True)


🔄 Aggregating by engine unit...
✅ Aggregated to 709 engines
-RECORD 0----------------------------------------------
 engine_key               | NULL                       
 dataset_name             | FD004                      
 unit_id                  | 55                         
 total_cycles             | 344                        
 max_cycle                | 344                        
 sensor_1_avg             | 474.08                     
 sensor_1_stddev          | 26.71                      
 sensor_2_avg             | 580.86                     
 sensor_2_stddev          | 37.95                      
 sensor_3_avg             | 1421.65                    
 sensor_3_stddev          | 107.95                     
 sensor_4_avg             | 1206.78                    
 sensor_4_stddev          | 120.45                     
 sensor_7_avg             | 287.86                     
 sensor_7_stddev          | 148.69                     
 sensor_11_avg            | 42.97           

In [0]:
output_path = get_gold_path("fact_nasa_engines")

print(f"💾 Writing to: {output_path}")

df_gold_engines.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .partitionBy("dataset_name") \
    .save(output_path)

print("✅ NASA Gold complete!")


💾 Writing to: abfss://gold@stgolistmigration.dfs.core.windows.net/fact_nasa_engines/
✅ NASA Gold complete!


In [0]:
print("🔍 Verifying...")

df_verify = spark.read.format("delta").load(output_path)

print(f"✅ Verified: {df_verify.count():,} engines")

print("\nEngine distribution by dataset:")
df_verify.groupBy("dataset_name").count().orderBy("dataset_name").show()

print("\nCycle statistics:")
df_verify.select("total_cycles").describe().show()

print("\nSample engine summary:")
df_verify.select(
    "engine_key", "dataset_name", "total_cycles", 
    "sensor_1_avg", "sensor_7_avg"
).limit(5).show(truncate=False)

print("🎉 NASA → Gold complete!")


🔍 Verifying...
✅ Verified: 709 engines

Engine distribution by dataset:
+------------+-----+
|dataset_name|count|
+------------+-----+
|       FD001|  100|
|       FD002|  260|
|       FD003|  100|
|       FD004|  249|
+------------+-----+


Cycle statistics:
+-------+------------------+
|summary|      total_cycles|
+-------+------------------+
|  count|               709|
|   mean|226.17630465444287|
| stddev| 66.43478346985647|
|    min|               128|
|    max|               543|
+-------+------------------+


Sample engine summary:
+----------+------------+------------+------------+------------+
|engine_key|dataset_name|total_cycles|sensor_1_avg|sensor_7_avg|
+----------+------------+------------+------------+------------+
|NULL      |FD002       |149         |471.63      |276.06      |
|NULL      |FD002       |211         |473.86      |287.68      |
|NULL      |FD002       |249         |473.87      |288.0       |
|NULL      |FD002       |192         |472.81      |280.93      |