# Bronze → Silver: NASA Turbofan Engine Data

## Purpose
Process NASA C-MAPSS turbofan engine degradation sensor data

## Dataset
- **Source:** NASA Prognostics Center of Excellence
- **Files:** train_FD001-004.txt (training data with run-to-failure cycles)
- **Sensors:** 21 sensor readings + operational settings
- **Use Case:** Predictive maintenance, RUL prediction

## Transformations
- Parse space-delimited files
- Add proper column names
- Filter invalid readings
- Normalize sensor IDs
- Partition by engine unit

**Author:** Kevin  
**Date:** Feb 9, 2026


In [0]:
from pyspark.sql.functions import (
    col, monotonically_increasing_id, current_timestamp, lit
)
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType

storage_account_name = "stgolistmigration"
account_key = ""

spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
    account_key
)

def get_bronze_path(filename):
    return f"abfss://bronze@{storage_account_name}.dfs.core.windows.net/{filename}"

def get_silver_path(table):
    return f"abfss://silver@{storage_account_name}.dfs.core.windows.net/{table}/"

print("✅ Config loaded")


✅ Config loaded


In [0]:
# NASA turbofan column schema
# Format: unit_id, time_cycle, 3 operational settings, 21 sensor measurements

column_names = [
    "unit_id",           # Engine unit number
    "time_cycle",        # Operational cycle
    "setting_1",         # Operational setting 1
    "setting_2",         # Operational setting 2  
    "setting_3",         # Operational setting 3
    "sensor_1",          # Fan inlet temperature
    "sensor_2",          # LPC outlet temperature
    "sensor_3",          # HPC outlet temperature
    "sensor_4",          # LPT outlet temperature
    "sensor_5",          # Fan inlet pressure
    "sensor_6",          # Bypass-duct pressure
    "sensor_7",          # HPC outlet pressure
    "sensor_8",          # Physical fan speed
    "sensor_9",          # Physical core speed
    "sensor_10",         # Engine pressure ratio
    "sensor_11",         # HPC outlet static pressure
    "sensor_12",         # Ratio of fuel flow to Ps30
    "sensor_13",         # Corrected fan speed
    "sensor_14",         # Corrected core speed
    "sensor_15",         # Bypass ratio
    "sensor_16",         # Burner fuel-air ratio
    "sensor_17",         # Bleed enthalpy
    "sensor_18",         # Demanded fan speed
    "sensor_19",         # Demanded corrected fan speed
    "sensor_20",         # HPT coolant bleed
    "sensor_21"          # LPT coolant bleed
]

print(f"✅ Schema defined: {len(column_names)} columns")


✅ Schema defined: 26 columns


In [0]:
print("📖 Reading NASA training data...")

# Read all 4 training files
datasets = []
max_cols = 0

# First pass: find max columns
for i in range(1, 5):
    file_path = get_bronze_path(f"train_FD00{i}.txt")
    df_temp = spark.read \
        .option("delimiter", " ") \
        .option("inferSchema", "true") \
        .csv(file_path)
    
    # Remove empty columns
    actual_cols = [c for c in df_temp.columns if df_temp.select(c).distinct().count() > 1]
    max_cols = max(max_cols, len(actual_cols))

print(f"Max columns found: {max_cols}")

# Second pass: read and align all to max columns
for i in range(1, 5):
    file_path = get_bronze_path(f"train_FD00{i}.txt")
    
    print(f"   Reading FD00{i}...")
    
    df_temp = spark.read \
        .option("delimiter", " ") \
        .option("inferSchema", "true") \
        .csv(file_path)
    
    # Remove empty columns
    actual_cols = [c for c in df_temp.columns if df_temp.select(c).distinct().count() > 1]
    df_temp = df_temp.select(actual_cols)
    
    # Rename to standard column names
    for idx in range(len(actual_cols)):
        if idx < len(column_names):
            df_temp = df_temp.withColumnRenamed(df_temp.columns[idx], column_names[idx])
    
    # Add missing columns with null values
    for idx in range(len(actual_cols), len(column_names)):
        if column_names[idx] not in df_temp.columns:
            df_temp = df_temp.withColumn(column_names[idx], col("unit_id").cast("double") * 0)  # Add as 0.0
    
    # Add dataset identifier
    df_temp = df_temp.withColumn("dataset_name", lit(f"FD00{i}"))
    
    # Ensure same column order
    df_temp = df_temp.select(column_names + ["dataset_name"])
    
    datasets.append(df_temp)
    print(f"      Loaded: {df_temp.count():,} cycles, {len(df_temp.columns)} columns")

# Union all datasets
df_nasa_bronze = datasets[0]
for df in datasets[1:]:
    df_nasa_bronze = df_nasa_bronze.union(df)

print(f"\n✅ Total cycles: {df_nasa_bronze.count():,}")
print(f"   Columns: {len(df_nasa_bronze.columns)}")


📖 Reading NASA training data...
Max columns found: 26
   Reading FD001...
      Loaded: 20,631 cycles, 27 columns
   Reading FD002...
      Loaded: 53,759 cycles, 27 columns
   Reading FD003...
      Loaded: 24,720 cycles, 27 columns
   Reading FD004...
      Loaded: 61,249 cycles, 27 columns

✅ Total cycles: 160,359
   Columns: 27


In [0]:
print("🔍 Data Quality Check")
print("=" * 80)

print("\n1️⃣ RECORD COUNTS:")
df_nasa_bronze.groupBy("dataset_name") \
    .count() \
    .orderBy("dataset_name") \
    .show(truncate=False)

print("\n2️⃣ UNIQUE ENGINES:")
total_engines = df_nasa_bronze.select("unit_id", "dataset_name").distinct().count()
print(f"Total unique engines: {total_engines}")

print("\n3️⃣ CYCLE RANGE:")
df_nasa_bronze.agg(
    {"time_cycle": "min"}
).show()

df_nasa_bronze.agg(
    {"time_cycle": "max"}
).show()

print("\n4️⃣ SAMPLE DATA:")
df_nasa_bronze.limit(3).show(5, truncate=False, vertical=True)


🔍 Data Quality Check

1️⃣ RECORD COUNTS:
+------------+-----+
|dataset_name|count|
+------------+-----+
|FD001       |20631|
|FD002       |53759|
|FD003       |24720|
|FD004       |61249|
+------------+-----+


2️⃣ UNIQUE ENGINES:
Total unique engines: 709

3️⃣ CYCLE RANGE:
+---------------+
|min(time_cycle)|
+---------------+
|              1|
+---------------+

+---------------+
|max(time_cycle)|
+---------------+
|            543|
+---------------+


4️⃣ SAMPLE DATA:
-RECORD 0---------------
 unit_id      | 1       
 time_cycle   | 1       
 setting_1    | -7.0E-4 
 setting_2    | -4.0E-4 
 setting_3    | 641.82  
 sensor_1     | 1589.7  
 sensor_2     | 1400.6  
 sensor_3     | 21.61   
 sensor_4     | 554.36  
 sensor_5     | 2388.06 
 sensor_6     | 9046.19 
 sensor_7     | 47.47   
 sensor_8     | 521.66  
 sensor_9     | 2388.02 
 sensor_10    | 8138.62 
 sensor_11    | 8.4195  
 sensor_12    | 392.0   
 sensor_13    | 39.06   
 sensor_14    | 23.419  
 sensor_15    | 0.0     


In [0]:
print("🔄 Transforming NASA data...")

df_nasa_silver = df_nasa_bronze \
    .filter(col("unit_id").isNotNull()) \
    .filter(col("time_cycle").isNotNull()) \
    .withColumn("record_id", monotonically_increasing_id()) \
    .withColumn("ingestion_timestamp", current_timestamp())

silver_count = df_nasa_silver.count()

print(f"✅ Transformation complete")
print(f"   Silver rows: {silver_count:,}")


🔄 Transforming NASA data...
✅ Transformation complete
   Silver rows: 160,359


In [0]:
output_path = get_silver_path("nasa_turbofan_train")

print(f"💾 Writing to: {output_path}")

df_nasa_silver.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .partitionBy("dataset_name") \
    .save(output_path)

print("✅ NASA Silver complete!")


💾 Writing to: abfss://silver@stgolistmigration.dfs.core.windows.net/nasa_turbofan_train/
✅ NASA Silver complete!


In [0]:
print("🔍 Verifying...")

df_verify = spark.read.format("delta").load(output_path)

print(f"✅ Verified: {df_verify.count():,} sensor readings")
print(f"   Unique engines: {df_verify.select('dataset_name').distinct().count()}")

print("\nDataset distribution:")
df_verify.groupBy("dataset_name") \
    .count() \
    .orderBy("dataset_name") \
    .show(truncate=False)

print("=" * 80)
print("🎉 NASA Bronze → Silver complete!")


🔍 Verifying...
✅ Verified: 160,359 sensor readings
   Unique engines: 4

Dataset distribution:
+------------+-----+
|dataset_name|count|
+------------+-----+
|FD001       |20631|
|FD002       |53759|
|FD003       |24720|
|FD004       |61249|
+------------+-----+

🎉 NASA Bronze → Silver complete!
