# NASA Dashlink Flight Data EDA

This notebook performs exploratory data analysis on NASA Dashlink flight data.

## Data Structure
- Zip files contain .mat (MATLAB) files
- Each .mat file contains flight telemetry data
- Files are organized by Tail number (aircraft identifier)

## Library Recommendation
**Polars** is recommended for this analysis because:
- Much faster for large datasets (written in Rust)
- Better memory efficiency
- Lazy evaluation for optimal performance
- Modern API design

**Pandas** is also available if you prefer more ecosystem support.


In [5]:
import os
import zipfile
from pathlib import Path
import scipy.io
import numpy as np
import polars as pl
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print(f"Polars version: {pl.__version__}")
print(f"Pandas version: {pd.__version__}")


Polars version: 1.36.0
Pandas version: 2.3.3


## Step 1: Unzip Files (if needed)


In [6]:
def unzip_flight_data(data_dir='data', extract_to_subdirs=True):
    """
    Unzip all zip files in the data directory.
    
    Parameters:
    -----------
    data_dir : str
        Directory containing zip files
    extract_to_subdirs : bool
        If True, extract each zip to its own subdirectory (recommended)
    """
    data_path = Path(data_dir)
    zip_files = list(data_path.glob('*.zip'))
    
    if not zip_files:
        print("No zip files found. Files may already be extracted.")
        return
    
    print(f"Found {len(zip_files)} zip files")
    
    for zip_file in tqdm(zip_files, desc="Unzipping"):
        try:
            # Extract to a subdirectory named after the zip file (without .zip)
            if extract_to_subdirs:
                extract_dir = data_path / zip_file.stem
                extract_dir.mkdir(exist_ok=True)
            else:
                extract_dir = data_path
            
            with zipfile.ZipFile(zip_file, 'r') as zip_ref:
                zip_ref.extractall(extract_dir)
            
        except zipfile.BadZipFile:
            print(f"Warning: {zip_file.name} appears to be corrupted or not a valid zip file")
        except Exception as e:
            print(f"Error extracting {zip_file.name}: {e}")

# Uncomment to unzip files
unzip_flight_data()


Found 210 zip files


Unzipping:   3%|▎         | 7/210 [00:28<13:11,  3.90s/it]



Unzipping:   6%|▌         | 12/210 [00:43<11:36,  3.52s/it]



Unzipping:   9%|▉         | 19/210 [01:02<09:57,  3.13s/it]



Unzipping:  11%|█▏        | 24/210 [01:12<07:32,  2.43s/it]



Unzipping:  16%|█▌        | 34/210 [01:48<10:36,  3.62s/it]



Unzipping:  19%|█▊        | 39/210 [01:57<07:13,  2.54s/it]



Unzipping:  23%|██▎       | 49/210 [02:36<11:37,  4.34s/it]



Unzipping:  26%|██▌       | 54/210 [02:58<12:21,  4.75s/it]



Unzipping:  31%|███▏      | 66/210 [03:47<09:56,  4.15s/it]



Unzipping:  37%|███▋      | 77/210 [04:29<09:50,  4.44s/it]



Unzipping:  48%|████▊     | 100/210 [05:58<07:09,  3.90s/it]



Unzipping:  50%|████▉     | 104/210 [06:11<06:23,  3.62s/it]



Unzipping:  51%|█████     | 107/210 [06:20<05:42,  3.32s/it]



Unzipping:  61%|██████▏   | 129/210 [07:52<05:58,  4.42s/it]



Unzipping:  62%|██████▏   | 131/210 [07:56<04:27,  3.38s/it]



Unzipping:  64%|██████▍   | 134/210 [08:05<04:07,  3.26s/it]



Unzipping:  71%|███████   | 149/210 [08:57<03:54,  3.85s/it]



Unzipping:  75%|███████▌  | 158/210 [09:23<02:29,  2.88s/it]



Unzipping:  78%|███████▊  | 164/210 [09:43<02:40,  3.48s/it]



Unzipping:  83%|████████▎ | 175/210 [10:21<02:13,  3.81s/it]



Unzipping:  88%|████████▊ | 185/210 [10:56<01:36,  3.86s/it]



Unzipping:  89%|████████▉ | 187/210 [11:01<01:09,  3.04s/it]



Unzipping:  90%|█████████ | 190/210 [11:08<00:56,  2.81s/it]



Unzipping:  94%|█████████▍| 198/210 [11:36<00:46,  3.85s/it]



Unzipping:  96%|█████████▌| 201/210 [11:40<00:22,  2.49s/it]



Unzipping:  97%|█████████▋| 204/210 [11:44<00:11,  1.98s/it]



Unzipping: 100%|██████████| 210/210 [12:05<00:00,  3.46s/it]


## Step 2: Explore Data Structure


In [7]:
def explore_data_structure(data_dir='data'):
    """Explore the structure of the flight data."""
    data_path = Path(data_dir)
    
    # Find all .mat files
    mat_files = list(data_path.rglob('*.mat'))
    print(f"Total .mat files found: {len(mat_files)}")
    
    # Find all directories
    dirs = [d for d in data_path.iterdir() if d.is_dir()]
    print(f"Total directories: {len(dirs)}")
    
    # Sample a .mat file to understand structure
    if mat_files:
        sample_file = mat_files[0]
        print(f"\nExamining sample file: {sample_file}")
        
        try:
            mat_data = scipy.io.loadmat(str(sample_file))
            print(f"\nKeys in .mat file: {list(mat_data.keys())}")
            
            # Show structure of each key
            for key in mat_data.keys():
                if not key.startswith('__'):  # Skip metadata keys
                    value = mat_data[key]
                    print(f"\n{key}:")
                    print(f"  Type: {type(value)}")
                    if isinstance(value, np.ndarray):
                        print(f"  Shape: {value.shape}")
                        print(f"  Dtype: {value.dtype}")
                        if value.size > 0 and value.size < 100:
                            print(f"  Sample values:\n{value}")
        except Exception as e:
            print(f"Error reading {sample_file}: {e}")
    
    return mat_files, dirs

mat_files, dirs = explore_data_structure()


Total .mat files found: 104505
Total directories: 209

Examining sample file: data/Tail_677_2/677200202040430.mat

Keys in .mat file: ['__header__', '__version__', '__globals__', 'VAR_1107', 'VAR_2670', 'VAR_5107', 'VAR_6670', 'FPAC', 'BLAC', 'CTAC', 'TH', 'MH', 'EGT_1', 'EGT_2', 'EGT_3', 'EGT_4', 'IVV', 'GS', 'TRK', 'TRKM', 'DA', 'POVT', 'WS', 'MW', 'DFGS', 'WD', 'ALT', 'NSQT', 'RALT', 'ALTR', 'FQTY_1', 'OIT_1', 'OIT_2', 'AOA1', 'AOA2', 'PTCH', 'FF_1', 'PSA', 'FF_2', 'FF_3', 'ROLL', 'FF_4', 'N1_1', 'N1_2', 'MACH', 'CAS', 'APFD', 'PH', 'CASM', 'TAS', 'VRTG', 'LATG', 'PI', 'PS', 'N1_3', 'EVNT', 'MRK', 'VIB_1', 'PT', 'VHF1', 'VHF2', 'LGDN', 'LGUP', 'VIB_2', 'VHF3', 'PUSH', 'SHKR', 'MSQT_2', 'VIB_3', 'LONG', 'PLA_1', 'N1_4', 'HYDY', 'HYDG', 'SMOK', 'CALT', 'VIB_4', 'PLA_2', 'PLA_3', 'PLA_4', 'GMT_HOUR', 'GMT_MINUTE', 'GMT_SEC', 'ACMT', 'FQTY_2', 'OIT_3', 'OIT_4', 'DATE_YEAR', 'DATE_MONTH', 'DATE_DAY', 'DVER_1', 'ACID', 'BLV', 'EAI', 'PACK', 'AOAI', 'AOAC', 'BAL1', 'APUF', 'TOCW', 'BAL2', 

In [14]:
# Examine the full structure of a sample .mat file to see all available fields
sample_file = mat_files[0]
print(f"Examining: {sample_file}\n")

mat_data = scipy.io.loadmat(str(sample_file), squeeze_me=True, struct_as_record=False)

# Get first few non-metadata keys to understand structure
count = 0
for key in mat_data.keys():
    if not key.startswith('__'):
        value = mat_data[key]
        print(f"Signal: {key}")
        print(f"  Type: {type(value)}")

        # If it's a struct, show all its fields
        if hasattr(value, '__dict__'):
            print(f"  Fields available:")
            for field_name, field_value in value.__dict__.items():
                if hasattr(field_value, 'shape'):
                    print(f"    - {field_name}: shape={field_value.shape}, dtype={field_value.dtype}")
                else:
                    print(f"    - {field_name}: {type(field_value).__name__} = {field_value}")
        print()

        count += 1
        if count >= 3:  # Show first 3 signals as example
            break

# Count total signals
signal_count = sum(1 for k in mat_data.keys() if not k.startswith('__'))
print(f"Total signals in file: {signal_count}")

Examining: data/Tail_677_2/677200202040430.mat

Signal: VAR_1107
  Type: <class 'scipy.io.matlab._mio5_params.mat_struct'>
  Fields available:
    - _fieldnames: list = ['data', 'Rate', 'Units', 'Description', 'Alpha']
    - data: shape=(682,), dtype=uint16
    - Rate: float = 0.25
    - Units: str = <units>
    - Description: str = SYNC WORD FOR SUBFRAME 1
    - Alpha: str = 1107

Signal: VAR_2670
  Type: <class 'scipy.io.matlab._mio5_params.mat_struct'>
  Fields available:
    - _fieldnames: list = ['data', 'Rate', 'Units', 'Description', 'Alpha']
    - data: shape=(682,), dtype=uint16
    - Rate: float = 0.25
    - Units: str = <units>
    - Description: str = SYNC WORD FOR SUBFRAME 2
    - Alpha: str = 2670

Signal: VAR_5107
  Type: <class 'scipy.io.matlab._mio5_params.mat_struct'>
  Fields available:
    - _fieldnames: list = ['data', 'Rate', 'Units', 'Description', 'Alpha']
    - data: shape=(682,), dtype=uint16
    - Rate: float = 0.25
    - Units: str = <units>
    - Descriptio

## What each .mat file contains

  One .mat file = one complete flight recording

  The filename 677200202040430 tells you:
  - Aircraft: Tail 677
  - Date: 2002-02-04 (Feb 4, 2002)
  - Flight ID: 0430

  What's inside

  It's continuous sensor recordings, not discrete events. Think of it like a flight data recorder (black box) that's constantly logging:

  - Engine parameters (EGT, N1, N2, fuel flow) - sampled every second or faster
  - Flight dynamics (altitude, airspeed, pitch, roll) - continuous streams
  - System states (flaps, gear, autopilot) - logged throughout

  Each signal is a time series recorded at a specific sampling rate (e.g., 1 Hz, 4 Hz, 8 Hz) for the entire duration of the flight - from engine start to shutdown.

  So the structure is:

  Tail_677_2/
    └── 677200202040430.mat  (one flight)
          ├── ALT.data:  [alt_0, alt_1, alt_2, ...] (altitude every second)
          ├── EGT_1.data: [temp_0, temp_1, temp_2, ...] (engine temp every second)
          ├── N1_1.data: [rpm_0, rpm_1, rpm_2, ...] (fan speed)
          └── ... (186 parameters total)

  The index column (0, 1, 2, 3...) represents sequential time samples, and combined with the rate field, gives you actual time into the flight.

  
  "Times and dates are anonymized"

  So the 0430 in the filename 677200202040430 may not be the actual flight time - it could be shifted or just a sequence ID.

  What we know for sure:

  | Source                 | What it tells you                                      |
  |------------------------|--------------------------------------------------------|
  | time_seconds column    | Relative time from start of recording (accurate)       |
  | Filename date 20020204 | Likely the actual date (dates appear to be real)       |
  | Filename 0430          | Possibly anonymized - could be time or flight sequence |

## Step 3: Load and Convert .mat Files to DataFrame


In [35]:
def load_mat_file(mat_path):
    """
    Load a .mat file and convert to a more usable format.
    Returns a dictionary with the data.
    """
    try:
        mat_data = scipy.io.loadmat(str(mat_path), squeeze_me=True, struct_as_record=False)

        # Remove metadata keys (keys starting with __)
        data = {k: v for k, v in mat_data.items() if not k.startswith('__')}

        return data
    except Exception as e:
        print(f"Error loading {mat_path}: {e}")
        return None

# Sync words to filter out (frame synchronization markers, not actual telemetry)
SYNC_WORDS = {'VAR_1107', 'VAR_2670', 'VAR_5107', 'VAR_6670'}

def mat_to_dataframe(mat_path, use_polars=True):
    """
    Convert a .mat file to a Polars or Pandas DataFrame.
    
    Parameters:
    -----------
    mat_path : Path or str
        Path to .mat file
    use_polars : bool
        If True, return Polars DataFrame (recommended), else Pandas
    """
    data = load_mat_file(mat_path)
    if data is None:
        return None

    # Extract filename info
    filename = Path(mat_path).stem
    tail_num = Path(mat_path).parent.name if Path(mat_path).parent.name.startswith('Tail_') else 'unknown'

    dfs = []

    for key, value in data.items():
        # Skip sync words
        if key in SYNC_WORDS:
            continue

        # Get category info for this signal
        category, priority, category_desc = categorize_signal(key)

        # Handle structured data (MATLAB structs with .data, .rate, .units fields)
        if hasattr(value, '__dict__'):
            struct_data = value.__dict__

            # Check if this struct has a 'data' field (time series)
            if 'data' in struct_data:
                signal_data = struct_data['data']

                # Get sampling rate (default to 1 Hz if not present)
                rate = struct_data.get('Rate', struct_data.get('rate', 1))
                if hasattr(rate, 'item'):
                    rate = rate.item()

                # Get units if available
                units = struct_data.get('Units', struct_data.get('units', ''))
                if hasattr(units, 'item'):
                    units = str(units)

                # Get signal description from the .mat file if available
                signal_desc = struct_data.get('Description', struct_data.get('description', ''))
                if hasattr(signal_desc, 'item'):
                    signal_desc = str(signal_desc)

                if isinstance(signal_data, np.ndarray) and signal_data.ndim == 1:
                    try:
                        float_value = signal_data.astype(np.float64)
                    except (ValueError, TypeError):
                        continue

                    n_samples = len(signal_data)
                    indices = np.arange(n_samples, dtype=np.int64)
                    time_seconds = indices / rate

                    df_dict = {
                        'tail_number': tail_num,
                        'filename': filename,
                        'signal_name': key,
                        'category': category,
                        'priority': priority,
                        'category_desc': category_desc,
                        'signal_desc': str(signal_desc) if signal_desc else '',
                        'value': float_value,
                        'index': indices,
                        'time_seconds': time_seconds,
                        'rate_hz': float(rate),
                        'units': str(units) if units else 'unknown'
                    }

                    if use_polars:
                        df = pl.DataFrame(df_dict)
                    else:
                        df = pd.DataFrame(df_dict)
                    dfs.append(df)

        # Handle raw arrays (no struct wrapper)
        elif isinstance(value, np.ndarray):
            if value.ndim == 1:
                try:
                    float_value = value.astype(np.float64)
                except (ValueError, TypeError):
                    continue

                n_samples = len(value)
                indices = np.arange(n_samples, dtype=np.int64)

                df_dict = {
                    'tail_number': tail_num,
                    'filename': filename,
                    'signal_name': key,
                    'category': category,
                    'priority': priority,
                    'category_desc': category_desc,
                    'signal_desc': '',
                    'value': float_value,
                    'index': indices,
                    'time_seconds': indices.astype(np.float64),
                    'rate_hz': 1.0,
                    'units': 'unknown'
                }

                if use_polars:
                    df = pl.DataFrame(df_dict)
                else:
                    df = pd.DataFrame(df_dict)
                dfs.append(df)

    if not dfs:
        return None

    if use_polars:
        return pl.concat(dfs, how="vertical_relaxed")
    else:
        return pd.concat(dfs, ignore_index=True)

# Test with a sample file
if mat_files:
    sample_df = mat_to_dataframe(mat_files[0], use_polars=True)
    if sample_df is not None:
        with pl.Config(tbl_rows=1000, tbl_cols=1000):
            print("Sample DataFrame (Polars):")
            print(sample_df.head(10))
            print(f"\nShape: {sample_df.shape}")
            print(f"\nSchema: {sample_df.schema}")

            # Show category distribution
            print("\n" + "="*60)
            print("SIGNALS BY CATEGORY")
            print("="*60)
            print(sample_df.group_by(['priority', 'category']).agg([
                pl.n_unique('signal_name').alias('num_signals')
            ]).sort('priority'))


Sample DataFrame (Polars):
shape: (10, 12)
┌────────┬────────┬────────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ tail_n ┆ filena ┆ signal ┆ categ ┆ prior ┆ categ ┆ signa ┆ value ┆ index ┆ time_ ┆ rate_ ┆ units │
│ umber  ┆ me     ┆ _name  ┆ ory   ┆ ity   ┆ ory_d ┆ l_des ┆ ---   ┆ ---   ┆ secon ┆ hz    ┆ ---   │
│ ---    ┆ ---    ┆ ---    ┆ ---   ┆ ---   ┆ esc   ┆ c     ┆ f64   ┆ i64   ┆ ds    ┆ ---   ┆ str   │
│ str    ┆ str    ┆ str    ┆ str   ┆ i32   ┆ ---   ┆ ---   ┆       ┆       ┆ ---   ┆ f64   ┆       │
│        ┆        ┆        ┆       ┆       ┆ str   ┆ str   ┆       ┆       ┆ f64   ┆       ┆       │
╞════════╪════════╪════════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ Tail_6 ┆ 677200 ┆ FPAC   ┆ fligh ┆ 3     ┆ Aircr ┆ FLIGH ┆ 0.0   ┆ 0     ┆ 0.0   ┆ 16.0  ┆ G     │
│ 77_2   ┆ 202040 ┆        ┆ t_per ┆       ┆ aft   ┆ T     ┆       ┆       ┆       ┆       ┆       │
│        ┆ 430    ┆        ┆ forma ┆       ┆ sta

## Step 3a. Explore and add signal categories

In [31]:
# Signal criticality categories for T&E analysis
SIGNAL_CATEGORIES = {
    # TIER 1: Flight Safety Critical - Monitor these first
    'safety_critical': {
        'signals': [
            'ALT', 'RALT',      # Altitude (pressure & radio)
            'BAL1', 'BAL2',     # Barometric corrected altitude
            'CAS', 'MACH',      # Airspeed
            'AOA1', 'AOA2', 'AOAC', 'AOAI',  # Angle of attack
            'SHKR',             # Stick shaker (stall warning)
            'PUSH',             # Stick pusher (stall protection)
            'GPWS',             # Ground proximity warning
            'WSHR',             # Windshear warning
            'WOW',              # Weight on wheels
            'FIRE_1', 'FIRE_2', 'FIRE_3', 'FIRE_4',  # Engine fire
            'APUF',             # APU fire warning
            'MW',               # Master warning
            'TCAS',             # Traffic collision avoidance
            'CALT',             # Cabin high altitude warning
            'POVT',             # Pylon overheat
            'SMOK',             # Smoke warning
            'SMKB',             # Animal bay smoke
            'TOCW',             # Takeoff configuration warning
        ],
        'description': 'Immediate safety-of-flight indicators',
        'priority': 1
    },

    # TIER 2: Engine Health - Critical for powerplant analysis
    'engine_health': {
        'signals': [
            'N1_1', 'N1_2', 'N1_3', 'N1_4',      # Fan speed
            'N2_1', 'N2_2', 'N2_3', 'N2_4',      # Core speed
            'EGT_1', 'EGT_2', 'EGT_3', 'EGT_4',  # Exhaust gas temp
            'FF_1', 'FF_2', 'FF_3', 'FF_4',      # Fuel flow
            'OIP_1', 'OIP_2', 'OIP_3', 'OIP_4',  # Oil pressure
            'OIT_1', 'OIT_2', 'OIT_3', 'OIT_4',  # Oil temperature
            'VIB_1', 'VIB_2', 'VIB_3', 'VIB_4',  # Engine vibration
            'N1T', 'N1C', 'N1CO',                # N1 target/command/compensation
            'OIPL',             # Low oil pressure all engines
            'FADF',             # FADEC fail all engines
            'FADS',             # FADEC status all engines
            'ECYC_1', 'ECYC_2', 'ECYC_3', 'ECYC_4',  # Engine cycles
            'EHRS_1', 'EHRS_2', 'EHRS_3', 'EHRS_4',  # Engine hours
        ],
        'description': 'Engine performance and health monitoring',
        'priority': 2
    },

    # TIER 3: Flight Performance - Core flight dynamics
    'flight_performance': {
        'signals': [
            'PTCH',             # Pitch angle
            'ROLL',             # Roll angle
            'MH', 'TH',         # Magnetic/True heading
            'IVV', 'ALTR',      # Vertical speed
            'GS', 'TAS',        # Ground speed, true airspeed
            'VRTG',             # Vertical acceleration (G)
            'LATG',             # Lateral acceleration
            'LONG', 'BLAC',     # Longitudinal acceleration
            'FPAC',             # Flight path acceleration
            'CTAC',             # Cross track acceleration
            'TRK', 'TRKM',      # Track angle
            'DA',               # Drift angle
        ],
        'description': 'Aircraft state and performance metrics',
        'priority': 3
    },

    # TIER 4: Flight Controls - Pilot inputs & control surfaces
    'flight_controls': {
        'signals': [
            'CCPC', 'CCPF',     # Control column position (Capt/FO)
            'CWPC', 'CWPF',     # Control wheel position
            'RUDP',             # Rudder pedal position
            'AIL_1', 'AIL_2',   # Aileron position
            'ELEV_1', 'ELEV_2', # Elevator position
            'RUDD',             # Rudder position
            'FLAP',             # Flap position
            'SPL_1', 'SPL_2',   # Spoiler position
            'SPLG', 'SPLY',     # Spoiler deploy green/yellow
            'ABRK',             # Airbrake
            'PTRM',             # Pitch trim
            'PLA_1', 'PLA_2', 'PLA_3', 'PLA_4',  # Power lever angle
        ],
        'description': 'Control inputs and surface deflections',
        'priority': 4
    },

    # TIER 5: Autopilot/Flight Management
    'automation': {
        'signals': [
            'APFD',             # Autopilot/Flight Director status
            'ATEN',             # Autothrottle engage
            'A_T',              # Thrust automatic on
            'LMOD',             # Lateral mode
            'VMODE',            # Vertical mode
            'TMODE',            # Thrust mode
            'DFGS',             # Digital flight guidance
            'FGC3',             # DFGS status 3
            'ALTS',             # Selected altitude
            'CASS', 'CASM',     # Selected/max airspeed
            'MNS',              # Selected Mach
            'HDGS',             # Selected heading
            'VSPS',             # Selected vertical speed
        ],
        'description': 'Autopilot and flight management status',
        'priority': 5
    },

    # TIER 6: Navigation
    'navigation': {
        'signals': [
            'LATP', 'LONP',     # Lat/Long position
            'LOC',              # Localizer deviation
            'GLS',              # Glideslope deviation
            'DWPT',             # Distance to waypoint
            'CRSS',             # Selected course
            'ILSF',             # ILS frequency
            'MRK',              # Marker beacons
            'TMAG',             # True/magnetic heading select
        ],
        'description': 'Navigation and approach guidance',
        'priority': 6
    },

    # TIER 7: Aircraft Systems
    'systems': {
        'signals': [
            'HYDG', 'HYDY',     # Hydraulic pressure (green/yellow)
            'BPGR_1', 'BPGR_2', # Brake pressure green
            'BPYR_1', 'BPYR_2', # Brake pressure yellow
            'LGDN', 'LGUP',     # Landing gear position
            'MSQT_1', 'MSQT_2', 'NSQT',  # Squat switches
            'PACK',             # Air conditioning
            'BLV',              # Bleed air valves
            'FQTY_1', 'FQTY_2', 'FQTY_3', 'FQTY_4',  # Fuel quantity
            'EAI',              # Engine anti-ice
            'TAI',              # Tail anti-ice
            'WAI_1', 'WAI_2',   # Wing anti-ice
            'VHF1', 'VHF2', 'VHF3',  # VHF radio keying
            'HF1', 'HF2',       # HF radio keying
        ],
        'description': 'Aircraft systems status',
        'priority': 7
    },

    # TIER 8: Environmental
    'environmental': {
        'signals': [
            'SAT', 'TAT',       # Static/Total air temperature
            'WS', 'WD',         # Wind speed/direction
            'PS', 'PT', 'PI', 'PSA',  # Pressures (static, total, impact, avg)
        ],
        'description': 'Atmospheric conditions',
        'priority': 8
    },

    # TIER 9: Flight Phase & Recording Metadata
    'metadata': {
        'signals': [
            'PH',               # Flight phase
            'EVNT',             # Event marker
            'FRMC',             # Frame counter
            'ACMT',             # ACMS timing
            'ACID',             # Aircraft ID
            'SNAP',             # Manual snapshot switch
            'GMT_HOUR', 'GMT_MINUTE', 'GMT_SEC',  # Time
            'DATE_YEAR', 'DATE_MONTH', 'DATE_DAY',  # Date
            'DVER_1', 'DVER_2', # Database version
            'ESN_1', 'ESN_2', 'ESN_3', 'ESN_4',  # Engine serial numbers
        ],
        'description': 'Flight phase, timing, and recording metadata',
        'priority': 9
    },
}

def categorize_signal(signal_name):
    """Return category and priority for a signal name."""
    # Normalize: remove .data suffix, handle underscores vs dots
    clean_name = signal_name.replace('.data', '').replace('.', '_').upper()

    for category, info in SIGNAL_CATEGORIES.items():
        for sig in info['signals']:
            # Match with flexible formatting
            if clean_name == sig.upper() or clean_name.startswith(sig.upper()):
                return category, info['priority'], info['description']

    return 'uncategorized', 99, 'Not yet classified'

print("Signal categories defined. Use categorize_signal('SIGNAL_NAME') to classify signals.")
print(f"\nCategories available: {list(SIGNAL_CATEGORIES.keys())}")
print(f"\nTotal signals categorized: {sum(len(c['signals']) for c in SIGNAL_CATEGORIES.values())}")

Signal categories defined. Use categorize_signal('SIGNAL_NAME') to classify signals.

Categories available: ['safety_critical', 'engine_health', 'flight_performance', 'flight_controls', 'automation', 'navigation', 'systems', 'environmental', 'metadata']

Total signals categorized: 182


In [32]:
# Apply categorization to the sample dataframe
if sample_df is not None:
    # Add category and priority columns
    df_categorized = sample_df.with_columns([
        pl.col('signal_name').map_elements(
            lambda x: categorize_signal(x)[0], 
            return_dtype=pl.Utf8
        ).alias('category'),
        pl.col('signal_name').map_elements(
            lambda x: categorize_signal(x)[1], 
            return_dtype=pl.Int64
        ).alias('priority'),
    ])
    
    # Summary by category
    print("="*60)
    print("SIGNAL DISTRIBUTION BY CATEGORY")
    print("="*60)
    category_summary = df_categorized.group_by(['category', 'priority']).agg([
        pl.n_unique('signal_name').alias('num_signals')
    ]).sort('priority')
    print(category_summary)
    
    # Show signals in each category
    print("\n" + "="*60)
    print("SIGNALS BY CATEGORY (Tier 1-3 Focus)")
    print("="*60)
    for priority in [1, 2, 3]:
        cat_data = df_categorized.filter(pl.col('priority') == priority)
        if cat_data.height > 0:
            cat_name = cat_data.select('category').unique().item()
            signals = cat_data.select('signal_name').unique().to_series().to_list()
            desc = SIGNAL_CATEGORIES[cat_name]['description']
            print(f"\n[TIER {priority}] {cat_name.upper()}")
            print(f"  Description: {desc}")
            print(f"  Signals found: {signals}")
    
    # Show uncategorized signals (may need to add to categories)
    uncategorized = df_categorized.filter(pl.col('category') == 'uncategorized')
    if uncategorized.height > 0:
        print("\n" + "="*60)
        print("UNCATEGORIZED SIGNALS (may need classification)")
        print("="*60)
        uncategorized_signals = uncategorized.select('signal_name').unique().to_series().to_list()
        print(f"  {uncategorized_signals[:20]}")  # Show first 20
        if len(uncategorized_signals) > 20:
            print(f"  ... and {len(uncategorized_signals) - 20} more")

SIGNAL DISTRIBUTION BY CATEGORY
shape: (9, 3)
┌────────────────────┬──────────┬─────────────┐
│ category           ┆ priority ┆ num_signals │
│ ---                ┆ ---      ┆ ---         │
│ str                ┆ i64      ┆ u32         │
╞════════════════════╪══════════╪═════════════╡
│ safety_critical    ┆ 1        ┆ 31          │
│ engine_health      ┆ 2        ┆ 42          │
│ flight_performance ┆ 3        ┆ 19          │
│ flight_controls    ┆ 4        ┆ 21          │
│ automation         ┆ 5        ┆ 11          │
│ navigation         ┆ 6        ┆ 9           │
│ systems            ┆ 7        ┆ 26          │
│ environmental      ┆ 8        ┆ 8           │
│ metadata           ┆ 9        ┆ 15          │
└────────────────────┴──────────┴─────────────┘

SIGNALS BY CATEGORY (Tier 1-3 Focus)

[TIER 1] SAFETY_CRITICAL
  Description: Immediate safety-of-flight indicators
  Signals found: ['FIRE_3', 'WOW', 'MACH', 'AOA1', 'CASM', 'RALT', 'APUF', 'AOA2', 'PUSH', 'ALTS', 'FIRE_2', 'WSHR', 

In [30]:
uncategorized_signals = df_categorized.filter(
    pl.col('category') == 'uncategorized'
).select('signal_name').unique().sort('signal_name')
with pl.Config(tbl_rows=1000, tbl_width_chars=1000):
    print(uncategorized_signals)

shape: (48, 1)
┌─────────────┐
│ signal_name │
│ ---         │
│ str         │
╞═════════════╡
│ ACID        │
│ APUF        │
│ A_T         │
│ BAL1        │
│ BAL2        │
│ BPYR_1      │
│ BPYR_2      │
│ CALT        │
│ CTAC        │
│ DVER_1      │
│ DVER_2      │
│ EAI         │
│ ECYC_1      │
│ ECYC_2      │
│ ECYC_3      │
│ ECYC_4      │
│ EHRS_1      │
│ EHRS_2      │
│ EHRS_3      │
│ EHRS_4      │
│ ESN_1       │
│ ESN_2       │
│ ESN_3       │
│ ESN_4       │
│ FADF        │
│ FADS        │
│ FGC3        │
│ GMT_HOUR    │
│ GMT_MINUTE  │
│ GMT_SEC     │
│ HF1         │
│ HF2         │
│ MNS         │
│ OIPL        │
│ POVT        │
│ SMKB        │
│ SMOK        │
│ SNAP        │
│ SPLG        │
│ SPLY        │
│ TAI         │
│ TMAG        │
│ TOCW        │
│ VHF1        │
│ VHF2        │
│ VHF3        │
│ WAI_1       │
│ WAI_2       │
└─────────────┘


## Step 3b: Distribution of sampling rates and signals

In [22]:
# See the distribution of sampling rates
print(sample_df.group_by('rate_hz').agg([
    pl.n_unique('signal_name').alias('num_signals'),
    pl.col('signal_name').alias('example_signal')
]).sort('rate_hz', descending=True))


shape: (6, 3)
┌─────────┬─────────────┬─────────────────────────────────┐
│ rate_hz ┆ num_signals ┆ example_signal                  │
│ ---     ┆ ---         ┆ ---                             │
│ f64     ┆ u32         ┆ list[str]                       │
╞═════════╪═════════════╪═════════════════════════════════╡
│ 16.0    ┆ 4           ┆ ["FPAC", "FPAC", … "IVV"]       │
│ 8.0     ┆ 4           ┆ ["RALT", "RALT", … "VRTG"]      │
│ 4.0     ┆ 49          ┆ ["TH", "TH", … "N1C"]           │
│ 2.0     ┆ 18          ┆ ["PSA", "PSA", … "CWPF"]        │
│ 1.0     ┆ 88          ┆ ["POVT", "POVT", … "ILSF"]      │
│ 0.25    ┆ 19          ┆ ["DATE_YEAR", "DATE_YEAR", … "… │
└─────────┴─────────────┴─────────────────────────────────┘


In [23]:
# which signals have which rate:

  # List signals by rate
for rate in sample_df.select('rate_hz').unique().sort('rate_hz', descending=True).to_series():
    signals = sample_df.filter(pl.col('rate_hz') == rate).select('signal_name').unique().to_series().to_list()
    print(f"\n{rate} Hz ({len(signals)} signals):")
    print(f"  {signals}...")  # Show first 5


16.0 Hz (4 signals):
  ['CTAC', 'IVV', 'BLAC', 'FPAC']...

8.0 Hz (4 signals):
  ['PTCH', 'ROLL', 'RALT', 'VRTG']...

4.0 Hz (49 signals):
  ['FF_3', 'N2_3', 'PLA_3', 'EGT_3', 'WS', 'PLA_1', 'N2_2', 'AOA2', 'WD', 'VIB_2', 'LONG', 'TAS', 'FF_4', 'CAS', 'AOAI', 'VIB_1', 'MH', 'PLA_4', 'N1_4', 'CASM', 'PLA_2', 'N1C', 'ALTR', 'EGT_4', 'FF_1', 'VIB_3', 'MACH', 'ALT', 'EGT_2', 'DA', 'AOAC', 'FF_2', 'BAL2', 'VIB_4', 'N1_1', 'GS', 'TRK', 'AOA1', 'LATG', 'N1_2', 'BAL1', 'TH', 'TRKM', 'N2_1', 'N2_4', 'N1_3', 'EGT_1', 'NSQT', 'N1T']...

2.0 Hz (18 signals):
  ['RUDD', 'SHKR', 'GMT_SEC', 'APUF', 'MSQT_1', 'CCPF', 'MSQT_2', 'TOCW', 'GMT_HOUR', 'PS', 'PSA', 'CWPC', 'GMT_MINUTE', 'PT', 'PI', 'RUDP', 'CCPC', 'CWPF']...

1.0 Hz (88 signals):
  ['N1CO', 'A_T', 'ELEV_1', 'MW', 'OIT_2', 'SAT', 'BPGR_1', 'TAT', 'SPLG', 'VSPS', 'LMOD', 'OIP_3', 'FQTY_1', 'VHF3', 'LOC', 'LGUP', 'CRSS', 'TAI', 'LONP', 'OIT_1', 'FIRE_3', 'WAI_1', 'TCAS', 'BLV', 'ALTS', 'APFD', 'ELEV_2', 'LATP', 'TMODE', 'OIP_1', 'SPL_1', 'FAD

In [1]:

 # Or to see which signals have which rate:

  # List signals by rate
for rate in sample_df.select('rate_hz').unique().sort('rate_hz', descending=True).to_series():
  signals = sample_df.filter(pl.col('rate_hz') == rate).select('signal_name').unique().to_series().to_list()
  print(f"\n{rate} Hz ({len(signals)} signals):")
  print(f"  {signals[:5]}...")  # Show first 5

NameError: name 'sample_df' is not defined

## Step 4: Load Multiple Files (Using Polars - Recommended)


In [None]:
def load_multiple_mat_files(mat_files, max_files=None, use_polars=True):
    """
    Load multiple .mat files and combine into a single DataFrame.
    
    Parameters:
    -----------
    mat_files : list
        List of paths to .mat files
    max_files : int, optional
        Limit number of files to load (for testing)
    use_polars : bool
        If True, use Polars (recommended for large datasets)
    """
    if max_files:
        mat_files = mat_files[:max_files]
    
    print(f"Loading {len(mat_files)} files...")
    
    dfs = []
    failed = 0
    
    for mat_file in tqdm(mat_files, desc="Loading files"):
        df = mat_to_dataframe(mat_file, use_polars=use_polars)
        if df is not None:
            dfs.append(df)
        else:
            failed += 1
    
    if failed > 0:
        print(f"Warning: {failed} files failed to load")
    
    if not dfs:
        print("No data loaded!")
        return None
    
    print(f"Combining {len(dfs)} DataFrames...")
    
    if use_polars:
          # Use vertical_relaxed to handle type mismatches across files
          combined_df = pl.concat(dfs, how="vertical_relaxed")

          # Filter out sync words (frame synchronization markers, not actual telemetry)
          sync_words = ['VAR_1107', 'VAR_2670', 'VAR_5107', 'VAR_6670']
          before_filter = combined_df.shape[0]
          combined_df = combined_df.filter(
              ~pl.col('signal_name').str.contains('|'.join(sync_words))
          )
          print(f"Filtered out {before_filter - combined_df.shape[0]} sync word rows")
    else:
          combined_df = pd.concat(dfs, ignore_index=True)

          # Filter out sync words
          sync_words = ['VAR_1107', 'VAR_2670', 'VAR_5107', 'VAR_6670']
          mask = ~combined_df['signal_name'].str.contains('|'.join(sync_words))
          combined_df = combined_df[mask]

    print(f"Combined DataFrame shape: {combined_df.shape}")
    return combined_df

# Load a sample of files for EDA (start with a small number)
# Uncomment and adjust max_files as needed
# df = load_multiple_mat_files(mat_files, max_files=10, use_polars=True)


## Step 5: Exploratory Data Analysis


In [10]:
import polars as pl

# === DATA LOADING & TRANSFORMATION (Use Polars) ===

# Load and process all flights
df = pl.scan_parquet('../data/processed/telemetry/tail_number=652/batch=1/flight_001.parquet') \
    .collect()

print(f"Loaded {len(df)} rows in Polars")

Loaded 31552 rows in Polars


In [12]:
def perform_eda(df):
    """
    Perform exploratory data analysis on the flight data.
    """
    if df is None:
        print("No data to analyze")
        return
    
    print("=" * 60)
    print("EXPLORATORY DATA ANALYSIS")
    print("=" * 60)
    
    # Basic info
    print("\n1. BASIC INFORMATION")
    print("-" * 60)
    
    print(f"Shape: {df.shape}")
    print(f"\nSchema:")
    print(df.schema)
    print(f"\nColumn names: {df.columns}")
    print(f"\nFirst few rows:")
    print(df.head())
    print(f"\nSummary statistics:")
    print(df.describe())


In [13]:
perform_eda(df)

EXPLORATORY DATA ANALYSIS

1. BASIC INFORMATION
------------------------------------------------------------
Shape: (31552, 183)

Schema:
Schema({'sample_index': Int32, 'ABRK': Float32, 'ACID': Float32, 'ACMT': Float32, 'AIL_1': Float32, 'AIL_2': Float32, 'ALT': Float32, 'ALTR': Float32, 'ALTS': Float32, 'AOA1': Float32, 'AOA2': Float32, 'AOAC': Float32, 'AOAI': Float32, 'APFD': Float32, 'APUF': Float32, 'ATEN': Float32, 'A_T': Float32, 'BAL1': Float32, 'BAL2': Float32, 'BLAC': Float32, 'BLV': Float32, 'BPGR_1': Float32, 'BPGR_2': Float32, 'BPYR_1': Float32, 'BPYR_2': Float32, 'CALT': Float32, 'CAS': Float32, 'CASM': Float32, 'CASS': Float32, 'CCPC': Float32, 'CCPF': Float32, 'CRSS': Float32, 'CTAC': Float32, 'CWPC': Float32, 'CWPF': Float32, 'DA': Float32, 'DATE_DAY': Float32, 'DATE_MONTH': Float32, 'DATE_YEAR': Float32, 'DFGS': Float32, 'DVER_1': Float32, 'DVER_2': Float32, 'DWPT': Float32, 'EAI': Float32, 'ECYC_1': Float32, 'ECYC_2': Float32, 'ECYC_3': Float32, 'ECYC_4': Float32, 'E

In [14]:
# === DATA LOADING & TRANSFORMATION (Use Polars) ===

# Load and process all flights
df = pl.scan_parquet('../data/processed/flight_metadata.parquet') \
    .collect()

print(f"Loaded {len(df)} rows in Polars")

Loaded 1 rows in Polars


In [16]:
perform_eda(df)

EXPLORATORY DATA ANALYSIS

1. BASIC INFORMATION
------------------------------------------------------------
Shape: (1, 18)

Schema:
Schema({'tail_number': String, 'year': String, 'month': String, 'day': String, 'hour': String, 'minute': String, 'date': String, 'time': String, 'flight_id': String, 'num_samples': Int64, 'num_signals': Int64, 'file_path': String, 'batch': Int64, 'flight_number': Int64, 'source_folder': String, 'signal_metadata_json': String, 'num_batches': Int64, 'num_flights': Int64})

Column names: ['tail_number', 'year', 'month', 'day', 'hour', 'minute', 'date', 'time', 'flight_id', 'num_samples', 'num_signals', 'file_path', 'batch', 'flight_number', 'source_folder', 'signal_metadata_json', 'num_batches', 'num_flights']

First few rows:
shape: (1, 18)
┌─────────────┬──────┬───────┬─────┬───┬───────────────┬───────────────┬─────────────┬─────────────┐
│ tail_number ┆ year ┆ month ┆ day ┆ … ┆ source_folder ┆ signal_metada ┆ num_batches ┆ num_flights │
│ ---         ┆ --

In [25]:
import json
python_object = json.loads(df['signal_metadata_json'][0])

print(json.dumps(python_object, indent=4))

{
    "FPAC": {
        "rate_hz": 16.0,
        "units": "G",
        "description": "FLIGHT PATH ACCELERATION",
        "category": "flight_performance",
        "priority": 3
    },
    "BLAC": {
        "rate_hz": 16.0,
        "units": "G",
        "description": "BODY LONGITUDINAL ACCELERATION",
        "category": "flight_performance",
        "priority": 3
    },
    "CTAC": {
        "rate_hz": 16.0,
        "units": "G",
        "description": "CROSS TRACK ACCELERATION",
        "category": "flight_performance",
        "priority": 3
    },
    "TH": {
        "rate_hz": 4.0,
        "units": "DEG",
        "description": "TRUE HEADING LSP",
        "category": "flight_performance",
        "priority": 3
    },
    "MH": {
        "rate_hz": 4.0,
        "units": "DEG",
        "description": "MAGNETIC HEADING LSP",
        "category": "flight_performance",
        "priority": 3
    },
    "EGT_1": {
        "rate_hz": 4.0,
        "units": "DEG",
        "description": "EXHA

In [None]:
def perform_eda(df, use_polars=True):
    """
    Perform exploratory data analysis on the flight data.
    """
    if df is None:
        print("No data to analyze")
        return
    
    print("=" * 60)
    print("EXPLORATORY DATA ANALYSIS")
    print("=" * 60)
    
    # Basic info
    print("\n1. BASIC INFORMATION")
    print("-" * 60)
    if use_polars:
        print(f"Shape: {df.shape}")
        print(f"\nSchema:")
        print(df.schema)
        print(f"\nFirst few rows:")
        print(df.head())
        print(f"\nSummary statistics:")
        print(df.describe())
        
        # Count by tail number
        print(f"\n2. DATA BY TAIL NUMBER")
        print("-" * 60)
        tail_counts = df.group_by('tail_number').agg([
            pl.count().alias('file_count'),
            pl.n_unique('filename').alias('unique_files')
        ])
        print(tail_counts)
        
        # Count by signal name
        print(f"\n3. SIGNAL TYPES")
        print("-" * 60)
        signal_counts = df.group_by('signal_name').agg([
            pl.count().alias('count'),
            pl.n_unique('filename').alias('files_with_signal')
        ]).sort('count', descending=True)
        print(signal_counts.head(20))
        
    else:
        print(f"Shape: {df.shape}")
        print(f"\nColumns: {df.columns.tolist()}")
        print(f"\nData types:")
        print(df.dtypes)
        print(f"\nFirst few rows:")
        print(df.head())
        print(f"\nSummary statistics:")
        print(df.describe())
        
        # Count by tail number
        print(f"\n2. DATA BY TAIL NUMBER")
        print("-" * 60)
        tail_counts = df.groupby('tail_number').agg({
            'filename': 'nunique'
        }).reset_index()
        tail_counts.columns = ['tail_number', 'unique_files']
        print(tail_counts)
        
        # Count by signal name
        print(f"\n3. SIGNAL TYPES")
        print("-" * 60)
        signal_counts = df.groupby('signal_name').agg({
            'filename': 'nunique'
        }).reset_index()
        signal_counts.columns = ['signal_name', 'files_with_signal']
        signal_counts = signal_counts.sort_values('files_with_signal', ascending=False)
        print(signal_counts.head(20))

# Uncomment to run EDA
# perform_eda(df, use_polars=True)


## Step 6: Visualizations


In [None]:
def create_visualizations(df, use_polars=True):
    """
    Create visualizations for the flight data.
    """
    if df is None:
        print("No data to visualize")
        return
    
    # Convert to pandas for easier plotting (Polars can be converted)
    if use_polars:
        df_pd = df.to_pandas()
    else:
        df_pd = df.copy()
    
    # 1. Files per tail number
    plt.figure(figsize=(12, 6))
    tail_counts = df_pd.groupby('tail_number')['filename'].nunique().sort_values(ascending=False)
    tail_counts.head(20).plot(kind='bar')
    plt.title('Number of Files per Tail Number (Top 20)')
    plt.xlabel('Tail Number')
    plt.ylabel('Number of Files')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # 2. Signal distribution
    plt.figure(figsize=(12, 6))
    signal_counts = df_pd.groupby('signal_name').size().sort_values(ascending=False)
    signal_counts.head(20).plot(kind='bar')
    plt.title('Most Common Signal Types (Top 20)')
    plt.xlabel('Signal Name')
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # 3. Sample time series (if we have value and index columns)
    if 'value' in df_pd.columns and 'index' in df_pd.columns:
        # Get a sample signal
        sample_signals = df_pd['signal_name'].value_counts().head(5).index
        
        fig, axes = plt.subplots(len(sample_signals), 1, figsize=(14, 3*len(sample_signals)))
        if len(sample_signals) == 1:
            axes = [axes]
        
        for idx, signal in enumerate(sample_signals):
            signal_data = df_pd[df_pd['signal_name'] == signal]
            if len(signal_data) > 0:
                # Take first occurrence
                first_occurrence = signal_data.iloc[0]
                if isinstance(first_occurrence['value'], (list, np.ndarray)):
                    values = np.array(first_occurrence['value'])
                    axes[idx].plot(values[:1000])  # Limit to first 1000 points
                    axes[idx].set_title(f'Sample Time Series: {signal} (from {first_occurrence["filename"]})')
                    axes[idx].set_xlabel('Sample Index')
                    axes[idx].set_ylabel('Value')
                    axes[idx].grid(True)
        
        plt.tight_layout()
        plt.show()

# Uncomment to create visualizations
# create_visualizations(df, use_polars=True)


## Notes

### Unzipping Files
- Run `unzip_flight_data()` to extract all zip files
- Files will be extracted to subdirectories named after each zip file

### Loading Data
- Start with a small number of files (`max_files=10`) to understand the data structure
- Adjust the `mat_to_dataframe()` function based on the actual structure of your .mat files
- NASA Dashlink data structure may vary, so you may need to customize the loading function

### Polars vs Pandas
- **Polars** (recommended): Faster, more memory-efficient, better for large datasets
- **Pandas**: More ecosystem support, easier to find examples online
- You can convert between them: `df_polars.to_pandas()` or `pl.from_pandas(df_pandas)`

### Next Steps
1. Understand the actual structure of your .mat files
2. Customize the loading function based on the data structure
3. Load more files as needed
4. Perform domain-specific analysis based on flight data characteristics
