# 01. Data Exploration: NASA C-MAPSS FD001

This notebook performs the first step of the project: loading and exploring the raw `train_FD001.txt` dataset.

**Goals:**
1.  Load the space-delimited text file into a `pandas` DataFrame.
2.  Assign the correct column names based on the `readme.txt`.
3.  Perform initial verification (check shape, dtypes, null values).
4.  Visualize the run-to-failure nature of the data.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for plots
sns.set_style('whitegrid')

## 1. Data Loading and Column Naming

Based on the `readme.txt`, the data is space-delimited and has no header. It contains 26 columns:
* `unit_number`
* `time_in_cycles`
* 3 operational settings (`op_setting_1` to `3`)
* 21 sensor readings (`sensor_1` to `21`)

We will define these column names and use them to load the data.

In [10]:
# Define the column names based on the readme.txt
op_settings = [f'op_setting_{i+1}' for i in range(3)]
sensors = [f'sensor_{i+1}' for i in range(21)]
cols = ['unit_number', 'time_in_cycles'] + op_settings + sensors

# Filepath
data_path = '../data/CMAPSSData/train_FD001.txt'

# Read the data
try:
    df = pd.read_csv(data_path, sep=r"\s+", header=None)
    
    # Assign the column names
    df.columns = cols
    
except FileNotFoundError:
    print(f"Error: Data file not found at {data_path}")
    print("Please download the 'train_FD001.txt' file and place it in the 'data/' directory.")

## 2. Initial Data Verification

We will check the first few rows, the data types, and the null value counts to ensure the data was loaded correctly.

In [None]:
# Check the first 5 rows
print("--- Data Head ---")
print(df.head())

# Check the data types and look for null values
# We expect 20631 entries, 26 columns, and 0 nulls
print("\n--- Data Info ---")
df.info()

# Check for any nulls (should be 0)
print(f"\nTotal Null Values: {df.isnull().sum().sum()}")

# Get basic descriptive statistics
print("\n--- Data Description ---")
print(df.describe())

   unit_number  time_in_cycles  op_setting_1  op_setting_2  op_setting_3  \
0            1               1       -0.0007       -0.0004         100.0   
1            1               2        0.0019       -0.0003         100.0   
2            1               3       -0.0043        0.0003         100.0   
3            1               4        0.0007        0.0000         100.0   
4            1               5       -0.0019       -0.0002         100.0   

   sensor_1  sensor_2  sensor_3  sensor_4  sensor_5  ...  sensor_12  \
0    518.67    641.82   1589.70   1400.60     14.62  ...     521.66   
1    518.67    642.15   1591.82   1403.14     14.62  ...     522.28   
2    518.67    642.35   1587.99   1404.20     14.62  ...     522.42   
3    518.67    642.35   1582.79   1401.87     14.62  ...     522.86   
4    518.67    642.37   1582.85   1406.22     14.62  ...     522.19   

   sensor_13  sensor_14  sensor_15  sensor_16  sensor_17  sensor_18  \
0    2388.02    8138.62     8.4195       0.03