#SpaceCraft Anomaly Detection

#Problem Statement :    
Space agencies like NASA and private firms like SpaceX rely on spacecraft telemetry data to monitor system health. Anomalies (unexpected deviations from normal behavior) in spacecraft systems can indicate potential failures or malfunctions.

The goal of this project is to detect anomalies in spacecraft telemetry data to ensure early identification of system failures and improve mission reliability.

#Objective :   

Build an unsupervised machine learning model that can automatically detect anomalies in spacecraft telemetry data. We aim to analyze time-series data of spacecraft sensor readings to find contextual anomalies (gradual drifts) and point anomalies (sudden spikes/drop-offs).




#Step.1 : Data Ingestion



In [1]:
!wget https://github.com/khundman/telemanom/archive/refs/heads/master.zip -O telemanom-master.zip


--2025-02-14 12:55:51--  https://github.com/khundman/telemanom/archive/refs/heads/master.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/khundman/telemanom/zip/refs/heads/master [following]
--2025-02-14 12:55:51--  https://codeload.github.com/khundman/telemanom/zip/refs/heads/master
Resolving codeload.github.com (codeload.github.com)... 140.82.112.10
Connecting to codeload.github.com (codeload.github.com)|140.82.112.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘telemanom-master.zip’

telemanom-master.zi     [ <=>                ] 466.36K  --.-KB/s    in 0.1s    

2025-02-14 12:55:51 (3.55 MB/s) - ‘telemanom-master.zip’ saved [477551]



In [2]:
import zipfile

with zipfile.ZipFile('telemanom-master.zip', 'r') as zip_ref:
    zip_ref.extractall('telemanom-master')


In [3]:
import os

os.listdir('telemanom-master/telemanom-master')


['example.py',
 'requirements.txt',
 '.gitignore',
 'README.md',
 'telemanom',
 'Dockerfile',
 'config.yaml',
 '.dockerignore',
 'results',
 'labeled_anomalies.csv',
 'LICENSE.txt']

In [4]:
import pandas as pd

# Load the 'labeled_anomalies.csv' file
df = pd.read_csv('telemanom-master/telemanom-master/labeled_anomalies.csv')
df.head()


Unnamed: 0,chan_id,spacecraft,anomaly_sequences,class,num_values
0,P-1,SMAP,"[[2149, 2349], [4536, 4844], [3539, 3779]]","[contextual, contextual, contextual]",8505
1,S-1,SMAP,"[[5300, 5747]]",[point],7331
2,E-1,SMAP,"[[5000, 5030], [5610, 6086]]","[contextual, contextual]",8516
3,E-2,SMAP,"[[5598, 6995]]",[point],8532
4,E-3,SMAP,"[[5094, 8306]]",[point],8307


#Step.2 : Data Processing including Data Pre-Processing and Feature Engineering

In [5]:
df.shape

(82, 5)

#Observation :    

There are 82 rows and 5 columns in our dataset

In [6]:
df.columns

Index(['chan_id', 'spacecraft', 'anomaly_sequences', 'class', 'num_values'], dtype='object')

In [7]:
#Duplicate rows detection
df.duplicated().sum()

0

#Observation :   

There are no duplicated rows

In [8]:
#Checking datatypes of all the features
df.dtypes

Unnamed: 0,0
chan_id,object
spacecraft,object
anomaly_sequences,object
class,object
num_values,int64


#Observation  :     
Datatype of all features except num_values is object , datatype of num_values is int64

In [9]:
#Examining unique values of all the columns
def print_unique_values_with_counts(column_name, df):
    if column_name in df.columns:
        value_counts = df[column_name].value_counts()
        print(f"Unique values and their counts in column '{column_name}':")
        for value, count in value_counts.items():
            print(f"{value}: {count}")
    else:
        print(f"Column '{column_name}' does not exist in the DataFrame.")

In [10]:
print_unique_values_with_counts('spacecraft', df)

Unique values and their counts in column 'spacecraft':
SMAP: 55
MSL: 27


In [11]:
print_unique_values_with_counts('anomaly_sequences', df)

Unique values and their counts in column 'anomaly_sequences':
[[1250, 1500]]: 2
[[1110, 2250]]: 2
[[2149, 2349], [4536, 4844], [3539, 3779]]: 1
[[1172, 1240]]: 1
[[900, 910]]: 1
[[1850, 2030]]: 1
[[5600, 5640]]: 1
[[4569, 8433]]: 1
[[4569, 8374]]: 1
[[5300, 6420]]: 1
[[5070, 5230]]: 1
[[6200, 8600]]: 1
[[1890, 1930]]: 1
[[2750, 2800]]: 1
[[4510, 4590]]: 1
[[4950, 6600]]: 1
[[3650, 3750], [5050, 5100], [7560, 7675]]: 1
[[5600, 5700]]: 1
[[5060, 5130]]: 1
[[4590, 4720]]: 1
[[1200, 1225]]: 1
[[5300, 5747]]: 1
[[1630, 1650], [1800, 2000]]: 1
[[940, 1040]]: 1
[[600, 1250]]: 1
[[1500, 2140]]: 1
[[1778, 1898], [1238, 1344]]: 1
[[870, 930], [1330, 1370]]: 1
[[4575, 4755]]: 1
[[780, 810], [890, 970]]: 1
[[3550, 3700]]: 1
[[1250, 1450], [2670, 2790], [3325, 3425]]: 1
[[2700, 2770]]: 1
[[690, 790], [1900, 2050]]: 1
[[630, 750]]: 1
[[290, 390], [1540, 1575]]: 1
[[550, 750], [2100, 2210]]: 1
[[1390, 1410]]: 1
[[1250, 1550]]: 1
[[5178, 7917]]: 1
[[4270, 4330]]: 1
[[2098, 2180], [5200, 5300]]: 1
[[55

In [None]:
print_unique_values_with_counts('class', df)

Unique values and their counts in column 'class':
[point]: 47
[contextual]: 17
[contextual, contextual]: 6
[point, point]: 4
[contextual, contextual, contextual]: 3
[point, contextual]: 3
[point, point, point]: 1
[contextual, point, contextual]: 1


In [12]:
#Feature Creation
import numpy as np

# Ensure anomaly_sequences is properly formatted as a list
df["anomaly_sequences"] = df["anomaly_sequences"].apply(lambda x: eval(x) if isinstance(x, str) and x.startswith("[") else [])

# Feature Creation: Total Anomalies
df["total_anomalies"] = df["anomaly_sequences"].apply(lambda x: len(x) if isinstance(x, list) else 0)

# Feature Creation: Average Anomaly Duration
df["avg_anomaly_duration"] = df["anomaly_sequences"].apply(
    lambda x: np.mean([end - start for start, end in x]) if x else 0
)

# Feature Creation: Maximum Anomaly Duration
df["max_anomaly_duration"] = df["anomaly_sequences"].apply(
    lambda x: np.max([end - start for start, end in x]) if x else 0
)

# Feature Creation: Anomaly Duration Difference
df['anomaly_duration'] = df['max_anomaly_duration'] - df['avg_anomaly_duration']

# Function to Calculate Gaps Between Anomalies
def calculate_gap(anomaly_sequences):
    """
    Calculates the time gap between consecutive anomalies.
    If there's only one anomaly, gap is set to NaN.
    """
    if not anomaly_sequences or len(anomaly_sequences) < 2:
        return np.nan  # No gap if there's only one or no anomalies

    # Extract start & end times of anomalies
    start_times = [seq[0] for seq in anomaly_sequences]
    end_times = [seq[1] for seq in anomaly_sequences]

    # Compute gaps as difference between start of new anomaly and end of previous anomaly
    gaps = [start_times[i] - end_times[i-1] for i in range(1, len(start_times))]

    return np.mean(gaps) if gaps else np.nan  # Store average gap if available

# Apply the function safely
df['gap_between_anomalies'] = df['anomaly_sequences'].apply(calculate_gap)

# Check results
print(df[['total_anomalies', 'avg_anomaly_duration', 'max_anomaly_duration', 'anomaly_duration', 'gap_between_anomalies']].head())



   total_anomalies  avg_anomaly_duration  max_anomaly_duration  \
0                3            249.333333                   308   
1                1            447.000000                   447   
2                2            253.000000                   476   
3                1           1397.000000                  1397   
4                1           3212.000000                  3212   

   anomaly_duration  gap_between_anomalies  
0         58.666667                  441.0  
1          0.000000                    NaN  
2        223.000000                  580.0  
3          0.000000                    NaN  
4          0.000000                    NaN  


In [13]:
#Creating Aggrgated Features
# Define rolling window size
window_size = 3  # Adjust as needed

# Create Moving Average & Rolling Standard Deviation for key anomaly-related features
df['moving_avg_anomalies'] = df['total_anomalies'].rolling(window=window_size, min_periods=1).mean()
df['rolling_std_anomalies'] = df['total_anomalies'].rolling(window=window_size, min_periods=1).std()

df['moving_avg_duration'] = df['avg_anomaly_duration'].rolling(window=window_size, min_periods=1).mean()
df['rolling_std_duration'] = df['avg_anomaly_duration'].rolling(window=window_size, min_periods=1).std()

df[['total_anomalies', 'moving_avg_anomalies', 'rolling_std_anomalies']].head()  # Preview result


Unnamed: 0,total_anomalies,moving_avg_anomalies,rolling_std_anomalies
0,3,3.0,
1,1,2.0,1.414214
2,2,2.0,1.0
3,1,1.333333,0.57735
4,1,1.333333,0.57735


In [14]:
df.head()

Unnamed: 0,chan_id,spacecraft,anomaly_sequences,class,num_values,total_anomalies,avg_anomaly_duration,max_anomaly_duration,anomaly_duration,gap_between_anomalies,moving_avg_anomalies,rolling_std_anomalies,moving_avg_duration,rolling_std_duration
0,P-1,SMAP,"[[2149, 2349], [4536, 4844], [3539, 3779]]","[contextual, contextual, contextual]",8505,3,249.333333,308,58.666667,441.0,3.0,,249.333333,
1,S-1,SMAP,"[[5300, 5747]]",[point],7331,1,447.0,447,0.0,,2.0,1.414214,348.166667,139.77144
2,E-1,SMAP,"[[5000, 5030], [5610, 6086]]","[contextual, contextual]",8516,2,253.0,476,223.0,580.0,2.0,1.0,316.444444,113.07929
3,E-2,SMAP,"[[5598, 6995]]",[point],8532,1,1397.0,1397,0.0,,1.333333,0.57735,699.0,612.218915
4,E-3,SMAP,"[[5094, 8306]]",[point],8307,1,3212.0,3212,0.0,,1.333333,0.57735,1620.666667,1492.126112


The anomaly_sequences feature is no longer required after performing feature extraction because we have already converted its information into meaningful numerical features

In [15]:
df.drop(columns=["anomaly_sequences"], inplace=True)


In [16]:
# One-Hot Encoding for spacecraft and class
df = pd.get_dummies(df, columns=['spacecraft', 'class'], drop_first=True)

In [17]:
#df after One Hot Encoding
df.head()

Unnamed: 0,chan_id,num_values,total_anomalies,avg_anomaly_duration,max_anomaly_duration,anomaly_duration,gap_between_anomalies,moving_avg_anomalies,rolling_std_anomalies,moving_avg_duration,rolling_std_duration,spacecraft_SMAP,"class_[contextual, contextual]","class_[contextual, point, contextual]",class_[contextual],"class_[point, contextual]","class_[point, point, point]","class_[point, point]",class_[point]
0,P-1,8505,3,249.333333,308,58.666667,441.0,3.0,,249.333333,,True,False,False,False,False,False,False,False
1,S-1,7331,1,447.0,447,0.0,,2.0,1.414214,348.166667,139.77144,True,False,False,False,False,False,False,True
2,E-1,8516,2,253.0,476,223.0,580.0,2.0,1.0,316.444444,113.07929,True,True,False,False,False,False,False,False
3,E-2,8532,1,1397.0,1397,0.0,,1.333333,0.57735,699.0,612.218915,True,False,False,False,False,False,False,True
4,E-3,8307,1,3212.0,3212,0.0,,1.333333,0.57735,1620.666667,1492.126112,True,False,False,False,False,False,False,True


In [18]:
df.shape

(82, 19)

#Observation :  
Extremely few samples , sliding window method is highly required , which we wil implement at the

In [None]:
df.columns

Index(['chan_id', 'num_values', 'total_anomalies', 'avg_anomaly_duration',
       'max_anomaly_duration', 'anomaly_duration', 'gap_between_anomalies',
       'moving_avg_anomalies', 'rolling_std_anomalies', 'moving_avg_duration',
       'rolling_std_duration', 'spacecraft_SMAP',
       'class_[contextual, contextual]',
       'class_[contextual, point, contextual]', 'class_[contextual]',
       'class_[point, contextual]', 'class_[point, point, point]',
       'class_[point, point]', 'class_[point]'],
      dtype='object')

In [19]:
columns_list = ['chan_id',
                'num_values',
                'total_anomalies',
                'avg_anomaly_duration',
                'max_anomaly_duration',
                'spacecraft_SMAP',
                'class_[contextual, contextual]',
                'class_[contextual, point, contextual]',
                'class_[contextual]',
                'class_[point, contextual]',
                'class_[point, point, point]',
                'class_[point, point]',
                'class_[point]']

In [20]:
#Checking missing values
df.isna().sum()

Unnamed: 0,0
chan_id,0
num_values,0
total_anomalies,0
avg_anomaly_duration,0
max_anomaly_duration,0
anomaly_duration,0
gap_between_anomalies,64
moving_avg_anomalies,0
rolling_std_anomalies,1
moving_avg_duration,0


#Observation :    
Except for gap_between_aomalies feature with 64 missing values , rolling_std_anomalies with  1 missing value and rolling_std_duration with 1 missing value  , there are no  missing values   





In [21]:
#Handling missing values
from warnings import filterwarnings

# Ignore all warnings
filterwarnings('ignore')

df['gap_between_anomalies'].fillna(df['gap_between_anomalies'].median(), inplace=True)  # Median imputation
df['rolling_std_anomalies'].fillna(method='bfill', inplace=True)  # Forward fill
df['rolling_std_duration'].fillna(method='bfill', inplace=True)  # Forward fill

# Verifying if missing values are handled
print(df.isna().sum())

chan_id                                  0
num_values                               0
total_anomalies                          0
avg_anomaly_duration                     0
max_anomaly_duration                     0
anomaly_duration                         0
gap_between_anomalies                    0
moving_avg_anomalies                     0
rolling_std_anomalies                    0
moving_avg_duration                      0
rolling_std_duration                     0
spacecraft_SMAP                          0
class_[contextual, contextual]           0
class_[contextual, point, contextual]    0
class_[contextual]                       0
class_[point, contextual]                0
class_[point, point, point]              0
class_[point, point]                     0
class_[point]                            0
dtype: int64


In [22]:
# Outlier Detection:

def calculate_outlier_percentage(df):
    outlier_percentages = {}

    # Iterate over each numeric column in the DataFrame
    for column_name in df.select_dtypes(include=[np.number]).columns:
        column = df[column_name]  # Access the column data

        # Calculate Q1 (25th percentile) and Q3 (75th percentile)
        Q1 = column.quantile(0.25)
        Q3 = column.quantile(0.75)

        # Calculate the Interquartile Range (IQR)
        IQR = Q3 - Q1

        # Define outlier criteria
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Identify outliers
        outliers = (column < lower_bound) | (column > upper_bound)

        # Count the number of outliers
        num_outliers = np.sum(outliers)

        # Calculate the percentage of outliers
        outlier_percentage = (num_outliers / len(column)) * 100

        outlier_percentages[column_name] = outlier_percentage

    return outlier_percentages

outlier_percent = calculate_outlier_percentage(df)
print(outlier_percent)

{'num_values': 0.0, 'total_anomalies': 21.951219512195124, 'avg_anomaly_duration': 13.414634146341465, 'max_anomaly_duration': 12.195121951219512, 'anomaly_duration': 21.951219512195124, 'gap_between_anomalies': 21.951219512195124, 'moving_avg_anomalies': 1.2195121951219512, 'rolling_std_anomalies': 0.0, 'moving_avg_duration': 6.097560975609756, 'rolling_std_duration': 0.0}


In [23]:
# Function to calculate skewness of only numerical columns
from scipy.stats import skew
def calculate_skewness(df):
    numeric_df = df.select_dtypes(include=[np.number])
    skewness = numeric_df.apply(skew, axis=0)
    return skewness

# Calculate skewness
skewness_values = calculate_skewness(df)
print(skewness_values)

num_values              -0.753204
total_anomalies          1.909836
avg_anomaly_duration     1.713625
max_anomaly_duration     1.678019
anomaly_duration         6.547964
gap_between_anomalies    1.887319
moving_avg_anomalies     1.462888
rolling_std_anomalies    0.815832
moving_avg_duration      1.535817
rolling_std_duration     0.597189
dtype: float64


In [24]:
from scipy.stats import kurtosis
# Function to calculate kurtosis of only numerical columns
def calculate_kurtosis(df):
    numeric_df = df.select_dtypes(include=[np.number])
    kurtosis_values = numeric_df.apply(kurtosis, axis=0)
    return kurtosis_values

# Calculate kurtosis
kurtosis_values = calculate_kurtosis(df)
print(kurtosis_values)

num_values               -1.235726
total_anomalies           2.536716
avg_anomaly_duration      1.626394
max_anomaly_duration      1.536214
anomaly_duration         47.434787
gap_between_anomalies    11.517474
moving_avg_anomalies      2.678821
rolling_std_anomalies    -0.653203
moving_avg_duration       1.944358
rolling_std_duration     -1.167838
dtype: float64


In [25]:
#Outlier Handling

# Log Transformation for `total_anomalies` (to reduce skewness)
df['total_anomalies'] = np.log1p(df['total_anomalies'])  # log1p prevents log(0) issues

# Winsorization for `moving_avg_duration` (capping extreme values)
Q1 = df['moving_avg_duration'].quantile(0.25)
Q3 = df['moving_avg_duration'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df['moving_avg_duration'] = np.clip(df['moving_avg_duration'], lower_bound, upper_bound)

# Keeping all anomaly-related columns (`anomaly_duration`, `max_anomaly_duration`, etc.)
# No removal or modification since they are critical for the project

# Verifying changes
print(df.describe())


        num_values  total_anomalies  avg_anomaly_duration  \
count    82.000000        82.000000             82.000000   
mean   6314.195122         0.799693            742.656504   
std    2701.236724         0.211071           1117.788959   
min    1096.000000         0.693147             10.000000   
25%    2997.500000         0.693147             81.000000   
50%    7895.500000         0.693147            157.500000   
75%    8472.000000         0.693147           1031.750000   
max    8640.000000         1.386294           4217.000000   

       max_anomaly_duration  anomaly_duration  gap_between_anomalies  \
count              82.00000         82.000000              82.000000   
mean              765.54878         22.892276             632.402439   
std              1114.58383         89.317057             488.810312   
min                10.00000          0.000000           -1141.000000   
25%                88.00000          0.000000             572.000000   
50%               

In [None]:
df.dtypes


Unnamed: 0,0
chan_id,object
num_values,float64
total_anomalies,float64
avg_anomaly_duration,float64
max_anomaly_duration,float64
spacecraft_SMAP,bool
"class_[contextual, contextual]",bool
"class_[contextual, point, contextual]",bool
class_[contextual],bool
"class_[point, contextual]",bool
