In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly
import math
import shap
from catboost import CatBoostRegressor
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from ydata_profiling import ProfileReport
from plotly.subplots import make_subplots
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from IPython.display import display
plotly.io.renderers.default = "notebook"
%matplotlib inline

# Table of Contents
1. [Feature Descriptions and Profiling](#Feature-Descriptions-and-Profiling)
2. [Time Series Vizual Analysis](#Time-Series-Vizual-Analysis)
3. [Feature Dependency Analysis](#Feature-Dependency-Analysis)
4. [Data Cleaning and Grouping](#Data-Cleaning-and-Grouping)
5. [Mahalanobis distance study](#Mahalanobis-distance-study)
6. [Noise study](#Noise-study)
7. [Conclusions](#Conclusions)

# Feature Descriptions and Profiling

In [None]:
df = pd.read_parquet('../data/01_raw/df_train_test.parquet')

In [None]:
df.shape

In [None]:
df.head(3)

The timestamps are taken every 10 mins.

We see that the df index is not timestamps. For time series analysis, it's better to have it as a timestamp, so let's make it.

This is especially importnt when plotting time series and understanding different phenomena over time.

In [None]:
df.index = pd.to_datetime(df['Timestamps'])
df.drop(columns=['Timestamps'], inplace=True)

Before doing any analysis, it's critical to understand what each raw feature means. So, let's describe it.

In [None]:
df.columns

| **Feature** | **Description** | **Typical Units** | **Category** |
|--------------|----------------|-------------------|---------------|
| **WindSpeed** | Mean wind speed measured over the recording interval by the anemometer on the nacelle. | m/s | Meteorological | 
| **WindDirAbs** | Absolute wind direction — measured with respect to geographic North. | ° (degrees) | Meteorological |
| **WindDirRel** | Relative wind direction — difference between wind direction and nacelle yaw position (turbine facing direction). | ° | 
| **Power** | Electrical power output delivered by the generator during the sampling interval (average). | kW | Performance |
| **Pitch** | Average blade pitch angle (rotation of blades around their longitudinal axis to control aerodynamic load). | ° | Control |
| **GenRPM** | Generator rotational speed (after gearbox). | rpm | Mechanical |
| **RotorRPM** | Rotor rotational speed (before gearbox). | rpm | Mechanical |
| **EnvirTemp** | Ambient environmental temperature near the nacelle. | °C | Environmental |
| **NacelTemp** | Temperature measured inside the nacelle (housing on top of the tower). | °C | Environmental |
| **GearOilTemp** | Temperature of gearbox lubricating oil (indicator of mechanical load and thermal stress). | °C | Mechanical / Condition
| **GearBearTemp** | Temperature of the main gearbox bearing. | °C | Mechanical / Condition Monitoring |
| **GenPh1Temp** | Temperature of generator winding Phase 1. | °C | Electrical / Condition Monitoring |
| **GenBearTemp** | Temperature of generator bearing (critical indicator of bearing wear or lubrication issues). | °C | Condition Monitoring |

## Let's make a quick dataset description

In [None]:
df.describe()

**We see that:**
- There are weird negative values for min_values of some features.
- Some parameters have the std / mean value high, so we have high variability within the data.

However, basic description does not give much info.

Let's use the Data Profiler.

In [None]:
profile = ProfileReport(df)
profile

**We see that:**

**Observations from values and distributions**
- There are outliers including negative values for many parameters, definitely needs to be cleaned out.
- A lot of the values of are zero, which means that a lot of time the turbine does don't work. It also means that these regime needs to be most likely cleaned out when analyzing the relatinsionships, correlations, etc. It can also be the turbine downtime.
- Power - a very important parameter, is highly skewed If we gonna use it as a model target, it can be a problem.
- GenRPM is relatuvely even distributed but also has some certain peaks which seem to be the main operating regimes.
- The temperature seem to be moderate and does not have much of negative values.

# Time Series Vizual Analysis

In [None]:
def plot_time_series(df, columns, step=10, rolling_window=None):
    """
    Clean and fast Plotly plot:
    - Subplots stacked vertically
    - Optional rolling median trend (black)
    - Only shows every `step`th tick
    
    Parameters
    ----------
    df : pd.DataFrame
        Time series dataframe with a DateTimeIndex.
    columns : list of str
        List of column names to plot.
    step : int
        Subsampling step for faster plotting.
    rolling_window : int or None
        Window size for rolling median.
        If None → no rolling median plotted.
    """

    # Subsample for speed
    df_small = df.iloc[::step]

    # Create subplot layout
    fig = make_subplots(
        rows=len(columns),
        cols=1,
        shared_xaxes=True,
        vertical_spacing=0.01,
        subplot_titles=columns
    )

    # Precompute rolling only if requested
    if rolling_window is not None:
        df_roll = (
            df[columns]
            .rolling(rolling_window, min_periods=1)
            .median()
            .iloc[::step]
        )

    for i, col in enumerate(columns, start=1):

        # Original series
        fig.add_trace(
            go.Scatter(
                x=df_small.index,
                y=df_small[col],
                mode="lines",
                name=col,
                line=dict(width=1)
            ),
            row=i, col=1
        )

        # Rolling trend only if rolling_window is given
        if rolling_window is not None:
            fig.add_trace(
                go.Scatter(
                    x=df_roll.index,
                    y=df_roll[col],
                    mode="lines",
                    name=f"{col} (rolling median)",
                    line=dict(width=1, color="black"),
                ),
                row=i, col=1
            )

    fig.update_xaxes(tickmode="auto")

    fig.update_layout(
        height=250 * len(columns),
        showlegend=False,
        title_text="Time Series Overview",
        margin=dict(l=50, r=30, t=50, b=50)
    )

    fig.show()

In [None]:
df.columns

In [None]:
cols_to_plot = [
    'Power', 'WindSpeed', 'GenRPM', 'RotorRPM', 
    'WindDirAbs', 'WindDirRel', 'Pitch',
    'EnvirTemp', 'NacelTemp', 'GearOilTemp',
    'GearBearTemp', 'GenPh1Temp', 'GenBearTemp'
]
plot_time_series(df, cols_to_plot, step=1)

- We see that most of the parameters have strong outlying values. This can be a problem when fitting the model.
- We see that there is a data chunk where all the values are zero. This corresponds to the turbine shoutdown.
- The operation of the turbine is unsteady which is expeted because the wind has turbulent and intermittent nature.
- Many signals have quite A LOT of noise, at least visually. It might be a good idea to analyze it and denoise if possible.
- Some signals have seasonality, especially tempearture-reated which makes sense.
- We don't observe any strong weird anomaly behavior any time before the shoutdown. 

Let's also plot the time series with median rolling that helps us see the trends.

In [None]:
cols_to_plot = [
    'Power', 'WindSpeed', 'GenRPM', 'RotorRPM', 
    'WindDirAbs', 'WindDirRel', 'Pitch',
    'EnvirTemp', 'NacelTemp', 'GearOilTemp',
    'GearBearTemp', 'GenPh1Temp', 'GenBearTemp'
]
plot_time_series(df, cols_to_plot, step=1, rolling_window=15)

- Here, if zooming in, it's possible to see that there is a lot of noise in parameters that can be smoothed out.
- Clearly,  in addition to noise, even the median parameter values have high variability.
- We can also see that we can use the median filter as the outlier removal.
- The smoothed values show better relationships betwen Power, Wind Speed, GenRMP and RotorRPM parameters, especially when there are big ups and downs. These parameters might have good feature <--> target relationships.
- The same we can see for GenPh1Temp and GenBearTemp, NacelTemp and EnvirTemp.

# Feature Dependency Analysis

In [None]:
# Compute correlation matrix
corr = df.corr()

# Plot heatmap with annotations
plt.figure(figsize=(15, 15))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0, square=True, annot_kws={"size": 12})
plt.title("Feature Correlation Heatmap")
plt.tight_layout()

- We see that some parameters have VERY good correlations like Power <--> GenRPM corr=0.88
- Some parameters are barely correlated with anything, e.g. Pitch.
- There are many moderate to string correlation values.
- What is weird is that on the time we have NOT seen such string relationships, especially taking into account the noise.

Let's take Power as an example and sort correlations from highest to lowest.

In [None]:
# Compute correlation of all columns with the target
corr = df.corr()['Power'].drop('Power')

# Sort by absolute correlation value
correlations_sorted = corr.reindex(corr.abs().sort_values(ascending=False).index)
correlations_sorted

From the Time Series plots, we haven't seen such a strong correlation. Let's check the scatter-like plots.

In [None]:
def plot_relationships(x, y, x_label='X', y_label='Y'):
    """
    Plots x-y relationships in different formats (Regression, KDE and HexBin plots)
    """
    fig, axs = plt.subplots(1, 3, figsize=(18, 5))
    
    # 1. Regression Plot (via seaborn)
    sns.regplot(x=x, y=y, ax=axs[0], scatter_kws={'s': 20}, line_kws={'color': 'red'})
    axs[0].set_title('Regression Plot')
    axs[0].set_xlabel(x_label)
    axs[0].set_ylabel(y_label)

    # 2. KDE Plot (Seaborn joint density)
    sns.kdeplot(x=x, y=y, fill=True, cmap="mako", ax=axs[1], thresh=0.01)
    axs[1].set_title('KDE Plot')
    axs[1].set_xlabel(x_label)
    axs[1].set_ylabel(y_label)

    # 3. Hexbin Plot (via Matplotlib)
    axs[2].hexbin(x, y, gridsize=30, cmap='viridis', mincnt=1)
    axs[2].set_title('Hexbin Plot')
    axs[2].set_xlabel(x_label)
    axs[2].set_ylabel(y_label)

    plt.tight_layout(rect=[0, 0, 1, 0.95])
    plt.show()

In [None]:
v1 = 'GenRPM'
v2 = 'Power'
n = 10
df_local = df.copy() # df[(df[v1] > 800)]
plot_relationships(df_local[v1][::n], df_local[v2][::n], v1, v2)

These variables do NOT look like well-correlated.

The correlation value is strongly effected by the outliers. But in this case making it high!

For the outliers, we can see that there are some clouds of points which are separated from the main relationship.

There is a big cloud of points around zero which drives big correlation.

Let's check for some more variables.

In [None]:
v1 = 'WindSpeed'
v2 = 'Power'
n = 10
df_local = df.copy()
plot_relationships(df_local[v1][::n], df_local[v2][::n], v1, v2)

We see the same picture here. We also see that there are MANY points that looks like one point in zero. 

This is NOT possible to see in the scatter plot, but we can see it in the KDE and Hexbin plots.

We clearly see that the correlation values are strongly influecned by the outliers, especially zeros.

Let's first remove zeros.

In [None]:
df_no_zero = df[df['Power'] > 20]

In [None]:
# Compute correlation of all columns with the target
corr = df_no_zero.corr()['Power'].drop('Power')

# Sort by absolute correlation value
correlations_sorted = corr.reindex(corr.abs().sort_values(ascending=False).index)
correlations_sorted

In [None]:
# Raw data correlations
# GenRPM          0.879374
# GenPh1Temp      0.828008
# GearOilTemp     0.743265
# WindSpeed       0.705276
# RotorRPM        0.703314
# GearBearTemp    0.703189
# GenBearTemp     0.677292
# WindDirAbs      0.419142
# NacelTemp       0.313068
# EnvirTemp       0.249844
# Pitch           0.104456
# WindDirRel     -0.015779

Now we see much smaller correlations.

Let's check how it looks in scatters.

In [None]:
v1 = 'GenRPM'
v2 = 'Power'
n = 10
df_local = df_no_zero.copy()
plot_relationships(df_local[v1][::n], df_local[v2][::n], v1, v2)

In [None]:
v1 = 'RotorRPM'
v2 = 'Power'
n = 10
df_local = df_no_zero.copy()
plot_relationships(df_local[v1][::n], df_local[v2][::n], v1, v2)

We still see the influence by the outlers. 
    
And now it's hard to say iof the outliers increase or decrease correlations.

But what is more important, it's hard to see the relationships clearly and trully understand the data.

Let's check the distrobutions closely.

In [None]:
# Select numeric columns only
numeric_cols = df_no_zero.select_dtypes(include=np.number).columns
n_cols = 3
n_rows = int(np.ceil(len(numeric_cols) / n_cols))

plt.figure(figsize=(15, 10))

for i, col in enumerate(numeric_cols, 1):
    plt.subplot(n_rows, n_cols, i)
    sns.histplot(df_no_zero[col].dropna(), bins=50, kde=True)
    plt.title(col, fontsize=10)
    plt.xlabel('')
    plt.ylabel('')

plt.tight_layout()

We can see long tails in most of the features, but the number of data points is not that big.

Let's also check if we can see in the multidimentional space reduced to 2 dmentions.

In [None]:
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df) # df_no_zero

In [None]:
# Apply PCA
pca = PCA(n_components=len(df.columns))
X_pca = pca.fit_transform(X_scaled)

# Create DataFrame of first 2 components
pca_df = pd.DataFrame(data=X_pca[:, :2], columns=["PC1", "PC2"])

In [None]:
# Plot 1: PCA Scatter Plot (2D)
plt.figure(figsize=(8, 5))
sns.scatterplot(x="PC1", y="PC2", data=pca_df[::1])
plt.title("PCA: First 2 Principal Components")
plt.grid(True)
plt.tight_layout()
plt.show()

We can see that there are outlying values.

If we check this for **df_no_zero**, we will not see them.

So, this means that these PCA outlying values are zeros.

Also, let's check how representative the first two principal components are.

In [None]:
# Plot 2: Explained Variance (Scree Plot)
explained_variance = pca.explained_variance_ratio_
cumulative_variance = explained_variance.cumsum()
cumulative_variance

In [None]:
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='-')
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Cumulative Explained Variance")
plt.grid(True)
plt.tight_layout()
plt.show()

Also, we see that a lot of information missing, so PCA plots might not be very representative.

# Data Cleaning and Grouping

### Z-Score Filter

Let's try to remove the outliers with a simple Z-score filter

In [None]:
def remove_outliers_zscore(df, threshold=3, nan_treatment='ffill'):
    """
    Replace outliers (based on z-score) with NaN for each numeric column
    and report how many values were replaced.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame (numeric or mixed).
    threshold : float, optional
        Z-score threshold. Default = 3.

    Returns
    -------
    df_masked : pd.DataFrame
        DataFrame with outlier values replaced by NaN.
    """
    df_masked = df.copy()
    numeric_cols = df.select_dtypes(include=np.number).columns

    total_replaced = 0
    replaced_per_column = {}

    for col in numeric_cols:
        mean = df[col].mean()
        std = df[col].std(ddof=0)
        z_score = np.abs((df[col] - mean) / std)

        outlier_mask = z_score > threshold
        n_replaced = outlier_mask.sum()

        df_masked.loc[outlier_mask, col] = np.nan

        replaced_per_column[col] = n_replaced
        total_replaced += n_replaced

    if nan_treatment == 'ffill':
        df_masked = df_masked.ffill()
    elif nan_treatment =='drop':
        df_masked = df_masked.dropna()
    else:
        raise ValueError(f'{nan_treatment} nan_treatment is not recognized')

    print(f"Replaced {total_replaced} values total (|z| > {threshold}).")
    print("Per column replacements:")
    for col, n in replaced_per_column.items():
        print(f"  {col}: {n}")

    return df_masked

In [None]:
df_clean = remove_outliers_zscore(df_no_zero, nan_treatment='ffill', threshold=3) # Check with 2-3

In [None]:
# Select numeric columns only
numeric_cols = df_clean.select_dtypes(include=np.number).columns
n_cols = 3
n_rows = int(np.ceil(len(numeric_cols) / n_cols))

plt.figure(figsize=(15, 10))

for i, col in enumerate(numeric_cols, 1):
    plt.subplot(n_rows, n_cols, i)
    sns.histplot(df_clean[col].dropna(), bins=25, kde=True)
    plt.title(col, fontsize=10)
    plt.xlabel('')
    plt.ylabel('')

plt.tight_layout()

We see that we cut the outliers quite well. Now, let's check the X-Y plots.

In [None]:
v1 = 'GenRPM'
v2 = 'Power'
n = 10
df_local = df_clean.copy()
plot_relationships(df_local[v1][::n], df_local[v2][::n], v1, v2)

In [None]:
v1 = 'WindSpeed'
v2 = 'Power'
n = 10
df_local = df_clean.copy()
plot_relationships(df_local[v1][::n], df_local[v2][::n], v1, v2)

Now, we can better see the relationships. 

However, due to the noisy nature of the signals, it's still hard.

Let's try one trick - let's plot grouped data.

## Grouped data plots

In [None]:
df_gr = df_clean.resample('1D').mean()

In [None]:
v1 = 'GenRPM'
v2 = 'Power'
n = 1
df_local = df_gr.copy()
plot_relationships(df_local[v1][::n], df_local[v2][::n], v1, v2)

In [None]:
v1 = 'WindSpeed'
v2 = 'Power'
n = 1
df_local = df_gr.copy()
plot_relationships(df_local[v1][::n], df_local[v2][::n], v1, v2)

In [None]:
v2 = "Power"         # fixed y-variable
df_local = df_gr.copy()
n = 1                # subsampling step

for col in df_local.columns:
    if col == v2:
        continue     # skip Power itself
    print(f"Plotting: {col} vs {v2}")
    plot_relationships(
        df_local[col][::n],
        df_local[v2][::n],
        col,
        v2
    )

### PCA on Grouped Data

Now we can see that ON AVERAGE for some variables like WindSpeed or GenRPM vs Power the relationships are quite linear.

We can take this into account because maybe it can be useful when creating the model and identifying the time horizon for the model prediction.

Soome varibleas are close to linear dependency, but the variable fo the relationship is very high, e.g. EnvirTemp vs Power.

Pitch has a very strange relationship with Power, however, in some data ranges it can be a useful predictor.

Let's see how it looks as a time series.

In [None]:
cols_to_plot = [
    'Power', 'WindSpeed', 'GenRPM', 'RotorRPM', 
    'WindDirAbs', 'WindDirRel', 'Pitch',
    'EnvirTemp', 'NacelTemp', 'GearOilTemp',
    'GearBearTemp', 'GenPh1Temp', 'GenBearTemp'
]

plot_time_series(df_gr, cols_to_plot, step=1)

We can see and study the time series more clearly after grouping.

We can see how big ups and downs for high correlation features match ups and downs of Power.

However, we can't really see any specific behavior before the the downtime.

We need to do something else.

Let's apply PCA to the grouped data.

In [None]:
df_pca = df_gr.copy().dropna()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_pca)

pca = PCA(n_components=len(df_pca.columns))
X_pca = pca.fit_transform(X_scaled)


pca_df = pd.DataFrame(
    data=X_pca[:, :2],
    index=df_pca.index,
    columns=["PC1", "PC2"]
)

In [None]:
# Plot 1: PCA Scatter Plot (2D)
plt.figure(figsize=(8, 5))
sns.scatterplot(x="PC1", y="PC2", data=pca_df[::1])
plt.title("PCA: First 2 Principal Components")
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# Plot 2: Explained Variance (Scree Plot)
explained_variance = pca.explained_variance_ratio_
cumulative_variance = explained_variance.cumsum()
cumulative_variance

- We see that even after grouping, first 2 PC still do not explain eniough of variance.
- We can see some "cluster" of points for PC_1 > 5, but it's not very well separated.
- However, let's plot it on time series.

In [None]:
pc1_thr = 4
pc1 = pca_df["PC1"]

mask_red = pc1 > pc1_thr   # positions where PC1 is "high"

cols_to_plot = [
    'Power', 'WindSpeed', 'GenRPM', 'RotorRPM', 
    'WindDirAbs', 'WindDirRel', 'Pitch',
    'EnvirTemp', 'NacelTemp', 'GearOilTemp',
    'GearBearTemp', 'GenPh1Temp', 'GenBearTemp'
]

for col in cols_to_plot:
    plt.figure(figsize=(10, 3))

    # base time series (grey line)
    plt.plot(
        df_pca.index,
        df_pca[col],
        color="grey",
        linewidth=1,
        label=col
    )

    # mark PC1 > threshold in red
    plt.scatter(
        df_pca.index[mask_red],
        df_pca.loc[mask_red, col],
        color="red",
        s=10,
        label=f"PC1 > {pc1_thr}"
    )

    plt.title(f"{col} – points where PC1 > {pc1_thr} in red")
    plt.xlabel("Time")
    plt.ylabel(col)
    plt.grid(True)
    plt.legend()
    plt.tight_layout()
    plt.show()

- We see that PCA "outliers" are just the points with highest values of Power, WindSpeed, GenRPM, etc.
- We see that these values happened far before the downtime and quite quickly after downtime.
- So, these points can hardly be considered as anomalies in terms of downtime detection.

# Mahalanobis distance study

Mahalanobis Distance measures how far a point is from the center of a multivariate distribution while accounting for correlations between features.

Unlike Euclidean distance, it adjusts for:
- different variances in each feature
- correlations between features
- scale differences across dimensions


Mahalanobis Distance is defined as:

$$
D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}
$$

Where:

- $x$ is the data point  
- $\mu$ is the mean vector  
- $\Sigma^{-1}$ is the inverse covariance matrix  


**Mahalanobis Distance answers:**

"How many multivariate standard deviations away is this point from the center?"

It is commonly used for:
- multivariate anomaly detection
- identifying outliers in high-dimensional data
- measuring how unusual a point is relative to a distribution

In [None]:
# 1) Split
split_point = 30000
df_mah_train = df_clean.iloc[:split_point].dropna().copy()
df_mah_test  = df_clean.iloc[split_point:].dropna().copy()

# 2) Arrays
x_train_raw = df_mah_train.values
x_test_raw  = df_mah_test.values

# 3) Scale
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train_raw)
x_test  = scaler.transform(x_test_raw)

# 4) Covariance
mu = np.mean(x_train, axis=0)
cov = np.cov(x_train, rowvar=False)
cov_inv = np.linalg.pinv(cov)

# 5) Mahalanobis
diff = x_test - mu
precision_proj = diff @ cov_inv
mah_components = precision_proj * diff

d2 = np.sum(mah_components, axis=1)
df_mah_test["mahalanobis_distance"] = np.sqrt(d2)

In [None]:
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=df_mah_test.index,
    y=df_mah_test["mahalanobis_distance"],
    mode="lines",
    name="Mahalanobis distance",
))

fig.add_trace(go.Scatter(
    x=df_mah_test.index,
    y=df_mah_test["mahalanobis_distance"].rolling(50).median(),
    mode="lines",
    name="Rolling median (50)",
    line=dict(color="black", width=2)
))

fig.update_layout(
    title="Mahalanobis distance (test segment)",
    xaxis_title="Time",
    yaxis_title="Distance",
    template="plotly_white",
    legend=dict(
        x=0.01,
        y=0.99,
        xanchor="left",
        yanchor="top",
        bgcolor="rgba(255,255,255,0)",
        bordercolor="rgba(0,0,0,0)",
        font=dict(size=11),
        orientation="h",
    )
)

fig.show()

- Finally, we can see some anomaly signs around the downtime.
- Around the end of may, the distance starts rising quite a lot. This indicates, that these data points are gettung away from the normal range INCLUDING the covariance between the features.
- So, the main conclusion is that THERE ARE some anomalies around the downtime.
- It's tempting to use Mahalanobis distance to monitor anomalie in production, however, this method can be numerically unstable.
- Also, we do not really understand which parameters influenced the distance (and the Turbine) to behave abnormally.

Let's try to explain what contribute the most to the distance rise.

### Computing distances per feature

In [None]:
# Assume that above the Mahalanobis distance threshold, the data are abnormal
threshold = 4.5
anom_mask = df_mah_test["mahalanobis_distance"] >= threshold

# Select anomaly points
mah_components_anom = mah_components[anom_mask]     # (n_anom, n_features)

feature_cols = df_mah_test.columns.drop("mahalanobis_distance")

df_contrib_anom = pd.DataFrame(
    mah_components_anom,
    columns=feature_cols,
    index=df_mah_test.loc[anom_mask].index
)

df_contrib_anom.head(5)

In [None]:
# Compute feature importance across anomalies (mean absolute contribution)
feature_importance = (
    df_contrib_anom.abs().mean().sort_values(ascending=False)
)

print("Feature contribution ranking:")
print(feature_importance)

- Nice! We clearly see 3 dominating features that contributed the most to the anomaly region.
- We have to notice, however, that these features are correlated, so this can be the reason why all of them dominate at the same time.
- Before drawing conclusions, let's try some more tricks to explain the Mahalanobis distance rise.

### Explanation using ML Model Feature Analysis

Let's fit a CatBoost model with default parameters where:

- **Features** - the distances per feature that we computed above.
- **Target** - Mahalanobis distance.

In [None]:
# Use true per-feature Mahalanobis contributions as model features
df_contrib_all = pd.DataFrame(
    mah_components,
    index=df_mah_test.index,
    columns=feature_cols
)

X = df_contrib_all
y = df_mah_test["mahalanobis_distance"]

# Select anomaly subset
threshold = 4.5
anom_mask = y >= threshold

X_anom = X[anom_mask]
y_anom = y[anom_mask]

# Train CatBoost on all test data
# (Model learns mapping: per-feature-contrib → anomaly score)

model = CatBoostRegressor(
    depth=6,
    learning_rate=0.05,
    iterations=600,
    loss_function="RMSE",
    verbose=False,
    random_seed=42
)

model.fit(X, y)

First, let's check defaul feature importance of CatBoost.

In [None]:
feature_importance = model.get_feature_importance()
df_catboost_importance = pd.DataFrame({
    "feature": X.columns,
    "importance": feature_importance
}).sort_values("importance", ascending=False)

print(df_catboost_importance)

Interestingly, we see the same TOP features, however, now Power is the TOP 1.

Now, let's try to use SHAP values.

We will check SHAP values for:
- Anomaly data (Mahalanobis distace > 4.5)
- The entire test dataset.

**First, let's check anomalies only**

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_anom)   # explain anomalies only

# Global ranking
shap_importance = np.abs(shap_values).mean(axis=0)
shap_ranking = pd.Series(shap_importance, index=X.columns).sort_values(ascending=False)

print("\nSHAP Feature Importance (Anomaly Region):")
print(shap_ranking)

**Now, let's check the entire data used to train CatBoost**

In [None]:
explainer = shap.TreeExplainer(model)

# Compute SHAP for entire dataset
shap_values_all = explainer.shap_values(X)

# Global SHAP importance (full dataset)
shap_importance_all = np.abs(shap_values_all).mean(axis=0)

shap_ranking_all = (
    pd.Series(shap_importance_all, index=X.columns)
    .sort_values(ascending=False)
)

print("\n=== Global SHAP Feature Importance (Full Dataset) ===\n")
print(shap_ranking_all)

We see similar importances.

## Mahalanobis study conclusions

- We have seen that using 4 different approaches, we identified that Power, GenRPM and GenPh1TEmp controbute the most to the anomalious region.
- When using the SHAP values explainer of a CatBoost model, Power feature dominates.
- From the technological perspective, Power is the resulting variable from the Wind Turbine work.
- **From both data-driven approach and technologcala point of view, it's proposed to monitor Power as the main signal to identify the anomaly using regression models.**

# Noise study

During the entire EDA, we observed that the data seems to have quite a lot of noise, so this section studies the noise in more detail using Fast Fourier Transformations.

FFT (Fast Fourier Transform) is an algorithm that converts a time-series signal into its frequency components.
Instead of looking at how the signal changes over time, FFT shows which frequencies are present and how strong they are.

This is useful because:

Noise often appears as high-frequency components

Trends and slow variations appear as low-frequency components

Periodic behavior becomes clearly visible in the frequency domain

By analyzing the frequency spectrum, we can better understand the underlying structure of the signal and identify which parts of the variation come from meaningful patterns versus random noise.

To quicly get an idea, let's create a sigmal of 2 sin waves of different frequencies + noise.

We need to define a compute of parameters.

**dt = sampling interval** - The time between two samples, expressed in any unit you choose (seconds, minutes, hours)

If we want to express units is hours and we have data points every 10 mins, we get:

dt = 10 mins / 60 mins = 0.1666 h

**f = Frequency** - Number of cycles per unit time (e.g., per minuite or per hour)

In [None]:
dt = 10/60          # 10 minutes in hours
t = np.arange(0, 480, dt)   # 48 hours

freq1 = 0.5     # slow wave
freq2 = 1.5     # fast wave

signal = (
    3 * np.sin(2*np.pi*freq1*t) +
    1 * np.sin(2*np.pi*freq2*t)
)

noise = 1.5 * np.random.randn(len(t))

df_demo = pd.DataFrame({
    "demo_signal": signal + noise
})

In [None]:
plt.figure(figsize=(12, 5))
plt.plot(df_demo["demo_signal"][:100])

In [None]:
def compute_fft(df, col, dt=10/60):
    """
    Pure FFT-based power spectrum.

    Parameters
    ----------
    df  : DataFrame with the signal
    col : column name (string)
    dt  : sampling interval for the selected unit (e.g. 10/60 for 10 min in hours)

    Returns
    -------
    freqs : array, frequencies in 1/dt units (e.g. cycles/hour)
    power : array, power at each frequency (|FFT|^2 / N)
    """
    x = df[col].astype(float).interpolate().bfill().ffill().values
    x = x - np.mean(x)          # remove DC offset (zero-frequency component - mean value)

    N = len(x)
    fft_vals = np.fft.rfft(x)   # one-sided FFT
    freqs = np.fft.rfftfreq(N, d=dt)

    power = (np.abs(fft_vals) ** 2) / N   # simple power spectrum

    return freqs, power, fft_vals, np.mean(x)

In [None]:
freqs, power, fft_vals, x_mean = compute_fft(df_demo, "demo_signal", dt=dt)

In [None]:
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=freqs,
    y=power,
    mode="lines",
    line=dict(width=2),
    name="FFT Power"
))

fig.update_layout(
    title="FFT Power Spectrum",
    xaxis_title="Frequency (cycles per hour)",
    yaxis_title="Power",
    height=500,
    width=900
)

fig.show()

Here, we see that after transforming the signal to the frequency space, we can clearly identify that there are 2 main frequencies in the signal, 0.5 and 1.5.

If zooming in, we can see that other frequencies are covered with small noise fluctuations.

What is the most beautiful, we can take the inverse transformation and reconstruct the original signal.

In [None]:
x_inverse = np.fft.irfft(fft_vals, n=df_demo.shape[0]) + x_mean   # inverse FFT (no filtering)

In [None]:
plt.figure(figsize=(12, 5))
plt.plot(df_demo['demo_signal'][:100], label="Original", linewidth=2)
plt.plot(x_inverse[:100], "o", markersize=10, alpha=0.6, label="Inverse FFT signal")
plt.title("Original vs Inverse FFT Reconstruction (No Filtering)")
plt.xlabel("Time")
plt.ylabel("Amplitude")
plt.legend()

This gives us the opportunity to denoiuse the signal.

What we can simply do is we keep only the frequenices (or power spectrum) that we want and cut the rest which we think is noise.

This then can be inverted to the original domain and ideally it should represent the signal with no to little noise.

### Denoising

In [None]:
def compute_fft_power(df, col, dt=10/60):
    """
    Pure FFT-based power spectrum.

    Parameters
    ----------
    df  : DataFrame with the signal
    col : column name (string)
    dt  : sampling interval for the selected unit (e.g. 10/60 for 10 min in hours)

    Returns
    -------
    freqs : array, frequencies in 1/dt units (e.g. cycles/hour)
    power : array, power at each frequency (|FFT|^2 / N)
    """
    x = df[col].astype(float).interpolate().bfill().ffill().values
    x = x - np.mean(x)          # remove DC

    N = len(x)
    fft_vals = np.fft.rfft(x)   # one-sided FFT
    freqs = np.fft.rfftfreq(N, d=dt)

    power = (np.abs(fft_vals) ** 2) / N   # simple power spectrum

    return freqs, power

In [None]:
dt = 10/60  # 10 minutes in hours (cycles/hour)
fft_dict = {}

for col in df.columns:
    freqs, power = compute_fft_power(df_no_zero, col, dt=dt)
    fft_dict[col] = {"freqs": freqs, "power": power}

In [None]:
col = "WindSpeed"   # pick the column you want

freqs = fft_dict[col]["freqs"]
power = fft_dict[col]["power"]

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=freqs,
    y=power,
    mode="lines",
    line=dict(width=2),
    name=col
))

fig.update_layout(
    title=f"FFT Power Spectrum: {col}",
    xaxis_title="Freq (cycles/hour)",
    yaxis_title="Power",
    height=500,
    width=900,
    showlegend=False
)

fig.show()

In [None]:
n_cols = 3
n_plots = len(df.columns)
n_rows = math.ceil(n_plots / n_cols)

fig = make_subplots(
    rows=n_rows,
    cols=n_cols,
    subplot_titles=df.columns,
    vertical_spacing=0.05,
    horizontal_spacing=0.05
)

# ---- Add each FFT plot ----
for idx, col in enumerate(df.columns):
    row = idx // n_cols + 1
    col_pos = idx % n_cols + 1
    
    fig.add_trace(
        go.Scatter(
            x=fft_dict[col]["freqs"],
            y=fft_dict[col]["power"],
            mode="lines",
            name=col
        ),
        row=row, col=col_pos
    )

    fig.update_xaxes(title_text="Freq (cycles/hour)", row=row, col=col_pos)
    fig.update_yaxes(title_text="Power", row=row, col=col_pos)


# ---- Set global layout ----
fig.update_layout(
    height=350 * n_rows,
    width=1200,
    showlegend=False,
    title_text="FFT Power Spectra for All Columns"
)

fig.show()

In [None]:
col = "WindSpeed"   # pick the column you want

freqs = fft_dict[col]["freqs"]
power = fft_dict[col]["power"]

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=freqs,
    y=power,
    mode="lines",
    line=dict(width=2),
    name=col
))

fig.update_layout(
    title=f"FFT Power Spectrum: {col}",
    xaxis_title="Freq (cycles/hour)",
    yaxis_title="Power",
    height=500,
    width=900,
    showlegend=False
)

fig.show()

Now, let's create a functio that cuts off the frequencies that are above some values. This is called low-pass filter.

In [None]:
def fft_lowpass_filter(x, dt, cutoff):
    """
    Apply a low-pass FFT filter.

    Parameters
    ----------
    x : array-like
        Raw time-series signal.
    dt : float
        Sampling interval in chosen time units (e.g. 10/60 for cycles/hour).
    cutoff : float
        Cutoff frequency in same units as FFT output (e.g. cycles/hour).

    Returns
    -------
    x_filtered : np.array
        Filtered time-series (mean added back).
    freqs : np.array
        Frequency axis.
    fft_filtered : np.array
        Filtered FFT values.
    """

    # Ensure numpy array
    x_clean = np.asarray(x, dtype=float)

    # Fill NaNs if needed
    if np.isnan(x_clean).any():
        nans = np.isnan(x_clean)
        x_clean[nans] = np.interp(np.flatnonzero(nans),
                                  np.flatnonzero(~nans),
                                  x_clean[~nans])

    # Store original mean
    mean_val = np.mean(x_clean)

    # Detrend (remove mean for FFT)
    x_detrended = x_clean - mean_val

    N = len(x_detrended)

    # FFT
    fft_vals = np.fft.rfft(x_detrended)
    freqs = np.fft.rfftfreq(N, d=dt)

    # Low-pass mask
    mask = freqs <= cutoff
    fft_filtered = fft_vals * mask

    # Inverse FFT + ADD MEAN BACK
    x_filtered = np.fft.irfft(fft_filtered, n=N) + mean_val

    return x_filtered, freqs, fft_filtered

In [None]:
cutoff = 1
col = "WindSpeed"

x = df_no_zero[col].values

x_filt, freqs, fft_filt = fft_lowpass_filter(x, dt, cutoff)
df_no_zero.loc[:, col + "_filt"] = x_filt

In [None]:
plt.figure(figsize=(12, 5))
plt.plot(df_no_zero[col][:100], label='Raw Signal')
plt.plot(df_no_zero[col + "_filt"][:100], label='FFT filtered signal')
# plt.plot(df_no_zero[col][:100].rolling(3).mean(), label='Rolling mean signal')
plt.legend()

Let's compute the correlations before and after filtering.

In [None]:
df_no_zero['Power'].corr(df_no_zero['WindSpeed'])

In [None]:
df_no_zero['Power'].corr(df_no_zero['WindSpeed_filt'])

We see that the correlation improved because we removed the noise from the signal.

This is a promising approach for data cleaning later and ML model improvement.

# Conclusions

## Executive Summary

This report analyzes multivariate behavior of a wind turbine using correlations, time-series exploration, PCA, Anomaly analysis (Mahalanobis distance, and model-based explainability (CatBoost + SHAP)) and Noise study.

What we have found is that, despite heavy noise, three variables consistently dominate anomaly formation:

**Key Drivers of Abnormal Behavior**
- GenRPM  
- GenPh1Temp  
- Power  

These features jointly describe mechanical–thermal stress states inside the turbine. Environmental variables (WindSpeed, WindDirAbs/Rel, Pitch) do not explain anomaly formation, indicating faults are internal rather than wind-driven.

**As the next step, it's decided to proceed with modeling Power parameter asn the target in a regression model and then analyzing the deviation of the predicted values from the observed values.**

---

# More detailed conclusions.

1. Many parameters contain strong outliers, including negative values and long tails, which must be cleaned before analysis or modeling.

2. Large block of zeros represents full turbine shutdown periods; these distort correlations, PCA structure, and feature distributions, so they must be excluded from relationship analysis.

3. KDE and hexbin plots reveal dense point clouds at zero values that cannot be seen in scatter plots but heavily distort distribution shapes and correlations.

4. Noise is present across many signals, especially temperature and RPM, and visually overwhelms real structure; denoising significantly clarifies relationships.

5. After FFT smoothing, outlier removal and zeros removal, correlations become physically meaningful, showing clearer dependencies between Power, WindSpeed, GenRPM, and RotorRPM. This is also confirmed after resampling the data with daily mean values.

6. Some variables show partially linear relationships with Power (e.g., WindSpeed, GenRPM), while others behave inconsistently (e.g., Pitch), suggesting differing predictive value.

7. Despite noise reduction using median filter, several variables still show high variability even in median values, indicating multiple operational regimes and non-stationarity.

8. PCA reveals outlying values driven primarily by zeros rather than true anomalies, and the first two components do not explain enough variance for reliable representation.

9. PCA applied to grouped data still fails to separate anomalies clearly; high PCA scores correspond mostly to high operational values rather than true abnormal behavior.

10. Time-series inspection shows that ups and downs of highly correlated features follow Power closely, but no clear anomaly pattern appears before shutdowns using raw or grouped data.

11. Mahalanobis distance clearly rises before the downtime, showing true multivariate deviation that includes covariance structure, unlike PCA or raw correlations.

12. Across four independent approaches (feature contributions, smoothed relationships, Mahalanobis decomposition, CatBoost+SHAP) applied to Mahalanobis distance, the same three features dominate anomaly formation: Power, GenRPM, and GenPh1Temp.

13. These three features are correlated, so their dominance likely represents a joint mechanical–thermal stress regime rather than three independent root causes.

14. Outlier and noise removal is essential for reliable modeling because raw data distort the physical relationships between variables; only after cleaning do true dependencies become visible and usable for predictive modeling.

15. Low-pass filtering (FFT cutoff)can be used to isolate the low-frequency structure from high-frequency noise, revealing clearer physical relationships between Power, RPM, and temperature variables and improving the interpretability of correlations and trends.