# PCA Analysis for Charpy Temperature Prediction

## Why PCA for Charpy Temperature?

**Problem**: The dataset contains 50+ features including chemical composition, welding parameters, and process conditions. Many of these features are highly correlated.

**Why PCA is necessary**:

1. **Multicollinearity**: Chemical elements like C, Mn, Si are often correlated due to steel composition requirements. This creates redundant information and unstable model coefficients.

2. **Dimensionality Curse**: With sufficient samples and 50+ features, PCA reduces features while retaining 90-95% of variance, improving model generalization.

3. **Noise Reduction**: Minor variations in measurements contribute little to prediction but add noise. PCA captures systematic variation while filtering noise.

4. **Computational Efficiency**: Training models on 15-20 principal components is faster than 50+ original features, especially for GridSearchCV with cross-validation.

**About Charpy Temperature**: Charpy Temperature (°C) is the testing temperature at which the Charpy impact test is performed. It's a critical parameter for assessing material toughness at different temperatures, especially important for applications in cold environments.

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.impute import KNNImputer
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import os

In [None]:
os.makedirs('pca_model', exist_ok=True)
os.makedirs('data', exist_ok=True)
os.makedirs('figures', exist_ok=True)

In [None]:
df = pd.read_csv('../welddatabase/welddb_pca.csv')
print(f"Prepared dataset shape: {df.shape}")

In [None]:
# Check Charpy Temperature data availability
charpy_temp_available = df['Charpy_Temp_C'].notna().sum()
print(f"Charpy Temperature data available: {charpy_temp_available} samples ({charpy_temp_available/len(df)*100:.1f}%)")

In [None]:
target_columns = ['Yield_Strength_MPa', 'UTS_MPa', 'Elongation_%', 
                  'Reduction_Area_%', 'Charpy_Temp_C', 'Charpy_Energy_J',
                  'Hardness_kg_mm2', 'FATT_50%', 'Primary_Ferrite_%',
                  'Ferrite_2nd_Phase_%', 'Acicular_Ferrite_%', 'Martensite_%',
                  'Ferrite_Carbide_%']

df_charpy_temp = df[df['Charpy_Temp_C'].notna()].copy()
y = df_charpy_temp['Charpy_Temp_C'].copy()
X = df_charpy_temp.drop(columns=target_columns)

print(f"Samples with Charpy Temperature: {len(df_charpy_temp)}")
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nCharpy Temperature statistics:")
print(f"  Mean: {y.mean():.2f}°C")
print(f"  Std: {y.std():.2f}°C")
print(f"  Range: [{y.min():.2f}°C, {y.max():.2f}°C]")

## Feature Imputation

**Why KNN Imputer?**

KNN Imputer is robust to different missing data mechanisms (MCAR, MAR, MNAR) because:

1. **Non-parametric approach**: Makes no assumptions about data distribution or missing mechanism
2. **Local similarity**: Imputes based on similar samples, preserving relationships between features
3. **Handles correlations**: Leverages multivariate structure naturally present in welding data
4. **Distance-based weighting**: Closer neighbors contribute more, reducing noise from dissimilar samples

In [None]:
imputer = KNNImputer(n_neighbors=5, weights='distance')
X_imputed = imputer.fit_transform(X)

print(f"Missing values before imputation: {X.isnull().sum().sum()}")
print(f"Missing values after imputation: {pd.DataFrame(X_imputed).isnull().sum().sum()}")
print(f"Shape after imputation: {X_imputed.shape}")

In [None]:
joblib.dump(imputer, 'pca_model/imputer.pkl')
print("Imputer saved to: pca_model/imputer.pkl")

## PCA Transformation

Applying PCA to retain 90% of variance, reducing dimensionality while preserving most information.

In [None]:
pca = PCA(n_components=0.90, random_state=42)
X_pca = pca.fit_transform(X_imputed)

print(f"Original features: {X_imputed.shape[1]}")
print(f"PCA components: {X_pca.shape[1]}")
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.4f}")
print(f"Dimensionality reduction: {(1 - X_pca.shape[1]/X_imputed.shape[1])*100:.1f}%")

In [None]:
joblib.dump(pca, 'pca_model/pca_model.pkl')
print("PCA model saved to: pca_model/pca_model.pkl")

## Save Transformed Data

In [None]:
pc_columns = [f'PC{i+1}' for i in range(X_pca.shape[1])]
df_pca = pd.DataFrame(X_pca, columns=pc_columns, index=df_charpy_temp.index)
df_pca['Charpy_Temp_C'] = y.values

df_pca.to_csv('data/welddb_pca_charpy_temp.csv', index=False)
print(f"Transformed data saved to: data/welddb_pca_charpy_temp.csv")
print(f"Shape: {df_pca.shape}")
df_pca.head()

## Visualization 1: Explained Variance

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Cumulative variance
cumsum = np.cumsum(pca.explained_variance_ratio_)
axes[0].plot(range(1, len(cumsum)+1), cumsum, 'bo-', linewidth=2)
axes[0].axhline(y=0.90, color='r', linestyle='--', linewidth=2, label='90% threshold')
axes[0].set_xlabel('Number of Components', fontsize=12)
axes[0].set_ylabel('Cumulative Explained Variance', fontsize=12)
axes[0].set_title('Cumulative Variance Explained', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Individual variance
n_show = min(15, len(pca.explained_variance_ratio_))
axes[1].bar(range(1, n_show+1), pca.explained_variance_ratio_[:n_show], 
           alpha=0.7, color='steelblue', edgecolor='black')
axes[1].set_xlabel('Component', fontsize=12)
axes[1].set_ylabel('Explained Variance Ratio', fontsize=12)
axes[1].set_title(f'Individual Component Variance (Top {n_show})', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('figures/explained_variance.png', dpi=300, bbox_inches='tight')
plt.show()

## Visualization 2: PCA Scatter Plots

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 2D scatter
scatter = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', 
                         alpha=0.6, edgecolors='k', linewidth=0.5)
axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})', fontsize=12)
axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})', fontsize=12)
axes[0].set_title('First Two Principal Components', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)
cbar = plt.colorbar(scatter, ax=axes[0])
cbar.set_label('Charpy Temp (°C)', fontsize=11)

# Correlation with target
correlations = df_pca.drop('Charpy_Temp_C', axis=1).corrwith(df_pca['Charpy_Temp_C']).abs().sort_values(ascending=False)
n_top = min(10, len(correlations))
axes[1].barh(range(n_top), correlations.head(n_top).values, color='coral', alpha=0.7, edgecolor='black')
axes[1].set_yticks(range(n_top))
axes[1].set_yticklabels(correlations.head(n_top).index)
axes[1].set_xlabel('Absolute Correlation', fontsize=12)
axes[1].set_title('Top 10 PC Correlations with Charpy Temp', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='x')
axes[1].invert_yaxis()

plt.tight_layout()
plt.savefig('figures/pca_scatter_correlation.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\nStrongest correlation: {correlations.iloc[0]:.3f} ({correlations.index[0]})")