# Model Training for Charpy Energy Prediction without using PCA Data

We are going to train and compare several regression models using the original Charpy energy dataset, without applying Principal Component Analysis (PCA) for dimensionality reduction. The objective is to evaluate model performance directly on the raw input features and determine whether using untransformed data provides better predictive accuracy. Charpy Energy (in joules) represents the toughness of welded materials—the energy absorbed during fracture under impact—and serves as a key indicator of mechanical strength and ductility. Various models, including Linear Regression, Ridge, Lasso, ElasticNet, Decision Tree, Random Forest, Gradient Boosting, and SVR, will be trained and optimized using 5-fold cross-validation with GridSearchCV. Model performance will be evaluated through R², MAE, and RMSE metrics, complemented by an overfitting analysis to identify the most robust model for predicting Charpy impact energy from real-world data.

In [54]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
import joblib
import os 
from sklearn.metrics import mean_squared_error, r2_score , mean_absolute_error
from sklearn.impute import KNNImputer

In [55]:
os.makedirs('trained_models', exist_ok=True)

In [56]:
df = pd.read_csv('../../welddatabase/welddb_new.csv')
print(f"Dataset shape: {df.shape}")

Dataset shape: (1652, 52)


In [57]:
X = df.drop('Charpy_Energy_J', axis=1)
y = df['Charpy_Energy_J']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

Training samples: 1321
Testing samples: 331


In [58]:
# --- 1️⃣ Supprimer les lignes dont la cible est manquante ---
df_clean = df.dropna(subset=['Charpy_Energy_J'])
print(f"✅ Lignes conservées après suppression des NaN dans la cible : {len(df_clean)} / {len(df)}")

# --- 2️⃣ Supprimer les lignes trop incomplètes ---
df_filtered = df_clean[df_clean.isnull().mean(axis=1) < 0.5]
print(f"✅ Lignes restantes après filtrage : {len(df_filtered)}")

# --- 3️⃣ Séparer X et y ---
X = df_filtered.drop(columns=['Charpy_Energy_J'])
y = df_filtered['Charpy_Energy_J']

# --- 4️⃣ Séparer les colonnes numériques ---
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
print(f"📊 Colonnes numériques : {len(numeric_cols)}")

# --- 5️⃣ Appliquer KNNImputer uniquement sur ces colonnes ---
imputer = KNNImputer(n_neighbors=5, weights='distance')
X_num_imputed = imputer.fit_transform(X[numeric_cols])

# --- 6️⃣ Corriger automatiquement la cohérence de taille ---
valid_col_count = X_num_imputed.shape[1]
numeric_cols = numeric_cols[:valid_col_count]  # Ajuste si une colonne a été ignorée

# --- 7️⃣ Reconstruction DataFrame cohérent ---
X_imputed = pd.DataFrame(X_num_imputed, columns=numeric_cols, index=X.index)

print(f"✅ Imputation réussie. Dimensions finales : {X_imputed.shape}")
print(f"Nombre total de valeurs manquantes restantes : {X_imputed.isna().sum().sum()}")
print(f"Nombre de NaN dans y : {y.isna().sum()}")


✅ Lignes conservées après suppression des NaN dans la cible : 879 / 1652
✅ Lignes restantes après filtrage : 879
📊 Colonnes numériques : 51
✅ Imputation réussie. Dimensions finales : (879, 50)
Nombre total de valeurs manquantes restantes : 0
Nombre de NaN dans y : 0


In [60]:
X_train, X_test, y_train, y_test = train_test_split(
    X_imputed, y, test_size=0.2, random_state=42
)

print(f"📊 Train set: {X_train.shape[0]} samples, Test set: {X_test.shape[0]} samples")


📊 Train set: 703 samples, Test set: 176 samples


# Defining the models and their Hyperparameter grids

In [65]:
# --- Define all regression models ---
models = {
    'LinearRegression': LinearRegression(),
    'RidgeRegression': Ridge(),
    'LassoRegression': Lasso(),
    'ElasticNetRegression': ElasticNet(),
    'DecisionTreeRegressor': DecisionTreeRegressor(random_state=42),
    'RandomForestRegressor': RandomForestRegressor(random_state=42, n_jobs=-1),
    'SVR': SVR()
  
}

# --- Define the hyperparameter grids for each model ---
param_grids = {
    'LinearRegression': {},  # No hyperparameters to tune
    'RidgeRegression': {'alpha': [0.1, 1.0, 10.0, 100.0]},
    'LassoRegression': {'alpha': [0.1, 1.0, 10.0, 100.0]},
    'ElasticNetRegression': {'alpha': [0.1, 1.0, 10.0], 'l1_ratio': [0.2, 0.5, 0.8]},
    'DecisionTreeRegressor': {'max_depth': [5, 10, 20, None], 'min_samples_split': [2, 10, 20]},
    'RandomForestRegressor': {'n_estimators': [100, 200], 'max_depth': [10, 20, None]},
    'GradientBoostingRegressor': {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 10]},
    'SVR': {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
    
}


Model Training and Hyperparameter Search

In [66]:
# --- 2️⃣ Model training and evaluation ---
results = []

for name, model in models.items():
    print(f"\n🚀 Training {name}...")

    grid_search = GridSearchCV(
        model,
        param_grids[name],
        cv=5,
        scoring='r2',
        n_jobs=-1,
        verbose=0
    )

    # --- Train on the imputed dataset ---
    grid_search.fit(X_train, y_train)

    # --- Predictions on training and testing sets ---
    y_train_pred = grid_search.predict(X_train)
    y_test_pred = grid_search.predict(X_test)

    # --- Compute metrics ---
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

    # --- Store model results ---
    results.append({
        'Model': name,
        'Best_Params': str(grid_search.best_params_),
        'CV_R2': grid_search.best_score_,
        'Train_R2': train_r2,
        'Test_R2': test_r2,
        'MAE_Test': test_mae,
        'RMSE_Test': test_rmse
    })

    # --- Save the best model with clear naming ---
    model_filename = f"trained_models/{name}_without_PCA_model.pkl"
    joblib.dump(grid_search.best_estimator_, model_filename)

    # --- Display quick summary ---
    print(f"✅ Best Parameters: {grid_search.best_params_}")
    print(f"CV R²: {grid_search.best_score_:.4f} | Train R²: {train_r2:.4f} | Test R²: {test_r2:.4f}")
    print(f"MAE: {test_mae:.4f} | RMSE: {test_rmse:.4f}")
    print(f"📁 Model saved as: {model_filename}")

print("\n🏁 Training completed for all models (WITHOUT PCA)!")

# --- 3️⃣ Create and display organized results table ---
results_df = pd.DataFrame(results).sort_values(by='Test_R2', ascending=False).reset_index(drop=True)

print("\n📊 MODEL PERFORMANCE SUMMARY (WITHOUT PCA)")
print("=" * 90)
display(results_df.style.set_properties(**{
    'background-color': '#1e1e1e',
    'color': 'white',
    'border-color': 'gray',
    'text-align': 'center'
}).format({
    'CV_R2': "{:.4f}",
    'Train_R2': "{:.4f}",
    'Test_R2': "{:.4f}",
    'MAE_Test': "{:.4f}",
    'RMSE_Test': "{:.4f}"
}))



🚀 Training LinearRegression...
✅ Best Parameters: {}
CV R²: 0.5168 | Train R²: 0.6324 | Test R²: 0.5957
MAE: 26.3282 | RMSE: 33.5351
📁 Model saved as: trained_models/LinearRegression_without_PCA_model.pkl

🚀 Training RidgeRegression...
✅ Best Parameters: {'alpha': 0.1}
CV R²: 0.5175 | Train R²: 0.6181 | Test R²: 0.6003
MAE: 26.4690 | RMSE: 33.3459
📁 Model saved as: trained_models/RidgeRegression_without_PCA_model.pkl

🚀 Training LassoRegression...
✅ Best Parameters: {'alpha': 0.1}
CV R²: 0.5020 | Train R²: 0.5926 | Test R²: 0.6061
MAE: 26.5304 | RMSE: 33.1038
📁 Model saved as: trained_models/LassoRegression_without_PCA_model.pkl

🚀 Training ElasticNetRegression...


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


✅ Best Parameters: {'alpha': 0.1, 'l1_ratio': 0.8}
CV R²: 0.4837 | Train R²: 0.5614 | Test R²: 0.5895
MAE: 27.2128 | RMSE: 33.7907
📁 Model saved as: trained_models/ElasticNetRegression_without_PCA_model.pkl

🚀 Training DecisionTreeRegressor...
✅ Best Parameters: {'max_depth': 10, 'min_samples_split': 20}
CV R²: 0.5852 | Train R²: 0.8635 | Test R²: 0.5512
MAE: 22.2042 | RMSE: 35.3319
📁 Model saved as: trained_models/DecisionTreeRegressor_without_PCA_model.pkl

🚀 Training RandomForestRegressor...
✅ Best Parameters: {'max_depth': 20, 'n_estimators': 200}
CV R²: 0.7584 | Train R²: 0.9699 | Test R²: 0.8283
MAE: 15.5922 | RMSE: 21.8531
📁 Model saved as: trained_models/RandomForestRegressor_without_PCA_model.pkl

🚀 Training SVR...


KeyboardInterrupt: 