<div style="background-color:#e6f0ff;padding:16px;border-radius:8px;border:1px solid #c6ddff;color:#1a1a33;font-family:Arial, sans-serif;line-height:1.6">
  <h1 style="margin:0;padding:0;color:#002b80">Data Preparation</h1>

  <p style="margin-top:8px">
    This notebook prepares the <strong>welding quality</strong> dataset for <strong>semi-supervised learning</strong> experiments.
    The dataset includes several mechanical property targets, each representing a different quality aspect of the weld.
  </p>

  <p>
    Creating one combined target from all properties isnâ€™t feasible, few rows contain <strong>all targets simultaneously</strong>,
    which would severely reduce usable data. Instead, we <strong>treat each target independently</strong> and build
    a separate prediction task for each property.
  </p>

  <p>
    For every target, the data is split into:
  </p>
  <ul>
    <li><strong>Training set</strong>: Labeled + unlabeled samples (semi-supervised setup)</li>
    <li><strong>Validation set</strong>: Labeled samples (for tuning)</li>
    <li><strong>Test set</strong>: Labeled samples (for evaluation)</li>
  </ul>

  <p style="margin-top:4px">
    Labeled data is divided <strong>60/20/20</strong> for train/val/test, and all unlabeled samples are added to the training set.
    Each target has its own directory with separate <code style="color:#004080;background:#f5f9ff;padding:2px 4px;border-radius:4px">X</code> (features)
    and <code style="color:#004080;background:#f5f9ff;padding:2px 4px;border-radius:4px">y</code> (labels) files.
  </p>
</div>


In [35]:
import sys
assert sys.version_info >= (3, 5)
import os
import sklearn
assert sklearn.__version__ >= "0.20"
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline 
import matplotlib as mpl
import matplotlib.pyplot as plt
# Style options for plots.
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Ignore useless warnings (see SciPy issue #5998).
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

### Loading data
We load the cleaned data localy. If you don't have the csv please run the notebook `01.dataset_cleaning.ipynb`

In [36]:
from sklearn.utils import shuffle

# Load the dataset
weld_df = pd.read_csv("../../data/clean_weld_quality_dataset.csv")

# Shuffle the data to avoid bias when deleting labels
weld_df = shuffle(weld_df, random_state=1)

Let's take a look at the columns

In [40]:
weld_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1652 entries, 161 to 1061
Data columns (total 55 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   carbon_wt_pct             1652 non-null   float64
 1   silicon_wt_pct            1652 non-null   float64
 2   manganese_wt_pct          1652 non-null   float64
 3   sulfur_wt_pct             1641 non-null   float64
 4   phosphorus_wt_pct         1642 non-null   float64
 5   nickel_wt_pct             697 non-null    float64
 6   chromium_wt_pct           784 non-null    float64
 7   molybdenum_wt_pct         791 non-null    float64
 8   vanadium_wt_pct           620 non-null    float64
 9   copper_wt_pct             564 non-null    float64
 10  cobalt_wt_pct             108 non-null    float64
 11  tungsten_wt_pct           63 non-null     float64
 12  oxygen_ppm                1256 non-null   float64
 13  titanium_ppm              865 non-null    float64
 14  nitrogen_pp

### Defining target cols

In [41]:
TARGET_COLS = [
    "yield_strength_MPa",  # Stress at which plastic deformation begins; measures material strength.
    "uts_MPa",             # Ultimate tensile strength; maximum stress material can withstand before fracture.
    "elongation_pct",      # Percent elongation; measure of ductility (total strain before fracture).
    "reduction_area_pct",  # Percent reduction in cross-sectional area after fracture; another ductility measure.
    "charpy_temp_C",       # Test temperature for Charpy impact test; defines testing condition.
    "charpy_toughness_J", ] # Charpy impact energy absorbed; indicates toughness and resistance to brittle fracture.

# Potentiel target columns to be dropped from the dataset because too much missing values
TODROP = [
    "hardness_kgmm2",      # Surface hardness; correlates with strength and wear resistance.
    "fatt50_C"             # 50% Fracture Appearance Transition Temperature; temperature where 50% brittle fracture occurs.
]


In [42]:
df = weld_df.drop(columns=TODROP)

### Semi-Supervised Train/Val/Test Splits for Each Target

For each target, we'll create:
- **Train set**: Combination of labeled + unlabeled data (semi-supervised)
- **Validation set**: Only labeled data (for tuning)
- **Test set**: Only labeled data (for final evaluation)

Strategy:
1. Split labeled data into train_labeled (60%), val (20%), test (20%)
2. Add ALL unlabeled data to training set
3. Save X and y separately for each target

In [43]:
from sklearn.model_selection import train_test_split
import os

# Create directory for splits
splits_dir = "../../data/data_splits"
os.makedirs(splits_dir, exist_ok=True)

# Define features (all columns except targets)
feature_cols = [col for col in df.columns if col not in TARGET_COLS]

print("=" * 80)
print("CREATING SEMI-SUPERVISED SPLITS FOR EACH TARGET")
print("=" * 80)
print(f"\nTotal samples: {len(df):,}")
print(f"Features: {len(feature_cols)}")
print(f"Targets: {len(TARGET_COLS)}")

for target in TARGET_COLS:
    print(f"\n{'='*80}")
    print(f"--> Processing target: {target}")
    print(f"{'='*80}")
    
    # Separate labeled and unlabeled data for this target
    labeled_mask = df[target].notna()
    unlabeled_mask = df[target].isna()
    
    df_labeled = df[labeled_mask].copy()
    df_unlabeled = df[unlabeled_mask].copy()
    
    n_labeled = len(df_labeled)
    n_unlabeled = len(df_unlabeled)
    
    print(f"\n  Labeled samples:   {n_labeled:>6,} ({n_labeled/len(df)*100:>5.1f}%)")
    print(f"  Unlabeled samples: {n_unlabeled:>6,} ({n_unlabeled/len(df)*100:>5.1f}%)")
    
    # Split labeled data: 60% train, 20% val, 20% test
    X_labeled = df_labeled[feature_cols]
    y_labeled = df_labeled[[target]]
    
    # First split: 60% train, 40% temp (val+test)
    X_train_labeled, X_temp, y_train_labeled, y_temp = train_test_split(
        X_labeled, y_labeled, test_size=0.4, random_state=42
    )
    
    # Second split: split temp into 50% val, 50% test (each 20% of total labeled)
    X_val, X_test, y_val, y_test = train_test_split(
        X_temp, y_temp, test_size=0.5, random_state=42
    )
    
    # Prepare unlabeled data for training
    X_train_unlabeled = df_unlabeled[feature_cols]
    y_train_unlabeled = df_unlabeled[[target]]  # Will contain NaN values
    
    # Combine labeled and unlabeled for training set
    X_train = pd.concat([X_train_labeled, X_train_unlabeled], axis=0, ignore_index=True)
    y_train = pd.concat([y_train_labeled, y_train_unlabeled], axis=0, ignore_index=True)
    
    print(f"\n  Split sizes:")
    print(f"    Train (labeled):     {len(X_train_labeled):>6,} samples")
    print(f"    Train (unlabeled):   {len(X_train_unlabeled):>6,} samples")
    print(f"    Train (total):       {len(X_train):>6,} samples ({len(X_train_labeled)/len(X_train)*100:.1f}% labeled)")
    print(f"    Validation:          {len(X_val):>6,} samples (100% labeled)")
    print(f"    Test:                {len(X_test):>6,} samples (100% labeled)")
    
    # Create target-specific directory
    target_dir = os.path.join(splits_dir, target)
    os.makedirs(target_dir, exist_ok=True)
    
    # Save splits to CSV
    X_train.to_csv(os.path.join(target_dir, "X_train.csv"), index=False)
    y_train.to_csv(os.path.join(target_dir, "y_train.csv"), index=False)
    
    X_val.to_csv(os.path.join(target_dir, "X_val.csv"), index=False)
    y_val.to_csv(os.path.join(target_dir, "y_val.csv"), index=False)
    
    X_test.to_csv(os.path.join(target_dir, "X_test.csv"), index=False)
    y_test.to_csv(os.path.join(target_dir, "y_test.csv"), index=False)
    
    print(f"\n  --> Saved to: {target_dir}/")
    print(f"    - X_train.csv, y_train.csv (with NaN for unlabeled)")
    print(f"    - X_val.csv, y_val.csv")
    print(f"    - X_test.csv, y_test.csv")

print(f"\n{'='*80}")
print(f"===> ALL SPLITS CREATED SUCCESSFULLY")
print(f"{'='*80}")
print(f"\nOutput directory: {splits_dir}/")
print(f"  Each target has its own subfolder with 6 files:")
print(f"  - X_train.csv, y_train.csv (semi-supervised)")
print(f"  - X_val.csv, y_val.csv (fully labeled)")
print(f"  - X_test.csv, y_test.csv (fully labeled)")

CREATING SEMI-SUPERVISED SPLITS FOR EACH TARGET

Total samples: 1,652
Features: 47
Targets: 6

--> Processing target: yield_strength_MPa

  Labeled samples:      780 ( 47.2%)
  Unlabeled samples:    872 ( 52.8%)

  Split sizes:
    Train (labeled):        468 samples
    Train (unlabeled):      872 samples
    Train (total):        1,340 samples (34.9% labeled)
    Validation:             156 samples (100% labeled)
    Test:                   156 samples (100% labeled)

  --> Saved to: ../../data/data_splits\yield_strength_MPa/
    - X_train.csv, y_train.csv (with NaN for unlabeled)
    - X_val.csv, y_val.csv
    - X_test.csv, y_test.csv

--> Processing target: uts_MPa

  Labeled samples:      738 ( 44.7%)
  Unlabeled samples:    914 ( 55.3%)

  Split sizes:
    Train (labeled):        442 samples
    Train (unlabeled):      914 samples
    Train (total):        1,356 samples (32.6% labeled)
    Validation:             148 samples (100% labeled)
    Test:                   148 samples 

### Example: Loading Splits for a Specific Target

In [44]:
# Example: Load splits for yield_strength_MPa
target_name = "yield_strength_MPa"
target_dir = os.path.join("../../data/data_splits", target_name)

print("=" * 80)
print(f"LOADING SPLITS FOR: {target_name}")
print("=" * 80)

# Load training data (semi-supervised)
X_train = pd.read_csv(os.path.join(target_dir, "X_train.csv"))
y_train = pd.read_csv(os.path.join(target_dir, "y_train.csv"))

# Load validation data (fully labeled)
X_val = pd.read_csv(os.path.join(target_dir, "X_val.csv"))
y_val = pd.read_csv(os.path.join(target_dir, "y_val.csv"))

# Load test data (fully labeled)
X_test = pd.read_csv(os.path.join(target_dir, "X_test.csv"))
y_test = pd.read_csv(os.path.join(target_dir, "y_test.csv"))

print(f"\n--> Training set:")
print(f"  X_train shape: {X_train.shape}")
print(f"  y_train shape: {y_train.shape}")
print(f"  Labeled samples: {y_train[target_name].notna().sum():,}")
print(f"  Unlabeled samples: {y_train[target_name].isna().sum():,}")

print(f"\n--> Validation set:")
print(f"  X_val shape: {X_val.shape}")
print(f"  y_val shape: {y_val.shape}")
print(f"  Missing values: {y_val[target_name].isna().sum()}")

print(f"\n--> Test set:")
print(f"  X_test shape: {X_test.shape}")
print(f"  y_test shape: {y_test.shape}")
print(f"  Missing values: {y_test[target_name].isna().sum()}")

print("\n" + "=" * 80)

# Show distribution of target values
print(f"\nTarget distribution (labeled training data):")
print(y_train[target_name].describe())

print(f"\nFirst few rows of y_train (showing labeled and unlabeled):")
print(y_train.head(10))

LOADING SPLITS FOR: yield_strength_MPa

--> Training set:
  X_train shape: (1340, 47)
  y_train shape: (1340, 1)
  Labeled samples: 468
  Unlabeled samples: 872

--> Validation set:
  X_val shape: (156, 47)
  y_val shape: (156, 1)
  Missing values: 0

--> Test set:
  X_test shape: (156, 47)
  y_test shape: (156, 1)
  Missing values: 0


Target distribution (labeled training data):
count    468.000000
mean     507.790812
std       93.097066
min      315.000000
25%      443.000000
50%      494.000000
75%      554.250000
max      920.000000
Name: yield_strength_MPa, dtype: float64

First few rows of y_train (showing labeled and unlabeled):
   yield_strength_MPa
0               417.0
1               509.0
2               427.0
3               430.0
4               594.0
5               555.0
6               502.0
7               620.0
8               546.0
9               533.0
