<h2>Capstone Two: Pre-processing and Training Data Development</h2>

The goal of this step is to prepare a clean and structured dataset ready for model training. This includes encoding categorical variables, standardizing numerical features, and splitting the data into training and testing sets. Most of the cleaning has been performed in the previous notebooks. Some alloys are removed do to low plate counts. Final dataframe contains only the predictor columns and target column.

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import joblib
from sklearn.compose import ColumnTransformer

In [4]:
df = pd.read_parquet('../data/eda_data.parquet')

Determine alloys with too low volume.

In [6]:
df['alloy'].value_counts()

alloy
XR7AT0K50    129044
XVKKT7H51    100586
XR6AT0K61     83875
XR7AT0K75     47386
XR7AT1K75     23302
XVKKT7H55     16569
XVKKT7H52     16157
XR2AT1K24     14350
XR7AT4K75     13436
XR2AT0K24      9597
XVKKT6H69      8581
XVKKT2H12      7773
XVKKT6H61      6553
XR2AT2K19      3629
XVKKT7H77      2184
XVKKT2H14      1165
XVKKT7H71       807
XVKKT7H90       608
XR6AT1K01       321
XR2AT0K14       186
XVKKT7H80       124
XVKKT7H91        64
XR7AT0K17        20
XVKKT2H81        17
XVKKT2H71         9
XR2AT0K50         5
Name: count, dtype: int64

In [7]:
alloy_counts = df['alloy'].value_counts()
alloys_to_drop = alloy_counts[alloy_counts <= 500].index
df.drop(df[df['alloy'].isin(alloys_to_drop)].index, inplace=True)

In [8]:
df.dtypes

alloy                           object
job                             object
lot                             object
piece                           object
gauge                          float64
width                          float64
length                         float64
delay                          float64
prestretch_force                 int64
stretch_target                 float64
stretch_actual                 float64
max_force                        int64
step_duration          timedelta64[ns]
pre_stretch_length             float64
post_stretch_length            float64
yield_point                      int64
discharge               datetime64[ns]
step_start_datetime     datetime64[ns]
key                             object
unitized_PF                    float64
flow_stress                      int64
max_force_Mlbs                 float64
max_force_mlbs                 float64
dtype: object

Keep just the predictor columns (columns that contain known values prior to a plate being stretched) and the target column ('flow_stress').

In [10]:
df = df[['alloy','gauge','width','length','stretch_target','flow_stress']]

In [11]:
df['gauge'] = ((df['gauge'] / 0.1).round() * 0.1).round(1)
df['width'] = ((df['width'] / 0.25).round() * 0.25).round(2)
df['length'] = ((df['length'] / 0.5).round() * 0.5).round(2)

In [12]:
df.to_csv('../data/prepped_data.csv')

In [13]:
df

Unnamed: 0,alloy,gauge,width,length,stretch_target,flow_stress
0,XR7AT0K50,4.0,56.25,356.0,2.00,37668
1,XR7AT0K50,4.0,56.25,358.0,2.00,38721
2,XR7AT0K50,3.8,56.50,343.0,2.00,38154
3,XR7AT0K50,3.8,56.50,344.0,2.00,38175
4,XR7AT0K50,4.0,56.00,355.0,2.00,39459
...,...,...,...,...,...,...
486568,XR7AT4K75,3.0,55.75,240.0,2.00,35936
486569,XR7AT4K75,4.0,55.75,340.5,2.00,35639
486570,XVKKT7H51,2.8,69.50,296.5,2.25,41976
486571,XVKKT7H51,2.8,69.50,294.0,2.25,42350


Seperate and split the predictors / target into training and testing sets.

In [15]:
X = df.drop(columns='flow_stress')  
y = df['flow_stress']

Ensure the test size is split per alloy category.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=X['alloy'])

In [18]:
# Define features
numeric_features = ['gauge', 'width', 'length', 'stretch_target']
categorical_features = ['alloy']

Define Preprocessing Pipeline for easier prediction checking in modeling steps.

In [20]:
# Define preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

Scale, fit, and transform the features.

In [22]:
preprocessor.fit(X_train)

X_train_transformed = preprocessor.transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

In [23]:
# Save preprocessor
joblib.dump(preprocessor, 'preprocessor.pkl')

['preprocessor.pkl']

Save to new files.

In [25]:
X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [26]:
# Convert to DataFrames

feature_names = preprocessor.get_feature_names_out()

X_train = pd.DataFrame(X_train_transformed.toarray(), columns=feature_names)
X_test = pd.DataFrame(X_test_transformed.toarray(), columns=feature_names)

X_train.to_parquet('../data/X_train_data.parquet')
X_test.to_parquet('../data/X_test_data.parquet')
y_train.to_frame().to_parquet('../data/y_train_data.parquet')
y_test.to_frame().to_parquet('../data/y_test_data.parquet')