<a href="https://colab.research.google.com/github/ikechukwuUE/steel-plate-defect/blob/master/steel_defect_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# STEEL DEFECT PREDICTION - kaggle

## Table of Contents

1. **Introduction**
2. **Data Preparation**
3. **Model Construction**
    - **Traditional Machine Learning Models**
    - **Neural Networks**
4. **Ensemble and Tuning**
5. **Execution**
6. **Conclusion**


## Introduction

### Project Overview
- **Objective:** Develop a sophisticated machine learning model to predict the probability of various defects on steel plates using both the competition dataset and the original Steel Plates Faults dataset from UCI.
- **Methodology:** Focus on extensive feature engineering using Principal Component Analysis (PCA) to reduce the dimensionality of the dataset and incorporate neural networks for defect prediction.
- **Expected Outcome:** A CSV file with predicted probabilities for each defect category for each id in the test set, evaluated using the area under the ROC curve (AUC) for each category.

### Version Details
- **Version Number:** 1.0
- **Configuration Parameters:** Detailed in the Configuration Parameters section.

## Imports

In [3]:
## Imports

%%time

# Installing select libraries
!pip install -q lightgbm==4.3.0 --force-reinstall
!pip install --force-reinstall scikit-learn
!pip install catboost
!pip install colorama
!pip install category_encoders

# General library imports
from gc import collect
from warnings import filterwarnings
filterwarnings('ignore')
from IPython.display import display_html, clear_output, Image, Markdown
clear_output()

import xgboost as xgb
import lightgbm as lgb
import catboost as cb
import sklearn as sk
import pandas as pd
print(f"---> XGBoost = {xgb.__version__} | LightGBM = {lgb.__version__} | Catboost = {cb.__version__}")
print(f"---> Sklearn = {sk.__version__}| Pandas = {pd.__version__}\n\n")
collect()

# Data manipulation and visualization
from copy import deepcopy
import numpy as np
import re
from scipy.stats import mode, kstest, normaltest, shapiro, anderson, jarque_bera
from collections import Counter
from itertools import product
from colorama import Fore, Style, init
init(autoreset=True)
import joblib
import os

from tqdm.notebook import tqdm
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap as LCM
%matplotlib inline

from pprint import pprint
from functools import partial

# Model and pipeline specifics
from category_encoders import OrdinalEncoder, OneHotEncoder
from sklearn.preprocessing import RobustScaler, MinMaxScaler, StandardScaler, FunctionTransformer as FT, PowerTransformer
from sklearn.impute import SimpleImputer as SI
from sklearn.model_selection import RepeatedStratifiedKFold as RSKF, StratifiedKFold as SKF, StratifiedGroupKFold as SGKF, KFold, RepeatedKFold as RKF, cross_val_score, cross_val_predict
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import mutual_info_classif, RFE
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer

# ML Model training
from sklearn.metrics import accuracy_score, roc_auc_score, make_scorer
from xgboost import DMatrix, XGBClassifier as XGBC
from lightgbm import log_evaluation, early_stopping, LGBMClassifier as LGBMC
from catboost import CatBoostClassifier as CBC, Pool
from sklearn.ensemble import HistGradientBoostingClassifier as HGBC, RandomForestClassifier as RFC

# Neural networks
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Ensemble and tuning
import optuna
from optuna import Trial, trial, create_study
from optuna.pruners import HyperbandPruner
from optuna.samplers import TPESampler, CmaEsSampler
optuna.logging.set_verbosity = optuna.logging.ERROR

---> XGBoost = 2.0.3 | LightGBM = 4.3.0 | Catboost = 1.2.3
---> Sklearn = 1.4.1.post1| Pandas = 1.5.3


CPU times: user 6.01 s, sys: 1.2 s, total: 7.21 s
Wall time: 1min 17s


In [4]:
# Setting rc parameters in seaborn for plots and graphs
sns.set({"axes.facecolor"       : "#ffffff",
         "figure.facecolor"     : "#ffffff",
         "axes.edgecolor"       : "#000000",
         "grid.color"           : "#ffffff",
         "font.family"          : ['Cambria'],
         "axes.labelcolor"      : "#000000",
         "xtick.color"          : "#000000",
         "ytick.color"          : "#000000",
         "grid.linewidth"       : 0.75,
         "grid.linestyle"       : "--",
         "axes.titlecolor"      : '#0099e6',
         'axes.titlesize'       : 8.5,
         'axes.labelweight'     : "bold",
         'legend.fontsize'      : 7.0,
         'legend.title_fontsize': 7.0,
         'font.size'            : 7.5,
         'xtick.labelsize'      : 7.5,
         'ytick.labelsize'      : 7.5,
        })


In [5]:
# Color printing
def PrintColor(text:str, color = Fore.BLUE, style = Style.BRIGHT):
    "Prints color outputs using colorama using a text F-string"
    print(style + color + text + Style.RESET_ALL)

# Making sklearn pipeline outputs as dataframe
from sklearn import set_config
set_config(transform_output = "pandas")
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

print()
collect()





0

In [6]:
# Function to load and preprocess data
def load_and_preprocess_data(train_path, test_path):
    # Load datasets
    train_data = pd.read_csv(train_path)
    test_data = pd.read_csv(test_path)

    # Preprocessing steps (e.g., handling missing values, encoding categorical variables)
    # Example: train_data = train_data.fillna(train_data.mean())
    # Example: test_data = test_data.fillna(test_data.mean())

    return train_data, test_data

# Function to split data into features and target
def split_data(data, target_column):
    X = data.drop(target_column, axis=1)
    y = data[target_column]
    return X, y

# Function to apply PCA
def apply_pca(X, n_components):
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X)
    return X_pca

# Function to train and evaluate a model
def train_and_evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

# Function to plot ROC curve
def plot_roc_curve(y_test, y_pred_proba):
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba[:, 1])
    roc_auc = auc(fpr, tpr)
    plt.figure()
    plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.show()

# Example usage
# train_data, test_data = load_and_preprocess_data('path/to/train_data.csv', 'path/to/test_data.csv')
# X_train, y_train = split_data(train_data, 'target_column')
# X_test, y_test = split_data(test_data, 'target_column')
# X_train_pca = apply_pca(X_train, n_components=10)
# X_test_pca = apply_pca(X_test, n_components=10)
# model = RandomForestClassifier()
# accuracy = train_and_evaluate_model(model, X_train_pca, y_train, X_test_pca, y_test)
# print(f"Accuracy: {accuracy}")
# y_pred_proba = model.predict_proba(X_test_pca)
# plot_roc_curve(y_test, y_pred_proba)

## Data Preparation


### Plan

#### Data Exploration
- **Objective:** Familiarize with the datasets, conduct initial exploratory data analysis (EDA) to understand the data structure and distribution.
- **Tasks:**
    - Load and inspect the datasets.
    - Perform basic statistical analysis.
    - Visualize data distributions.


In [9]:
# load and inspect the datasets
df_train, df_test = load_and_preprocess_data(train_path = 'https://raw.githubusercontent.com/ikechukwuUE/steel-plate-defect/master/data/train.csv',
                         test_path='https://raw.githubusercontent.com/ikechukwuUE/steel-plate-defect/master/data/test.csv')

In [14]:
display("train dataset", df_train)
print("")
display("test dataset", df_test)

'train dataset'

Unnamed: 0,id,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,Maximum_of_Luminosity,Length_of_Conveyer,TypeOfSteel_A300,TypeOfSteel_A400,Steel_Plate_Thickness,Edges_Index,Empty_Index,Square_Index,Outside_X_Index,Edges_X_Index,Edges_Y_Index,Outside_Global_Index,LogOfAreas,Log_X_Index,Log_Y_Index,Orientation_Index,Luminosity_Index,SigmoidOfAreas,Pastry,Z_Scratch,K_Scatch,Stains,Dirtiness,Bumps,Other_Faults
0,0,584,590,909972,909977,16,8,5,2274,113,140,1358,0,1,50,0.7393,0.4000,0.5000,0.0059,1.0000,1.0000,0.0,1.2041,0.9031,0.6990,-0.5000,-0.0104,0.1417,0,0,0,1,0,0,0
1,1,808,816,728350,728372,433,20,54,44478,70,111,1687,1,0,80,0.7772,0.2878,0.2581,0.0044,0.2500,1.0000,1.0,2.6365,0.7782,1.7324,0.7419,-0.2997,0.9491,0,0,0,0,0,0,1
2,2,39,192,2212076,2212144,11388,705,420,1311391,29,141,1400,0,1,40,0.0557,0.5282,0.9895,0.1077,0.2363,0.3857,0.0,4.0564,2.1790,2.2095,-0.0105,-0.0944,1.0000,0,0,1,0,0,0,0
3,3,781,789,3353146,3353173,210,16,29,3202,114,134,1387,0,1,40,0.7202,0.3333,0.3333,0.0044,0.3750,0.9310,1.0,2.3222,0.7782,1.4314,0.6667,-0.0402,0.4025,0,0,1,0,0,0,0
4,4,1540,1560,618457,618502,521,72,67,48231,82,111,1692,0,1,300,0.1211,0.5347,0.0842,0.0192,0.2105,0.9861,1.0,2.7694,1.4150,1.8808,0.9158,-0.2455,0.9998,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19214,19214,749,757,143210,143219,17,4,4,2193,122,140,1360,0,0,50,0.8950,0.1500,0.8571,0.0044,1.0000,0.8000,0.0,1.2305,0.7782,0.6021,-0.1429,0.0044,0.2901,0,0,0,1,0,0,0
19215,19215,723,735,2488529,2488541,231,17,26,27135,104,133,1652,1,0,70,0.9243,0.3254,0.2778,0.0065,0.7333,0.9216,1.0,2.3636,1.0414,1.4150,0.7222,-0.0989,0.5378,0,0,0,0,0,0,1
19216,19216,6,31,1578055,1578129,780,114,98,71112,41,94,1358,0,1,200,0.0148,0.4331,0.2281,0.0199,0.1862,0.9554,1.0,2.8921,1.4314,1.8692,0.7719,-0.4283,0.9997,1,0,0,0,0,0,0
19217,19217,9,18,1713172,1713184,126,13,26,14808,88,132,1692,1,0,60,0.0192,0.2361,0.0390,0.0068,0.7692,1.0000,1.0,2.1004,1.0414,1.4150,0.9610,-0.1162,0.3509,0,0,0,0,0,0,1





'test dataset'

Unnamed: 0,id,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,Maximum_of_Luminosity,Length_of_Conveyer,TypeOfSteel_A300,TypeOfSteel_A400,Steel_Plate_Thickness,Edges_Index,Empty_Index,Square_Index,Outside_X_Index,Edges_X_Index,Edges_Y_Index,Outside_Global_Index,LogOfAreas,Log_X_Index,Log_Y_Index,Orientation_Index,Luminosity_Index,SigmoidOfAreas
0,19219,1015,1033,3826564,3826588,659,23,46,62357,67,127,1656,0,1,150,0.3877,0.4896,0.3273,0.0095,0.5652,1.0000,1.0,2.8410,1.1139,1.6628,0.6727,-0.2261,0.9172
1,19220,1257,1271,419960,419973,370,26,28,39293,92,132,1354,0,1,40,0.1629,0.4136,0.0938,0.0047,0.2414,1.0000,1.0,2.5682,0.9031,1.4472,0.9063,-0.1453,0.9104
2,19221,1358,1372,117715,117724,289,36,32,29386,101,134,1360,0,1,40,0.0609,0.6234,0.4762,0.0155,0.6000,0.7500,0.0,2.4609,1.3222,1.3222,-0.5238,-0.0435,0.6514
3,19222,158,168,232415,232440,80,10,11,8586,107,140,1690,1,0,100,0.4439,0.3333,0.8182,0.0037,0.8000,1.0000,1.0,1.9031,0.6990,1.0414,0.1818,-0.0738,0.2051
4,19223,559,592,544375,544389,140,19,15,15524,103,134,1688,1,0,60,0.8191,0.2619,0.4286,0.0158,0.8421,0.5333,0.0,2.1461,1.3222,1.1461,-0.5714,-0.0894,0.4170
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12809,32028,1101,1116,447943,447992,313,32,37,21603,79,103,1353,0,1,70,0.2853,0.3050,0.2339,0.0126,0.4063,0.9194,1.0,2.4955,1.2305,1.6335,0.7661,-0.3109,0.8894
12810,32029,1289,1306,3149494,3149542,59,9,18,5249,113,141,1362,0,1,40,0.0106,0.2778,0.2778,0.0052,0.7778,1.0000,1.0,1.7708,0.8451,1.2553,0.7222,-0.0448,0.1954
12811,32030,41,210,1587535,1587191,16584,796,522,1858162,24,143,1400,0,1,40,0.0557,0.5644,0.9371,0.1236,0.2199,0.4097,0.0,4.2525,2.2504,2.2672,-0.0629,-0.0801,1.0000
12812,32031,1329,1340,702237,702267,386,43,34,36875,66,124,1364,0,1,40,0.0133,0.1814,0.1539,0.0095,0.2407,1.0000,1.0,2.5866,1.1139,1.5911,0.8461,-0.2629,0.7844


In [15]:
# Missing columns in the test dataset
absent_columns = set(df_train.columns) - set(df_test.columns)

absent_columns

{'Bumps',
 'Dirtiness',
 'Edges_Index',
 'Edges_X_Index',
 'Edges_Y_Index',
 'Empty_Index',
 'K_Scatch',
 'Length_of_Conveyer',
 'LogOfAreas',
 'Log_X_Index',
 'Log_Y_Index',
 'Luminosity_Index',
 'Maximum_of_Luminosity',
 'Minimum_of_Luminosity',
 'Orientation_Index',
 'Other_Faults',
 'Outside_Global_Index',
 'Outside_X_Index',
 'Pastry',
 'Pixels_Areas',
 'SigmoidOfAreas',
 'Square_Index',
 'Stains',
 'Steel_Plate_Thickness',
 'Sum_of_Luminosity',
 'TypeOfSteel_A300',
 'TypeOfSteel_A400',
 'X_Maximum',
 'X_Minimum',
 'X_Perimeter',
 'Y_Maximum',
 'Y_Minimum',
 'Y_Perimeter',
 'Z_Scratch',
 'id'}

#### Data Integration
- **Objective:** Assess the reliability of the data, consider ethical implications, and plan for data integration.
- **Tasks:**
    - Merge datasets if necessary.
    - Handle missing values.
    - Ensure data consistency.

In [None]:
# Example code for data integration
# Merge datasets if necessary
# Handle missing values
# Ensure data consistency

#### Feature Engineering
- **Objective:** Perform extensive feature engineering using PCA to reduce the dimensionality of the dataset.
- **Tasks:**
    - Select relevant features.
    - Apply PCA to reduce dimensionality.
    - Evaluate the impact of PCA on model performance.

In [None]:
# Example code for feature engineering

# Select relevant features
# Apply PCA to reduce dimensionality
# Evaluate the impact of PCA on model performance

## Model Construction

### Construct

#### Traditional Machine Learning Models
- **Objective:** Train and evaluate traditional machine learning models.
- **Tasks:**
    - Select appropriate machine learning algorithms.
    - Define model architecture.
    - Train the model.
    - Evaluate the model.

In [None]:
# Example code for training and evaluating traditional machine learning models

#### Neural Networks
- **Objective:** Train and evaluate neural network models for defect prediction using Keras and TensorFlow.
- **Tasks:**
    - Define neural network architecture.
    - Train the neural network model.
    - Evaluate the neural network model.


In [None]:
# Define neural network architecture
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=100)) # Input dimension should match the number of features after PCA
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid')) # Assuming binary classification

In [None]:
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Train the neural network model
# model.fit(X_train, y_train, epochs=10, batch_size=32)

In [None]:
# Evaluate the neural network model
# loss, accuracy = model.evaluate(X_test, y_test)

## Ensemble and Tuning

### Execute

#### Ensemble Strategy
- **Objective:** Combine multiple models to improve prediction accuracy.
- **Tasks:**
    - Define ensemble strategy.
    - Train ensemble models.
    - Evaluate ensemble performance.

#### Hyperparameter Tuning
- **Objective:** Optimize model parameters to improve model performance.
- **Tasks:**
    - Set up hyperparameter search space.
    - Conduct hyperparameter tuning.
    - Evaluate tuning results.

In [None]:
# Example code for hyperparameter tuning

## Execution

### Model Execution
- **Objective:** Apply the trained model to the test dataset to make predictions.
- **Tasks:**
    - Load the test dataset.
    - Apply the model to make predictions.
    - Prepare the submission file.

In [None]:
# Example code for model execution

### Business Recommendations
- **Objective:** Propose business recommendations based on the model's predictions.
- **Tasks:**
    - Analyze model predictions.
    - Propose actionable recommendations.

In [None]:
# Example code for business recommendations

### Ethical Considerations
- **Objective:** Address ethical implications and ensure model ethics.
- **Tasks:**
    - Review ethical considerations.
    - Ensure model fairness and transparency.

In [None]:
# Example code for ethical considerations

## Conclusion

### Final Thoughts
- **Objective:** Summarize the project's achievements and lessons learned.
- **Tasks:**
    - Reflect on the project's successes and challenges.
    - Discuss the impact of the project on the field of steel plate defect prediction.

### Future Work
- **Objective:** Identify areas for future research and improvement.
- **Tasks:**
    - Suggest potential improvements to the model.
    - Identify new datasets or features to explore.