<a href="https://colab.research.google.com/github/ikechukwuUE/steel-plate-defect/blob/master/steel_plate_prediction_starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enhancing Steel Plate Defect Prediction with Advanced Feature Engineering and Neural Networks

## Table of Contents

1. **Introduction**
2. **Data Preparation**
3. **Model Construction**
    - **Traditional Machine Learning Models**
    - **Neural Networks**
4. **Ensemble and Tuning**
5. **Execution**
6. **Conclusion**
7. **Appendices**


## Introduction

### Project Overview
- **Objective:** Develop a sophisticated machine learning model to predict the probability of various defects on steel plates using both the competition dataset and the original Steel Plates Faults dataset from UCI.
- **Methodology:** Focus on extensive feature engineering using Principal Component Analysis (PCA) to reduce the dimensionality of the dataset and incorporate neural networks for defect prediction.
- **Expected Outcome:** A CSV file with predicted probabilities for each defect category for each id in the test set, evaluated using the area under the ROC curve (AUC) for each category.

### Version Details
- **Version Number:** 1.0
- **Configuration Parameters:** Detailed in the Configuration Parameters section.

## Imports

In [None]:
## Imports

%%time

# Installing select libraries
!pip install -q lightgbm==4.3.0 --force-reinstall
!pip install --force-reinstall scikit-learn --no-index --find-links=file:///kaggle/input/scikit-learn-1-4-0/

# General library imports
from gc import collect
from warnings import filterwarnings
filterwarnings('ignore')
from IPython.display import display_html, clear_output
clear_output()

import xgboost as xgb
import lightgbm as lgb
import catboost as cb
import sklearn as sk
import pandas as pd
print(f"---> XGBoost = {xgb.__version__} | LightGBM = {lgb.__version__} | Catboost = {cb.__version__}")
print(f"---> Sklearn = {sk.__version__}| Pandas = {pd.__version__}\n\n")
collect()

# Data manipulation and visualization
from copy import deepcopy
import numpy as np
import re
from scipy.stats import mode, kstest, normaltest, shapiro, anderson, jarque_bera
from collections import Counter
from itertools import product
from colorama import Fore, Style, init
init(autoreset=True)
import joblib
import os

from tqdm.notebook import tqdm
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap as LCM
%matplotlib inline

from pprint import pprint
from functools import partial

# Model and pipeline specifics
from category_encoders import OrdinalEncoder, OneHotEncoder
from sklearn.preprocessing import RobustScaler, MinMaxScaler, StandardScaler, FunctionTransformer as FT, PowerTransformer
from sklearn.impute import SimpleImputer as SI
from sklearn.model_selection import RepeatedStratifiedKFold as RSKF, StratifiedKFold as SKF, StratifiedGroupKFold as SGKF, KFold, RepeatedKFold as RKF, cross_val_score, cross_val_predict
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import mutual_info_classif, RFE
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer

# ML Model training
from sklearn.metrics import accuracy_score, roc_auc_score, make_scorer
from xgboost import DMatrix, XGBClassifier as XGBC
from lightgbm import log_evaluation, early_stopping, LGBMClassifier as LGBMC
from catboost import CatBoostClassifier as CBC, Pool
from sklearn.ensemble import HistGradientBoostingClassifier as HGBC, RandomForestClassifier as RFC

# Ensemble and tuning
import optuna
from optuna import Trial, trial, create_study
from optuna.pruners import HyperbandPruner
from optuna.samplers import TPESampler, CmaEsSampler
optuna.logging.set_verbosity = optuna.logging.ERROR

In [None]:
# Setting rc parameters in seaborn for plots and graphs
sns.set({"axes.facecolor"       : "#ffffff",
         "figure.facecolor"     : "#ffffff",
         "axes.edgecolor"       : "#000000",
         "grid.color"           : "#ffffff",
         "font.family"          : ['Cambria'],
         "axes.labelcolor"      : "#000000",
         "xtick.color"          : "#000000",
         "ytick.color"          : "#000000",
         "grid.linewidth"       : 0.75,
         "grid.linestyle"       : "--",
         "axes.titlecolor"      : '#0099e6',
         'axes.titlesize'       : 8.5,
         'axes.labelweight'     : "bold",
         'legend.fontsize'      : 7.0,
         'legend.title_fontsize': 7.0,
         'font.size'            : 7.5,
         'xtick.labelsize'      : 7.5,
         'ytick.labelsize'      : 7.5,
        })


In [None]:
# Color printing
def PrintColor(text:str, color = Fore.BLUE, style = Style.BRIGHT):
    "Prints color outputs using colorama using a text F-string"
    print(style + color + text + Style.RESET_ALL)

# Making sklearn pipeline outputs as dataframe
from sklearn import set_config
set_config(transform_output = "pandas")
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

print()
collect()


In [None]:
# Function to load and preprocess data
def load_and_preprocess_data(train_path, test_path):
    # Load datasets
    train_data = pd.read_csv(train_path)
    test_data = pd.read_csv(test_path)

    # Preprocessing steps (e.g., handling missing values, encoding categorical variables)
    # Example: train_data = train_data.fillna(train_data.mean())
    # Example: test_data = test_data.fillna(test_data.mean())

    return train_data, test_data

# Function to split data into features and target
def split_data(data, target_column):
    X = data.drop(target_column, axis=1)
    y = data[target_column]
    return X, y

# Function to apply PCA
def apply_pca(X, n_components):
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X)
    return X_pca

# Function to train and evaluate a model
def train_and_evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

# Function to plot ROC curve
def plot_roc_curve(y_test, y_pred_proba):
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba[:, 1])
    roc_auc = auc(fpr, tpr)
    plt.figure()
    plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.show()

# Example usage
# train_data, test_data = load_and_preprocess_data('path/to/train_data.csv', 'path/to/test_data.csv')
# X_train, y_train = split_data(train_data, 'target_column')
# X_test, y_test = split_data(test_data, 'target_column')
# X_train_pca = apply_pca(X_train, n_components=10)
# X_test_pca = apply_pca(X_test, n_components=10)
# model = RandomForestClassifier()
# accuracy = train_and_evaluate_model(model, X_train_pca, y_train, X_test_pca, y_test)
# print(f"Accuracy: {accuracy}")
# y_pred_proba = model.predict_proba(X_test_pca)
# plot_roc_curve(y_test, y_pred_proba)

## Data Preparation




### Plan

#### Data Exploration
- **Objective:** Familiarize with the datasets, conduct initial exploratory data analysis (EDA) to understand the data structure and distribution.
- **Tasks:**
    - Load and inspect the datasets.
    - Perform basic statistical analysis.
    - Visualize data distributions.

#### Data Integration
- **Objective:** Assess the reliability of the data, consider ethical implications, and plan for data integration.
- **Tasks:**
    - Merge datasets if necessary.
    - Handle missing values.
    - Ensure data consistency.

# Example code for data integration
# Merge datasets if necessary
# Handle missing values
# Ensure data consistency

#### Feature Engineering
- **Objective:** Perform extensive feature engineering using PCA to reduce the dimensionality of the dataset.
- **Tasks:**
    - Select relevant features.
    - Apply PCA to reduce dimensionality.
    - Evaluate the impact of PCA on model performance.

# Example code for feature engineering

# Select relevant features
# Apply PCA to reduce dimensionality
# Evaluate the impact of PCA on model performance

## Model Construction

### Construct

#### Traditional Machine Learning Models
- **Objective:** Train and evaluate traditional machine learning models.
- **Tasks:**
    - Select appropriate machine learning algorithms.
    - Define model architecture.
    - Train the model.
    - Evaluate the model.

# Example code for training and evaluating traditional machine learning models

#### Neural Networks
- **Objective:** Train and evaluate neural network models for defect prediction using Keras and TensorFlow.
- **Tasks:**
    - Define neural network architecture.
    - Train the neural network model.
    - Evaluate the neural network model.

# Import necessary libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Define neural network architecture
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=100)) # Input dimension should match the number of features after PCA
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid')) # Assuming binary classification

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the neural network model
# model.fit(X_train, y_train, epochs=10, batch_size=32)

# Evaluate the neural network model
# loss, accuracy = model.evaluate(X_test, y_test)

## Ensemble and Tuning

### Execute

#### Ensemble Strategy
- **Objective:** Combine multiple models to improve prediction accuracy.
- **Tasks:**
    - Define ensemble strategy.
    - Train ensemble models.
    - Evaluate ensemble performance.

#### Hyperparameter Tuning
- **Objective:** Optimize model parameters to improve model performance.
- **Tasks:**
    - Set up hyperparameter search space.
    - Conduct hyperparameter tuning.
    - Evaluate tuning results.

# Example code for hyperparameter tuning

## Execution

### Model Execution
- **Objective:** Apply the trained model to the test dataset to make predictions.
- **Tasks:**
    - Load the test dataset.
    - Apply the model to make predictions.
    - Prepare the submission file.

# Example code for model execution

### Business Recommendations
- **Objective:** Propose business recommendations based on the model's predictions.
- **Tasks:**
    - Analyze model predictions.
    - Propose actionable recommendations.

# Example code for business recommendations

### Ethical Considerations
- **Objective:** Address ethical implications and ensure model ethics.
- **Tasks:**
    - Review ethical considerations.
    - Ensure model fairness and transparency.

# Example code for ethical considerations

## Conclusion

### Final Thoughts
- **Objective:** Summarize the project's achievements and lessons learned.
- **Tasks:**
    - Reflect on the project's successes and challenges.
    - Discuss the impact of the project on the field of steel plate defect prediction.

### Future Work
- **Objective:** Identify areas for future research and improvement.
- **Tasks:**
    - Suggest potential improvements to the model.
    - Identify new datasets or features to explore.