<a href="https://www.kaggle.com/code/orestasdulinskas/steel-plate-defect-prediction?scriptVersionId=187948080" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

![Steel Plates](https://www.dsstainlesssteel.com/wp-content/uploads/2018/05/Stainless-Steel-Plate-Sheet-1.jpg)

# Steel Plate Defect Prediction
---
## Background
The Steel Plates Faults dataset project stems from the need to enhance quality control measures in steel manufacturing processes. With the continuous demand for high-quality steel products across various industries, it becomes imperative to identify and rectify faults in steel plates efficiently. By leveraging machine learning and data science techniques, this project aims to develop predictive models capable of accurately classifying and detecting various types of faults present in steel plates. By doing so, manufacturers can streamline their quality assurance processes, reduce downtime, minimize production costs, and ultimately deliver superior-quality steel products to their customers. This project not only addresses immediate operational challenges but also lays the foundation for implementing proactive fault detection strategies to improve overall productivity and competitiveness in the steel industry.

## Objective
The primary objective of this project is to develop robust predictive models capable of accurately predicting the presence and type of faults in steel plates. By utilizing machine learning algorithms, the project aims to predict the occurrence of various types of faults, including Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps, and Other_Faults, based on the features provided in the dataset. These predictive models will enable steel manufacturers to automate and enhance their quality control processes, allowing for timely identification and rectification of faults in steel plates during the manufacturing process.

## Data
The Steel Plates Faults dataset comprises various features related to steel plates and different types of faults associated with them. The dataset includes the following features:

* **id**: Unique identifier for each observation.
* **X_Minimum**: Minimum x-coordinate of the defect.
* **X_Maximum**: Maximum x-coordinate of the defect.
* **Y_Minimum**: Minimum y-coordinate of the defect.
* **Y_Maximum**: Maximum y-coordinate of the defect.
* **Pixels_Areas**: Area of the defect in pixels.
* **X_Perimeter**: Perimeter of the defect in the x-direction.
* **Y_Perimeter**: Perimeter of the defect in the y-direction.
* **Sum_of_Luminosity**: Sum of luminosity values within the defect area.
* **Minimum_of_Luminosity**: Minimum luminosity value within the defect area.
* **Maximum_of_Luminosity**: Maximum luminosity value within the defect area.
* **Length_of_Conveyer**: Length of the conveyer belt during manufacturing.
* **TypeOfSteel_A300**: Indicator variable for type of steel (A300).
* **TypeOfSteel_A400**: Indicator variable for type of steel (A400).
* **Steel_Plate_Thickness**: Thickness of the steel plate.
* **Edges_Index**: Ratio of perimeter to the length of the defect.
* **Empty_Index**: Ratio of empty pixels to the total number of pixels in the defect.
* **Square_Index**: Ratio of area to the square of the perimeter of the defect.
* **Outside_X_Index**: Ratio of pixels outside the defect in the x-direction to the total number of pixels in the defect.
* **Edges_X_Index**: Ratio of horizontal edges to the total number of edges in the defect.
* **Edges_Y_Index**: Ratio of vertical edges to the total number of edges in the defect.
* **Outside_Global_Index**: Ratio of pixels outside the defect to the total number of pixels in the image.
* **LogOfAreas**: Logarithm of the defect area.
* **Log_X_Index**: Logarithm of the maximum length of the defect in the x-direction.
* **Log_Y_Index**: Logarithm of the maximum length of the defect in the y-direction.
* **Orientation_Index**: Index representing the orientation of the defect.
* **Luminosity_Index**: Index representing the luminosity of the defect.
* **SigmoidOfAreas**: Sigmoid function applied to the defect area.

* **Pastry**: Binary indicator variable for the presence of the 'Pastry' fault.
* **Z_Scratch**: Binary indicator variable for the presence of the 'Z_Scratch' fault.
* **K_Scatch**: Binary indicator variable for the presence of the 'K_Scatch' fault.
* **Stains**: Binary indicator variable for the presence of the 'Stains' fault.
* **Dirtiness**: Binary indicator variable for the presence of the 'Dirtiness' fault.
* **Bumps**: Binary indicator variable for the presence of the 'Bumps' fault.
* **Other_Faults**: Binary indicator variable for the presence of other types of faults not specified above.

These features provide comprehensive information about the characteristics of steel plates and the types of faults they may exhibit, facilitating the development of predictive models for fault detection and classification.

# Data Cleaning
---

In [None]:
# Import dataset
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

path = r'/kaggle/input/playground-series-s4e3/train.csv'
df = pd.read_csv(path)
df.set_index('id', inplace=True)

df_org   = pd.read_csv('/kaggle/input/faulty-steel-plates/faults.csv')
df = pd.concat([df, df_org], ignore_index=True)

df.head()

In [None]:
import seaborn as sns
sns.set_theme()

df.info() # Summary of DataFrame information

print('\nNumber of unique values in each column')
for i in df.columns:
    print(f'{i} - {df[i].nunique()}')

print('\nNumber of duplicated rows\n', df.duplicated().sum())    
    
print('\nNumber of missing values in each column\n', df.isnull().sum())

sns.heatmap(df.isnull())

In [None]:
df.describe()

Summary: Since I am using cleaned dataset, there are not missing or duplicate values

# Feature Selection
---

In [None]:
# Defining training and target features
target = ['Stains', 'Dirtiness', 'Z_Scratch', 'K_Scatch', 'Pastry', 'Bumps', 'Other_Faults']

In [None]:
import plotly.express as px

px.imshow(df.corr()[target].drop(target, axis=0).T, zmin=0, zmax=1, title='Correlation Matrix', template='ggplot2')

In [None]:
features = {}
for i in target:
    features[i] = df.corr()[target].drop(target, axis=0)[i].sort_values().tail(3).keys().to_list()

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Train the model
model = RandomForestClassifier()
model.fit(df.drop(target, axis=1), df[target])

# Get feature importances
importances = model.feature_importances_

# Convert the importances into a DataFrame
feature_importance_df = pd.DataFrame({
    'Feature': df.drop(target, axis=1).columns,
    'Importance': importances
})

# Sort the DataFrame to plot
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

px.bar(feature_importance_df, x='Feature', y='Importance', template='ggplot2', title='Feature Importances')

# Exploratory Data Analysis (EDA)
---

In [None]:
subject = pd.DataFrame()
for i in target:
    subject[i] = df[i].value_counts()
    
px.bar(subject.T, x=subject.T.index, y=subject.T.columns,
       title='Label distribution', template='ggplot2',
       labels={'values':'Count','index':'Target features', 'Stains':'Labels'},
      text_auto=True)

**Summary**:
* **Low Incidence Faults**: Stains and Dirtiness are relatively rare.
* **Moderate Faults**: Z_Scratch and Pastry faults occur more frequently.
* **Significant Faults**: K_Scatch and Bumps are more common.
* **Most Common Fault**: Other Faults category is the most prevalent.

In [None]:
import matplotlib.pyplot as plt

for j in features.keys():
    print('\n<<', j, '>>\n')
    for i in features[j]:
        fig, ax = plt.subplots(figsize=(15, 4))
        fig = sns.histplot(data=df, x=i, hue=j,bins=50, kde=True)
        fig.set_title(f'{i} ')
        fig.grid(True)
        plt.show()

In [None]:
from itertools import combinations

for t in target:
    print('\n<<', j, '>>\n')
    for i, j in combinations(features[t], 2):
        plt.figure(figsize=(15, 6))
        sns.scatterplot(data=df, x=i, y=j, hue=t, alpha=0.5, edgecolor=None)
        plt.title(f"{i} and {j}")
        plt.show()

# Pre-Processing
---

In [None]:
#df_copy = df.copy()

### Removing Outliers

In [None]:
"""print('Lenght of data before removing outliers:', len(df))
for i in df.drop(target, axis=1).columns:
    Q1 = df[i].quantile(0.15)
    Q3 = df[i].quantile(0.85)
    IQR = Q3 - Q1
    df = df[(df[i] >= Q1 - 1.5*IQR) & (df[i] <= Q3 + 1.5*IQR)]
    
print('Length of data after removing outliers:', len(df))"""

### Feature Engineering

In [None]:
import numpy as np

def feature_engineering(df):
    epsilon = 1e-6
    
    # Calculate area
    df['X_Distance'] = df['X_Maximum'] - df['X_Minimum']
    df['Y_Distance'] = df['Y_Maximum'] - df['Y_Minimum']
    df['Area'] = (df['X_Distance']) * (df['Y_Distance'])
    
    # Density Feature
    #df['Density'] = df['Pixels_Areas'] / (df['X_Perimeter'] + df['Y_Perimeter'])

    # Calculate perimeter
    df['Perimeter'] = 2 * ((df['X_Maximum'] - df['X_Minimum']) + (df['Y_Maximum'] - df['Y_Minimum']))
    
    # Relative Perimeter Feature
    df['Relative_Perimeter'] = df['X_Perimeter'] / (df['X_Perimeter'] + df['Y_Perimeter'] + epsilon)
    
    # Circularity Feature
    #df['Circularity'] = df['Pixels_Areas'] / (df['X_Perimeter'] ** 2)
    
    # Combined Geometric Index Feature
    df['Combined_Geometric_Index'] = df['Edges_Index'] * df['Square_Index']
    
    # Symmetry Index Feature
    df['Symmetry_Index'] = np.abs(df['X_Distance'] - df['Y_Distance']) / (df['X_Distance'] + df['Y_Distance'] + epsilon)
    
    # Compute mean, median, and standard deviation
    df['Mean_Luminosity'] = df[['Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity']].mean(axis=1)
    df['Median_Luminosity'] = df[['Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity']].median(axis=1)
    df['Std_Luminosity'] = df[['Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity']].std(axis=1)
    
    # Calculate aspect ratio
    df['Aspect_Ratio'] = (df['Y_Maximum'] - df['Y_Minimum']) / (df['X_Maximum'] - df['X_Minimum'])
    
    # Apply logarithmic transformation
    df['Log_Pixels_Areas'] = np.log(df['Pixels_Areas'])
    
    # Interaction Term Feature
    df['X_Distance*Pixels_Areas'] = df['X_Distance'] * df['Pixels_Areas']
    
    # Create composite feature
    df['Luminosity_Index_Product'] = df['Luminosity_Index'] * df['Sum_of_Luminosity']
    
    # Color Contrast Feature
    df['Color_Contrast'] = df['Maximum_of_Luminosity'] - df['Minimum_of_Luminosity']
    
    # Average Luminosity Feature
    df['Average_Luminosity'] = (df['Sum_of_Luminosity'] + df['Minimum_of_Luminosity']) / 2
    
    # Generate interaction feature
    df['Area_Pixels_Interact'] = df['Area'] * df['Pixels_Areas']
    
    # Additional Features
    df['sin_orientation'] = np.sin(df['Orientation_Index'])
    df['Edges_Index2'] = np.exp(df['Edges_Index'] + epsilon)
    df['X_Maximum2'] = np.sin(df['X_Maximum'])
    df['Y_Minimum2'] = np.sin(df['Y_Minimum'])
    df['Aspect_Ratio_Pixels'] = np.where(df['Y_Perimeter'] == 0, 0, df['X_Perimeter'] / df['Y_Perimeter'])
    df['Aspect_Ratio'] = np.where(df['Y_Distance'] == 0, 0, df['X_Distance'] / df['Y_Distance'])

    # Normalized Steel Thickness Feature
    df['Normalized_Steel_Thickness'] = (df['Steel_Plate_Thickness'] - df['Steel_Plate_Thickness'].min()) / (df['Steel_Plate_Thickness'].max() - df['Steel_Plate_Thickness'].min())

    # Logarithmic Features
    df['Log_Perimeter'] = np.log(df['X_Perimeter'] + df['Y_Perimeter'] + epsilon)
    df['Log_Luminosity'] = np.log(df['Sum_of_Luminosity'] + epsilon)
    df['Log_Aspect_Ratio'] = np.log(df['Aspect_Ratio'] ** 2 + epsilon)

    # Statistical Features
    df['Combined_Index'] = df['Orientation_Index'] * df['Luminosity_Index']
    df['Sigmoid_Areas'] = 1 / (1 + np.exp(-df['LogOfAreas'] + epsilon))
    
    return df

In [None]:
# Applying feature engineering to the dataframe
df = feature_engineering(df)
df.shape

### Data Scaling

In [None]:
#df = df_copy.copy()

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, Normalizer

"""scaler = StandardScaler()
df[df.drop(target, axis=1).columns] = scaler.fit_transform(df.drop(target, axis=1))"""

scaler = MinMaxScaler()
df[df.drop(target, axis=1).columns] = scaler.fit_transform(df.drop(target, axis=1))

"""scaler = MaxAbsScaler()
df[df.drop(target, axis=1).columns] = scaler.fit_transform(df.drop(target, axis=1))"""

"""scaler = Normalizer()
df[df.drop(target, axis=1).columns] = scaler.fit_transform(df.drop(target, axis=1))"""

## Feature Selection

In [None]:
"""from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import hamming_loss

X = df[df.drop(target, axis=1).columns[selector.support_]]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initialize the RandomForestClassifier
rf_classifier = RandomForestClassifier()

# Fit the model
rf_classifier.fit(X_train, y_train)

# Predict labels
y_pred = rf_classifier.predict(X_test)

# Ensure y_test is a DataFrame or 2D array
if isinstance(y_test, pd.Series):
    y_test = y_test.to_frame()

# Calculate Hamming Loss
hamming = hamming_loss(y_test, y_pred)
print(f"Hamming Loss: {hamming}")"""

In [None]:
#comp = pd.DataFrame(columns=['Hamming Loss'])

In [None]:
#comp.loc['minmax rescaler + feat engineer + feat select'] = hamming
#comp

### Results of comparing different pre-processing methods
* base	0.105286
* removed 5% outliers	0.105719
* removed 10% outliers	0.107542
* removed 15% outliers	0.122124
* standard rescaler	0.103902
* minmax rescaler	0.103666
* maxabs rescaler	0.104206
* minmax rescaler + feat engineer	0.104240
* minmax rescaler + feat engineer + feat select	0.108797

In [None]:
"""from sklearn.feature_selection import RFECV

X = df.drop(target, axis=1)
y = df[target]

estimator = RandomForestClassifier()

# Initialize RFE with cross-validation
selector = RFECV(estimator, step=1, cv=5)

# Fit RFE to the data
selector.fit(X, y)

# Explore the results (e.g., selected features, accuracy)
print("Number of selected features:", selector.n_features_)
print("Selected features indices:", selector.support_)
print("Ranking of features:", selector.ranking_)"""

In [None]:
"""cv_results = pd.DataFrame(selector.cv_results_)
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Mean test accuracy")
plt.errorbar(
    x=cv_results.index,
    y=cv_results["mean_test_score"],
    yerr=cv_results["std_test_score"],
)
plt.title("Recursive Feature Elimination \nwith correlated features")
plt.show()"""

In [None]:
features = ['X_Minimum',
 'X_Maximum',
 'Y_Maximum',
 'Pixels_Areas',
 'Length_of_Conveyer',
 'Steel_Plate_Thickness',
 'Empty_Index',
 'Outside_X_Index',
 'Log_X_Index',
 'Orientation_Index',
 'Luminosity_Index',
 'X_Distance',
 'Combined_Geometric_Index',
 'Std_Luminosity',
 'X_Distance*Pixels_Areas',
 'Luminosity_Index_Product',
 'Color_Contrast',
 'Combined_Index',
 'Sigmoid_Areas']

# Model Training
---

In [None]:
def compare_models(model_scores):
    subejct = pd.DataFrame.from_dict(model_scores, orient='index',columns=['Score'])
    fig = px.bar(subejct, x=subejct.index, y='Score', template='ggplot2', text_auto=True,
                 range_y=[0, 100], labels={'index':'Models'})
    fig.show()

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from xgboost import XGBClassifier

classifiers = {
    "Logisitic Regression": {'model':LogisticRegression()},
    
    "Weighted Logisitic Regression": {'model':LogisticRegression(class_weight='balanced')},
        
    "Decision Tree Classifier": {'model':DecisionTreeClassifier()},
    
    "Random Forest Classifier": {'model':RandomForestClassifier()},
    
    "Extra Trees Classifier": {'model':ExtraTreesClassifier()},
    
    "Naive Bayes": {'model':GaussianNB()},
    
    "Voting Classifier": {'model':VotingClassifier(estimators=[
                        ('lr', LogisticRegression()),
                        ('rf', RandomForestClassifier()),
                        ('gnb', GaussianNB())
                    ], voting='soft')},
    
    "Gradient Boosting Classifier": {'model':GradientBoostingClassifier()},
    
    "XGBoost Classifier": {'model':XGBClassifier(objective='binary:logistic')}
}

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import roc_auc_score

best_models = {}

model_scores = {}

# Iterating through each target
for t in target:
    
    print('\n<<' ,t, '>>\n')
    
    model_scores = {}
    
    X = df.drop(target, axis=1)[features]
    y = df[t]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    # Iterating through models
    for key, classifier in classifiers.items():

        # Fitting the model
        try:
            classifier['model'].fit(X_train, y_train, eval_metric='auc')
        except TypeError:
            classifier['model'''].fit(X_train, y_train)

        # Evaluating model performance
        pred = classifier['model'].predict_proba(X_test)
        pred = [proba[1] for proba in pred]
        pred = np.array(pred)
        training_score = roc_auc_score(y_test, pred)
        model_scores[key] = round(training_score.mean() * 100, 2)

    # Saving best performing model for current label
    best_models[t] = [item[0] for item in sorted(model_scores.items(), key=lambda item: item[1], reverse=True)[:1]]

    # Comparing model performance for current label
    compare_models(model_scores)

# Model Optimization
---

In [None]:
# Defining choose model function that sets up model for training
def choose_model(params):
    match best_models[t][0]:
        case "Logisitic Regression":
            model = LogisticRegression(**params)
            
        case "Weighted Logisitic Regression":
            model = LogisticRegression(**params, class_weight='balanced')

        case "Decision Tree Classifier":
            model = DecisionTreeClassifier(**params)

        case "Random Forest Classifier":
            model = RandomForestClassifier(**params)
            
        case "Extra Trees Classifier":
            model = ExtraTreesClassifier(**params)
            
        case "Naive Bayes":
            model = GaussianNB(**params)
            
        case "Voting Classifier":
            model = VotingClassifier(estimators=[
                        ('lr', LogisticRegression()),
                        ('rf', RandomForestClassifier()),
                        ('gnb', GaussianNB())
                    ], voting='soft', **params)

        case "Gradient Boosting Classifier":
            model = GradientBoostingClassifier(**params)

        case "XGBoost Classifier":
            model = XGBClassifier(objective='binary:logistic',**params)

    return model

In [None]:
# Defining objective function for Optuna optimization
def objective(trial):
    
    # Specifying hyperparameters

    classifiers = {
        "Logisitic Regression": {'model':LogisticRegression(), 'params':{
            'C': trial.suggest_float('C', 0.01, 10.0, log=True),
            'penalty': trial.suggest_categorical('penalty', ['l1', 'l2']),
            'solver': trial.suggest_categorical('solver', ['liblinear', 'saga']),
        }},
        
        "Weighted Logisitic Regression": {'model':LogisticRegression(), 'params':{
            'class_weight': trial.suggest_categorical('class_weight', ['balanced', None]),
            'C': trial.suggest_float('C', 0.01, 10.0, log=True),
            'penalty': trial.suggest_categorical('penalty', ['l1', 'l2']),
            'solver': trial.suggest_categorical('solver', ['liblinear', 'saga']),
        }},
        
        "Decision Tree Classifier": {
        'model': DecisionTreeClassifier(),
        'params': {
            'max_depth': trial.suggest_int('max_depth', 1, 100, log=False, step=1),
            'min_samples_split': trial.suggest_int('min_samples_split', 2, 20, log=False, step=1),
            'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20, log=False, step=1)
            }
        },
        
        "Random Forest Classifier": {
            'model': RandomForestClassifier(),
            'params': {
                'min_samples_split': trial.suggest_int('min_samples_split', 2, 20, log=False, step=1),
                'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20, log=False, step=1),
                'max_depth': trial.suggest_int('max_depth', 1, 100, log=False, step=1),
                'max_features': trial.suggest_categorical('max_features', [1, 'sqrt', 'log2', None]),
                'n_estimators': trial.suggest_int('n_estimators', 100, 2000)
            }
        },
        
        "Extra Trees Classifier": {
            'model': ExtraTreesClassifier(),
            'params': {
                'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
                'max_depth': trial.suggest_int('max_depth', 1, 100, log=False, step=1),
                'min_samples_split': trial.suggest_int('min_samples_split', 2, 20, log=False, step=1),
                'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20, log=False, step=1),
                'max_features': trial.suggest_categorical('max_features', [1, 'sqrt', 'log2', None]),
            }
        },
        
        "Naive Bayes": {
            'model': GaussianNB(),
            'params': {
                'var_smoothing': trial.suggest_float('var_smoothing', 1e-9, 1e-3, log=True),
            }
        },
        
        "Voting Classifier": {
            'model': VotingClassifier(estimators=[
                        ('lr', LogisticRegression()),
                        ('rf', RandomForestClassifier()),
                        ('gnb', GaussianNB())
                    ], voting='soft'),
            'params': {
                'weights': trial.suggest_categorical('weights', ['1,1,1', '2,1,1', '1,2,1']),
            }
        },
        
        "Gradient Boosting Classifier": {
            'model': GradientBoostingClassifier(),
            'params': {
                'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
                'learning_rate': trial.suggest_float('learning_rate', 0.001, 1.0, log=False),
                'max_depth': trial.suggest_int('max_depth', 1, 100, log=False, step=1),
                'subsample': trial.suggest_float('subsample', 0.1, 1),
            }
        },
        "XGBoost Classifier": {
            'model': XGBClassifier(objective='binary:logistic'),
            'params': {
                'learning_rate': trial.suggest_float('learning_rate', 0.001, 1.0, log=False),
                'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
                'gamma': trial.suggest_float('gamma', 0, 1),
                'subsample': trial.suggest_float('subsample', 0.1, 1),
                'max_depth': trial.suggest_int('max_depth', 1, 100, log=False, step=1),
                'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
                "booster": "gbtree",
                "reg_alpha": trial.suggest_float('reg_alpha', 0.1, 1),
                "reg_lambda": trial.suggest_float('reg_lambda', 0, 1),
                "colsample_bytree": trial.suggest_float('colsample_bytree', 0.1, 1)
            }
        }
    }
    
    # Selecting best performing model for current label
    model = choose_model(classifiers[best_models[t][0]]['params'])

    model.fit(X_train, y_train) # Fitting training data to the model
    
    pred = model.predict_proba(X_test)
    pred = [proba[1] for proba in pred]
    pred = np.array(pred)
    score = roc_auc_score(y_test, pred) # Evaluating model perfomance using ROC-AUC score
    
    return score

In [None]:
import optuna
import plotly

model_grid_scores = {} # Saving models scores
best_grid_models = {} # Saving best performing models
test_datasets = {} # Saving test datasets specific to target trained for later evaluations

# Iterating through targets
for t in target:
    
    print('\n<<' ,t, '>>\n')
    
    X = df.drop(target, axis=1)[features]
    y = df[t]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    test_datasets[t] = {'X':X_test, 'y':y_test}

    # Initializing Optuna study
    study = optuna.create_study(direction='maximize')

    # Performing Hyperparameter optimization using optuna objective function
    print('Training', best_models[t][0], '\n')
    study.optimize(objective, n_trials=10) # Number of trials

    print('Best trial parameters:', study.best_trial.params)
    print('\n Best ROC-AUC score:', study.best_trial.value)

    # Selecting best hyperparameter combination
    best_trial = study.best_trial # Getting best trial
    best_params = best_trial.params # Getting best trial parameters
    parameters = set(classifiers[best_models[t][0]]['model'].get_params().keys()) & set(best_params.keys()) # Getting model parameter keys
    best_params = {key: best_params[key] for key in parameters} # Choosing parameters appropriate for selected model
    best = choose_model(best_params) # Choosing the model and assigning its parameters
    best.fit(X_train, y_train) # Fitting the model

    best_grid_models[t] = best # Saving the model in a dictionary

    # Evaluating model performance
    pred = best.predict_proba(X_test)
    pred = [proba[1] for proba in pred]
    pred = np.array(pred)
    score = roc_auc_score(y_test, pred)
    model_grid_scores[t] = round(score.mean(), 2) * 100

    optuna.visualization.plot_optimization_history(study).show() # Visualising Optimization history
    optuna.visualization.plot_param_importances(study).show() # Visualising Parameter importances

In [None]:
# Evaluating model performances for each label
compare_models(model_grid_scores)

# Model Evaluation
---

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

def performance_metrics(model):
    
    print('\n<<' ,t, '>>\n')
    
    preds = model.predict(test_datasets[t]['X'])
    
    print(classification_report(test_datasets[t]['y'], preds), '\n')

    cf_matrix = confusion_matrix(test_datasets[t]['y'], preds, normalize='all')
    fig = px.imshow(pd.DataFrame(cf_matrix), 
          template='ggplot2', title='Confusion Matrix', aspect='auto', text_auto=True, zmin=0,zmax=1)
    fig.show()

In [None]:
for t in target:
    performance_metrics(best_grid_models[t])

# Submission
---

In [None]:
# Importing test data
path = r'/kaggle/input/playground-series-s4e3/test.csv'
df_test = pd.read_csv(path)
df_test.set_index('id', inplace=True)
id = df_test.index
df_test.head()

In [None]:
# Applying feature engineering
df_test = feature_engineering(df_test)

In [None]:
# Keeping selected features
df_test = df_test[features]

In [None]:
# Scaling data
df_test = scaler.fit_transform(df_test)

In [None]:
# Creating submission dataframe with the index same as test data
submission = pd.DataFrame(index=id)

# Iterating through dataframes of each label
for t in target:
    
    print('\n<<', t, '>>\n')

    X = df.drop(target, axis=1)[features]
    y = df[t]
    
    # Training model on full data
    model = best_grid_models[t]
    model.fit(X, y)
    
    # Predicting probability
    results = model.predict_proba(df_test)
    results = [proba[1] for proba in results]
    results = np.array(results)
    submission[j] = results.T

In [None]:
submission.head()

In [None]:
submission.to_csv('submission.csv')