![Steel Plates](https://www.dsstainlesssteel.com/wp-content/uploads/2018/05/Stainless-Steel-Plate-Sheet-1.jpg)

# Steel Plate Defect Prediction
---
## Background
The Steel Plates Faults dataset project stems from the need to enhance quality control measures in steel manufacturing processes. With the continuous demand for high-quality steel products across various industries, it becomes imperative to identify and rectify faults in steel plates efficiently. By leveraging machine learning and data science techniques, this project aims to develop predictive models capable of accurately classifying and detecting various types of faults present in steel plates. By doing so, manufacturers can streamline their quality assurance processes, reduce downtime, minimize production costs, and ultimately deliver superior-quality steel products to their customers. This project not only addresses immediate operational challenges but also lays the foundation for implementing proactive fault detection strategies to improve overall productivity and competitiveness in the steel industry.

## Objective
The primary objective of this project is to develop robust predictive models capable of accurately predicting the presence and type of faults in steel plates. By utilizing machine learning algorithms, the project aims to predict the occurrence of various types of faults, including Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps, and Other_Faults, based on the features provided in the dataset. These predictive models will enable steel manufacturers to automate and enhance their quality control processes, allowing for timely identification and rectification of faults in steel plates during the manufacturing process.

## Data
The Steel Plates Faults dataset comprises various features related to steel plates and different types of faults associated with them. The dataset includes the following features:

* **id**: Unique identifier for each observation.
* **X_Minimum**: Minimum x-coordinate of the defect.
* **X_Maximum**: Maximum x-coordinate of the defect.
* **Y_Minimum**: Minimum y-coordinate of the defect.
* **Y_Maximum**: Maximum y-coordinate of the defect.
* **Pixels_Areas**: Area of the defect in pixels.
* **X_Perimeter**: Perimeter of the defect in the x-direction.
* **Y_Perimeter**: Perimeter of the defect in the y-direction.
* **Sum_of_Luminosity**: Sum of luminosity values within the defect area.
* **Minimum_of_Luminosity**: Minimum luminosity value within the defect area.
* **Maximum_of_Luminosity**: Maximum luminosity value within the defect area.
* **Length_of_Conveyer**: Length of the conveyer belt during manufacturing.
* **TypeOfSteel_A300**: Indicator variable for type of steel (A300).
* **TypeOfSteel_A400**: Indicator variable for type of steel (A400).
* **Steel_Plate_Thickness**: Thickness of the steel plate.
* **Edges_Index**: Ratio of perimeter to the length of the defect.
* **Empty_Index**: Ratio of empty pixels to the total number of pixels in the defect.
* **Square_Index**: Ratio of area to the square of the perimeter of the defect.
* **Outside_X_Index**: Ratio of pixels outside the defect in the x-direction to the total number of pixels in the defect.
* **Edges_X_Index**: Ratio of horizontal edges to the total number of edges in the defect.
* **Edges_Y_Index**: Ratio of vertical edges to the total number of edges in the defect.
* **Outside_Global_Index**: Ratio of pixels outside the defect to the total number of pixels in the image.
* **LogOfAreas**: Logarithm of the defect area.
* **Log_X_Index**: Logarithm of the maximum length of the defect in the x-direction.
* **Log_Y_Index**: Logarithm of the maximum length of the defect in the y-direction.
* **Orientation_Index**: Index representing the orientation of the defect.
* **Luminosity_Index**: Index representing the luminosity of the defect.
* **SigmoidOfAreas**: Sigmoid function applied to the defect area.

* **Pastry**: Binary indicator variable for the presence of the 'Pastry' fault.
* **Z_Scratch**: Binary indicator variable for the presence of the 'Z_Scratch' fault.
* **K_Scatch**: Binary indicator variable for the presence of the 'K_Scatch' fault.
* **Stains**: Binary indicator variable for the presence of the 'Stains' fault.
* **Dirtiness**: Binary indicator variable for the presence of the 'Dirtiness' fault.
* **Bumps**: Binary indicator variable for the presence of the 'Bumps' fault.
* **Other_Faults**: Binary indicator variable for the presence of other types of faults not specified above.

These features provide comprehensive information about the characteristics of steel plates and the types of faults they may exhibit, facilitating the development of predictive models for fault detection and classification.

# Data Cleaning
---

In [None]:
# Import dataset
import pandas as pd

path = r'/kaggle/input/playground-series-s4e3/train.csv'
df = pd.read_csv(path)
df.set_index('id', inplace=True)

df_org   = pd.read_csv('/kaggle/input/faulty-steel-plates/faults.csv')
df = pd.concat([df, df_org], ignore_index=True)

df.head()

In [None]:
df.info() # Summary of DataFrame information

print('\nNumber of unique values in each column')
for i in df.columns:
    print(f'{i} - {df[i].nunique()}')

print('\nNumber of missing values in each column\n', df.isnull().sum())

print('\nNumber of duplicated rows\n', df.duplicated().sum())

In [None]:
df.describe()

Summary: Since I am using cleaned dataset, there are not missing or duplicate values

# Exploratory Data Analysis (EDA)
---

In [None]:
# Defining training and target features
target = ['Stains', 'Dirtiness', 'Z_Scratch', 'K_Scatch', 'Pastry', 'Bumps', 'Other_Faults']
features = df.drop(target, axis=1).columns

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
plt.style.use('seaborn-v0_8-bright')
colors = sns.color_palette('bright')
warnings.filterwarnings("ignore")

fig, axes = plt.subplots(1, 7, figsize=(20, 3))  # 1 row, 7 columns

# Plot pie charts for each target label
for i, colname in enumerate(target):
    axes[i].pie(df[colname].value_counts(), labels=df[colname].unique(), autopct='%1.1f%%')
    axes[i].set_title(colname)

# Display the plot
plt.tight_layout()
plt.show()

**Summary**:
* **Low Incidence Faults**: Stains and Dirtiness are relatively rare.
* **Moderate Faults**: Z_Scratch and Pastry faults occur more frequently.
* **Significant Faults**: K_Scatch and Bumps are more common.
* **Most Common Fault**: Other Faults category is the most prevalent.

In [None]:
# Defining categorical and continous features
categorical = ['Outside_Global_Index', 'TypeOfSteel_A400' , 'TypeOfSteel_A300']
continous = df.drop(categorical+target, axis=1)

In [None]:
# Visualizing Categorical Data Distribution and Target Association
for tar in target:
    print('\n',tar, '\n')
    
    for cat in categorical:
        fig, axs = plt.subplots(1, 2, figsize=(15, 4), gridspec_kw={'width_ratios': [1.2, 1.8]})

        # Pie Chart        
        ax1 = axs[0]
        subject = df[cat].value_counts().reset_index(name='Count')
        ax1.pie(subject['Count'], labels=subject[cat], autopct='%1.1f%%', radius=1.2, startangle=30, wedgeprops=dict(width=0.3, edgecolor='w'))   
        ax1.set_title(f'Share of {cat}')

        # Bar Chart
        ax2 = axs[1]
        subject = df.groupby([cat, tar]).size().reset_index(name='Count')
        pivot_df = subject.pivot(index=cat, columns=tar, values='Count')
        pivot_df.plot(kind='bar', stacked=True, ax=ax2)
        ax2.legend([f'{tar}= 0', f'{tar}= 1'], loc="upper right")
        ax2.set_title(f'{tar} by {cat}')
        ax2.set_xlabel(cat)
        ax2.set_ylabel('Count')
        ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
        ax2.grid(True)

        plt.tight_layout()
        plt.show()

In [None]:
# Histogram Analysis of Continuous Variables by Target Class
for tar in target:
    print('\n', tar, '\n')
    for con in continous:
        fig, ax = plt.subplots(figsize=(15, 4))
        sns.histplot(data=df, x=con, hue=tar, bins=50, kde=True)
        ax.set_title(f'{con} Count by {tar}')
        ax.grid(True)
        plt.show()

# Pre-Processing
---

## Data Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[features] = scaler.fit_transform(df[features])

## Feature Engineering

In [None]:
import numpy as np

def feature_engineering(df):
    epsilon = 1e-6
    
    # Calculate area
    df['X_Distance'] = df['X_Maximum'] - df['X_Minimum']
    df['Y_Distance'] = df['Y_Maximum'] - df['Y_Minimum']
    df['Area'] = (df['X_Distance']) * (df['Y_Distance'])
    
    # Density Feature
    df['Density'] = df['Pixels_Areas'] / (df['X_Perimeter'] + df['Y_Perimeter'])

    # Calculate perimeter
    df['Perimeter'] = 2 * ((df['X_Maximum'] - df['X_Minimum']) + (df['Y_Maximum'] - df['Y_Minimum']))
    
    # Relative Perimeter Feature
    df['Relative_Perimeter'] = df['X_Perimeter'] / (df['X_Perimeter'] + df['Y_Perimeter'] + epsilon)
    
    # Circularity Feature
    df['Circularity'] = df['Pixels_Areas'] / (df['X_Perimeter'] ** 2)
    
    # Combined Geometric Index Feature
    df['Combined_Geometric_Index'] = df['Edges_Index'] * df['Square_Index']
    
    # Symmetry Index Feature
    df['Symmetry_Index'] = np.abs(df['X_Distance'] - df['Y_Distance']) / (df['X_Distance'] + df['Y_Distance'] + epsilon)
    
    # Compute mean, median, and standard deviation
    df['Mean_Luminosity'] = df[['Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity']].mean(axis=1)
    df['Median_Luminosity'] = df[['Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity']].median(axis=1)
    df['Std_Luminosity'] = df[['Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity']].std(axis=1)
    
    # Calculate aspect ratio
    df['Aspect_Ratio'] = (df['Y_Maximum'] - df['Y_Minimum']) / (df['X_Maximum'] - df['X_Minimum'])
    
    # Apply logarithmic transformation
    df['Log_Pixels_Areas'] = np.log(df['Pixels_Areas'])
    
    # Interaction Term Feature
    df['X_Distance*Pixels_Areas'] = df['X_Distance'] * df['Pixels_Areas']
    
    # Create composite feature
    df['Luminosity_Index_Product'] = df['Luminosity_Index'] * df['Sum_of_Luminosity']
    
    # Color Contrast Feature
    df['Color_Contrast'] = df['Maximum_of_Luminosity'] - df['Minimum_of_Luminosity']
    
    # Average Luminosity Feature
    df['Average_Luminosity'] = (df['Sum_of_Luminosity'] + df['Minimum_of_Luminosity']) / 2
    
    # Generate interaction feature
    df['Area_Pixels_Interact'] = df['Area'] * df['Pixels_Areas']
    
    # Additional Features
    df['sin_orientation'] = np.sin(df['Orientation_Index'])
    df['Edges_Index2'] = np.exp(df['Edges_Index'] + epsilon)
    df['X_Maximum2'] = np.sin(df['X_Maximum'])
    df['Y_Minimum2'] = np.sin(df['Y_Minimum'])
    df['Aspect_Ratio_Pixels'] = np.where(df['Y_Perimeter'] == 0, 0, df['X_Perimeter'] / df['Y_Perimeter'])
    df['Aspect_Ratio'] = np.where(df['Y_Distance'] == 0, 0, df['X_Distance'] / df['Y_Distance'])

    # Normalized Steel Thickness Feature
    df['Normalized_Steel_Thickness'] = (df['Steel_Plate_Thickness'] - df['Steel_Plate_Thickness'].min()) / (df['Steel_Plate_Thickness'].max() - df['Steel_Plate_Thickness'].min())

    # Logarithmic Features
    df['Log_Perimeter'] = np.log(df['X_Perimeter'] + df['Y_Perimeter'] + epsilon)
    df['Log_Luminosity'] = np.log(df['Sum_of_Luminosity'] + epsilon)
    df['Log_Aspect_Ratio'] = np.log(df['Aspect_Ratio'] ** 2 + epsilon)

    # Statistical Features
    df['Combined_Index'] = df['Orientation_Index'] * df['Luminosity_Index']
    df['Sigmoid_Areas'] = 1 / (1 + np.exp(-df['LogOfAreas'] + epsilon))
    
    return df

In [None]:
# Applying feature engineering to the dataframe
df = feature_engineering(df)
df.shape

In [None]:
# Checking Correlation Matrix with engineered features
plt.figure(figsize=(15,4))

plt.imshow(df.corr()[target].drop(target, axis=0).T, vmin=-1, vmax=1)
plt.xticks(range(len(df.corr()[target].drop(target, axis=0).T.columns)), df.corr()[target].drop(target, axis=0).T.columns, rotation=45, ha='right')
plt.yticks(range(len(target)), target)
plt.colorbar()
plt.grid(True)
plt.title('Correlation Matrix')

plt.tight_layout()
plt.show()

## Feature Selection

In [None]:
from sklearn.feature_selection import RFECV
from xgboost import XGBClassifier

# Create a XGBoost Classifier
model = XGBClassifier()

X = df[features]
y = df[target]

# Create an RFECV selector
rfecv = RFECV(model, step=1, cv=5, scoring='roc_auc')  # Use 5-fold cross-validation

# Fit the RFECV selector
rfecv.fit(X, y)

# Get the selected feature names
features = [X.columns[i] for i in rfecv.get_support(indices=True)]
print(f"Selected features ({len(features)}) :", features)

In [None]:
# Keeping only selected and target features
eng_features = features + target

df = df[eng_features]
df.shape

## Label balancing

In [None]:
fig, axes = plt.subplots(1, 7, figsize=(15, 4))  # 1 row, 7 columns

# Plot pie charts for each column in 'target'
for i, colname in enumerate(target):
    axes[i].pie(df[colname].value_counts(), labels=df[colname].unique(), autopct='%1.1f%%')
    axes[i].set_title(colname)

# Display the plot
plt.tight_layout()
plt.show()

In [None]:
from sklearn.model_selection import train_test_split

# Splitting shuffeled data into training and test splits (80/20)
df_train, df_test = train_test_split(df, test_size=0.2, shuffle=True)

In [None]:
# Applying SMOTE over-sampling technique
from imblearn.over_sampling import SMOTE

def label_balancing(df):
    
    # Creating dataframes for each label
    Pastry = pd.DataFrame()
    Z_Scratch = pd.DataFrame()
    K_Scatch = pd.DataFrame()
    Stains = pd.DataFrame()
    Dirtiness = pd.DataFrame()
    Bumps = pd.DataFrame()
    Other_Faults = pd.DataFrame()
    
    # Initialising SMOTE model
    sm = SMOTE()
    
    # Resampling each label
    Pastry[df.drop(target, axis=1).columns], Pastry['Pastry'] = sm.fit_resample(df.drop(target, axis=1), df['Pastry'])
    Z_Scratch[df.drop(target, axis=1).columns], Z_Scratch['Z_Scratch'] = sm.fit_resample(df.drop(target, axis=1), df['Z_Scratch'])
    K_Scatch[df.drop(target, axis=1).columns], K_Scatch['K_Scatch'] = sm.fit_resample(df.drop(target, axis=1), df['K_Scatch'])
    Stains[df.drop(target, axis=1).columns], Stains['Stains'] = sm.fit_resample(df.drop(target, axis=1), df['Stains'])
    Dirtiness[df.drop(target, axis=1).columns], Dirtiness['Dirtiness'] = sm.fit_resample(df.drop(target, axis=1), df['Dirtiness'])
    Bumps[df.drop(target, axis=1).columns], Bumps['Bumps'] = sm.fit_resample(df.drop(target, axis=1), df['Bumps'])
    Other_Faults[df.drop(target, axis=1).columns], Other_Faults['Other_Faults'] = sm.fit_resample(df.drop(target, axis=1), df['Other_Faults'])

    return Stains, Dirtiness, Z_Scratch, K_Scatch, Pastry, Bumps, Other_Faults

In [None]:
# Applying label balancing function to each target feature
Stains, Dirtiness, Z_Scratch, K_Scatch, Pastry, Bumps, Other_Faults = label_balancing(df_train)

# Creating a list of dataframes
target_dfs = [Stains, Dirtiness, Z_Scratch, K_Scatch, Pastry, Bumps, Other_Faults]

In [None]:
# Visualising pie chart of each label after balancing
fig, axes = plt.subplots(1, 7, figsize=(15, 4))  # 1 row, 7 columns

for (i, dfname), name in zip(enumerate(target_dfs), target):
    axes[i].pie(dfname[name].value_counts(), labels=dfname[name].unique(), autopct='%1.1f%%', startangle=45)
    axes[i].set_title(f'{name} \n {dfname.shape}')

# Display the plot
plt.tight_layout()
plt.show()

# Model Training
---

In [None]:
# Defining function that visualises each trained model performance for comparison
def compare_models(model_scores):
    fig, ax = plt.subplots(figsize=(12, 4))
    bars = ax.bar(model_scores.keys(), model_scores.values())
    ax.bar_label(bars, label_type="edge")
    plt.ylim(0, 100)
    plt.grid(True)
    plt.show()

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

# Initialising models
classifiers = {
    "Random Forest Classifier": {'model':RandomForestClassifier()},
    
    "GradientBoostingClassifier": {'model':GradientBoostingClassifier()},
    
    "XGBoost Classifier": {'model':XGBClassifier(objective='binary:logistic')}
    }

best_models = {}

# Iterating through dataframes of each target label
for i, j in zip(target_dfs, target):
    
    print('\n', j, '\n')
    
    # Setting up data
    X_train = i[features]
    y_train = i[j]
    
    X_test = df_test[features]
    y_test = df_test[j]
    
    model_scores = {}
    
    # Iterating through models
    for key, classifier in classifiers.items():
        print('Training', key)
        
        # Fitting the model
        try:
            classifier['model'].fit(X_train, y_train, 
                    early_stopping_rounds=250,
                    eval_metric='auc',
                    eval_set=[(X_test, y_test)])
        except TypeError:
            classifier['model'''].fit(X_train, y_train)
            
        # Evaluating model performance
        pred = classifier['model'].predict_proba(X_test)
        pred = [proba[1] for proba in pred]
        pred = np.array(pred)
        training_score = roc_auc_score(y_test, pred)
        model_scores[key] = round(training_score.mean() * 100, 2)
        print(key, "has a training score of", round(training_score.mean() * 100, 2), "% accuracy score \n")
    
    # Saving best performing model for current label
    best_models[j] = [item[0] for item in sorted(model_scores.items(), key=lambda item: item[1], reverse=True)[:1]]
    
    # Comparing model performance for current label
    compare_models(model_scores)

# Model Optimization
---

In [None]:
# Defining choose model function that sets up model for training
def choose_model(params):
    if best_models[j][0] == "Random Forest Classifier":
        model = RandomForestClassifier(**params)
    elif best_models[j][0] == "GradientBoostingClassifier":
        model = GradientBoostingClassifier(**params)
    else:
        model = XGBClassifier(objective='binary:logistic', **params)
        
    return model

In [None]:
# Defining objective function for Optuna optimization
def objective(trial):
    
    # Specifying hyperparameters

    classifiers = {
        "Random Forest Classifier": {
            'model': RandomForestClassifier(),
            'params': {
                'min_samples_split': trial.suggest_int('min_samples_split', 2, 10),
                'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
                'max_depth': trial.suggest_int('max_depth', 10, 100),
                'max_features': trial.suggest_categorical('max_features', [1, 'sqrt', 'log2', None]),
                'n_estimators': trial.suggest_int('n_estimators', 100, 2000)
            }
        },
        "GradientBoostingClassifier": {
            'model': GradientBoostingClassifier(),
            'params': {
                'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
                'learning_rate': trial.suggest_float('learning_rate', 0.01, 1),
                'max_depth': trial.suggest_int('max_depth', 1, 20),
                'subsample': trial.suggest_float('subsample', 0.1, 1),
            }
        },
        "XGBoost Classifier": {
            'model': XGBClassifier(objective='binary:logistic'),
            'params': {
                'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.5),
                'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
                'gamma': trial.suggest_float('gamma', 0, 1),
                'subsample': trial.suggest_float('subsample', 0.1, 1),
                'max_depth': trial.suggest_int('max_depth', 1, 20),
                'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
                "booster": "gbtree",
                "reg_alpha": trial.suggest_float('reg_alpha', 0.1, 1),
                "reg_lambda": trial.suggest_float('reg_lambda', 0, 1),
                "colsample_bytree": trial.suggest_float('colsample_bytree', 0.1, 1)
            }
        }
    }
    
    # Selecting best performing model for current label
    model = choose_model(classifiers[best_models[j][0]]['params'])

    model.fit(X_train, y_train) # Fitting training data to the model
    
    pred = model.predict_proba(X_test)
    pred = [proba[1] for proba in pred]
    pred = np.array(pred)
    score = roc_auc_score(y_test, pred) # Evaluating model perfomance using ROC-AUC score
    
    return score

In [None]:
import optuna

model_grid_scores = {}
best_grid_models = {}

# Iterating through dataframes of each label
for i, j in zip(target_dfs, target):
    
    print('\n', j, '\n')

    # Setting up data
    X_train = i[features]
    y_train = i[j]
    
    X_test = df_test[features]
    y_test = df_test[j]
    
    # Initializing Optuna study
    study = optuna.create_study(direction='maximize')
    
    # Performing Hyperparameter optimization using optuna objective function
    print('Training', best_models[j][0], '\n')
    study.optimize(objective, n_trials=10) # Number of trials set to 10

    print('Best trial parameters:', study.best_trial.params)
    print('\n Best ROC-AUC score:', study.best_trial.value)
    
    # Selecting best hyperparameter combination
    best_trial = study.best_trial # Getting best trial
    best_params = best_trial.params # Getting best trial parameters
    parameters = set(classifiers[best_models[j][0]]['model'].get_params().keys()) & set(best_params.keys()) # Getting model parameter keys
    best_params = {key: best_params[key] for key in parameters} # Choosing parameters appropriate for selected model
    best = choose_model(best_params) # Choosing the model and assigning its parameters
    best.fit(X_train, y_train) # Fitting the model
    
    best_grid_models[j] = best # Saving the model in a dictionary
    
    # Evaluating model performance
    pred = best.predict_proba(X_test)
    pred = [proba[1] for proba in pred]
    pred = np.array(pred)
    score = roc_auc_score(y_test, pred)
    model_grid_scores[j] = round(score.mean(), 2) * 100
    
    optuna.visualization.plot_optimization_history(study).show() # Visualising Optimization history
    optuna.visualization.plot_param_importances(study).show() # Visualising Parameter importances

# Evaluating model performances for each label
compare_models(model_grid_scores)

# Model Evaluation
---

In [None]:
# Performing Cross validation using ROC-AUC score for each label
from sklearn.model_selection import cross_val_predict

# Using test data for validation
X = df_test[features]

# Iterating through dataframe of each label
for i, j in zip(target_dfs, target):
    
    print('\n', j, '\n')
    
    y = df_test[j]
    
    pred = cross_val_predict(best_grid_models[j], X, y, cv=5, verbose=2)
    print('Cross validation ROC-AUC score:' ,roc_auc_score(y, pred))

In [None]:
# Performing Classification report for each label
from sklearn.metrics import classification_report

# Using test data for validation
X = df_test[features]

# Iterating through dataframe of each label
for i, j in zip(target_dfs, target):
    
    print('\n', j, '\n')
    
    y = df_test[j]
    
    pred = best_grid_models[j].predict(X)
    print(classification_report(y, pred), '\n')

# Submission
---

In [None]:
# Importing test data
path = r'/kaggle/input/playground-series-s4e3/test.csv'
df_test = pd.read_csv(path)
df_test.set_index('id', inplace=True)
id = df_test.index
df_test.head()

In [None]:
# Applying feature engineering
df_test = feature_engineering(df_test)

In [None]:
# Keeping selected features
df_test = df_test[features]

In [None]:
# Scaling data
df_test = scaler.fit_transform(df_test)

In [None]:
# Creating submission dataframe with the index same as test data
submission = pd.DataFrame(index=id)

# Creating new dataframes for each label with label balancing using all data available
Stains, Dirtiness, Z_Scratch, K_Scatch, Pastry, Bumps, Other_Faults = label_balancing(df)

# Putting all dataframes into a list
target_dfs = [Stains, Dirtiness, Z_Scratch, K_Scatch, Pastry, Bumps, Other_Faults]

# Iterating through dataframes of each label
for i, j in zip(target_dfs, target):
    
    print('\n', j, '\n')
    
    X = i[features]
    y = i[j]
    
    # Training model on full training data
    model = best_grid_models[j]
    model.fit(X, y)
    
    # Predicting probability
    results = model.predict_proba(df_test)
    results = [proba[1] for proba in results]
    results = np.array(results)
    submission[j] = results.T

In [None]:
submission.head()

In [None]:
submission.to_csv('submission.csv')