**CW2:Machine Learning Case Study**

This notebook harnesses the machine learning techniques for analyzing Steel Plates Faults dataset, aiming to conduct a study focused on regression utilizing scikit-learn.

**Import** **Libraries**

Import necessary libraries for building a Random Forest Regressor model, preprocessing data, and performing evaluation using various regression metrics.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor # for building random forest models
from sklearn.preprocessing import StandardScaler # for data preprocessing
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, mean_absolute_error, mean_squared_error, r2_score #Import Accuracy Metrics
from sklearn.model_selection import GridSearchCV

**Exploratory Data Analysis**

Import dataset by reading CSV files, print data head to display initial rows, display column names, inspect data to check data properties and null values,
display summary statistics to show data distribution and visualize data by plotting pairwise relationships.

In [None]:
# Import the dataset
train_data = pd.read_csv("/content/drive/MyDrive/ML & NN/playground-series/train.csv")
test_data = pd.read_csv("/content/drive/MyDrive/ML & NN/playground-series/test.csv")

In [None]:
# Print data head
train_data.head()

Unnamed: 0,id,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,...,Orientation_Index,Luminosity_Index,SigmoidOfAreas,Pastry,Z_Scratch,K_Scatch,Stains,Dirtiness,Bumps,Other_Faults
0,0,584,590,909972,909977,16,8,5,2274,113,...,-0.5,-0.0104,0.1417,0,0,0,1,0,0,0
1,1,808,816,728350,728372,433,20,54,44478,70,...,0.7419,-0.2997,0.9491,0,0,0,0,0,0,1
2,2,39,192,2212076,2212144,11388,705,420,1311391,29,...,-0.0105,-0.0944,1.0,0,0,1,0,0,0,0
3,3,781,789,3353146,3353173,210,16,29,3202,114,...,0.6667,-0.0402,0.4025,0,0,1,0,0,0,0
4,4,1540,1560,618457,618502,521,72,67,48231,82,...,0.9158,-0.2455,0.9998,0,0,0,0,0,0,1


In [16]:
# Print test data
test_data.head()

Unnamed: 0,id,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,...,Outside_X_Index,Edges_X_Index,Edges_Y_Index,Outside_Global_Index,LogOfAreas,Log_X_Index,Log_Y_Index,Orientation_Index,Luminosity_Index,SigmoidOfAreas
0,19219,1015,1033,3826564,3826588,659,23,46,62357,67,...,0.0095,0.5652,1.0,1.0,2.841,1.1139,1.6628,0.6727,-0.2261,0.9172
1,19220,1257,1271,419960,419973,370,26,28,39293,92,...,0.0047,0.2414,1.0,1.0,2.5682,0.9031,1.4472,0.9063,-0.1453,0.9104
2,19221,1358,1372,117715,117724,289,36,32,29386,101,...,0.0155,0.6,0.75,0.0,2.4609,1.3222,1.3222,-0.5238,-0.0435,0.6514
3,19222,158,168,232415,232440,80,10,11,8586,107,...,0.0037,0.8,1.0,1.0,1.9031,0.699,1.0414,0.1818,-0.0738,0.2051
4,19223,559,592,544375,544389,140,19,15,15524,103,...,0.0158,0.8421,0.5333,0.0,2.1461,1.3222,1.1461,-0.5714,-0.0894,0.417


In [17]:
# Train Data Columns
train_data.columns

Index(['id', 'X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum',
       'Pixels_Areas', 'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity',
       'Minimum_of_Luminosity', 'Maximum_of_Luminosity', 'Length_of_Conveyer',
       'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
       'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index',
       'Edges_X_Index', 'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas',
       'Log_X_Index', 'Log_Y_Index', 'Orientation_Index', 'Luminosity_Index',
       'SigmoidOfAreas', 'Pastry', 'Z_Scratch', 'K_Scatch', 'Stains',
       'Dirtiness', 'Bumps', 'Other_Faults'],
      dtype='object')

In [19]:
# Test Data Columns
test_data.columns

Index(['id', 'X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum',
       'Pixels_Areas', 'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity',
       'Minimum_of_Luminosity', 'Maximum_of_Luminosity', 'Length_of_Conveyer',
       'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
       'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index',
       'Edges_X_Index', 'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas',
       'Log_X_Index', 'Log_Y_Index', 'Orientation_Index', 'Luminosity_Index',
       'SigmoidOfAreas'],
      dtype='object')

In [20]:
# Inspect data
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19219 entries, 0 to 19218
Data columns (total 35 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     19219 non-null  int64  
 1   X_Minimum              19219 non-null  int64  
 2   X_Maximum              19219 non-null  int64  
 3   Y_Minimum              19219 non-null  int64  
 4   Y_Maximum              19219 non-null  int64  
 5   Pixels_Areas           19219 non-null  int64  
 6   X_Perimeter            19219 non-null  int64  
 7   Y_Perimeter            19219 non-null  int64  
 8   Sum_of_Luminosity      19219 non-null  int64  
 9   Minimum_of_Luminosity  19219 non-null  int64  
 10  Maximum_of_Luminosity  19219 non-null  int64  
 11  Length_of_Conveyer     19219 non-null  int64  
 12  TypeOfSteel_A300       19219 non-null  int64  
 13  TypeOfSteel_A400       19219 non-null  int64  
 14  Steel_Plate_Thickness  19219 non-null  int64  
 15  Ed

In [None]:
# Summary statistics
train_data.describe()

Unnamed: 0,id,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,...,Orientation_Index,Luminosity_Index,SigmoidOfAreas,Pastry,Z_Scratch,K_Scatch,Stains,Dirtiness,Bumps,Other_Faults
count,19219.0,19219.0,19219.0,19219.0,19219.0,19219.0,19219.0,19219.0,19219.0,19219.0,...,19219.0,19219.0,19219.0,19219.0,19219.0,19219.0,19219.0,19219.0,19219.0,19219.0
mean,9609.0,709.854675,753.857641,1849756.0,1846605.0,1683.987616,95.654665,64.124096,191846.7,84.808419,...,0.102742,-0.138382,0.571902,0.076279,0.059837,0.178573,0.029554,0.025235,0.247828,0.341225
std,5548.191747,531.544189,499.836603,1903554.0,1896295.0,3730.319865,177.821382,101.054178,442024.7,28.800344,...,0.487681,0.120344,0.332219,0.26545,0.23719,0.383005,0.169358,0.156844,0.431762,0.474133
min,0.0,0.0,4.0,6712.0,6724.0,6.0,2.0,1.0,250.0,0.0,...,-0.9884,-0.885,0.119,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4804.5,49.0,214.0,657468.0,657502.0,89.0,15.0,14.0,9848.0,70.0,...,-0.2727,-0.1925,0.2532,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,9609.0,777.0,796.0,1398169.0,1398179.0,168.0,25.0,23.0,18238.0,90.0,...,0.1111,-0.1426,0.4729,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,14413.5,1152.0,1165.0,2368032.0,2362511.0,653.0,64.0,61.0,67978.0,105.0,...,0.5294,-0.084,0.9994,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,19218.0,1705.0,1713.0,12987660.0,12987690.0,152655.0,7553.0,903.0,11591410.0,196.0,...,0.9917,0.6421,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
# Visualise the data
sns.pairplot(train_data)

Output hidden; open in https://colab.research.google.com to view.

**Data Pre-processing**

Separate features and target variables,split the data into training and testing sets, standardize features using StandardScaler.

In [None]:
# Separate target variable from features
y_train  = train_data[["Pastry", "Z_Scratch", "K_Scatch", "Stains", "Dirtiness",\
                       "Bumps", "Other_Faults"]]

X_train = train_data.drop(['id', 'X_Minimum', 'X_Maximum', 'Y_Maximum', 'Y_Minimum',\
                   'Maximum_of_Luminosity', 'Edges_Index', 'Empty_Index',\
                   'Edges_X_Index', 'Outside_Global_Index', 'Luminosity_Index',\
                   'Pastry', 'Z_Scratch', 'K_Scatch', 'Stains', 'Dirtiness',\
                   'Bumps', 'Other_Faults'], axis=1)

X_test = test_data.drop(['id', 'X_Minimum', 'X_Maximum', 'Y_Maximum', 'Y_Minimum',\
                   'Maximum_of_Luminosity', 'Edges_Index', 'Empty_Index',\
                   'Edges_X_Index', 'Outside_Global_Index', 'Luminosity_Index'], axis=1)

In [None]:
# Print training data shape
print(y_train.shape)
print(X_train.shape)

(19219, 7)
(19219, 17)


In [None]:
# Print test data shape
print(X_test.shape)

(12814, 17)


In [None]:
# Scale features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Model Definition**

Define Random Forest Regressor model, specify hyperparameters for tuning, perform grid search for hyperparameter tuning, use the best hyperparameters found.

In [None]:
# Define the RandomForestRegressor model
rf_regressor = RandomForestRegressor(random_state=42)

# Define the hyperparameters to tune
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Define the scoring function
scoring = {'MSE': make_scorer(mean_squared_error)}

**Model Training**

Build and fit the Random Forest model using the best hyperparameters obtained from grid search

In [None]:
# Perform Grid Search
grid_search = GridSearchCV(estimator=rf_regressor, param_grid=param_grid,\
                           scoring=scoring, cv=5, refit='MSE', verbose=2)
grid_search.fit(X_train_scaled, y_train)

# Print the best hyperparameters
print("\nBest Hyperparameters found by Grid Search:", grid_search.best_params_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
[CV] END .max_depth=10, min_samples_split=2, n_estimators=50; total time=   8.3s
[CV] END .max_depth=10, min_samples_split=2, n_estimators=50; total time=  10.1s
[CV] END .max_depth=10, min_samples_split=2, n_estimators=50; total time=   6.5s
[CV] END .max_depth=10, min_samples_split=2, n_estimators=50; total time=   4.7s
[CV] END .max_depth=10, min_samples_split=2, n_estimators=50; total time=   5.0s
[CV] END max_depth=10, min_samples_split=2, n_estimators=100; total time=  10.3s
[CV] END max_depth=10, min_samples_split=2, n_estimators=100; total time=  11.7s
[CV] END max_depth=10, min_samples_split=2, n_estimators=100; total time=  10.4s
[CV] END max_depth=10, min_samples_split=2, n_estimators=100; total time=   9.8s
[CV] END max_depth=10, min_samples_split=2, n_estimators=100; total time=  10.0s
[CV] END max_depth=10, min_samples_split=2, n_estimators=150; total time=  15.1s
[CV] END max_depth=10, min_samples_split=2, n_e

**Prediction**

Make predictions on the test set using the trained Random Forest model, generate predictions on the scaled test data using the trained Random Forest model, define the target columns for prediction and create a DataFrame containing the predictions, with columns labeled according to the target categories.

In [None]:
# Use the best estimator to make predictions on the test set
best_rf_model = grid_search.best_estimator_
predictions = best_rf_model.predict(X_test_scaled)

In [None]:
# Select targert columns for prediction
columns = ['Pastry', 'Z_Scratch', 'K_Scatch', 'Stains', 'Dirtiness', 'Bumps', 'Other_Faults']

In [None]:
# Show prediction
pred = pd.DataFrame(predictions, columns=columns)
pred.insert(0, 'id', test_data['id'])
pred

Unnamed: 0,id,Pastry,Z_Scratch,K_Scatch,Stains,Dirtiness,Bumps,Other_Faults
0,19219,0.34,0.000000,0.000000,0.000000,0.02,0.240000,0.400000
1,19220,0.12,0.020000,0.000000,0.000000,0.18,0.120000,0.500000
2,19221,0.00,0.042618,0.021006,0.026667,0.00,0.286606,0.584764
3,19222,0.24,0.000000,0.000000,0.000000,0.02,0.300000,0.440000
4,19223,0.02,0.000000,0.000000,0.000000,0.00,0.420000,0.460000
...,...,...,...,...,...,...,...,...
12809,32028,0.12,0.080000,0.000000,0.000000,0.14,0.200000,0.380000
12810,32029,0.12,0.000000,0.020000,0.060000,0.08,0.280000,0.400000
12811,32030,0.00,0.000000,0.879730,0.000000,0.00,0.000000,0.120433
12812,32031,0.22,0.000000,0.020000,0.000000,0.14,0.260000,0.320000


**Evaluation**

Evaluate the model using regression metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2).

In [None]:
# Evaluate the model
X_train_split, X_test_split, y_train_split, y_test_split = train_test_split(X_train_scaled, y_train, test_size=0.3, random_state=42)
rf_fit = RandomForestRegressor(n_estimators=50, max_depth=30, min_samples_split=2, random_state=42)
rf_fit.fit(X_train_split, y_train_split)
y_pred = rf_fit.predict(X_test_split)

# Evaluate the model on the test set
mae_test = mean_absolute_error(y_test_split, y_pred)
mse_test = mean_squared_error(y_test_split, y_pred)
rmse_test = np.sqrt(mse_test)
r2_test = r2_score(y_test_split, y_pred)

# Print evaluation metrics for test set
print("\nTest Set Evaluation Metrics:")
print("Mean Absolute Error (MAE):", mae_test)
print("Mean Squared Error (MSE):", mse_test)
print("Root Mean Squared Error (RMSE):", rmse_test)
print("R-squared (R2):", r2_test)


Test Set Evaluation Metrics:
Mean Absolute Error (MAE): 0.14547542120126483
Mean Squared Error (MSE): 0.07620604160119124
Root Mean Squared Error (RMSE): 0.27605441782589035
R-squared (R2): 0.2911610147697704
