# 🏭 Detecting Manufacturing Anomalies

Anomaly detection is pivotal in products as it enables early issue identification, quality assurance improvement, security enhancement, performance optimization, fraud prevention, predictive maintenance, and enhanced customer experience. 

In this problem, we have a set of data comprising 1558 attributes from a manufactuing machine for a handful products samples and whether each product is good or anamoulous (e.g., has a defect). By training a supervised machine learning model on such data, we can be able to infer whether a new product is good or anamous given its 1558 attributes.

In [2]:
%load_ext autoreload
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### Data Ingestion

In [3]:
dataset = pd.read_csv('wafer.csv')
# print the shape of the dataset
print(f"Dataset has {dataset.shape[0]} rows and {dataset.shape[1]} columns")

dataset.head(10)

Dataset has 1763 rows and 1559 columns


Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_1550,feature_1551,feature_1552,feature_1553,feature_1554,feature_1555,feature_1556,feature_1557,feature_1558,Class
0,100,160,1.6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,20,83,4.15,1,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
2,99,150,1.5151,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,40,40,1.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,12,234,19.5,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,90,90,1.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1,1,2.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,15,80,5.3333,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,100,190,1.9,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1,1,2.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Split the Dataset

In [4]:
# column split
x_data = dataset.drop(columns=['Class'])     
y_data = dataset['Class']                              

# train-test split
x_data, y_data = x_data.to_numpy(), y_data.to_numpy()

# transform y_data to -1, 1 range by replacing 0s and 1s
y_data = np.where(y_data==0, -1, 1)
x_train, x_val, y_train, y_val = train_test_split(x_data, y_data, test_size=0.2, 
                                                  random_state=0, stratify=y_data)

## 😎 Machine Learning Engineering Time

Let's try our Adaboost library:

In [5]:
from Adaboost import Adaboost
from sklearn.metrics import f1_score

ada_model = Adaboost(T=1000)
ada_model.fit(x_train, y_train)
ada_y_pred = ada_model.predict(x_val)
ada_accuracy = ada_model.score(x_val, y_val)
print("AdaBoost Accuracy:", round(ada_accuracy, 3))

f1 = f1_score(y_val, ada_y_pred)
print("F1 Score:", round(f1, 3))

AdaBoost Accuracy: 0.921
F1 Score: 0.417


Let's try another model to more fairly judge Adaboost's performance

In [10]:
from sklearn.naive_bayes import GaussianNB

gnb_model = GaussianNB()
gnb_model.fit(x_train, y_train)
gnb_y_pred = gnb_model.predict(x_val)
gnb_accuracy = gnb_model.score(x_val, y_val)

print("Gaussian Naive Bayes Accuracy:", round(gnb_accuracy, 3))

f1 = f1_score(y_val, gnb_y_pred)
print("F1 Score:", round(f1, 3))

Gaussian Naive Bayes Accuracy: 0.779
F1 Score: 0.339


- Why is there a discripancy between the accuracy and F1-score for both models. In your explaination, write code that illustrates any points that need to be made and decide which of them is the right choice for this problem.

In [12]:
'''
f1 = tp / tp + 1/2(fp + fn)
f1 is low is an indication that the model has a poor performance on the positive class (having a defect)

'''

'\nf1 = tp / tp + 1/2(fp + fn)\nf1 is low is an indication that the model has a poor performance on the positive class (having a defect)\n\n'

- Why is Gaussian Naives Bayes faster than Adaboost and why is it less performant?

In [11]:
'''
speed: 
beacuse adaboost is an iterative algorithm, it starts with weak learners and then imporves the model by adding more weak learners. 

while GNB is a simpler model that assumes that features is independent of each other. 
performance: 
    GNB because of strong assumption of feature indpendence, 
    adaboost is more flexible and can capture complex patterns in the data

'''

'\nspeed: \nbeacuse adaboost is an iterative algorithm, it starts with weak learners and then imporves the model by adding more weak learners. \n\nwhile GNB is a simpler model that assumes that features is independent of each other. \nperformance: \n    GNB because of strong assumption of feature indpendence, \n    adaboost is more flexible and can capture complex patterns in the data\n\n'

#### 👨🏻‍🔬 Now let's try PCA on the features:

We won't be so greedy. Let's start by attempting to reduce the dimensionality from 1559D to 3D.

In [7]:
from PCA import PCA

pca = PCA(new_dim=3)
x_train_pca = pca.fit_transform(x_train)
x_val_pca = pca.transform(x_val)            # ask yourself, why did we not use pca.fit_transform(x_val)

print(f"Training data has {x_train_pca.shape[0]} rows and {x_train_pca.shape[1]} columns")

shape of U  (1558, 1558)
test [30.27943508 28.86898365 26.4599253  ...  0.          0.
  0.        ]
shape of U  (1558, 1558)
(3, 1558)
Training data has 1410 rows and 3 columns


Now let's try Adaboost on the data with reduced dimensionality:

In [8]:
from Adaboost import Adaboost
from sklearn.metrics import f1_score

ada_model = Adaboost(T=1000)
ada_model.fit(x_train_pca, y_train)
ada_y_pred = ada_model.predict(x_val_pca)
ada_accuracy = ada_model.score(x_val_pca, y_val)
print("AdaBoost Accuracy:", round(ada_accuracy, 3))

f1 = f1_score(y_val, ada_y_pred)
print("F1 Score:", round(f1, 3))

AdaBoost Accuracy: 0.921
F1 Score: 0.417


Let's try Gaussian Naive Bayes

In [9]:
# import gaussian naive bayes
from sklearn.naive_bayes import GaussianNB

gnb_model = GaussianNB()
gnb_model.fit(x_train_pca, y_train)
gnb_y_pred = gnb_model.predict(x_val_pca)
gnb_accuracy = gnb_model.score(x_val_pca, y_val)

print("Gaussian Naive Bayes Accuracy:", round(gnb_accuracy, 3))

f1 = f1_score(y_val, gnb_y_pred)
print("F1 Score:", round(f1, 3))

Gaussian Naive Bayes Accuracy: 0.929
F1 Score: 0.444


- In view of the results above, explain what benefit we got from using PCA for each of Adaboost and Gaussian Naive Bayes (whether in terms of speed or performance). What effect in machine learning was responsible of making Gaussian Naive Bayes perform much worse when the dimensionality was high?

In [None]:
'''
Answer
'''

<div align="center" >
    <img src="https://media1.tenor.com/m/pWeXVj_pjM4AAAAC/perfection-michael-fassbender.gif" width=800>
</div>