# 🏭 Detecting Manufacturing Anomalies

Anomaly detection is pivotal in products as it enables early issue identification, quality assurance improvement, security enhancement, performance optimization, fraud prevention, predictive maintenance, and enhanced customer experience. 

In this problem, we have a set of data comprising 1558 attributes from a manufactuing machine for a handful products samples and whether each product is good or anamoulous (e.g., has a defect). By training a supervised machine learning model on such data, we can be able to infer whether a new product is good or anamous given its 1558 attributes.

In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

#### Data Ingestion

In [None]:
dataset = pd.read_csv('wafer.csv')
# print the shape of the dataset
print(f"Dataset has {dataset.shape[0]} rows and {dataset.shape[1]} columns")

dataset.head(10)

#### Split the Dataset

In [None]:
# column split
x_data = dataset.drop(columns=['Class'])     
y_data = dataset['Class']                              

# train-test split
x_data, y_data = x_data.to_numpy(), y_data.to_numpy()

# transform y_data to -1, 1 range by replacing 0s and 1s
y_data = np.where(y_data==0, -1, 1)
x_train, x_val, y_train, y_val = train_test_split(x_data, y_data, test_size=0.2, 
                                                  random_state=0, stratify=y_data)

## 😎 Machine Learning Engineering Time

Let's try our Adaboost library:

In [None]:
from Adaboost import Adaboost
from sklearn.metrics import f1_score

ada_model = Adaboost(T=1000)
ada_model.fit(x_train, y_train)
ada_y_pred = ada_model.predict(x_val)
ada_accuracy = ada_model.score(x_val, y_val)
print("AdaBoost Accuracy:", round(ada_accuracy, 3))

f1 = f1_score(y_val, ada_y_pred)
print("F1 Score:", round(f1, 3))

Let's try another model to more fairly judge Adaboost's performance

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb_model = GaussianNB()
gnb_model.fit(x_train, y_train)
gnb_y_pred = gnb_model.predict(x_val)
gnb_accuracy = gnb_model.score(x_val, y_val)

print("Gaussian Naive Bayes Accuracy:", round(gnb_accuracy, 3))

f1 = f1_score(y_val, gnb_y_pred)
print("F1 Score:", round(f1, 3))

- Why is there a discripancy between the accuracy and F1-score for both models. In your explaination, write code that illustrates any points that need to be made and decide which of them is the right choice for this problem.

In [None]:
'''
Answer
'''

- Why is Gaussian Naives Bayes faster than Adaboost and why is it less performant?

In [None]:
'''
Answer
'''

#### 👨🏻‍🔬 Now let's try PCA on the features:

We won't be so greedy. Let's start by attempting to reduce the dimensionality from 1559D to 3D.

In [None]:
from PCA import PCA

pca = PCA(new_dim=3)
x_train_pca = pca.fit_transform(x_train)
x_val_pca = pca.transform(x_val)            # ask yourself, why did we not use pca.fit_transform(x_val)

print(f"Training data has {x_train_pca.shape[0]} rows and {x_train_pca.shape[1]} columns")

Now let's try Adaboost on the data with reduced dimensionality:

In [None]:
from Adaboost import Adaboost
from sklearn.metrics import f1_score

ada_model = Adaboost(T=1000)
ada_model.fit(x_train_pca, y_train)
ada_y_pred = ada_model.predict(x_val_pca)
ada_accuracy = ada_model.score(x_val_pca, y_val)
print("AdaBoost Accuracy:", round(ada_accuracy, 3))

f1 = f1_score(y_val, ada_y_pred)
print("F1 Score:", round(f1, 3))

Let's try Gaussian Naive Bayes

In [None]:
# import gaussian naive bayes
from sklearn.naive_bayes import GaussianNB

gnb_model = GaussianNB()
gnb_model.fit(x_train_pca, y_train)
gnb_y_pred = gnb_model.predict(x_val_pca)
gnb_accuracy = gnb_model.score(x_val_pca, y_val)

print("Gaussian Naive Bayes Accuracy:", round(gnb_accuracy, 3))

f1 = f1_score(y_val, gnb_y_pred)
print("F1 Score:", round(f1, 3))

- In view of the results above, explain what benefit we got from using PCA for each of Adaboost and Gaussian Naive Bayes (whether in terms of speed or performance). What effect in machine learning was responsible of making Gaussian Naive Bayes perform much worse when the dimensionality was high?

In [None]:
'''
Answer
'''

<div align="center" >
    <img src="https://media1.tenor.com/m/pWeXVj_pjM4AAAAC/perfection-michael-fassbender.gif" width=800>
</div>