Detecting Anomalies can be a difficult task and especially in the case of labeled datasets due to some level of human bias introduced while labeling the final product as anomalous or good. These giant manufacturing systems need to be monitored every 10 milliseconds to capture their behavior which brings in lots of information and what we call the Industrial IoT (IIOT). Also, hardly a manufacturer wants to create an anomalous product. Hence, the anomalies are like a needle in a haystack which renders the dataset that is significantly Imbalanced.

Capturing such a dataset using a machine learning model and making the model generalize can be fun. In this competition, we bring such a use-case from one of India's leading manufacturers of wafers(semiconductors). The dataset collected was anonymized to hide the feature names, also there are 1558 features that would require some serious domain knowledge to understand them.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
train_data=pd.read_csv(r"C:\Users\user\Desktop\Data sets\Anomalies detection\Participants_Data_WH18\Train.csv")
test_data=pd.read_csv(r"C:\Users\user\Desktop\Data sets\Anomalies detection\Participants_Data_WH18\Test.csv")

In [3]:
train_data.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_1550,feature_1551,feature_1552,feature_1553,feature_1554,feature_1555,feature_1556,feature_1557,feature_1558,Class
0,100,160,1.6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,20,83,4.15,1,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
2,99,150,1.5151,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,40,40,1.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,12,234,19.5,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
test_data.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_1549,feature_1550,feature_1551,feature_1552,feature_1553,feature_1554,feature_1555,feature_1556,feature_1557,feature_1558
0,60.0,468.0,7.8,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,108.0,179.0,1.6574,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1.0,1.0,2.0,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,60.0,468.0,7.8,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,60.0,120.0,2.0,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1763 entries, 0 to 1762
Columns: 1559 entries, feature_1 to Class
dtypes: float64(1), int64(1558)
memory usage: 21.0 MB


In [6]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 756 entries, 0 to 755
Columns: 1558 entries, feature_1 to feature_1558
dtypes: float64(4), int64(1554)
memory usage: 9.0 MB


In [7]:
np.sum(train_data.isnull())


feature_1       0
feature_2       0
feature_3       0
feature_4       0
feature_5       0
               ..
feature_1555    0
feature_1556    0
feature_1557    0
feature_1558    0
Class           0
Length: 1559, dtype: int64

In [8]:
# drop duplicates
duplicate=train_data.drop('Class',axis=1).T.drop_duplicates().T.columns

In [9]:
Class = train_data['Class']
train_data['Class'] = Class


In [10]:
test_data=test_data[duplicate]
train_data=train_data[duplicate]
train_data['Class'] = Class


In [11]:
test_data.shape,train_data.shape

((756, 729), (1763, 730))

Let's work more eliminating useless features. Let's check if there are any columns which only consists of zeroes.

In [12]:
train_data.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_1549,feature_1550,feature_1551,feature_1552,feature_1553,feature_1554,feature_1555,feature_1557,feature_1558,Class
0,100,160,1.6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,20,83,4.15,1,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
2,99,150,1.5151,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,40,40,1.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,12,234,19.5,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
#count the no of zeros in each feature and drop them
zero = pd.DataFrame((train_data == 0).astype(int).sum(axis=0))

In [14]:
zero

Unnamed: 0,0
feature_1,0
feature_2,0
feature_3,0
feature_4,486
feature_5,1758
...,...
feature_1554,1756
feature_1555,1736
feature_1557,1746
feature_1558,1761


In [15]:
all_zero = zero[zero[0]>1761].index

In [16]:
all_zero

Index(['feature_46', 'feature_57', 'feature_150', 'feature_369', 'feature_568',
       'feature_1007', 'feature_1013'],
      dtype='object')

In [17]:
train_data.drop(all_zero,axis=1,inplace=True)

In [18]:
test_data.drop(all_zero,axis=1,inplace=True)

In [19]:
train_data.shape,test_data.shape

((1763, 723), (756, 722))

In [20]:
train_data['Class'].value_counts()

0    1620
1     143
Name: Class, dtype: int64

In [21]:
features=train_data.drop(['Class'],axis=1)
label=train_data['Class']

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=0)

In [23]:
from xgboost import XGBClassifier
boost=XGBClassifier(silent=True,booster='gbtree',scale_pos_weight=5,learning_rate=0.02,colsample_bytree = 0.7,
                      subsample = 0.5,
                      max_delta_step = 3,
                      reg_lambda = 2, objective='binary:logistic',
                      n_estimators=818, 
                      max_depth=8)
boost.fit(X_train,y_train,eval_metric=['logloss'],eval_set=[(X_test,y_test)])

Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0]	validation_0-logloss:0.68022
[1]	validation_0-logloss:0.66768
[2]	validation_0-logloss:0.65500
[3]	validation_0-logloss:0.64295
[4]	validation_0-logloss:0.63053
[5]	validation_0-logloss:0.61992
[6]	validation_0-logloss:0.60864
[7]	validation_0-logloss:0.59741
[8]	validation_0-logloss:0.58791
[9]	validation_0-logloss:0.57809
[10]	validation_0-logloss:0.56772
[11]	validation_0-logloss:0.55927
[12]	validation_0-logloss:0.55043
[13]	validation_0-logloss:0.54061
[14]	validation_0-logloss:0.53171
[15]	validation_0-logloss:0.52324
[16]	validation_0-logloss:0.51550
[17]	validation_0-logloss:0.50772
[18]	validation_0-logloss:0.50025
[19]	validation_0-logloss:0.49284
[20]	validation_0-logloss:0.48533
[21]	validatio

[227]	validation_0-logloss:0.21037
[228]	validation_0-logloss:0.21035
[229]	validation_0-logloss:0.20997
[230]	validation_0-logloss:0.20958
[231]	validation_0-logloss:0.20972
[232]	validation_0-logloss:0.20955
[233]	validation_0-logloss:0.20888
[234]	validation_0-logloss:0.20878
[235]	validation_0-logloss:0.20892
[236]	validation_0-logloss:0.20926
[237]	validation_0-logloss:0.20908
[238]	validation_0-logloss:0.20833
[239]	validation_0-logloss:0.20829
[240]	validation_0-logloss:0.20775
[241]	validation_0-logloss:0.20817
[242]	validation_0-logloss:0.20802
[243]	validation_0-logloss:0.20789
[244]	validation_0-logloss:0.20765
[245]	validation_0-logloss:0.20732
[246]	validation_0-logloss:0.20686
[247]	validation_0-logloss:0.20687
[248]	validation_0-logloss:0.20631
[249]	validation_0-logloss:0.20606
[250]	validation_0-logloss:0.20636
[251]	validation_0-logloss:0.20625
[252]	validation_0-logloss:0.20624
[253]	validation_0-logloss:0.20579
[254]	validation_0-logloss:0.20579
[255]	validation_0-l

[462]	validation_0-logloss:0.20437
[463]	validation_0-logloss:0.20440
[464]	validation_0-logloss:0.20467
[465]	validation_0-logloss:0.20434
[466]	validation_0-logloss:0.20435
[467]	validation_0-logloss:0.20434
[468]	validation_0-logloss:0.20429
[469]	validation_0-logloss:0.20479
[470]	validation_0-logloss:0.20473
[471]	validation_0-logloss:0.20431
[472]	validation_0-logloss:0.20419
[473]	validation_0-logloss:0.20457
[474]	validation_0-logloss:0.20438
[475]	validation_0-logloss:0.20412
[476]	validation_0-logloss:0.20434
[477]	validation_0-logloss:0.20438
[478]	validation_0-logloss:0.20444
[479]	validation_0-logloss:0.20439
[480]	validation_0-logloss:0.20482
[481]	validation_0-logloss:0.20494
[482]	validation_0-logloss:0.20477
[483]	validation_0-logloss:0.20483
[484]	validation_0-logloss:0.20493
[485]	validation_0-logloss:0.20518
[486]	validation_0-logloss:0.20545
[487]	validation_0-logloss:0.20524
[488]	validation_0-logloss:0.20533
[489]	validation_0-logloss:0.20546
[490]	validation_0-l

[697]	validation_0-logloss:0.20928
[698]	validation_0-logloss:0.20916
[699]	validation_0-logloss:0.20921
[700]	validation_0-logloss:0.20905
[701]	validation_0-logloss:0.20943
[702]	validation_0-logloss:0.20936
[703]	validation_0-logloss:0.20942
[704]	validation_0-logloss:0.20939
[705]	validation_0-logloss:0.20947
[706]	validation_0-logloss:0.20952
[707]	validation_0-logloss:0.20950
[708]	validation_0-logloss:0.20940
[709]	validation_0-logloss:0.20976
[710]	validation_0-logloss:0.20960
[711]	validation_0-logloss:0.20962
[712]	validation_0-logloss:0.20971
[713]	validation_0-logloss:0.20959
[714]	validation_0-logloss:0.20960
[715]	validation_0-logloss:0.20986
[716]	validation_0-logloss:0.20979
[717]	validation_0-logloss:0.20994
[718]	validation_0-logloss:0.20963
[719]	validation_0-logloss:0.20954
[720]	validation_0-logloss:0.20921
[721]	validation_0-logloss:0.20906
[722]	validation_0-logloss:0.20904
[723]	validation_0-logloss:0.20936
[724]	validation_0-logloss:0.20958
[725]	validation_0-l

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.02, max_delta_step=3, max_depth=8,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=818, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=2, scale_pos_weight=5, silent=True,
              subsample=0.5, tree_method='exact', validate_parameters=1,
              verbosity=None)

In [24]:
predictions = boost.predict_proba(X_test)[:,-1]

In [25]:

from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, predictions)

0.9264878852034816

In [31]:
p1 = boost.predict_proba(test_data)[:,-1]
submission = pd.DataFrame(p1)
submission = submission.rename(columns={0: "Class"})
submission.index = submission['Class']
submission.drop('Class',axis=1,inplace=True)
#submission.to_csv('submissiom.csv',header=True, index=True)

KeyError: 'Class'

In [33]:
p1

array([3.55602950e-01, 1.28597463e-03, 4.68074381e-01, 8.59046340e-01,
       9.48031127e-01, 1.51054515e-02, 6.44609332e-01, 3.12881242e-03,
       8.83398484e-03, 3.23213427e-03, 7.02054024e-01, 7.29980990e-02,
       7.71079538e-03, 1.26866242e-02, 3.59754008e-03, 9.50869464e-04,
       9.09366906e-01, 7.01964796e-01, 1.43386424e-01, 5.67318872e-03,
       1.40425109e-03, 2.96033435e-02, 2.11234856e-03, 5.01817511e-03,
       7.94919208e-03, 1.10104673e-01, 3.94124165e-03, 1.94207244e-02,
       1.87407553e-01, 3.40785056e-01, 1.12139655e-03, 5.32607315e-03,
       8.63909628e-03, 8.56347475e-03, 1.36742997e-03, 3.78770404e-03,
       6.71622111e-04, 1.19645591e-03, 2.98353378e-03, 9.66496766e-01,
       8.70113261e-04, 3.65531282e-03, 1.89614284e-03, 8.58181342e-03,
       7.67647266e-01, 7.49112805e-03, 6.97468920e-03, 7.12291121e-01,
       7.21228658e-04, 4.08818992e-03, 9.13286746e-01, 3.60360835e-03,
       9.37511921e-01, 3.43328007e-02, 6.09250972e-03, 1.80932600e-02,
      