# Class Imbalance Problem by Software Defect Prediction

### Abstract

Software defect prediction is a field which has encouraged researchers to develop sophisticated detection techniques using Machine Learning methods. The issues is that often the data available for the problem is highly imbalanced which makes it difficult for the classifiers to detect the defects. In order to overcome this issue several methods have been found out and all these methods can be classified mainly in three different categories namely,

#### 1. Resampling based methods:
These methods either use undersampling and oversampling techniques in order to transform the imbalanced dataset to a balanced one.

#### 2. Cost Sensitive Learning base models:
These kind of methods considers the cost associated with misclassifying examples and tries to make the classifier favor to the minority-class by adding different cost factors into the algorithms.

#### 3. Ensemble Learning:
These kind of methods tries to improve the performance of the imbalanced dataset 

In [7]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import sklearn
import random

In [8]:
#cm1 = pd.read_csv("data/CM1.csv")
#print(sum(cm1.Defective == "Y"),cm1.shape[0],sum(cm1.Defective == "Y")/cm1.shape[0] * 100)
#jm1 = pd.read_csv("data/JM1.csv")
#print(sum(jm1.label == "Y"),jm1.shape[0],sum(jm1.label == "Y")/jm1.shape[0] * 100)
#kc1 = pd.read_csv("data/KC1.csv")
#print(sum(kc1.Defective == "Y"),kc1.shape[0],sum(kc1.Defective == "Y")/kc1.shape[0] * 100)
datasets = ['CM1','JM1','KC1','KC3','KC4','MC1','MC2','MW1','PC1','PC2','PC3','PC4','PC5']
d = {}
df = pd.DataFrame(columns = ["Project","Number of Defective instances","Total Number of instances","Percentage of Defective Instances"]
)
#df.columns = ["Project","Number of Defective instances","Total Number of instances","Percentage of Defective Instances"]

for i in range(len(datasets)):
    d[i] = pd.read_csv("data/"+datasets[i]+".csv")
    try:
        df.loc[len(df)] = [datasets[i],sum(d[i][d[i].columns[-1]] == "Y"),d[i].shape[0],sum(d[i][d[i].columns[-1]] == "Y")/d[i].shape[0] * 100]
        print(datasets[i],sum(d[i][d[i].columns[-1]] == "Y"),d[i].shape[0],sum(d[i][d[i].columns[-1]] == "Y")/d[i].shape[0] * 100)
    except:
        continue

CM1 42 344 12.209302325581394
JM1 1759 9591 18.340110520279428
KC1 325 2095 15.513126491646778
KC3 36 200 18.0
MC1 68 8737 0.7782991873640838
MC2 44 125 35.199999999999996
MW1 27 263 10.26615969581749
PC1 61 735 8.299319727891156
PC2 16 1493 1.0716677829872738
PC3 138 1099 12.556869881710647
PC4 178 1379 12.907904278462654
PC5 502 16962 2.9595566560547106


In [9]:
df

Unnamed: 0,Project,Number of Defective instances,Total Number of instances,Percentage of Defective Instances
0,CM1,42,344,12.209302
1,JM1,1759,9591,18.340111
2,KC1,325,2095,15.513126
3,KC3,36,200,18.0
4,MC1,68,8737,0.778299
5,MC2,44,125,35.2
6,MW1,27,263,10.26616
7,PC1,61,735,8.29932
8,PC2,16,1493,1.071668
9,PC3,138,1099,12.55687


Generally a dataset whose imbalanced ratio is more than **10:1** can be regarded as **highly imbalanced dataset**. Ordinary classifiers fail under these conditions and thus we require sophisticated techniques. In this analysis we'd try to develop a heuristic about which methods work best in which conditions. We'd work on the following datasets:

| Group | Project | Language | Number of Instances | Percentage of Defective Instances |
|-------|---------|----------|---------------------|-----------------------------------|
| NASA  |   PC5   |    C     |        16962        |              2.95%                |
| NASA  |   PC4   |    C     |        1379         |              12.9%                |
| NASA  |   JM1   |    C     |        9591         |              18.34%               |

We'd use the following techniques in order to deal with the selected datasets:

1. Ordinary Algorithms with Hyper-parameter tuning (if possible):
    - Support Vector Machine
    - Logistic Regression
2. Ensemble Based Methods:
    - Random Forests
    - Bagging
    - Boosting
3. Cluster-Based Over-Sampling with Filtering
    - Support Vector Machine
    - Logistic Regression
    - Random Forest
    - Bagging
    - Boosting

Note: we'd use **PCA** in order to reduce the over-fitting if required.

In [4]:
PC4 = d[11]

In [10]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

## Analysis for PC4

PC4.dtypes

LOC_BLANK                            int64
BRANCH_COUNT                         int64
CALL_PAIRS                           int64
LOC_CODE_AND_COMMENT                 int64
LOC_COMMENTS                         int64
CONDITION_COUNT                      int64
CYCLOMATIC_COMPLEXITY                int64
CYCLOMATIC_DENSITY                 float64
DECISION_COUNT                       int64
DECISION_DENSITY                   float64
DESIGN_COMPLEXITY                    int64
DESIGN_DENSITY                     float64
EDGE_COUNT                           int64
ESSENTIAL_COMPLEXITY                 int64
ESSENTIAL_DENSITY                  float64
LOC_EXECUTABLE                       int64
PARAMETER_COUNT                      int64
HALSTEAD_CONTENT                   float64
HALSTEAD_DIFFICULTY                float64
HALSTEAD_EFFORT                    float64
HALSTEAD_ERROR_EST                 float64
HALSTEAD_LENGTH                      int64
HALSTEAD_LEVEL                     float64
HALSTEAD_PR

In [6]:
## Correcting Defective label to int

PC4['Defective'] = (PC4['Defective'] == "Y")*1
PC4.dtypes

LOC_BLANK                            int64
BRANCH_COUNT                         int64
CALL_PAIRS                           int64
LOC_CODE_AND_COMMENT                 int64
LOC_COMMENTS                         int64
CONDITION_COUNT                      int64
CYCLOMATIC_COMPLEXITY                int64
CYCLOMATIC_DENSITY                 float64
DECISION_COUNT                       int64
DECISION_DENSITY                   float64
DESIGN_COMPLEXITY                    int64
DESIGN_DENSITY                     float64
EDGE_COUNT                           int64
ESSENTIAL_COMPLEXITY                 int64
ESSENTIAL_DENSITY                  float64
LOC_EXECUTABLE                       int64
PARAMETER_COUNT                      int64
HALSTEAD_CONTENT                   float64
HALSTEAD_DIFFICULTY                float64
HALSTEAD_EFFORT                    float64
HALSTEAD_ERROR_EST                 float64
HALSTEAD_LENGTH                      int64
HALSTEAD_LEVEL                     float64
HALSTEAD_PR

In [11]:
##### Since, the number of defective instances is very less, we'd use 10-Fold cross validation

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import recall_score,balanced_accuracy_score

kf = StratifiedKFold(n_splits=5)
X = PC4[PC4.columns[:-1]]
y = PC4[PC4.columns[-1]]
yo = y.copy()

X = (X-X.mean())/X.std()
X.describe()

Unnamed: 0,LOC_BLANK,BRANCH_COUNT,CALL_PAIRS,LOC_CODE_AND_COMMENT,LOC_COMMENTS,CONDITION_COUNT,CYCLOMATIC_COMPLEXITY,CYCLOMATIC_DENSITY,DECISION_COUNT,DECISION_DENSITY,...,MULTIPLE_CONDITION_COUNT,NODE_COUNT,NORMALIZED_CYLOMATIC_COMPLEXITY,NUM_OPERANDS,NUM_OPERATORS,NUM_UNIQUE_OPERANDS,NUM_UNIQUE_OPERATORS,NUMBER_OF_LINES,PERCENT_COMMENTS,LOC_TOTAL
count,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,...,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0
mean,-1.679424e-16,-1.581202e-16,1.052256e-16,-1.227042e-15,-1.898409e-16,-5.15823e-16,8.252202e-17,2.898334e-18,6.966468e-16,5.772516e-17,...,-2.982064e-16,-1.489422e-17,2.407228e-16,9.47393e-17,2.5923990000000003e-17,-2.547314e-16,-1.078824e-16,1.852519e-16,-1.779094e-15,-3.703025e-16
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-0.7376029,-0.655989,-0.8154636,-0.4089595,-0.5289664,-0.5567631,-0.6451747,-1.430837,-0.5707811,-0.9389322,...,-0.535724,-0.6942747,-1.024888,-0.5529072,-0.6325653,-0.6536946,-1.954849,-0.8017939,-0.8619863,-0.8461845
25%,-0.6373511,-0.655989,-0.4650353,-0.4089595,-0.5289664,-0.5567631,-0.6451747,-0.6848729,-0.5707811,-0.9389322,...,-0.535724,-0.536026,-0.5734919,-0.4285166,-0.4748065,-0.3658923,-0.6775513,-0.5615962,-0.8619863,-0.5705292
50%,-0.3365958,-0.2732883,-0.4650353,-0.4089595,-0.4254697,-0.5567631,-0.2804705,-0.2372947,-0.5707811,-0.9389322,...,-0.535724,-0.2722781,-0.2913696,-0.2764837,-0.2874679,-0.2219911,-0.1985647,-0.3650708,-0.3529046,-0.3736325
75%,0.2649147,0.1094124,0.2358212,-0.041562,0.09201373,0.05733513,0.08423372,0.4589381,0.1211812,0.8323593,...,0.04770218,0.1497185,0.2164506,0.06904575,0.116789,0.1617453,0.5997464,0.2245055,0.6301336,0.2170575
max,8.285056,12.92989,7.594815,12.08255,7.543775,8.501186,12.48418,3.542255,8.078748,3.489296,...,11.71623,14.2866,4.617559,18.8382,16.00113,28.1745,4.112315,12.86764,3.494479,7.423476


In [12]:
sum(yo == 1)

178

In [13]:
# Now we'd use PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=10)
XT = pca.fit_transform(X)
pca.explained_variance_ratio_*100

array([44.45323967, 11.95419553,  9.27769965,  5.35220998,  4.56140915,
        4.19499958,  3.55462455,  2.88734001,  2.31394959,  1.91315067])

In [14]:
df = pd.DataFrame()
for i in range(pca.explained_variance_ratio_.shape[0]):
    df["pc%i" % (i+1)] = XT[:,i]
df.head()

Unnamed: 0,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10
0,3.144432,-2.302698,-0.893383,-1.106093,-0.887244,2.117538,0.807772,-0.141952,0.028326,-0.109218
1,-2.415338,0.686656,-0.762651,-1.030448,0.534132,-0.146047,-0.493514,-0.242101,0.808496,-0.557066
2,-1.542967,-0.218857,-0.82451,-0.649179,-0.5941,0.47949,-0.000753,-0.41024,0.831409,-0.265242
3,-1.418571,-0.387407,-1.194695,-0.928402,-0.364667,0.433959,-0.030995,0.112886,0.731877,0.046696
4,-3.250346,0.909308,-0.256969,-0.290185,0.837061,0.2283,-0.585172,-0.931123,0.464226,-0.181306


In [13]:
#params_grid = {'C':[100,1000,10000],'gamma':[0.1,0.01,0.001],'kernel':['rbf']}

bal = []
recall = []

kf = StratifiedKFold(n_splits=5)

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    svm = SVC(kernel = "poly",gamma = "auto")
    #grid = GridSearchCV(svm,params_grid,refit = True,verbose = 0,scoring = 'balanced_accuracy',cv = 5)
    svm.fit(Xtrain,ytrain)
    #print(grid.best_estimator_)
    grid_predictions = svm.predict(Xtest)
    r = recall_score(ytest,grid_predictions)
    b = balanced_accuracy_score(ytest,grid_predictions)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]

bal = np.array(bal)
recall = np.array(recall)
print("===============")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Recall: 0.2222222222222222
Batch Balanced Accuracy: 0.5945136007376671
Recall: 0.3055555555555556
Batch Balanced Accuracy: 0.6215277777777778
Recall: 0.1388888888888889
Batch Balanced Accuracy: 0.5611111111111111
Recall: 0.34285714285714286
Batch Balanced Accuracy: 0.669345238095238
Recall: 0.2857142857142857
Batch Balanced Accuracy: 0.6324404761904762
Average Balanced Accuracy: 0.6157876407824541
Std Balanced Accuracy: 0.03638898186676437
Average Recall Score: 0.259047619047619
Std Recall Score: 0.0716831442324107


In [18]:
#params_grid = {'C':[100,1000,10000],'gamma':[0.1,0.01,0.001],'kernel':['rbf']}

bal = []
recall = []

kf = StratifiedKFold(n_splits=5)

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    svm = SVC(kernel = "rbf",gamma = "auto")
    #grid = GridSearchCV(svm,params_grid,refit = True,verbose = 0,scoring = 'balanced_accuracy',cv = 5)
    svm.fit(Xtrain,ytrain)
    #print(grid.best_estimator_)
    grid_predictions = svm.predict(Xtest)
    r = recall_score(ytest,grid_predictions)
    b = balanced_accuracy_score(ytest,grid_predictions)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]

bal = np.array(bal)
recall = np.array(recall)
print("===============")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Recall: 0.1388888888888889
Batch Balanced Accuracy: 0.563220378054403
Recall: 0.19444444444444445
Batch Balanced Accuracy: 0.576388888888889
Recall: 0.19444444444444445
Batch Balanced Accuracy: 0.5972222222222222
Recall: 0.2
Batch Balanced Accuracy: 0.5979166666666667
Recall: 0.17142857142857143
Batch Balanced Accuracy: 0.5836309523809524
Average Balanced Accuracy: 0.5836758216426267
Std Balanced Accuracy: 0.013098345011952712
Average Recall Score: 0.17984126984126986
Std Recall Score: 0.022718938438430866


In [16]:
#params_grid = {'C':[100,1000,10000],'gamma':[0.1,0.01,0.001],'kernel':['rbf']}

bal = []
recall = []

kf = StratifiedKFold(n_splits=5)

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    svm = SVC(kernel = "linear",gamma = "auto")
    #grid = GridSearchCV(svm,params_grid,refit = True,verbose = 0,scoring = 'balanced_accuracy',cv = 5)
    svm.fit(Xtrain,ytrain)
    #print(grid.best_estimator_)
    grid_predictions = svm.predict(Xtest)
    r = recall_score(ytest,grid_predictions)
    b = balanced_accuracy_score(ytest,grid_predictions)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]

bal = np.array(bal)
recall = np.array(recall)
print("===============")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Recall: 0.19444444444444445
Batch Balanced Accuracy: 0.5868487782388198
Recall: 0.25
Batch Balanced Accuracy: 0.6104166666666666
Recall: 0.027777777777777776
Batch Balanced Accuracy: 0.5138888888888888
Recall: 0.02857142857142857
Batch Balanced Accuracy: 0.5142857142857142
Recall: 0.0
Batch Balanced Accuracy: 0.4979166666666667
Average Balanced Accuracy: 0.5446713429493513
Std Balanced Accuracy: 0.04507378240143474
Average Recall Score: 0.10015873015873016
Std Recall Score: 0.10172287580396641


In [17]:
#params_grid = {'C':[100,1000,10000],'gamma':[0.1,0.01,0.001],'kernel':['rbf']}

bal = []
recall = []

kf = StratifiedKFold(n_splits=5)

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    svm = SVC(kernel = "sigmoid",gamma = "auto")
    #grid = GridSearchCV(svm,params_grid,refit = True,verbose = 0,scoring = 'balanced_accuracy',cv = 5)
    svm.fit(Xtrain,ytrain)
    #print(grid.best_estimator_)
    grid_predictions = svm.predict(Xtest)
    r = recall_score(ytest,grid_predictions)
    b = balanced_accuracy_score(ytest,grid_predictions)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]

bal = np.array(bal)
recall = np.array(recall)
print("===============")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Recall: 0.19444444444444445
Batch Balanced Accuracy: 0.5225334255417243
Recall: 0.25
Batch Balanced Accuracy: 0.5541666666666667
Recall: 0.3611111111111111
Batch Balanced Accuracy: 0.6243055555555556
Recall: 0.08571428571428572
Batch Balanced Accuracy: 0.5011904761904762
Recall: 0.2
Batch Balanced Accuracy: 0.5708333333333333
Average Balanced Accuracy: 0.5546058914575511
Std Balanced Accuracy: 0.04243445680505266
Average Recall Score: 0.21825396825396828
Std Recall Score: 0.08931163644867243


We have used 10 components which covers around 90% of the complete variance
Since multiple values of C and gamma are optimal for each fold, we can conclude that further hyper-parameter tuning won't lead to much improved results but only overfitting the data itself. Hence we'd report the results on the complete dataset are as follows:

- *Balanced Accuracy*: **0.83** (Higher values due to absence of true samples in certain folds)
- *Recall Score*: **0.143**

Now we'd try **Logistic Regression** itself on the transformed data.

In [19]:
bal = []
recall = []

#param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    lr = LogisticRegression(solver = 'lbfgs')
    #grid = GridSearchCV(lr,param_grid,refit = True,verbose = 0,scoring = 'balanced_accuracy',cv = 5)
    lr.fit(Xtrain,ytrain)
    #print(grid.best_estimator_)
    y_pred = lr.predict(Xtest)
    r = recall_score(ytest,y_pred)
    b = balanced_accuracy_score(ytest,y_pred)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]

print("============")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Recall: 0.25
Batch Balanced Accuracy: 0.591804979253112
Recall: 0.3611111111111111
Batch Balanced Accuracy: 0.6618055555555555
Recall: 0.3333333333333333
Batch Balanced Accuracy: 0.6520833333333333
Recall: 0.42857142857142855
Batch Balanced Accuracy: 0.7080357142857143
Recall: 0.2
Batch Balanced Accuracy: 0.59375
Average Balanced Accuracy: 0.641495916485543
Std Balanced Accuracy: 0.04404765898440592
Average Recall Score: 0.3146031746031746
Std Recall Score: 0.08100620681748974


Now we'd try ensemble based models such as:
- Random Forest
- Bagging

In [20]:
from sklearn.ensemble import RandomForestClassifier

bal = []
recall = []

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    rf = RandomForestClassifier(n_estimators = 100)
    #grid = GridSearchCV(lr,param_grid,refit = True,verbose = 0,scoring = 'balanced_accuracy',cv = 5)
    rf.fit(Xtrain,ytrain)
    #print(grid.best_estimator_)
    y_pred = rf.predict(Xtest)
    r = recall_score(ytest,y_pred)
    b = balanced_accuracy_score(ytest,y_pred)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]

bal = np.array(bal)
recall = np.array(recall)
print("===========")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Recall: 0.1388888888888889
Batch Balanced Accuracy: 0.5486975564776395
Recall: 0.2777777777777778
Batch Balanced Accuracy: 0.5993055555555555
Recall: 0.16666666666666666
Batch Balanced Accuracy: 0.5729166666666666
Recall: 0.2857142857142857
Batch Balanced Accuracy: 0.6303571428571428
Recall: 0.14285714285714285
Batch Balanced Accuracy: 0.5610119047619048
Average Balanced Accuracy: 0.5824577652637818
Std Balanced Accuracy: 0.02921954005092046
Average Recall Score: 0.20238095238095238
Std Recall Score: 0.06554229467321454


In [21]:
from sklearn.naive_bayes import GaussianNB

bal = []
recall = []

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    nb = GaussianNB()
    #grid = GridSearchCV(lr,param_grid,refit = True,verbose = 0,scoring = 'balanced_accuracy',cv = 5)
    nb.fit(Xtrain,ytrain)
    #print(grid.best_estimator_)
    y_pred = nb.predict(Xtest)
    r = recall_score(ytest,y_pred)
    b = balanced_accuracy_score(ytest,y_pred)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]

bal = np.array(bal)
recall = np.array(recall)
print("===========")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Recall: 0.0
Batch Balanced Accuracy: 0.487551867219917
Recall: 0.3055555555555556
Batch Balanced Accuracy: 0.6111111111111112
Recall: 0.1388888888888889
Batch Balanced Accuracy: 0.5569444444444445
Recall: 0.11428571428571428
Batch Balanced Accuracy: 0.5321428571428571
Recall: 0.02857142857142857
Batch Balanced Accuracy: 0.5059523809523809
Average Balanced Accuracy: 0.5387405321741421
Std Balanced Accuracy: 0.04314569437098532
Average Recall Score: 0.11746031746031746
Std Recall Score: 0.10728209647342812


In [23]:
import tensorflow as tf
from os_elm import OS_ELM

In [24]:
def softmax(a):
    c = np.max(a, axis=-1).reshape(-1, 1)
    exp_a = np.exp(a - c)
    sum_exp_a = np.sum(exp_a, axis=-1).reshape(-1, 1)
    return exp_a / sum_exp_a

In [26]:
result = []
for j in range(10,20,1):
    mbal = []
    mrecall = []
    for k in range(10):
        tf.reset_default_graph()
        n_input_nodes = 10
        n_hidden_nodes = j
        n_output_nodes = 2
        bal = []
        recall = []

        for train,test in kf.split(df,y):
            Xtrain,Xtest = df.values[train],df.values[test]
            ytrain,ytest = y.values[train],y.values[test]
            ytrain = ytrain.reshape(-1,1)
            ytest = ytest.reshape(-1,1)
            ytrain = np.hstack((ytrain == False,ytrain))
            os_elm1 = OS_ELM(
                # the number of input nodes.
                n_input_nodes=n_input_nodes,
                # the number of hidden nodes.
                n_hidden_nodes=n_hidden_nodes,
                # the number of output nodes.
                n_output_nodes=n_output_nodes,
                # loss function.
                # the default value is 'mean_squared_error'.
                # for the other functions, we support
                # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
                loss='binary_crossentropy',
                # activation function applied to the hidden nodes.
                # the default value is 'sigmoid'.
                # for the other functions, we support 'linear' and 'tanh'.
                # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
                activation='sigmoid',
            )
            border = int(2*n_hidden_nodes)
            Xtrain_init = Xtrain[:border]
            Xtrain_seq = Xtrain[border:]
            ytrain_init = ytrain[:border]
            ytrain_seq = ytrain[border:]
            os_elm1.init_train(Xtrain_init, ytrain_init)
            batch_size = 64
            for i in range(0, len(Xtrain_seq), batch_size):
                x_batch = Xtrain_seq[i:i+batch_size]
                t_batch = ytrain_seq[i:i+batch_size]
                os_elm1.seq_train(x_batch, t_batch)
            n_classes = n_output_nodes
            y_pred = os_elm1.predict(Xtest)
            y_pred = softmax(y_pred)
            res = []
            for ys in y_pred:
                res.append(np.argmax(ys))
            res = np.array(res)
            #print(res)
            #print(ytest)
            r = recall_score(ytest,res)
            b = balanced_accuracy_score(ytest,res)
            #print("Recall:",r)
            #print("Batch Balanced Accuracy:",b)
            bal += [b]
            recall += [r]
            tf.reset_default_graph()

        bal = np.array(bal)
        recall = np.array(recall)
        mbal.append(np.mean(bal))
        mrecall.append(np.mean(recall))
    result.append((j,np.mean(mbal),np.std(mbal),np.mean(mrecall),np.std(mrecall)))
        #print("=============")
        #print("Average Balanced Accuracy:",np.mean(bal))
        #print("Std Balanced Accuracy:",np.std(bal))
        #print("Average Recall Score:",np.mean(recall))
        #print("Std Recall Score:",np.std(recall))
        #print(j)
        #print()

In [27]:
for i in result:
    print("J:",i[0])
    print("Average Balanced Accuracy:",i[1])
    print("Std Balanced Accuracy:",i[2])
    print("Average Recall:",i[3])
    print("Std Recall:",i[4])

J: 10
Average Balanced Accuracy: 0.511593089310413
Std Balanced Accuracy: 0.007613016449780367
Average Recall: 0.029095238095238098
Std Recall: 0.017924711670924156
J: 11
Average Balanced Accuracy: 0.5116683955739972
Std Balanced Accuracy: 0.010140355992995327
Average Recall: 0.02733333333333333
Std Recall: 0.021949092478273525
J: 12
Average Balanced Accuracy: 0.5141418280313508
Std Balanced Accuracy: 0.010848244610680884
Average Recall: 0.032444444444444435
Std Recall: 0.023049477383649288
J: 13
Average Balanced Accuracy: 0.5251242343410394
Std Balanced Accuracy: 0.016054996434536943
Average Recall: 0.06023809523809524
Std Recall: 0.041647554195548525
J: 14
Average Balanced Accuracy: 0.5187456118685372
Std Balanced Accuracy: 0.009321795979175342
Average Recall: 0.04373015873015874
Std Recall: 0.020644106624458116
J: 15
Average Balanced Accuracy: 0.5289099486267536
Std Balanced Accuracy: 0.010906190757156415
Average Recall: 0.06447619047619048
Std Recall: 0.023637985239502064
J: 16
Ave

In [21]:
from sklearn.mixture import GaussianMixture

from sklearn.naive_bayes import GaussianNB

bal = []
recall = []

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    gmm = GaussianMixture(n_components=2)
    #grid = GridSearchCV(lr,param_grid,refit = True,verbose = 0,scoring = 'balanced_accuracy',cv = 5)
    gmm.fit(Xtrain,ytrain)
    #print(grid.best_estimator_)
    y_pred = gmm.predict(Xtest)
    r = recall_score(ytest,y_pred)
    b = balanced_accuracy_score(ytest,y_pred)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]

bal = np.array(bal)
recall = np.array(recall)
print("===========")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Recall: 0.6111111111111112
Batch Balanced Accuracy: 0.3926924850161365
Recall: 0.4166666666666667
Batch Balanced Accuracy: 0.4854166666666667
Recall: 0.6388888888888888
Batch Balanced Accuracy: 0.6715277777777777
Recall: 0.6857142857142857
Batch Balanced Accuracy: 0.6241071428571429
Recall: 0.6571428571428571
Batch Balanced Accuracy: 0.7348214285714285
Average Balanced Accuracy: 0.5817131001778305
Std Balanced Accuracy: 0.1251462244365645
Average Recall Score: 0.6019047619047619
Std Recall Score: 0.09575073669548127


Random Forest also produces results similar to Logistic Regression and SVMs.

The average recall score as well as balanced accuracy score are higher for **Random Forest** as compared to the **SVM** and **Logistic Regression**.

Now we'd use the **KMFOS** technique which would generate extra samples and balance the dataset.

In [15]:
# Code for KMFOS

from sklearn.neighbors import NearestNeighbors

# Noise filtering step

def clni(n):
    nbrs = NearestNeighbors(n_neighbors = n, algorithm = 'auto', p = 2).fit(compData)
    distances,indices = nbrs.kneighbors(compData)
    dropped = []
    i = 0
    while i <len(compData):
        t,f = 0,0
        for j in range(n-1):
            if compData['defects'][indices[i][j+1]] == True:
                t += 1
            else:
                f += 1
        if t>f and compData['defects'][i] == False:
            dropped.append(i)
        elif f>t and compData['defects'][i] == True:
            dropped.append(i)
        i+=1
    return dropped

# Over-sampling step

def overSamplingM(clusters,d,nplus,n):
    k = len(clusters)
    t = 0
    s = 0
    first = False
    second = False
    for i in range(k):
        first = False
        second = False
        ni = len(clusters[i])
        for j in range(i+1,k):
            nj = len(clusters[j])
            if ni + nj == 0:
                continue
            alpha = ni/(ni+nj)
            beta = nj/(ni+nj)
            total = ((ni+nj)/((k-1)*nplus))*n
            t += total
            r = int(total)
            if(total-r > 0.5):
                r = r+1
            s += r
            for l in range(r):
                if ni:
                    p = clusters[i].sample()
                    #print("P:",p)
                    first = True
                if nj:
                    q = clusters[j].sample()
                    #print("Q:",q)
                    second = True
                if first and second:
                    m = alpha*p[p.columns[:-1]] + beta*q[q.columns[:-1]].values
                    m['defects'] = True
                    #print("M:",m)
                    d = d.append(m,ignore_index = True)
                elif first:
                    m = alpha*p[:-1]
                    m['defects'] = True
                    #print(m)
                    d = d.append(m,ignore_index = True)
                elif second:
                    m = beta*q[:-1]
                    m['defects'] = True
                    #print(m)
                    d = d.append(m,ignore_index = True)  
    #print(s)
    #print(t)
    return d


# Code for peforming initial clustering

def InitialClustering(dat,k):
    kmeans = KMeans(n_clusters = k, init = 'k-means++',max_iter=300,n_init = 10,random_state = 0)
    kmeans.fit(dat)
    clusters = {}
    for i in range(len(kmeans.labels_)):
        if kmeans.labels_[i] in clusters:
            clusters[kmeans.labels_[i]].append(dat.index[i])
        else:
            clusters[kmeans.labels_[i]] = [dat.index[i]]
    for key in clusters.keys():
        clusters[key] = np.array(clusters[key])
    return clusters

In [16]:
sum(y == 1)

178

In [17]:
df
data = df
data['defects'] = (y == 1)
sum(data['defects'])

178

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.model_selection import StratifiedKFold

kf5 = StratifiedKFold(n_splits=5)

folds = []
for train,test in kf5.split(df[df.columns[:-1]],df[df.columns[-1]]):
    Xtrain = df.loc[train]
    Xtest,ytest = df.values[:,:-1][test],df.values[:,-1][test]
    test = (Xtest,ytest)
    params = {3:[5,15,20],5:[5,15,20],20:[5,15,20],50:[5,15,20]}
    result = {}
    key = 3
    val = 5
    for key in params.keys():
        for val in params[key]:
            X = Xtrain.copy()
            temp = Xtrain.defects
            X.index = [i for i in range(len(X))]
            Np = sum(X['defects'] == True)
            Nm = sum(X['defects'] == False)
            N = Nm - Np
            D = X.loc[X['defects'] == True]
            clusters = InitialClustering(D,key)
            for i in clusters.keys():
                clusters[i] = X.loc[clusters[i]]
            compData = overSamplingM(clusters,X,Np,N)
            compData = compData.dropna()
            #print("DefectiveInstances:",len(compData.groupby("defects").groups[True]))
            #print("NonDefectiveInstances:",len(compData.groupby("defects").groups[False]))
            dropped = clni(val)
            result[(key,val)] = compData.drop(index = dropped)
            #print("Defect:",len(result[(key,val)].groupby('defects').groups[True]))
            #print("NonDefect:",len(result[(key,val)].groupby('defects').groups[False]))
    print("Complete")
    folds.append((result,test))

Complete
Complete
Complete
Complete
Complete


The parameters of KMFOS method are **k** and **kn** which are the initial number of clusters formed and also the number of neighbors used for noise filtering step. For more details refer the paper. Now we'd simply fit the model on the balanced data set for each of the values of the parameters and average out the results. We'd use **80-20** split for each of the dataset.

In [19]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y)
        y_pred = lr.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]        

        
mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7651344980131286
Std Balanced Accuracy: 0.07166135565446739
Average Recall Score: 0.7636772486772487
Std Recall Score: 0.08957114283043367


In [20]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        svm = SVC(gamma = 'auto',kernel = 'rbf')
        svm.fit(X,Y)
        y_pred = svm.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7265754050582888
Std Balanced Accuracy: 0.05370289988647389
Average Recall Score: 0.609920634920635
Std Recall Score: 0.06775946485270895


In [21]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        svm = SVC(gamma = 'auto',kernel = 'linear')
        svm.fit(X,Y)
        y_pred = svm.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7620714162220905
Std Balanced Accuracy: 0.07507615214295076
Average Recall Score: 0.7749603174603175
Std Recall Score: 0.09293406045662977


In [22]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        svm = SVC(gamma = 'auto',kernel = 'poly')
        svm.fit(X,Y)
        y_pred = svm.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7473892127159762
Std Balanced Accuracy: 0.04621690101978553
Average Recall Score: 0.7337433862433862
Std Recall Score: 0.04723123237445138


In [23]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        svm = SVC(gamma = 'auto',kernel = 'sigmoid')
        svm.fit(X,Y)
        y_pred = svm.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.6763527739357615
Std Balanced Accuracy: 0.06585167282327455
Average Recall Score: 0.698425925925926
Std Recall Score: 0.11453976606704763


In [26]:
from sklearn.ensemble import RandomForestClassifier
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        rf = RandomForestClassifier(n_estimators=100)
        rf.fit(X,Y)
        y_pred = rf.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7356835152802477
Std Balanced Accuracy: 0.05955554988040364
Average Recall Score: 0.6412301587301588
Std Recall Score: 0.06284247452796791


In [27]:
from sklearn.naive_bayes import GaussianNB
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        nb = GaussianNB()
        nb.fit(X,Y)
        y_pred = nb.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7144671672265034
Std Balanced Accuracy: 0.04782584579168045
Average Recall Score: 0.6097619047619048
Std Recall Score: 0.11841147347244109


### OS-ELM

In [35]:
n_hidden_nodes = 20
n_output_nodes = 2
n_input_nodes = 10

In [33]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=10,
        # the number of hidden nodes.
        n_hidden_nodes=20,
        # the number of output nodes.
        n_output_nodes= 2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        #print(sum(Y),Y.shape)
        Y = np.hstack((Y==False,Y))
        #print(Y)
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Y[:border]
        ytrain_seq = Y[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_pred = os_elm1.predict(fold[1][0])
        #print(y_pred)
        y_pred = softmax(y_pred)
        res = []
        for ys in y_pred:
            res.append(np.argmax(ys))
        res = np.array(res)
        #print(y_pred,res)
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,res)
        r = recall_score(l,res)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7372936631429889
Std Balanced Accuracy: 0.058316959838506495
Average Recall Score: 0.7309920634920636
Std Recall Score: 0.06641413864902


### Max-Vote SVM + LR + OS-ELM

In [36]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=9,
        # the number of output nodes.
        n_output_nodes=2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        Yo = np.hstack((Y==False,Y))
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Yo[:border]
        ytrain_seq = Yo[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_predo = os_elm1.predict(fold[1][0])
        y_predo = softmax(y_predo)
        res = []
        for ys in y_predo:
            res.append(np.argmax(ys))
        res = np.array(res)
        
        svm = SVC(gamma = "auto",kernel = "rbf")
        svm.fit(X,Y.ravel())
        y_preds = svm.predict(fold[1][0])
        y_preds = y_preds.reshape(-1,1)
        
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y.ravel())
        y_predl = lr.predict(fold[1][0])
        y_predl = y_predl.reshape(-1,1)
        res = res.reshape(-1,1)
        y_pred = ((y_predl + y_preds + res) >= 2)*1
        y_pred = y_pred.ravel()
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7449138427846934
Std Balanced Accuracy: 0.06040398557176943
Average Recall Score: 0.6791666666666668
Std Recall Score: 0.0849880231724003


In [37]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=9,
        # the number of output nodes.
        n_output_nodes=2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        Yo = np.hstack((Y==False,Y))
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Yo[:border]
        ytrain_seq = Yo[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_predo = os_elm1.predict(fold[1][0])
        y_predo = softmax(y_predo)
        res = []
        for ys in y_predo:
            res.append(np.argmax(ys))
        res = np.array(res)
        
        svm = SVC(gamma = "auto",kernel = "linear")
        svm.fit(X,Y.ravel())
        y_preds = svm.predict(fold[1][0])
        y_preds = y_preds.reshape(-1,1)
        
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y.ravel())
        y_predl = lr.predict(fold[1][0])
        y_predl = y_predl.reshape(-1,1)
        res = res.reshape(-1,1)
        y_pred = ((y_predl + y_preds + res) >= 2)*1
        y_pred = y_pred.ravel()
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7433521592131551
Std Balanced Accuracy: 0.06158221289013565
Average Recall Score: 0.6740740740740742
Std Recall Score: 0.09311106728998358


In [38]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=9,
        # the number of output nodes.
        n_output_nodes=2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        Yo = np.hstack((Y==False,Y))
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Yo[:border]
        ytrain_seq = Yo[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_predo = os_elm1.predict(fold[1][0])
        y_predo = softmax(y_predo)
        res = []
        for ys in y_predo:
            res.append(np.argmax(ys))
        res = np.array(res)
        
        svm = SVC(gamma = "auto",kernel = "poly")
        svm.fit(X,Y.ravel())
        y_preds = svm.predict(fold[1][0])
        y_preds = y_preds.reshape(-1,1)
        
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y.ravel())
        y_predl = lr.predict(fold[1][0])
        y_predl = y_predl.reshape(-1,1)
        res = res.reshape(-1,1)
        y_pred = ((y_predl + y_preds + res) >= 2)*1
        y_pred = y_pred.ravel()
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.750285825429757
Std Balanced Accuracy: 0.047469850149202276
Average Recall Score: 0.7065476190476191
Std Recall Score: 0.05534597657193278


In [39]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=9,
        # the number of output nodes.
        n_output_nodes=2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        Yo = np.hstack((Y==False,Y))
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Yo[:border]
        ytrain_seq = Yo[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_predo = os_elm1.predict(fold[1][0])
        y_predo = softmax(y_predo)
        res = []
        for ys in y_predo:
            res.append(np.argmax(ys))
        res = np.array(res)
        
        svm = SVC(gamma = "auto",kernel = "sigmoid")
        svm.fit(X,Y.ravel())
        y_preds = svm.predict(fold[1][0])
        y_preds = y_preds.reshape(-1,1)
        
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y.ravel())
        y_predl = lr.predict(fold[1][0])
        y_predl = y_predl.reshape(-1,1)
        res = res.reshape(-1,1)
        y_pred = ((y_predl + y_preds + res) >= 2)*1
        y_pred = y_pred.ravel()
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7383262942106305
Std Balanced Accuracy: 0.056923274689157435
Average Recall Score: 0.7165079365079364
Std Recall Score: 0.07629689483433559


In [53]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=9,
        # the number of output nodes.
        n_output_nodes=2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        Yo = np.hstack((Y==False,Y))
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Yo[:border]
        ytrain_seq = Yo[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_predo = os_elm1.predict(fold[1][0])
        y_predo = softmax(y_predo)
        res = []
        for ys in y_predo:
            res.append(np.argmax(ys))
        res = np.array(res)
        
        svm = SVC(gamma = "auto",kernel = "rbf")
        svm.fit(X,Y.ravel())
        y_preds = svm.predict(fold[1][0])
        y_preds = y_preds.reshape(-1,1)
        
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y.ravel())
        y_predl = lr.predict(fold[1][0])
        y_predl = y_predl.reshape(-1,1)
        res = res.reshape(-1,1)
        y_pred = ((y_predl + y_preds + res) >= 2)*1
        y_pred = y_pred.ravel()
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7472250076840325
Std Balanced Accuracy: 0.05893185007351958
Average Recall Score: 0.6856481481481481
Std Recall Score: 0.07457009848778103


### Bagging SVM

In [40]:
from sklearn.ensemble import BaggingClassifier

mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        clf = BaggingClassifier(base_estimator=SVC(kernel = "linear",gamma = "auto"),n_estimators=10,random_state = 0)
        clf.fit(X,Y)
        y_pred = clf.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7636088196228238
Std Balanced Accuracy: 0.07364543883703369
Average Recall Score: 0.7716534391534392
Std Recall Score: 0.0915131772676298


In [41]:
from sklearn.ensemble import BaggingClassifier

mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        clf = BaggingClassifier(base_estimator=SVC(kernel = "poly",gamma = "auto"),n_estimators=10,random_state = 0)
        clf.fit(X,Y)
        y_pred = clf.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7418419174954444
Std Balanced Accuracy: 0.05037688840165136
Average Recall Score: 0.716058201058201
Std Recall Score: 0.05335729297730338


In [42]:
from sklearn.ensemble import BaggingClassifier

mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        clf = BaggingClassifier(base_estimator=SVC(kernel = "sigmoid",gamma = "auto"),n_estimators=10,random_state = 0)
        clf.fit(X,Y)
        y_pred = clf.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.6670409133570441
Std Balanced Accuracy: 0.05636988881637473
Average Recall Score: 0.647142857142857
Std Recall Score: 0.09419304089027483


In [43]:
from sklearn.ensemble import BaggingClassifier

mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        clf = BaggingClassifier(base_estimator=SVC(kernel = "rbf",gamma = "auto"),n_estimators=10,random_state = 0)
        clf.fit(X,Y)
        y_pred = clf.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7236979887044721
Std Balanced Accuracy: 0.051116873158549764
Average Recall Score: 0.6033333333333334
Std Recall Score: 0.06453701670369726


### Bagging LR

In [44]:
from sklearn.ensemble import BaggingClassifier

mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        clf = BaggingClassifier(base_estimator=LogisticRegression(solver = "lbfgs"),n_estimators=10,random_state = 0)
        clf.fit(X,Y)
        y_pred = clf.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.763119387911919
Std Balanced Accuracy: 0.07169393846185035
Average Recall Score: 0.7604100529100529
Std Recall Score: 0.08965659400561928


### LR + OS-ELM Hybrid

In [51]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=9,
        # the number of output nodes.
        n_output_nodes=2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        Yo = np.hstack((Y==False,Y))
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Yo[:border]
        ytrain_seq = Yo[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_predo = os_elm1.predict(fold[1][0])
        y_predo = softmax(y_predo)
        res = []
        for ys in y_predo:
            res.append(ys[1])
        res = np.array(res)
        #print("Res:",res)
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y.ravel())
        y_predl = lr.predict_proba(fold[1][0])
        #print("Y_predl:",y_predl)
        res2 = []
        for ys in y_predl:
            res2.append(ys[1])
        res2 = np.array(res2)
        #print("Res2:",res2)
        #y_predl = y_predl.reshape(-1,1)
        #res = res.reshape(-1,1)
        #y_pred = np.max(res,res2)
        #print(res)
        #print(res2)
        y_pred = (res+res2)/2
        y_pred = (y_pred > 0.5)*1
        y_pred = y_pred.ravel()
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7673545591011878
Std Balanced Accuracy: 0.06796610641993406
Average Recall Score: 0.7725661375661377
Std Recall Score: 0.08233832573729914


### Max Vote with 5 classifiers

In [46]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=9,
        # the number of output nodes.
        n_output_nodes=2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        Yo = np.hstack((Y==False,Y))
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Yo[:border]
        ytrain_seq = Yo[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_predo = os_elm1.predict(fold[1][0])
        y_predo = softmax(y_predo)
        res = []
        for ys in y_predo:
            res.append(np.argmax(ys))
        res = np.array(res)
        
        svm = SVC(gamma = "auto",kernel = "linear")
        svm.fit(X,Y.ravel())
        y_preds = svm.predict(fold[1][0])
        y_preds = y_preds.reshape(-1,1)
        
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y.ravel())
        y_predl = lr.predict(fold[1][0])
        y_predl = y_predl.reshape(-1,1)
        
        rf = RandomForestClassifier(n_estimators=100)
        rf.fit(X,Y.ravel())
        y_predr = rf.predict(fold[1][0])
        y_predr = y_predr.reshape(-1,1)
        
        nb = GaussianNB()
        nb.fit(X,Y.ravel())
        y_predn = nb.predict(fold[1][0])
        y_predn = y_predn.reshape(-1,1)
        
        res = res.reshape(-1,1)
        y_pred = ((y_predl + y_preds + res + y_predr + y_predn) >= 3)*1
        y_pred = y_pred.ravel()
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7403808192825309
Std Balanced Accuracy: 0.05264722735905099
Average Recall Score: 0.6430291005291006
Std Recall Score: 0.068391457437368


In [47]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=9,
        # the number of output nodes.
        n_output_nodes=2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        Yo = np.hstack((Y==False,Y))
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Yo[:border]
        ytrain_seq = Yo[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_predo = os_elm1.predict(fold[1][0])
        y_predo = softmax(y_predo)
        res = []
        for ys in y_predo:
            res.append(np.argmax(ys))
        res = np.array(res)
        
        svm = SVC(gamma = "auto",kernel = "poly")
        svm.fit(X,Y.ravel())
        y_preds = svm.predict(fold[1][0])
        y_preds = y_preds.reshape(-1,1)
        
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y.ravel())
        y_predl = lr.predict(fold[1][0])
        y_predl = y_predl.reshape(-1,1)
        
        rf = RandomForestClassifier(n_estimators=100)
        rf.fit(X,Y.ravel())
        y_predr = rf.predict(fold[1][0])
        y_predr = y_predr.reshape(-1,1)
        
        nb = GaussianNB()
        nb.fit(X,Y.ravel())
        y_predn = nb.predict(fold[1][0])
        y_predn = y_predn.reshape(-1,1)
        
        res = res.reshape(-1,1)
        y_pred = ((y_predl + y_preds + res + y_predr + y_predn) >= 3)*1
        y_pred = y_pred.ravel()
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.741451177577993
Std Balanced Accuracy: 0.05396348298828909
Average Recall Score: 0.6528703703703704
Std Recall Score: 0.06476605173550425


In [48]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=9,
        # the number of output nodes.
        n_output_nodes=2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        Yo = np.hstack((Y==False,Y))
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Yo[:border]
        ytrain_seq = Yo[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_predo = os_elm1.predict(fold[1][0])
        y_predo = softmax(y_predo)
        res = []
        for ys in y_predo:
            res.append(np.argmax(ys))
        res = np.array(res)
        
        svm = SVC(gamma = "auto",kernel = "sigmoid")
        svm.fit(X,Y.ravel())
        y_preds = svm.predict(fold[1][0])
        y_preds = y_preds.reshape(-1,1)
        
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y.ravel())
        y_predl = lr.predict(fold[1][0])
        y_predl = y_predl.reshape(-1,1)
        
        rf = RandomForestClassifier(n_estimators=100)
        rf.fit(X,Y.ravel())
        y_predr = rf.predict(fold[1][0])
        y_predr = y_predr.reshape(-1,1)
        
        nb = GaussianNB()
        nb.fit(X,Y.ravel())
        y_predn = nb.predict(fold[1][0])
        y_predn = y_predn.reshape(-1,1)
        
        res = res.reshape(-1,1)
        y_pred = ((y_predl + y_preds + res + y_predr + y_predn) >= 3)*1
        y_pred = y_pred.ravel()
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7431310786186305
Std Balanced Accuracy: 0.058697265523007996
Average Recall Score: 0.6627513227513229
Std Recall Score: 0.069987563733091


In [49]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=9,
        # the number of output nodes.
        n_output_nodes=2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        Yo = np.hstack((Y==False,Y))
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Yo[:border]
        ytrain_seq = Yo[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_predo = os_elm1.predict(fold[1][0])
        y_predo = softmax(y_predo)
        res = []
        for ys in y_predo:
            res.append(np.argmax(ys))
        res = np.array(res)
        
        svm = SVC(gamma = "auto",kernel = "rbf")
        svm.fit(X,Y.ravel())
        y_preds = svm.predict(fold[1][0])
        y_preds = y_preds.reshape(-1,1)
        
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y.ravel())
        y_predl = lr.predict(fold[1][0])
        y_predl = y_predl.reshape(-1,1)
        
        rf = RandomForestClassifier(n_estimators=100)
        rf.fit(X,Y.ravel())
        y_predr = rf.predict(fold[1][0])
        y_predr = y_predr.reshape(-1,1)
        
        nb = GaussianNB()
        nb.fit(X,Y.ravel())
        y_predn = nb.predict(fold[1][0])
        y_predn = y_predn.reshape(-1,1)
        
        res = res.reshape(-1,1)
        y_pred = ((y_predl + y_preds + res + y_predr + y_predn) >= 3)*1
        y_pred = y_pred.ravel()
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7413658299304047
Std Balanced Accuracy: 0.053499667921518924
Average Recall Score: 0.6482010582010581
Std Recall Score: 0.0703797359205126


### AdaBoost Classifier

In [50]:
from sklearn.ensemble import AdaBoostClassifier

mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        abc = AdaBoostClassifier(base_estimator=LogisticRegression(solver = "lbfgs"))
        abc.fit(X,Y)
        y_pred = abc.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7678189422380294
Std Balanced Accuracy: 0.06838827897436792
Average Recall Score: 0.7575396825396826
Std Recall Score: 0.08958879580260765


In [54]:
from sklearn.ensemble import AdaBoostClassifier

mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        abc = AdaBoostClassifier(base_estimator=SVC(kernel = "linear",gamma = "auto",probability = True))
        abc.fit(X,Y)
        y_pred = abc.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.6946991015170475
Std Balanced Accuracy: 0.029415309828667514
Average Recall Score: 0.7700661375661376
Std Recall Score: 0.08034566839494957


In [55]:
from sklearn.ensemble import AdaBoostClassifier

mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        abc = AdaBoostClassifier(base_estimator=SVC(kernel = "poly",gamma = "auto",probability = True))
        abc.fit(X,Y)
        y_pred = abc.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.6732084197786998
Std Balanced Accuracy: 0.03876089739895646
Average Recall Score: 0.7741666666666667
Std Recall Score: 0.06734206878768302


### Gradient Boosting

In [52]:
from sklearn.ensemble import GradientBoostingClassifier

mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        gb = GradientBoostingClassifier()
        gb.fit(X,Y)
        y_pred = gb.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.7399251012645722
Std Balanced Accuracy: 0.07231241665880418
Average Recall Score: 0.6626190476190476
Std Recall Score: 0.08485515615014357
