# Class Imbalance Problem by Software Defect Prediction

### Abstract

Software defect prediction is a field which has encouraged researchers to develop sophisticated detection techniques using Machine Learning methods. The issues is that often the data available for the problem is highly imbalanced which makes it difficult for the classifiers to detect the defects. In order to overcome this issue several methods have been found out and all these methods can be classified mainly in three different categories namely,

#### 1. Resampling based methods:
These methods either use undersampling and oversampling techniques in order to transform the imbalanced dataset to a balanced one.

#### 2. Cost Sensitive Learning base models:
These kind of methods considers the cost associated with misclassifying examples and tries to make the classifier favor to the minority-class by adding different cost factors into the algorithms.

#### 3. Ensemble Learning:
These kind of methods tries to improve the performance of the imbalanced dataset 

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import sklearn
import random

In [2]:
#cm1 = pd.read_csv("data/CM1.csv")
#print(sum(cm1.Defective == "Y"),cm1.shape[0],sum(cm1.Defective == "Y")/cm1.shape[0] * 100)
#jm1 = pd.read_csv("data/JM1.csv")
#print(sum(jm1.label == "Y"),jm1.shape[0],sum(jm1.label == "Y")/jm1.shape[0] * 100)
#kc1 = pd.read_csv("data/KC1.csv")
#print(sum(kc1.Defective == "Y"),kc1.shape[0],sum(kc1.Defective == "Y")/kc1.shape[0] * 100)
datasets = ['CM1','JM1','KC1','KC3','KC4','MC1','MC2','MW1','PC1','PC2','PC3','PC4','PC5']
d = {}
df = pd.DataFrame(columns = ["Project","Number of Defective instances","Total Number of instances","Percentage of Defective Instances"]
)
#df.columns = ["Project","Number of Defective instances","Total Number of instances","Percentage of Defective Instances"]

for i in range(len(datasets)):
    d[i] = pd.read_csv("data/"+datasets[i]+".csv")
    try:
        df.loc[len(df)] = [datasets[i],sum(d[i][d[i].columns[-1]] == "Y"),d[i].shape[0],sum(d[i][d[i].columns[-1]] == "Y")/d[i].shape[0] * 100]
        print(datasets[i],sum(d[i][d[i].columns[-1]] == "Y"),d[i].shape[0],sum(d[i][d[i].columns[-1]] == "Y")/d[i].shape[0] * 100)
    except:
        continue

CM1 42 344 12.209302325581394
JM1 1759 9591 18.340110520279428
KC1 325 2095 15.513126491646778
KC3 36 200 18.0
MC1 68 8737 0.7782991873640838
MC2 44 125 35.199999999999996
MW1 27 263 10.26615969581749
PC1 61 735 8.299319727891156
PC2 16 1493 1.0716677829872738
PC3 138 1099 12.556869881710647
PC4 178 1379 12.907904278462654
PC5 502 16962 2.9595566560547106


In [3]:
df

Unnamed: 0,Project,Number of Defective instances,Total Number of instances,Percentage of Defective Instances
0,CM1,42,344,12.209302
1,JM1,1759,9591,18.340111
2,KC1,325,2095,15.513126
3,KC3,36,200,18.0
4,MC1,68,8737,0.778299
5,MC2,44,125,35.2
6,MW1,27,263,10.26616
7,PC1,61,735,8.29932
8,PC2,16,1493,1.071668
9,PC3,138,1099,12.55687


Generally a dataset whose imbalanced ratio is more than **10:1** can be regarded as **highly imbalanced dataset**. Ordinary classifiers fail under these conditions and thus we require sophisticated techniques. In this analysis we'd try to develop a heuristic about which methods work best in which conditions. We'd work on the following datasets:

| Group | Project | Language | Number of Instances | Percentage of Defective Instances |
|-------|---------|----------|---------------------|-----------------------------------|
| NASA  |   PC5   |    C     |        16962        |              2.95%                |
| NASA  |   PC4   |    C     |        1379         |              12.9%                |
| NASA  |   JM1   |    C     |        9591         |              18.34%               |

We'd use the following techniques in order to deal with the selected datasets:

1. Ordinary Algorithms with Hyper-parameter tuning (if possible):
    - Support Vector Machine
    - Logistic Regression
2. Ensemble Based Methods:
    - Random Forests
    - Bagging
    - Boosting
3. Cluster-Based Over-Sampling with Filtering
    - Support Vector Machine
    - Logistic Regression
    - Random Forest
    - Bagging
    - Boosting

Note: we'd use **PCA** in order to reduce the over-fitting if required.

In [4]:
KC1 = d[2]
PC4 = d[11]
PC1 = d[8]

In [5]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

## Analysis for PC1

PC1.dtypes

LOC_BLANK                            int64
BRANCH_COUNT                         int64
CALL_PAIRS                           int64
LOC_CODE_AND_COMMENT                 int64
LOC_COMMENTS                         int64
CONDITION_COUNT                      int64
CYCLOMATIC_COMPLEXITY                int64
CYCLOMATIC_DENSITY                 float64
DECISION_COUNT                       int64
DECISION_DENSITY                   float64
DESIGN_COMPLEXITY                    int64
DESIGN_DENSITY                     float64
EDGE_COUNT                           int64
ESSENTIAL_COMPLEXITY                 int64
ESSENTIAL_DENSITY                  float64
LOC_EXECUTABLE                       int64
PARAMETER_COUNT                      int64
HALSTEAD_CONTENT                   float64
HALSTEAD_DIFFICULTY                float64
HALSTEAD_EFFORT                    float64
HALSTEAD_ERROR_EST                 float64
HALSTEAD_LENGTH                      int64
HALSTEAD_LEVEL                     float64
HALSTEAD_PR

In [6]:
## Correcting Defective label to int

PC1['Defective'] = (PC1['Defective'] == "Y")*1
PC1.dtypes

LOC_BLANK                            int64
BRANCH_COUNT                         int64
CALL_PAIRS                           int64
LOC_CODE_AND_COMMENT                 int64
LOC_COMMENTS                         int64
CONDITION_COUNT                      int64
CYCLOMATIC_COMPLEXITY                int64
CYCLOMATIC_DENSITY                 float64
DECISION_COUNT                       int64
DECISION_DENSITY                   float64
DESIGN_COMPLEXITY                    int64
DESIGN_DENSITY                     float64
EDGE_COUNT                           int64
ESSENTIAL_COMPLEXITY                 int64
ESSENTIAL_DENSITY                  float64
LOC_EXECUTABLE                       int64
PARAMETER_COUNT                      int64
HALSTEAD_CONTENT                   float64
HALSTEAD_DIFFICULTY                float64
HALSTEAD_EFFORT                    float64
HALSTEAD_ERROR_EST                 float64
HALSTEAD_LENGTH                      int64
HALSTEAD_LEVEL                     float64
HALSTEAD_PR

In [7]:
##### Since, the number of defective instances is very less, we'd use 10-Fold cross validation

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import recall_score,balanced_accuracy_score

kf = StratifiedKFold(n_splits=5)
X = PC1[PC1.columns[:-1]]
y = PC1[PC1.columns[-1]]
yo = y.copy()

X = (X-X.mean())/X.std()
X.describe()

Unnamed: 0,LOC_BLANK,BRANCH_COUNT,CALL_PAIRS,LOC_CODE_AND_COMMENT,LOC_COMMENTS,CONDITION_COUNT,CYCLOMATIC_COMPLEXITY,CYCLOMATIC_DENSITY,DECISION_COUNT,DECISION_DENSITY,...,MULTIPLE_CONDITION_COUNT,NODE_COUNT,NORMALIZED_CYLOMATIC_COMPLEXITY,NUM_OPERANDS,NUM_OPERATORS,NUM_UNIQUE_OPERANDS,NUM_UNIQUE_OPERATORS,NUMBER_OF_LINES,PERCENT_COMMENTS,LOC_TOTAL
count,735.0,735.0,735.0,735.0,735.0,735.0,735.0,735.0,735.0,735.0,...,735.0,735.0,735.0,735.0,735.0,735.0,735.0,735.0,735.0,735.0
mean,2.454575e-16,1.291484e-16,8.398422e-17,-5.078326e-16,4.32987e-16,-3.353327e-17,1.586033e-16,1.603139e-15,2.646409e-16,-3.037026e-15,...,-4.0179500000000004e-17,-1.296015e-16,-6.45742e-16,-1.7823990000000002e-17,-1.0422500000000001e-17,-2.785376e-16,6.797284e-18,3.791374e-17,-1.62642e-15,-1.084544e-16
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-0.6124197,-0.5629526,-0.8935369,-0.3310945,-0.4776055,-0.5498607,-0.6450375,-1.969935,-0.5307145,-0.4355453,...,-0.5497144,-0.5829778,-1.631291,-0.662031,-0.6380924,-0.7516631,-1.578774,-0.6980089,-0.8519218,-0.6823274
25%,-0.5431159,-0.4484934,-0.6147807,-0.3310945,-0.4776055,-0.4151224,-0.4372345,-0.7191149,-0.5307145,-0.4355453,...,-0.4159263,-0.4556125,-0.7676455,-0.4954752,-0.4877675,-0.5099205,-0.6864537,-0.5487981,-0.8519218,-0.484892
50%,-0.3352043,-0.3340341,-0.3360245,-0.3310945,-0.3959683,-0.280384,-0.333333,-0.2500576,-0.2466102,-0.4355453,...,-0.2821382,-0.2964059,-0.3358229,-0.3080999,-0.2978834,-0.268178,-0.1765563,-0.3498503,-0.3870507,-0.2874566
75%,0.1499226,0.009343612,0.2214879,-0.08302676,0.09385504,0.05646177,0.08227303,0.4535285,0.03749403,0.09596026,...,0.05233208,0.05384866,0.6141869,0.07706038,0.09770851,0.2153071,0.4608154,0.180677,0.6589093,0.1320937
max,14.98095,12.77155,5.796612,11.57616,12.50271,12.25028,13.38167,3.73693,12.68013,12.85209,...,12.16016,13.49089,3.464216,11.21548,12.29776,15.44509,10.53129,13.60965,3.376843,14.10065


In [8]:
sum(PC1['Defective'])

61

In [9]:
# Now we'd use PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=8)
XT = pca.fit_transform(X)
pca.explained_variance_ratio_*100

array([56.57075357, 10.07727277,  6.1846134 ,  4.51684265,  4.1101382 ,
        3.21640136,  2.70358739,  2.07406149])

In [10]:
df = pd.DataFrame()
for i in range(pca.explained_variance_ratio_.shape[0]):
    df["pc%i" % (i+1)] = XT[:,i]
df.head()

Unnamed: 0,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8
0,-2.521388,0.822144,1.708837,0.134342,-0.037248,-0.616636,-0.110241,0.972536
1,-3.137713,0.592765,1.712132,-1.220505,0.55042,1.360461,0.101228,0.349711
2,3.982195,-2.613912,-2.798527,1.742143,-1.129623,0.240516,0.199956,0.75389
3,-2.358768,-1.901431,-0.383355,-0.207482,-0.372655,1.554141,1.134323,-0.197235
4,8.600955,-4.332267,-0.980958,3.393093,2.598393,0.395955,-0.022758,-0.793893


In [11]:
#params_grid = {'C':[100,1000,10000],'gamma':[0.1,0.01,0.001],'kernel':['rbf']}

bal = []
recall = []

kf = StratifiedKFold(n_splits=5)

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    svm = SVC(kernel = "rbf",gamma = "auto",class_weight="balanced")
    #grid = GridSearchCV(svm,params_grid,refit = True,verbose = 0,scoring = 'balanced_accuracy',cv = 5)
    svm.fit(Xtrain,ytrain)
    #print(grid.best_estimator_)
    grid_predictions = svm.predict(Xtest)
    r = recall_score(ytest,grid_predictions)
    b = balanced_accuracy_score(ytest,grid_predictions)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]

bal = np.array(bal)
recall = np.array(recall)
print("===============")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Recall: 0.15384615384615385
Batch Balanced Accuracy: 0.41396011396011395
Recall: 0.25
Batch Balanced Accuracy: 0.5657407407407408
Recall: 0.3333333333333333
Batch Balanced Accuracy: 0.5592592592592592
Recall: 0.16666666666666666
Batch Balanced Accuracy: 0.4907407407407407
Recall: 0.08333333333333333
Batch Balanced Accuracy: 0.5043532338308457
Average Balanced Accuracy: 0.5068108177063402
Std Balanced Accuracy: 0.054971110828370924
Average Recall Score: 0.19743589743589746
Std Recall Score: 0.08613629247302987


In [12]:
bal = []
recall = []

#param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    lr = LogisticRegression(solver = 'lbfgs')
    #grid = GridSearchCV(lr,param_grid,refit = True,verbose = 0,scoring = 'balanced_accuracy',cv = 5)
    lr.fit(Xtrain,ytrain)
    #print(grid.best_estimator_)
    y_pred = lr.predict(Xtest)
    r = recall_score(ytest,y_pred)
    b = balanced_accuracy_score(ytest,y_pred)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]

print("============")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Recall: 0.07692307692307693
Batch Balanced Accuracy: 0.5236467236467236
Recall: 0.08333333333333333
Batch Balanced Accuracy: 0.537962962962963
Recall: 0.08333333333333333
Batch Balanced Accuracy: 0.537962962962963
Recall: 0.08333333333333333
Batch Balanced Accuracy: 0.5416666666666666
Recall: 0.08333333333333333
Batch Balanced Accuracy: 0.5416666666666666
Average Balanced Accuracy: 0.5365811965811965
Std Balanced Accuracy: 0.006675974217155054
Average Recall Score: 0.08205128205128204
Std Recall Score: 0.0025641025641025606


Now we'd try ensemble based models such as:
- Random Forest
- Bagging

In [13]:
from sklearn.ensemble import RandomForestClassifier

bal = []
recall = []

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    rf = RandomForestClassifier(n_estimators = 100)
    #grid = GridSearchCV(lr,param_grid,refit = True,verbose = 0,scoring = 'balanced_accuracy',cv = 5)
    rf.fit(Xtrain,ytrain)
    #print(grid.best_estimator_)
    y_pred = rf.predict(Xtest)
    r = recall_score(ytest,y_pred)
    b = balanced_accuracy_score(ytest,y_pred)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]

bal = np.array(bal)
recall = np.array(recall)
print("===========")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Recall: 0.15384615384615385
Batch Balanced Accuracy: 0.550997150997151
Recall: 0.16666666666666666
Batch Balanced Accuracy: 0.5796296296296296
Recall: 0.0
Batch Balanced Accuracy: 0.5
Recall: 0.08333333333333333
Batch Balanced Accuracy: 0.537962962962963
Recall: 0.08333333333333333
Batch Balanced Accuracy: 0.5416666666666666
Average Balanced Accuracy: 0.542051282051282
Std Balanced Accuracy: 0.025602083438791598
Average Recall Score: 0.09743589743589742
Std Recall Score: 0.059777144753203934


In [14]:
from sklearn.naive_bayes import GaussianNB

bal = []
recall = []

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    nb = GaussianNB()
    #grid = GridSearchCV(lr,param_grid,refit = True,verbose = 0,scoring = 'balanced_accuracy',cv = 5)
    nb.fit(Xtrain,ytrain)
    #print(grid.best_estimator_)
    y_pred = nb.predict(Xtest)
    r = recall_score(ytest,y_pred)
    b = balanced_accuracy_score(ytest,y_pred)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]

bal = np.array(bal)
recall = np.array(recall)
print("===========")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Recall: 0.38461538461538464
Batch Balanced Accuracy: 0.6737891737891738
Recall: 0.16666666666666666
Batch Balanced Accuracy: 0.575925925925926
Recall: 0.25
Batch Balanced Accuracy: 0.6101851851851852
Recall: 0.08333333333333333
Batch Balanced Accuracy: 0.5231481481481481
Recall: 0.16666666666666666
Batch Balanced Accuracy: 0.568407960199005
Average Balanced Accuracy: 0.5902912786494876
Std Balanced Accuracy: 0.050122068929377955
Average Recall Score: 0.21025641025641026
Std Recall Score: 0.1018726693606099


In [15]:
df

Unnamed: 0,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8
0,-2.521388,0.822144,1.708837,0.134342,-0.037248,-0.616636,-0.110241,0.972536
1,-3.137713,0.592765,1.712132,-1.220505,0.550420,1.360461,0.101228,0.349711
2,3.982195,-2.613912,-2.798527,1.742143,-1.129623,0.240516,0.199956,0.753890
3,-2.358768,-1.901431,-0.383355,-0.207482,-0.372655,1.554141,1.134323,-0.197235
4,8.600955,-4.332267,-0.980958,3.393093,2.598393,0.395955,-0.022758,-0.793893
...,...,...,...,...,...,...,...,...
730,16.753565,-1.873700,1.117572,1.402874,-4.386335,4.051415,-1.815742,3.185775
731,-1.076460,-1.780674,0.148264,0.493565,-0.837638,0.239646,-0.121430,-0.659616
732,1.544871,-0.540507,1.109615,0.381071,-1.213654,-0.612453,-1.611542,-0.651792
733,-0.676700,0.647131,-0.698357,0.126723,0.575452,-0.997832,-0.006964,0.421220


In [17]:
import tensorflow as tf
from os_elm import OS_ELM
import tqdm

In [18]:
def softmax(a):
    c = np.max(a, axis=-1).reshape(-1, 1)
    exp_a = np.exp(a - c)
    sum_exp_a = np.sum(exp_a, axis=-1).reshape(-1, 1)
    return exp_a / sum_exp_a

In [19]:
tf.reset_default_graph()
n_input_nodes = 8
n_hidden_nodes = 4
n_output_nodes = 1
bal = []
recall = []

for train,test in kf.split(df,y):
    Xtrain,Xtest = df.values[train],df.values[test]
    ytrain,ytest = y.values[train],y.values[test]
    ytrain = ytrain.reshape(-1,1)
    ytest = ytest.reshape(-1,1)
    os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=n_hidden_nodes,
        # the number of output nodes.
        n_output_nodes=n_output_nodes,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='tanh',
    )
    border = int(2*n_hidden_nodes)
    Xtrain_init = Xtrain[:border]
    Xtrain_seq = Xtrain[border:]
    ytrain_init = ytrain[:border]
    ytrain_seq = ytrain[border:]
    os_elm1.init_train(Xtrain_init, ytrain_init)
    batch_size = 64
    for i in range(0, len(Xtrain_seq), batch_size):
        x_batch = Xtrain_seq[i:i+batch_size]
        t_batch = ytrain_seq[i:i+batch_size]
        os_elm1.seq_train(x_batch, t_batch)
    n_classes = n_output_nodes
    y_pred = os_elm1.predict(Xtest)
    y_pred = (1/(1+np.exp(-1*y_pred))>0.5)*1
    r = recall_score(ytest,y_pred)
    b = balanced_accuracy_score(ytest,y_pred)
    print("Recall:",r)
    print("Batch Balanced Accuracy:",b)
    bal += [b]
    recall += [r]
    tf.reset_default_graph()
print("================")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Instructions for updating:
Colocations handled automatically by placer.
Recall: 0.9230769230769231
Batch Balanced Accuracy: 0.6170940170940171
Recall: 0.4166666666666667
Batch Balanced Accuracy: 0.27870370370370373
Recall: 0.75
Batch Balanced Accuracy: 0.6675925925925925
Recall: 0.75
Batch Balanced Accuracy: 0.6305555555555555
Recall: 0.5833333333333334
Batch Balanced Accuracy: 0.6349502487562189
Average Balanced Accuracy: 0.5657792235404175
Std Balanced Accuracy: 0.14449368717245942
Average Recall Score: 0.6846153846153846
Std Recall Score: 0.17173745691938824


Random Forest also produces results similar to Logistic Regression and SVMs.

The average recall score as well as balanced accuracy score are higher for **Random Forest** as compared to the **SVM** and **Logistic Regression**.

Now we'd use the **KMFOS** technique which would generate extra samples and balance the dataset.

In [20]:
# Code for KMFOS

from sklearn.neighbors import NearestNeighbors

# Noise filtering step

def clni(n):
    nbrs = NearestNeighbors(n_neighbors = n, algorithm = 'auto', p = 2).fit(compData)
    distances,indices = nbrs.kneighbors(compData)
    dropped = []
    i = 0
    while i <len(compData):
        t,f = 0,0
        for j in range(n-1):
            if compData['defects'][indices[i][j+1]] == True:
                t += 1
            else:
                f += 1
        if t>f and compData['defects'][i] == False:
            dropped.append(i)
        elif f>t and compData['defects'][i] == True:
            dropped.append(i)
        i+=1
    return dropped

# Over-sampling step

def overSamplingM(clusters,d,nplus,n):
    k = len(clusters)
    t = 0
    s = 0
    first = False
    second = False
    for i in range(k):
        first = False
        second = False
        ni = len(clusters[i])
        for j in range(i+1,k):
            nj = len(clusters[j])
            if ni + nj == 0:
                continue
            alpha = ni/(ni+nj)
            beta = nj/(ni+nj)
            total = ((ni+nj)/((k-1)*nplus))*n
            t += total
            r = int(total)
            if(total-r > 0.5):
                r = r+1
            s += r
            for l in range(r):
                if ni:
                    p = clusters[i].sample()
                    #print("P:",p)
                    first = True
                if nj:
                    q = clusters[j].sample()
                    #print("Q:",q)
                    second = True
                if first and second:
                    m = alpha*p[p.columns[:-1]] + beta*q[q.columns[:-1]].values
                    m['defects'] = True
                    #print("M:",m)
                    d = d.append(m,ignore_index = True)
                elif first:
                    m = alpha*p[:-1]
                    m['defects'] = True
                    #print(m)
                    d = d.append(m,ignore_index = True)
                elif second:
                    m = beta*q[:-1]
                    m['defects'] = True
                    #print(m)
                    d = d.append(m,ignore_index = True)  
    #print(s)
    #print(t)
    return d

def overSampling(clusters,d,nplus,n):
    k = len(clusters)
    t = 0
    s = 0
    first = False
    second = False
    for i in range(k):
        if len(clusters[i].groupby('defects').groups) == 1:
            first = False
            if clusters[i].defects.values[0]:
                first = True
                ni = len(clusters[i])
            else:
                ni = 0
        else:
            first = True
            ni = len(clusters[i].groupby('defects').groups[True])
        for j in range(i+1,k):
            if len(clusters[j].groupby('defects').groups) == 1:
                second = False
                if clusters[j].defects.values[0]:
                    second = True
                    nj = len(clusters[j])
                else:
                    nj = 0
            else:
                second = True
                nj = len(clusters[j].groupby('defects').groups[True])
            if ni+nj == 0:
                continue
            alpha = ni/(ni+nj)
            beta = nj/(ni+nj)
            total = ((ni+nj)/((k-1)*nplus))*n
            r = int(total)
            if(total-r > 0.5):
                r = r+1
            for m in range(r):
                # Remove last column
                if first:
                    p = d.iloc[random.choice(clusters[i].groupby('defects').groups[True])]
                if second:
                    q = d.iloc[random.choice(clusters[j].groupby('defects').groups[True])]
                if first and second:
                    m = alpha*p[:-1] + beta*q[:-1]
                    m['defects'] = True
                    d = d.append(m,ignore_index = True)
                elif first:
                    m = alpha*p[:-1]
                    m['defects'] = True
                    d = d.append(m,ignore_index = True)
                elif second:
                    m = beta*q[:-1]
                    m['defects'] = True
                    d = d.append(m,ignore_index = True)
    return d

# Code for peforming initial clustering

def InitialClustering(dat,k):
    kmeans = KMeans(n_clusters = k, init = 'k-means++',max_iter=300,n_init = 10,random_state = 0)
    kmeans.fit(dat)
    clusters = {}
    for i in range(len(kmeans.labels_)):
        if kmeans.labels_[i] in clusters:
            clusters[kmeans.labels_[i]].append(dat.index[i])
        else:
            clusters[kmeans.labels_[i]] = [dat.index[i]]
    for key in clusters.keys():
        clusters[key] = np.array(clusters[key])
    return clusters

In [21]:
sum(y == 1)

61

In [22]:
df
data = df
data['defects'] = (y == 1)

In [23]:
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.model_selection import StratifiedKFold

kf10 = StratifiedKFold(n_splits=10)

folds = []
for train,test in kf10.split(df[df.columns[:-1]],df[df.columns[-1]]):
    Xtrain = df.loc[train]
    Xtest,ytest = df.values[:,:-1][test],df.values[:,-1][test]
    test = (Xtest,ytest)
    params = {3:[5,15,20],5:[5,15,20],20:[5,15,20],50:[5,15,20]}
    result = {}
    key = 3
    val = 5
    for key in params.keys():
        for val in params[key]:
            X = Xtrain.copy()
            temp = Xtrain.defects
            X.index = [i for i in range(len(X))]
            Np = sum(X['defects'] == True)
            Nm = sum(X['defects'] == False)
            N = Nm - Np
            D = X.loc[X['defects'] == True]
            clusters = InitialClustering(D,key)
            for i in clusters.keys():
                clusters[i] = X.loc[clusters[i]]
            compData = overSamplingM(clusters,X,Np,N)
            compData = compData.dropna()
            #print("DefectiveInstances:",len(compData.groupby("defects").groups[True]))
            #print("NonDefectiveInstances:",len(compData.groupby("defects").groups[False]))
            dropped = clni(val)
            result[(key,val)] = compData.drop(index = dropped)
            #print("Defect:",len(result[(key,val)].groupby('defects').groups[True]))
            #print("NonDefect:",len(result[(key,val)].groupby('defects').groups[False]))
    print("Complete")    
    folds.append((result,test))

Complete
Complete
Complete
Complete
Complete
Complete
Complete
Complete
Complete
Complete


The parameters of KMFOS method are **k** and **kn** which are the initial number of clusters formed and also the number of neighbors used for noise filtering step. For more details refer the paper. Now we'd simply fit the model on the balanced data set for each of the values of the parameters and average out the results. We'd use **80-20** split for each of the dataset.

In [24]:
bal = []
recall = []
for fold in folds:
    result = fold[0]
    #bal = []
    #recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y)
        y_pred = lr.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
        
bal = np.array(bal)
recall = np.array(recall)
print("=============")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Average Balanced Accuracy: 0.6363424038769736
Std Balanced Accuracy: 0.11874368687356067
Average Recall Score: 0.5146825396825396
Std Recall Score: 0.29563119205063243


In [25]:
bal = []
recall = []
for fold in folds:
    result = fold[0]
    #bal = []
    #recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        svm = SVC(gamma = 'auto',kernel = 'rbf')
        svm.fit(X,Y)
        y_pred = svm.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    #bal = np.array(bal)
    #recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    #mbal += [np.mean(bal)]
    #mrecall += [np.mean(recall)]

bal = np.array(bal)
recall = np.array(recall)
print("=============")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Average Balanced Accuracy: 0.583952131847764
Std Balanced Accuracy: 0.11122502454999865
Average Recall Score: 0.3198412698412698
Std Recall Score: 0.25823060157116445


In [26]:
bal = []
recall = []
for fold in folds:
    result = fold[0]
    #bal = []
    #recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        rf = RandomForestClassifier(n_estimators=100)
        rf.fit(X,Y)
        y_pred = rf.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    #bal = np.array(bal)
    #recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    #mbal += [np.mean(bal)]
    #mrecall += [np.mean(recall)]

bal = np.array(bal)
recall = np.array(recall)
print("=============")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Average Balanced Accuracy: 0.6032635753306297
Std Balanced Accuracy: 0.11443241336719867
Average Recall Score: 0.3664682539682539
Std Recall Score: 0.2536821770471334


In [27]:
bal = []
recall = []
for fold in folds:
    result = fold[0]
    #bal = []
    #recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        nb = GaussianNB()
        nb.fit(X,Y)
        y_pred = nb.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    #bal = np.array(bal)
    #recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    #mbal += [np.mean(bal)]
    #mrecall += [np.mean(recall)]

bal = np.array(bal)
recall = np.array(recall)
print("=============")
print("Average Balanced Accuracy:",np.mean(bal))
print("Std Balanced Accuracy:",np.std(bal))
print("Average Recall Score:",np.mean(recall))
print("Std Recall Score:",np.std(recall))

Average Balanced Accuracy: 0.6134594447231628
Std Balanced Accuracy: 0.11353742563428275
Average Recall Score: 0.42420634920634925
Std Recall Score: 0.25663867631143666


In [28]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=n_hidden_nodes,
        # the number of output nodes.
        n_output_nodes= 2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        #print(sum(Y),Y.shape)
        Y = np.hstack((Y==False,Y))
        #print(Y)
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Y[:border]
        ytrain_seq = Y[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_pred = os_elm1.predict(fold[1][0])
        #print(y_pred)
        y_pred = softmax(y_pred)
        res = []
        for ys in y_pred:
            res.append(np.argmax(ys))
        res = np.array(res)
        #print(y_pred,res)
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,res)
        r = recall_score(l,res)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.6281864487088368
Std Balanced Accuracy: 0.13449505707542075
Average Recall Score: 0.5359126984126985
Std Recall Score: 0.2969674745910376


In [37]:
mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects'].values
        tf.reset_default_graph()
        os_elm1 = OS_ELM(
        # the number of input nodes.
        n_input_nodes=n_input_nodes,
        # the number of hidden nodes.
        n_hidden_nodes=n_hidden_nodes,
        # the number of output nodes.
        n_output_nodes=2,
        # loss function.
        # the default value is 'mean_squared_error'.
        # for the other functions, we support
        # 'mean_absolute_error', 'categorical_crossentropy', and 'binary_crossentropy'.
        loss='binary_crossentropy',
        # activation function applied to the hidden nodes.
        # the default value is 'sigmoid'.
        # for the other functions, we support 'linear' and 'tanh'.
        # NOTE: OS-ELM can apply an activation function only to the hidden nodes.
        activation='sigmoid',
    )
        Y = Y.reshape(-1,1)
        Yo = np.hstack((Y==False,Y))
        border = int(2*n_hidden_nodes)
        Xtrain_init = X[:border]
        Xtrain_seq = X[border:]
        ytrain_init = Yo[:border]
        ytrain_seq = Yo[border:]
        os_elm1.init_train(Xtrain_init, ytrain_init)
        batch_size = 64
        for i in range(0, len(Xtrain_seq), batch_size):
            x_batch = Xtrain_seq[i:i+batch_size]
            t_batch = ytrain_seq[i:i+batch_size]
            os_elm1.seq_train(x_batch, t_batch)
        n_classes = n_output_nodes
        y_predo = os_elm1.predict(fold[1][0])
        y_predo = softmax(y_predo)
        res = []
        for ys in y_predo:
            res.append(np.argmax(ys))
        res = np.array(res)
        
        svm = SVC(gamma = "auto",kernel = "rbf")
        svm.fit(X,Y.ravel())
        y_preds = svm.predict(fold[1][0])
        y_preds = y_preds.reshape(-1,1)
        
        lr = LogisticRegression(solver = "lbfgs")
        lr.fit(X,Y.ravel())
        y_predl = lr.predict(fold[1][0])
        y_predl = y_predl.reshape(-1,1)
        res = res.reshape(-1,1)
        y_pred = ((y_predl + y_preds + res) >= 2)*1
        y_pred = y_pred.ravel()
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.6338565401284892
Std Balanced Accuracy: 0.07626298673688779
Average Recall Score: 0.4533730158730159
Std Recall Score: 0.20090155281983765


In [34]:
from sklearn.ensemble import BaggingClassifier

mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        clf = BaggingClassifier(base_estimator=SVC(kernel = "rbf",gamma = "auto"),n_estimators=10,random_state = 0)
        clf.fit(X,Y)
        y_pred = clf.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.5840322198531154
Std Balanced Accuracy: 0.07872853822513894
Average Recall Score: 0.3192460317460318
Std Recall Score: 0.16719598957272064


In [33]:
from sklearn.ensemble import BaggingClassifier

mbal = []
mrecall = []
for fold in folds:
    result = fold[0]
    bal = []
    recall = []
    for key in result.keys():
        dat = result[key]
        X = dat.iloc[:,:-1]
        Y = dat['defects']
        clf = BaggingClassifier(base_estimator=LogisticRegression(solver = "lbfgs"),n_estimators=10,random_state = 0)
        clf.fit(X,Y)
        y_pred = clf.predict(fold[1][0])
        l = np.array([bool(i) for i in fold[1][1]])
        b = balanced_accuracy_score(l,y_pred)
        r = recall_score(l,y_pred)
        bal += [b]
        recall += [r]
    bal = np.array(bal)
    recall = np.array(recall)
    #print("Recall:",np.mean(recall))
    #print("Balanced Accuracy:",np.mean(bal))
    #print("====================")
    mbal += [np.mean(bal)]
    mrecall += [np.mean(recall)]

mbal = np.array(mbal)
mrecall = np.array(mrecall)
print("=============")
print("Average Balanced Accuracy:",np.mean(mbal))
print("Std Balanced Accuracy:",np.std(mbal))
print("Average Recall Score:",np.mean(mrecall))
print("Std Recall Score:",np.std(mrecall))

Average Balanced Accuracy: 0.6357648469835696
Std Balanced Accuracy: 0.08242535751875663
Average Recall Score: 0.5148809523809523
Std Recall Score: 0.2292611294351741
