**Data Preprocessing**

In this section, we are looking to:
1. choose an appropriate feature scaling method for our dataset
2. narrow down the number of predicted mislabelled indices
3. employ the use of several machine learning models to help in identifying the mislabelled indices
4. swap the mislabelled data back to their original value

In [37]:
import pandas as pd
import numpy as np

Reading the top rows of the data to ensure that the diagnosis values had been changed from "B" to 0 and "M" to 1.

In [38]:
dataset = pd.read_csv("../data/dataset_exploratory.csv")
dataset.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


We split the dataset into X and y where X contains the data on all 30 features and y contains the data on diagnosis. Additionally, we created a header to store the names of all features for future reference.

In [39]:
X = dataset.loc[:, dataset.columns != "diagnosis"]
y = dataset.loc[:, dataset.columns == "diagnosis"].to_numpy()
header = list(dataset.columns)

We will scale the features by standardising them since the data has varying range, and that our dataset has outliers.

In [40]:
from sklearn.preprocessing import StandardScaler

In [41]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Previously, in our exploratory data analysis, we were unable to identify any distinct clusters between the features we selected. Now, we will utilise K-means clustering to see if it could split the benign and malignant points into two clusters, and then identifying the mislabelled points from there.

In [42]:
from sklearn.cluster import KMeans

Since the K-means algorithm depends mostly on the initialisation of the two centroids, we decided to try 50 different initialisations by running the algorithm 50 times, using different values for random_state in each iteration.

Then, within each iteration, it obtains the indices of the rows that produced a false positive or false negative prediction, i.e. when actual = 0, predicted = 1 and actual = 1, predicted = 0 respectively. 

The indices are saved into two separate dictionaries, frequent_mislabel_B which contains indices of actual = 0, prediction = 1, and frequent_mislabel_M which contains indices of actual = 1, prediction = 0.

In [43]:
maxIter = 50

false_positives = {}
false_negatives = {}

for i in range(maxIter):
    kmeans = KMeans(n_clusters = 2, random_state = i*10)
    kmeans.fit(X_scaled)

    dataset["cluster"] = kmeans.labels_

    cross_tab = pd.crosstab(dataset["diagnosis"], dataset["cluster"])
    
    if (cross_tab[0][0] + cross_tab[1][1]) < (cross_tab[0][1] + cross_tab[1][0]):
        dataset["cluster"] = dataset["cluster"].map({0: 1, 1: 0})

    mislabel_B = dataset[(dataset["diagnosis"] == 0) & (dataset["cluster"] == 1)]
    mislabel_M = dataset[(dataset["diagnosis"] == 1) & (dataset["cluster"] == 0)]

    indices_B = mislabel_B.index
    indices_M = mislabel_M.index

    for j in range(len(indices_B)):
        if indices_B[j] in false_positives:
            false_positives[indices_B[j]] += 1
        else:
            false_positives[indices_B[j]] = 1

    for j in range(len(indices_M)):
        if indices_M[j] in false_negatives:
            false_negatives[indices_M[j]] += 1
        else:
            false_negatives[indices_M[j]] = 1

The reasoning for storing the occurrences in a dictionary is to see if there are any indices that appear less frequently than others. So, we sort the dictionary based on the values in ascending order.

In [44]:
freq_false_positives = {k: v for k, v in sorted(false_positives.items(),
                                  key=lambda item: item[1])}
freq_false_negatives = {k: v for k, v in sorted(false_negatives.items(), 
                                  key=lambda item: item[1])}

In [45]:
print(freq_false_positives)
print(freq_false_negatives)

{176: 22, 247: 22, 505: 22, 541: 22, 47: 50, 53: 50, 65: 50, 68: 50, 81: 50, 89: 50, 112: 50, 128: 50, 152: 50, 161: 50, 214: 50, 242: 50, 290: 50, 318: 50, 368: 50, 376: 50, 421: 50, 465: 50, 485: 50, 504: 50}
{43: 28, 13: 50, 16: 50, 38: 50, 39: 50, 40: 50, 41: 50, 44: 50, 54: 50, 63: 50, 73: 50, 74: 50, 86: 50, 99: 50, 100: 50, 119: 50, 124: 50, 126: 50, 135: 50, 143: 50, 171: 50, 182: 50, 184: 50, 186: 50, 195: 50, 205: 50, 207: 50, 234: 50, 255: 50, 261: 50, 263: 50, 268: 50, 274: 50, 277: 50, 284: 50, 293: 50, 333: 50, 350: 50, 385: 50, 414: 50, 431: 50, 435: 50, 444: 50, 450: 50, 456: 50, 470: 50, 483: 50, 489: 50, 497: 50, 511: 50, 536: 50}


It seems that there are some indices that do not appear very often, and majority of the indices appear in all 50 iterations. We will keep the indices that appear in all 50 iterations.

In [46]:
def remove_infreq(dct, maxIter):
    output = {}
    for key in dct:
        if dct[key] == maxIter:
            output[key] = maxIter
    return output

In [47]:
freq_B = remove_infreq(freq_false_positives, maxIter)
freq_M = remove_infreq(freq_false_negatives, maxIter)

We then obtain all the indices that are predicted to be mislabelled by combining the keys of both dictionaries.

In [48]:
kmeans_predicted_indices = []

for index in freq_B:
    kmeans_predicted_indices.append(index)
for index in freq_M:
    kmeans_predicted_indices.append(index)

We will now validate the indices by comparing it to the actual swapped indices.

In [49]:
actual_indices = np.load("../data/changed_label_row_inds.npy")
actual_indices

array([ 10,  47,  53,  63,  65,  74,  91, 124, 143, 161, 195, 214, 234,
       268, 284, 293, 297, 333, 350, 368, 431, 450, 456, 470, 483, 497,
       511, 514])

The two functions below are extracted from the exploratory jupyter notebook as we are performing the same type of comparison as before.

In [50]:
def get_matching_indices(predicted_indices, actual_indices):
    matching_indices = []

    for ind in predicted_indices:
        if ind in actual_indices:
            matching_indices.append(ind)

    return sorted(matching_indices)

In [51]:
def get_missing_indices(matching_indices, actual_indices):
    missing_indices = []

    for ind in actual_indices:
        if ind in actual_indices and ind not in matching_indices:
            missing_indices.append(ind)

    return sorted(missing_indices)

We then obtain the matching indices obtained from the K-means method and the indices that are not present in the matching indices.

In [52]:
kmeans_matching_indices = get_matching_indices(kmeans_predicted_indices, actual_indices)
kmeans_missing_indices = get_missing_indices(kmeans_matching_indices, actual_indices)

As seen below, the ratio of matching to predicted indices is roughly the same as previously obtained in the exploratory data analysis, where the ratio was 26:76 from the feature engineered radius.

In [53]:
print(f'predicted indices: {len(kmeans_predicted_indices)}\n{np.array(kmeans_predicted_indices)}\n')
print(f'matching indices: {len(kmeans_matching_indices)}\n{np.array(kmeans_matching_indices)}\n')
print(f'missing indices: {len(kmeans_missing_indices)}\n{np.array(kmeans_missing_indices)}')

predicted indices: 70
[ 47  53  65  68  81  89 112 128 152 161 214 242 290 318 368 376 421 465
 485 504  13  16  38  39  40  41  44  54  63  73  74  86  99 100 119 124
 126 135 143 171 182 184 186 195 205 207 234 255 261 263 268 274 277 284
 293 333 350 385 414 431 435 444 450 456 470 483 489 497 511 536]

matching indices: 24
[ 47  53  63  65  74 124 143 161 195 214 234 268 284 293 333 350 368 431
 450 456 470 483 497 511]

missing indices: 4
[ 10  91 297 514]


We will now load the indices predicted from the feature engineered radius

In [54]:
feg_predicted_indices = np.load("../data/feg_predicted_indices.npy")
feg_predicted_indices

array([ 10,  12,  13,  14,  36,  38,  39,  41,  47,  53,  63,  65,  74,
        81,  83,  86,  92,  99, 122, 124, 135, 143, 146, 161, 177, 190,
       194, 195, 197, 200, 202, 204, 212, 213, 214, 215, 225, 234, 268,
       277, 284, 293, 329, 333, 340, 347, 350, 351, 359, 368, 372, 385,
       389, 414, 430, 431, 432, 450, 451, 456, 465, 470, 472, 476, 479,
       481, 483, 497, 509, 511, 513, 514, 518, 532, 536, 549])

There seems to be some overlap between kmeans_predicted_indices and feg_predicted_indices. Thus, we decided to intersect the two indices by converting them to a set, getting the intersection, and converting it back to a list.

In [55]:
intersect_predicted_indices = sorted(list(set(feg_predicted_indices).intersection(set(kmeans_predicted_indices))))
intersect_matching_indices = get_matching_indices(intersect_predicted_indices, actual_indices)
intersect_missing_indices = get_missing_indices(intersect_matching_indices, actual_indices)

As seen below, the number of matching indices did not change, but we narrowed down the number of predicted indices. This suggests we are perhaps on the right track.

In [56]:
print(f'predicted indices: {len(intersect_predicted_indices)}\n{np.array(intersect_predicted_indices)}\n')
print(f'matching indices: {len(intersect_matching_indices)}\n{np.array(intersect_matching_indices)}\n')
print(f'missing indices: {len(intersect_missing_indices)}\n{np.array(intersect_missing_indices)}')

predicted indices: 37
[ 13  38  39  41  47  53  63  65  74  81  86  99 124 135 143 161 195 214
 234 268 277 284 293 333 350 368 385 414 431 450 456 465 470 483 497 511
 536]

matching indices: 24
[ 47  53  63  65  74 124 143 161 195 214 234 268 284 293 333 350 368 431
 450 456 470 483 497 511]

missing indices: 4
[ 10  91 297 514]


Next, we decided to employ the logistic regression model onto our dataset, since it is a binary classification dataset.

In [57]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

We set the testing data to be the rows of data that have the same indices as those in predicted indices, and every other row of data to be the training data. This makes it such that we train the data based on the non-mislabelled data, and test it on the mislabelled data.

In [58]:
all_indices = np.arange(X.shape[0])
train_indices = np.setdiff1d(all_indices, intersect_predicted_indices)

X_train = X_scaled[train_indices]
y_train = y[train_indices]
X_test = X_scaled[intersect_predicted_indices]
y_test = y[intersect_predicted_indices]

We then fitted the training data to the logistic regression model, obtained the predictions given the testing data, and obtained our confusion matrix.

In [59]:
logreg = LogisticRegression(max_iter = 1000, random_state = 23)
logreg.fit(X_train, np.ravel(y_train))

y_pred = logreg.predict(X_test)
conf_matrix = confusion_matrix(y_pred, y_test)

Our confusion matrix seems to have a huge proportion of false positive and false negative data, which was what we were hoping to obtain, as there should be a substantial amount of mislabelled data in the testing data.

In [60]:
print("Confusion Matrix:\n", conf_matrix)

Confusion Matrix:
 [[ 2 25]
 [ 6  4]]


We then extracted the indices of the rows that produced the false positives and false negatives

In [61]:
logreg_predicted_indices = []

for i in range(y_pred.shape[0]):
    if y_pred[i] != y_test[i]:
        logreg_predicted_indices.append(intersect_predicted_indices[i])

logreg_matching_indices = get_matching_indices(logreg_predicted_indices, actual_indices)
logreg_missing_indices = get_missing_indices(logreg_matching_indices, actual_indices)

As seen below, we have succcesfully narrowed down the number of predicted indices while keeping the matching indices the same.

In [62]:
print(f'predicted indices: {len(logreg_predicted_indices)}\n{np.array(logreg_predicted_indices)}\n')
print(f'matching indices: {len(logreg_matching_indices)}\n{np.array(logreg_matching_indices)}\n')
print(f'missing indices: {len(logreg_missing_indices)}\n{np.array(logreg_missing_indices)}')

predicted indices: 31
[ 13  38  41  47  53  63  65  74  86  99 124 135 143 161 195 214 234 268
 284 293 333 350 368 385 431 450 456 470 483 497 511]

matching indices: 24
[ 47  53  63  65  74 124 143 161 195 214 234 268 284 293 333 350 368 431
 450 456 470 483 497 511]

missing indices: 4
[ 10  91 297 514]


We then decided to repeat this same process once more to see if it could narrow down the number of predicted indices.

In [63]:
all_indices = np.arange(X.shape[0])
train_indices = np.setdiff1d(all_indices, logreg_predicted_indices)

X_train = X_scaled[train_indices]
y_train = y[train_indices]
X_test = X_scaled[logreg_predicted_indices]
y_test = y[logreg_predicted_indices]

In [64]:
logreg = LogisticRegression(max_iter = 1000, random_state = 23)
logreg.fit(X_train, np.ravel(y_train))

y_pred = logreg.predict(X_test)
conf_matrix = confusion_matrix(y_pred, y_test)

In [65]:
print("Confusion Matrix:\n", conf_matrix)

Confusion Matrix:
 [[ 0 22]
 [ 6  3]]


In [66]:
logreg2_predicted_indices = []

for i in range(y_pred.shape[0]):
    if y_pred[i] != y_test[i]:
        logreg2_predicted_indices.append(logreg_predicted_indices[i])

logreg2_matching_indices = get_matching_indices(logreg_predicted_indices, actual_indices)
logreg2_missing_indices = get_missing_indices(logreg2_matching_indices, actual_indices)

As seen below, it seems we have succeeded in narrowing the number of predicted indices down. Given that there were 28 mislabelled data, we managed to predict 28 mislabelled data, albeit only 24 matching the actual mislabelled data.

In [67]:
print(f'predicted indices: {len(logreg2_predicted_indices)}\n{np.array(logreg2_predicted_indices)}\n')
print(f'matching indices: {len(logreg2_matching_indices)}\n{np.array(logreg2_matching_indices)}\n')
print(f'missing indices: {len(logreg2_missing_indices)}\n{np.array(logreg2_missing_indices)}')

predicted indices: 28
[ 38  47  53  63  65  74  99 124 135 143 161 195 214 234 268 284 293 333
 350 368 385 431 450 456 470 483 497 511]

matching indices: 24
[ 47  53  63  65  74 124 143 161 195 214 234 268 284 293 333 350 368 431
 450 456 470 483 497 511]

missing indices: 4
[ 10  91 297 514]


We will now swap the mislabelled data back to their original value, so that we can perform our classification task. This is using the indices from the actual mislabelled data, so that we do not harm the quality of the dataset.

In [68]:
for index in actual_indices:
    y[index] = 1 - y[index]

Since we split the dataset into X and y earlier, we merged it back into one dataframe to be saved in a csv file.

In [69]:
dataset_modified_X = pd.DataFrame(X_scaled, columns=header[1:])
dataset_modified_Y = pd.DataFrame(y, columns = ["diagnosis"])
dataset_modified = pd.concat([dataset_modified_Y, dataset_modified_X], axis = 1)

In [70]:
dataset_modified.to_csv("../data/dataset_preprocessed.csv", index = False)

Since we applied feature scaling to the dataset, we saved the scaler such that for future data, it can be scaled down in the same way, and then perform predictions based on the new scaled data.

In [71]:
import pickle

In [72]:
with open("scaler.bin", "wb") as f:
    pickle.dump(scaler, f)