# LR Breast Cancer Diagnoser

Using a dataset consisting of the characteristics of digitized images of the cell nuceli in breast masses, let's see if we can use machine learning to build a model that can diagnose a breast mass as benign or malignant given this information.

## Data Exploration and Preprocessing

Let's begin by loading in the necessary packages and checking to see what file(s) we have in our dataset.

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.metrics import fbeta_score,confusion_matrix
import eli5
from eli5.sklearn import PermutationImportance
from pdpbox import pdp
print(os.listdir("../input"))

Let's read our data file into a pandas dataframe and take a peek at its contents.

In [None]:
data = pd.read_csv("../input/data.csv", index_col="id")
data.head()

As we can see, we have data pertaining to the medical diagnosis of breast masses, with many different features pertaining to the diagnosis such as the average radius or texture of the mass cell nuclei image. As articulated from the dataset description, this is a binary classification problem, with these masses being identified as either benign (represented by the label 'B') or malignant (represented by the label 'M').

Let's gather some statistical information about our dataset before proceeding.

In [None]:
data.describe(include="all")

Viewing the above dataset description, we clearly have an unnecessary column in our dataset containing no data named 'Unnamed: 32'. We can confidently drop this column from our dataset.

In [None]:
data.drop("Unnamed: 32", axis=1, inplace=True)

Viewing the above dataset description, we're also thankfully missing no data points, which is helpful since there are only 569 data instances in our dataset. All of the features in our dataset seem to have differing feasible ranges for their values, so this data will definitely have to be standardized prior to training a machine learning model on it as well. Let's view the class distribution in our dataset.

In [None]:
ax = sns.countplot(data.diagnosis)
ax.set_xticklabels(["Malignant", "Benign"])
ax.set_xlabel("Diagnosis")
ax.set_ylabel("Number of Data Samples")

As we can see from the above plot, though we have definitely more 'benign' than 'malignant' samples in our dataset, this discrepancy doesn't seem to be large enough to be of much concern.

Let's now separate our dataset into a train and test set, with 80% of the data samples being placed in to the training set, and 20% being placed in the test set, ensuring that the class distribution remains relatively the same in both datasets.

In [None]:
X = data.drop("diagnosis", axis=1)
y = data.diagnosis
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=0,
                                                    stratify=y)
X_train = X_train.copy()
y_train = y_train.copy()
X_test = X_test.copy()
y_test = y_test.copy()

Let's also standardize the values in both the train and test sets using a standardizer fit to the training data.

In [None]:
scaler = StandardScaler()
X_train[X_train.columns] = scaler.fit_transform(X_train)
X_test[X_test.columns] = scaler.transform(X_test)

With the data satisfactorily preprocessed, let's proceed to build a supervised machine learning model to diagnose breast cancer.

## Diagnosing Breast Cancer with Machine Learning

### Training a Model

Let's train several different machine learning models to diagnose a breast mass as either benign or malignant. We will begin by training a logistic regression (LR) model for this task, using cross-validation to identify the best hyperparameters to use.

In [None]:
lr_param_grid = dict(class_weight=(None, "balanced"),
                     penalty=("l1", "l2"),
                     C=np.logspace(-3, 3, 7))
lr_cv = GridSearchCV(LogisticRegression(solver="liblinear", random_state=0),
                     lr_param_grid,
                     cv=5,
                     iid=False)
best_lr_params = lr_cv.fit(X_train, y_train).best_params_
lr_model = LogisticRegression(penalty=best_lr_params["penalty"],
                              C=best_lr_params["C"],
                              class_weight=best_lr_params["class_weight"],
                              solver="liblinear",
                              random_state=0)
lr_model.fit(X_train, y_train)

Support vector machines (SVMs) are known to perform well on small datasets. Let's perform cross-validation to find the best hyperparameters to use to train an SVM for this task as well, checking to see what is the best kernel to use for this task.

In [None]:
svm_param_grid = dict(class_weight=(None, "balanced"),
                      C=np.logspace(-3, 3, 7),
                      kernel=("linear", "poly", "rbf", "sigmoid"),
                      degree=(2, 3),
                      gamma=("auto", "scale"),
                      shrinking=(True, False))
svm_cv = GridSearchCV(SVC(random_state=0),
                      svm_param_grid,
                      cv=5,
                      iid=False)
best_svm_params = svm_cv.fit(X_train, y_train).best_params_
print(f"A {best_svm_params['kernel']} SVM kernel should be used for this task.")

It seems that an SVM with a linear kernel performs the best in cross-validation. Since sci-kit learn has its own specific LinearSVC class optimized for an SVM with a linear kernel, let's perform further cross-validation to train a linear SVM to hopefully achieve superior results.

In [None]:
lsvm_param_grid = dict(class_weight=(None, "balanced"),
                       penalty=("l1", "l2"),
                       loss=("hinge", "squared_hinge"),
                       dual=(True, False),
                       C=np.logspace(-3, 3, 7))
lsvm_cv = GridSearchCV(LinearSVC(random_state=0),
                       lsvm_param_grid,
                       cv=5,
                       iid=False,
                       error_score=np.nan)
best_lsvm_params = lsvm_cv.fit(X_train, y_train).best_params_
svm_model = LinearSVC(class_weight=best_lsvm_params["class_weight"],
                      penalty=best_lsvm_params["penalty"],
                      loss=best_lsvm_params["loss"],
                      dual=best_lsvm_params["dual"],
                      C=best_lsvm_params["C"],
                      random_state=0)
svm_model.fit(X_train, y_train)

### Model Testing & Analysis

Let's now see the accuracy scores achieved by these two models on the test set.

In [None]:
print(f"The test accuracy score of the LR model is {lr_model.score(X_test, y_test)}.")
print("The test accuracy score of the linear SVM model is "
      f"{svm_model.score(X_test, y_test)}.")

Though both models identify a cancerous breast mass with a high accuracy, both still make a few errors. When selecting a machine learning model to use, though we'd of course prefer having no errors, it's important to consider what types of errors our model is making. In this context, false positive errors would correspond to incorrectly diagnosing a benign tumor as malignant, while false negative errors would correspond to misdiagnosing a malignant tumor as benign. I think clearly we would much rather have false positive errors as opposed to false negative errors, since we don't want any cancerous tumors to get past our machine learning model. Recall is the standard metric used to evaluate the false negatives made by a machine learning model, but I personally tend to avoid it since a machine learning model that simply classified all labels as positive would achieve perfect recall. I will therefore use the F2-Score instead, the weighted harmonic mean of recall and precision, which lends more weight to recall without ignoring a model's precision.

In [None]:
lr_preds = lr_model.predict(X_test)
svm_preds = svm_model.predict(X_test)
print("The test F2-Score of the LR model is "
      f"{fbeta_score(y_test, lr_preds, beta=2, pos_label='M')}.")
print("The test F2-Score of the linear SVM model is "
      f"{fbeta_score(y_test, svm_preds, beta=2, pos_label='M')}.")

As we can see, the logistic regression model again performs better than the linear support vector machine, and will therefore be selected as the machine learning model to use as a breast cancer diagnoser in this work.

To gain a better understanding of the types of predictions made by this logistic regression model on the test set, let's view a confusion matrix of it's predictions.

In [None]:
confusion = pd.DataFrame(confusion_matrix(y_test, lr_preds))
confusion = confusion.div(confusion.sum().sum())
confusion.columns = ["Predicted Negative", "Predicted Positive"]
confusion.index = ["Actual Negative", "Actual Positive"]
ax = sns.heatmap(confusion, vmin=0, vmax=1, annot=True, fmt=".0%")
ax.set_yticklabels(ax.get_yticklabels(), rotation=0)
ax.collections[0].colorbar.set_ticks((0, .25, .5, .75, 1))
ax.collections[0].colorbar.set_ticklabels(("0%", "25%", "50%", "75%", "100%"))

Viewing the above confusion matrix, we can see that of all diagnoses made by the logistic regression model, around 97% of them are correct, while around 1% of predictions are false positives and 2% are false negatives.

To gain a better understanding of which breast mass features are the most significant, let's visualize the permutation importance of the features in our dataset.

In [None]:
permutations = PermutationImportance(lr_model, random_state=0).fit(X_test, y_test)
eli5.show_weights(permutations, top=None, feature_names=X_test.columns.tolist())

As we can see, the worst texture of the digitized cell nuclei image of a breast mass seems to be the most significant feature, while the standard error of the image concavity seems to be the least significant.

Let's build partial dependence plots of the 5 most significant features to view their impact on breast tumor malignancy.

In [None]:
feature_names=X_test.columns
top_features = ("texture_worst", "radius_se", "perimeter_se",
                "area_se", "compactness_se")
for i, feature in enumerate(top_features):
    pdp_feature = pdp.pdp_isolate(lr_model, X_test, feature_names, feature)
    pdp.pdp_plot(pdp_feature, feature)

As we can see viewing the above partial dependence plots, the probability of breast tumor malignancy steadily increases as the worst texture, or radius, perimeter and area standard errors respectively increase, while the probability decreases as the compactness standard error goes up for the digitized cell nuclei image of a breast mass.

## Final Remarks

Training a logistic regression model on a dataset of digitized cell nuclei images of breast masses, we were able to build a breast cancer diagnoser with a test accuracy of about 97% and a test F2-Score of about 96%. Using this logistic regression model, we were able to find that the worst texture, and the standard error of each of the radius, perimeter, area, and compactness of these images were the best individual indicators of breast tumor malignancy. Other features such as the worst smoothness and standard error of the concavity proved to be the worst. Despite the high accuracy and F2-Score achieved by this machine learning model, further improvements can certainly be made to further reduce the errors, most importantly the false negatives, made by this model. Using the above and further feature analysis, one could remove unnecessary features and/or further engineer these features to perhaps improve upon these results.