# Logistic Regression with scikit-learn

## Goal: Correctly classify a sample tumor as malignant or benign using the data in the the breast cancer wisconsin (diagnostic) dataset from scikit-learn [1].

### Logistic Regression:
Logistic regression is a model used to predict the probability of a sample being in a particular class or to classify a sample as being in a particular class. Its hypothesis function is
<center>$z = WX_{train} + B$ &emsp;&emsp;[Eq.1]</center>
<center>$h = f(z) = \frac{1}{1 + \exp^{-z}}$ &emsp;&emsp;[Eq.2]</center>
where $f(z) = P(positive|X_{train})$ is the vector of probabilities that a given sample is positive and f is the sigmoid, or the eponymous *logistic* function; W is the matrix of weights which is optimized by the model during training; and B is a matrix of biases tuned to model how data is distributed around the origin (e.g., if the data is centered around the origin, the biases should be close to 

In [3]:
import numpy as np
import matplotlib.pyplot as plt

z = np.arange(-50, 50)
h = 1/(1+ np.exp(-z))
plt.plot(z, h)
plt.title("Sigmoid Function (h=f(z))")
plt.show()

<Figure size 640x480 with 1 Axes>

Because the sigmoid function constrains its outputs to values only ranging between 0 and 1, it is suitable in most instances to view this output as a probability. Logistic regression can be used for classification purposes by designating a threshold value called a decision boundary, often 0.5, over which the sample is labeled positive, and below which the sample is labeled negative.

### Step 1: Load the dataset
Since scikit-learn stores the dataset in its own library, we must import that data to train and test on.

In [4]:
from sklearn.datasets import load_breast_cancer
raw = load_breast_cancer()

X = raw.data
y = raw.target

#### Examine the dataset.
To get a better idea of the data we're dealing with, let's look at some informative aspects of the dataset.

In [5]:
# Show dataset description
print(raw.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

In [6]:
# Show feature names
features = list(raw.feature_names);
print("Number of features:", len(features))
print("All", len(features), "Features:\n" + str(features))

Number of features: 30
All 30 Features:
['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']


In [7]:
# Show target/class label names
print("Class labels:", list(raw.target_names))

Class labels: ['malignant', 'benign']


In [8]:
# Show number of samples
print("Number of samples:", len(X))

Number of samples: 569


In [9]:
# Show dimension of X (the array of samples), e.g., (# of samples, # of features)
print("Dimension of X:", X.shape)

Dimension of X: (569, 30)


In [10]:
# Show dimension of y (the vector of predictions for each sample)
print("Dimension of y:", y.shape)

Dimension of y: (569,)


### Step 2: Split the data into training and test sets
To avoid overfitting the model to the data, which happens when the model is tested on the same data that it was trained on and causes decreased accuracy on unseen data (or, in other words, increased generalization error), we split the data such that a certain percentage is dedicated to training, and the rest is used for testing.

In [11]:
# Split X, y into X_train, X_test, y_train, y_test with 7:3 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)

### Step 3: Fit the model to the training data using a logistic model
To fit the model, a cost function, something like:
<center>$Cost(h(z), $</center>
The logistic model class provided by scikit-learn has a solver attribute, which designates the optimization algorithm. The scikit-learn documentation recommends using the "liblinear" solver for smaller datasets (like the one used here). By default, the model uses L1 regularization (L1 meaning L1 norm, or the equation for the length of a vector).

In [12]:
# Build a logistic regression model of solver='liblinear' with X_train, y_train
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear').fit(X_train, y_train)

### Step 4: Predict labels for the test data
In order to evaluate the performance of a trained model, it must be used on unseen data to predict target labels for each of the test samples.

In [13]:
# Predict y_pred from X_test
y_pred = model.predict(X_test)

In [14]:
# Show first ten predictions
print([raw.target_names[i] for i in y_pred[0:10]])

['benign', 'malignant', 'benign', 'benign', 'malignant', 'benign', 'malignant', 'benign', 'malignant', 'benign']


### Step 5: Measure the model's performance

#### Confusion Matrix:
A confusion matrix is a useful way to see the relationship between the ground truth and the hypothesis output, but also works as a basis on which other performance measures can be calculated. The columns correspond to positively and negatively predicted samples and the rows are the actual positive and negative samples. Hence, the entries starting from the top left and going clockwise are:
* TP: the number of true positives (the model guessed correctly that the sample was positive), 
* FN: the number of false negatives (the model guessed that the sample was negative and it was actually positive), 
* TN: the number of true negatives (the model guessed that the sample was positive and it was), and 
* FP: the number of false positives (the model guessed incorrectly that the sample was positive).

In [17]:
# Show confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
print("Confusion Matrix:\n" + str(confusion_matrix(y_test, y_pred)))

Confusion Matrix:
[[ 51   7]
 [  6 107]]


#### Other Performance Measures:
Accuracy: $\frac{TP + TN}{TP + FN + TN + FP}$<br>
Precision: $\frac{TP}{TP + FP}$<br>
Recall/Sensitivity/True Positive Rate (*TPR*): $\frac{TP}{TP + FN}$<br>
Specificity/True Negative Rate (*TNR*): $\frac{TN}{TN + FP} = 1 - TPR$<br>

(There are a number of other metrics, even just consisting of the values found in the confusion matrix, but here we will only calculate these four.)

In [21]:
# Show accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9239766081871345


In [22]:
# Show precision
print("Precision:", precision_score(y_test, y_pred))

Precision: 0.9385964912280702


In [26]:
# Show recall
recall = recall_score(y_test, y_pred)
print("Recall:", recall)

Recall: 0.9469026548672567


In [27]:
# Show specificity
print("Specificity:", 1 - recall)

Specificity: 0.053097345132743334


### Resources

1: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
<br>
2: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html