# Logistic Regression on the Breast Cancer dataset

In [6]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)

In [7]:
logreg = LogisticRegression().fit(X_train, y_train)

In [8]:
print("Training set score: {:3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:3f}".format(logreg.score(X_test,y_test)))

Training set score: 0.953052
Test set score: 0.958042


The default value of C-1 provides quite good performance, with 95% accuracy on both the training and the test set. But as training and test set performance are very close, it is likely that we are underfitting. Lets try to increase C to fit a more flexible model:

In [11]:
logreg100 = LogisticRegression(C=100).fit(X_train, y_train)

print("Training set score: {:.3f}".format(logreg100.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg100.score(X_test, y_test)))

Training set score: 0.972
Test set score: 0.965


Using C=100 results in higher training set accuracy, and also a slightly increased test set accuracy, confirming our intuition that a more complex model should perform better.-