# Logistic Regression & Cross Validation

In this exercise, you will use logistic regression to classify breast cancer as either malignant or benign. First run the code below to print and read the description of the data set. 

In [None]:
from sklearn.datasets import load_breast_cancercer
import numpy as np

DataCancer=load_breast_cancer()
print(DataCancer.keys())
print(DataCancer.DESCR)

X_features=DataCancer.data
Y_targetClass=DataCancer.target

### A) Use logistic regression, with ridge regularization and tuning parameter set to 1. Find the accuracy of the model. Scale the features  to have zero mean and unit variance. 
- Use random_state = 0 in the train_test_split.

In [2]:
# write your code here
from sklearn.datasets import load_breast_cancer
import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

DataCancer = load_breast_cancer()
print(DataCancer.keys())
print(DataCancer.DESCR)

X_features = DataCancer.data
Y_targetClass = DataCancer.target

scaler = preprocessing.StandardScaler()

X_train, X_test, Y_train, Y_test = train_test_split(X_features, Y_targetClass, random_state=0)

scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)

X_test_scaled = scaler.transform(X_test)

LogModel = LogisticRegression(C=1)


LogModel.fit(X_train_scaled, Y_train)

print('The score is ',LogModel.score(X_test_scaled, Y_test))


dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Ra

### B) For the same problem, we also use logistic regression but want to select the best tuning parameter of Ridge regularization in logistic regression. We try the following set of values for the tuning parameter: [0.01, 0.1, 1, 10, 100], and use the five fold cross validation. Find the best tuning parameter in this set, and the test accuracy of the model when the best tuning parameter is selected. 
- Feature scaled as in part A

In [3]:
#write your code here
kfolds = 5

candidateC = [0.01, 0.1, 1, 10, 100]

highScore = 0

bestC = 0

for candidate in candidateC:
    Model = LogisticRegression(C=candidate)

    scores = cross_val_score(Model, X_train_scaled, Y_train, cv=kfolds)

    score = np.mean(scores)

    print('This score of C=' + str(candidate) + ':' + str(score))

    if score > highScore:
        bestC = candidate
        highScore = score

SelectedModel = LogisticRegression(C=bestC).fit(X_train_scaled, Y_train)

test_score = SelectedModel.score(X_test_scaled, Y_test)

print('Final score is', highScore)
print('Best C is', bestC)
print('Test score is ', test_score)


This score of C=0.01:0.9719008533646016
This score of C=0.1:0.9835561201224676
This score of C=1:0.9811751677415153
This score of C=10:0.9765520161552994
This score of C=100:0.9671949710116605
Final score is 0.9835561201224676
Best C is 0.1
Test score is  0.965034965034965
