# Instructor Task
## Dataset
- [Here](https://s3-us-west-2.amazonaws.com/ga-dat-2015-suneel/datasets/breast-cancer.csv) is the dataset.
- [Here](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names) is a description of the data. Ignore column 0 as it is merely the ID of a patient record.

In [13]:
import pandas as pd

## 1. Read in the data

In [14]:
url = "https://s3-us-west-2.amazonaws.com/ga-dat-2015-suneel/datasets/breast-cancer.csv"
df = pd.read_csv(url, header=None)

In [15]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 2. Separate the data into feature and target.

In [16]:
y = df[1]
X = df[df.columns[2:]]

## 3. Create and evaluate logistic regression using cross_val_score and 5 folds.
- What is the mean accuracy?
- What is the standard deviation of accuracy?

In [17]:
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression

reg_model = LogisticRegression()
scores = cross_val_score(reg_model, X, y, cv=5)

In [18]:
scores.mean()

0.95090419392073855

In [19]:
scores.std()

0.015935038978124323

## 4. Get a classification report to identify type 1, type 2 errors.
- Use train_test_split to run logistic regression once, with a test size of 0.33
- Make predictions on the test set
- Compare the predictions to the answers to determine the classification report

In [20]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

model = LogisticRegression()

# Fit the model
model.fit(X_train, y_train)

# Produce the predictions
y_pred = model.predict(X_test)

print(classification_report(y_pred, y_test))

             precision    recall  f1-score   support

          B       0.96      0.98      0.97       122
          M       0.95      0.92      0.94        66

avg / total       0.96      0.96      0.96       188



## 5. Scale the data and see if that improves the score

In [21]:
from sklearn import preprocessing

X_scaled = preprocessing.scale(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.33)

model = LogisticRegression()

# Fit the model
model.fit(X_train, y_train)

# Produce the predictions
y_pred = model.predict(X_test)

print(classification_report(y_pred, y_test))

             precision    recall  f1-score   support

          B       1.00      1.00      1.00       111
          M       1.00      1.00      1.00        77

avg / total       1.00      1.00      1.00       188



## 6. Tune the model using automated parametric grid search via LogisticRegressionCV and explain the intution behind what is being tuned.

Intuition: We're going to tune C, the regularization parameter, which should prevent overfitting so our model generalizes well to the test data.

In [23]:
from sklearn.linear_model import LogisticRegressionCV

grid_search_model = LogisticRegressionCV()

grid_search_model.fit(X_train, y_train)

y_pred = grid_search_model.predict(X_test)

print(classification_report(y_pred, y_test))

             precision    recall  f1-score   support

          B       1.00      0.98      0.99       113
          M       0.97      1.00      0.99        75

avg / total       0.99      0.99      0.99       188



What was the best C?

In [24]:
grid_search_model.C_

array([ 0.35938137])