# Instructor Task
## Dataset
- [Here](https://s3-us-west-2.amazonaws.com/ga-dat-2015-suneel/datasets/breast-cancer.csv) is the dataset.
- [Here](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names) is a description of the data. Ignore column 0 as it is merely the ID of a patient record.

In [2]:
import pandas as pd

## 1. Read in the data

In [135]:
url = 'https://s3-us-west-2.amazonaws.com/ga-dat-2015-suneel/datasets/breast-cancer.csv'
#fetch the csv from the URL and read it into a DataFrame
dta = pd.read_csv("breast-cancer.csv", sep=',', usecols=range(1,32), header=None)

In [136]:
dta.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 2. Separate the data into feature and target.

In [137]:
X = dta.as_matrix(range(2,32))
y = dta.as_matrix([1]).flatten()

## 3. Create and evaluate using cross_val_score and 5 folds.
- What is the mean accuracy?
- What is the standard deviation of accuracy?

In [138]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
model = LogisticRegression()
model = model.fit(X, y)
print (model.score(X, y))


0.959578207381


In [139]:
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=5)
print (scores)

[ 0.93043478  0.93913043  0.97345133  0.94690265  0.96460177]


In [140]:
print (scores.mean())
print (scores.std())

0.950904193921
0.0159350389781


## 4. Get a classification report to identify type 1, type 2 errors.
- Use train_test_split to run your model once, with a test size of 0.33
- Make predictions on the test set
- Compare the predictions to the answers to determine the classification report

In [142]:
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split

Logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
Logreg.fit(X_train, y_train)
prediction = Logreg.predict(X_test)
print ('\n*classification report: \n', classification_report(y_test, prediction))



*classification report: 
              precision    recall  f1-score   support

          B       0.94      0.98      0.96       120
          M       0.97      0.88      0.92        68

avg / total       0.95      0.95      0.95       188



## 5. Scale the data and see if that improves the score.

In [88]:
from sklearn import preprocessing

scaler = preprocessing.scale(X)
scores=cross_val_score(LogisticRegression(), scaler, y, cv=5)
scores.mean()

0.97891496729511351

## 6. Tune the model using automated parametric grid search via LogisticRegressionCV. Explain your intution behind what is being tuned.

### Q: What should we do to prevent overfitting so our model generalizes well to the test data?

In [91]:
from sklearn.linear_model import LogisticRegressionCV

model1 = LogisticRegressionCV(Cs=30,cv=5)
model1.fit(scaler, y)

LogisticRegressionCV(Cs=30, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

In [94]:
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
svc = svm.SVC(C=30, kernel='linear')
svc.fit(X[:-100], y[:-100]).score(X[-100:], y[-100:])
Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs), n_jobs=-1)
clf.fit(X[:1000], y[:1000])  

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=30, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': array([  1.00000e-06,   3.59381e-06,   1.29155e-05,   4.64159e-05,
         1.66810e-04,   5.99484e-04,   2.15443e-03,   7.74264e-03,
         2.78256e-02,   1.00000e-01])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

### Q: What was the best C?

In [92]:
model1.C_[0]

0.72789538439831458

In [96]:
clf.best_estimator_.C  

0.02782559402207126

## 7. Create Two Plots that describe the data and discuss your results

In [27]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

X, y = make_classification(569, 2, 2, 0, weights=[.5, .5], random_state=15)
clf = LogisticRegression().fit(X[:100], y[:100])


In [28]:
xx, yy = np.mgrid[-5:5:.01, -5:5:.01]
grid = np.c_[xx.ravel(), yy.ravel()]
probs = clf.predict_proba(grid)[:, 1].reshape(xx.shape)

In [34]:
f, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu",
                      vmin=0, vmax=1)
ax_c = f.colorbar(contour)
ax_c.set_label("$P(y = 1)$")
ax_c.set_ticks([0, .25, .5, .75, 1])

ax.scatter(X[100:,0], X[100:, 1], c=y[100:], s=50,
           cmap="RdBu", vmin=-.2, vmax=1.2,
           edgecolor="white", linewidth=1)

ax.set(aspect="equal",
       xlim=(-5, 5), ylim=(-5, 5),
       xlabel="$X_1$", ylabel="$X_2$")
plt.show()

In [125]:
from sklearn import linear_model, datasets
iris = datasets.load_iris()
X = iris.data[:, 2:30] 
Y = iris.target

h = .05  # step size in the mesh

logreg = linear_model.LogisticRegression(C=1e5)

# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('x')
plt.ylabel('y')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())

plt.show()

## 8. Provide a one-sentence summary for a non-technical audience. Then provide a longer paragraph-length technical explanation.