# MMCi Computational Assignment 1, Part 1:
*Visualizing features of breast cancer samples*

**Please note that this assignment may be completed in groups.**

---
In this assignment, we'll be loading, describing, and visualizing the [Wisconsin Breast Cancer Diagnosis Dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))

This is a well-known dataset that has been explored and discussed quite a bit:
- [discussion and examples on kaggle](https://www.kaggle.com/shubamsumbria/breast-cancer-prediction)
- [Medium article similar to this assignment](https://medium.com/analytics-vidhya/breast-cancer-diagnostic-dataset-eda-fa0de80f15bd)

Goals are as follows:

- Become more comfortable with the Jupyter notebook format
- 

We'll begin by importing a few required libraries using an `import` statement. Each of them extends the basic functionality of Python. By importing `as X` (e.g. `as np`), we can shorten subsequent calls to the library in our code.

- `numpy` for efficient math operations
- `matplotlib` for visualization/plotting
- `sklearn` will give us a convenient way to load our dataset. In later assignments, we'll also be using this library to define models, train them, and evaluate their performance.

**Note** that there are many libraries for visualization and plotting in Python. For this assignment, we'll go with the one that's *simplest* to use -- `matplotlib` -- rather than the one that gives us the *prettiest* figures (e.g. `seaborn`, `plotly`).

In [59]:
import numpy as np
import pandas as pd

We can now use `sklearn` to load the dataset we'll be working with. Typically you might load from `.csv` with `pd.read_csv()`, from `.xlsx` with `pd.read_excel()`, and so on, but the result would be the same: you'd end up with a `pandas` dataframe. In this case, `sklearn` gives us a nice way to load this dataframe without having to find and download a `.csv` file on our own.

In [60]:
from sklearn.datasets import load_breast_cancer

df, y = load_breast_cancer(return_X_y=True, as_frame=True)

We now have two objects: a dataframe `df` of predictors, and a single *series* (i.e. column) `y` of the associated outcomes. In this dataset, the possible outcomes are (0) benign, and (1) malignant. Let's use the `.head()` method of our `pandas` dataframe `df` to take a look at its first few rows:

In [66]:
df.head(5)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Since we have 30 features in total, not all of them are visible here. 

In [63]:
y.value_counts()

1    357
0    212
Name: target, dtype: int64

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score, accuracy_score

In [35]:
df.shape

(569, 30)

In [7]:
X_train = df.values[:450]
X_test = df.values[450:]

y_train = y[:450]
y_test = y[450:]

In [39]:
lr = LogisticRegression(max_iter=10000).fit(X_train, y_train)
auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1])
acc = accuracy_score(y_test, lr.predict(X_test))
print('AUC = %.3f, Accuracy = %.3f' % (auc, acc))

AUC = 0.993, Accuracy = 0.933


In [57]:
mlp = MLPClassifier(hidden_layer_sizes=(1000, ), early_stopping=True, max_iter=10000).fit(X_train, y_train)
auc = roc_auc_score(y_test, mlp.predict_proba(X_test)[:, 1])
acc = accuracy_score(y_test, mlp.predict(X_test))
print('AUC = %.3f, Accuracy = %.3f' % (auc, acc))

AUC = 0.944, Accuracy = 0.899
