# Activity 12: Cross Validation

**Please note that (optionally) this assignment may be completed in groups of 2 students.**

---
In this assignment, we'll develop models that predict malignancy in the [Wisconsin Breast Cancer Diagnosis Dataset][1] we explored last time.

As before, please note that there are a number of resources related to this dataset, including the following:
- [discussion and examples on kaggle][2]
- [Medium article similar to this assignment][3]

Goals are as follows:

- Continue to gain familiarity with Python and the Jupyter notebook format
- See how to prepare a simple, modeling-friendly dataset for model development
- See how to train and evaluate logistic regression and multilayer perceptron models can be trained and evaluated in `sklearn`
- Begin to interpret and contextualize model performance

Each of our computational assignments will begin by importing a few required libraries using an `import` statement. These libraries extend the basic functionality of Python. By importing `as X` (e.g. `as np`), we can shorten subsequent calls to the library in our code.

- `numpy` for efficient math operations
- `pandas` for dataframes and dataframe operations
- `matplotlib` for visualization/plotting
- `sklearn` gives us a convenient way to load our dataset, as before, but this time we'll also use it to develop our models! The vast majority of "standard" machine learning models are implemented in this library. When working with specific neural network architectures, on the other hand -- this includes convolutional and recurrent neural networks -- we'll need to use a more customizable machine learning library like `tensorflow` or `pytorch`.

[1]: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
[2]: https://www.kaggle.com/shubamsumbria/breast-cancer-prediction
[3]: https://medium.com/analytics-vidhya/breast-cancer-diagnostic-dataset-eda-fa0de80f15bd

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Loading the data

As in assignment 1, we'll use `sklearn` to load the dataset. Typically you might load from `.csv` with `pd.read_csv()`, from `.xlsx` with `pd.read_excel()`, etc., but the result would be the same: you'd end up with a `pandas` dataframe. In this case, `sklearn` gives us a nice way to load this dataframe without having to find and download a `.csv` file on our own.

**Important**: if you encounter an error with `load_breast_cancer`, try upgrading `sklearn` by adding a code block with the following:
> `!pip install --upgrade scikit-learn`

In [2]:
from sklearn.datasets import load_breast_cancer
df, y_true = load_breast_cancer(return_X_y=True, as_frame=True)
y_true = 1 - y_true # let's set benign to 0 and malignant to 1, in keeping with usual conventions

It turns out predicting `y` from these data is a bit too easy. To illustrate a few important concepts, such as overfitting, we need to make the problem more difficult by adding some randomness to the labels.

We can do this by creating a function, `flip_some_labels`, that will flip a portion of the labels `y_true` at random, resulting in the (noisier) labels `y`. In a subsequent assignment, we'll return to the true labels `y_true` to see how well we can actually predict malignancy from these data.

In [3]:
def flip_some_labels(labels, flip_rate=.1, random_seed=0):
    return (labels + (np.random.RandomState(random_seed).rand(len(labels)) < flip_rate)) % 2

y = flip_some_labels(y_true)

We now have two objects: a dataframe `df` of predictors, and a single *series* (i.e. column) `y` of the associated outcomes. Since we explored this dataset last time, we can skip the descriptive statistics and plots. Still, let's do a quick `.head()` check just to make sure nothing has changed.

In [4]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


We'll also check our outcomes with `.value_counts()`. Compared to the previous exercise, about 10% of the labels have been flipped.

In [5]:
y.value_counts()

0    340
1    229
Name: target, dtype: int64

Since we explored this dataset in our previous assignment, we can now take a few more simple steps to prepare it for model development. First, note that there are 569 patients in our dataset. We can determine this by adding the value counts for y, or with `len(df)`. To keep things simple, let's **use the first 400 samples (i.e. rows) to *train* our models, and the remaining 169 to *test* it**. We will *not* be defining a validation set in this exercise. Later on, we'll be using a validation set to select a best model or tune hyperparameters, but we don't need to worry about this just yet.

The final step prior to modeling will be to *standardize* our data by shifting it so that the mean is 0, then scaling it so that the standard deviation is 1. This step will make our model coefficients more interpretable and keep them all in the same range; the latter is particularly important for neural networks, and for models in which large coefficients are penalized.

All the features in this dataset are numeric, so (a) we won't have to worry about preparing categorical features for modeling, and (b) all of the features can (and should) be standardized.

We need to standardize both our training set and our test set. However, **the test set should be standardized using the mean and standard deviation from the *training* set**. It is useful to think of standardization as part of our model, and we don't want to use **any** information from the test set -- not even its mean or standard deviation -- in the model development process.

## Exercise 1: Partition and Standardize

In the following block, you should:
1. divide the data and labels into a training set and test set. Note that `df[:N]` selects the first N rows, and `df[N:]` selects the remaining rows.
2. standardize both sets of data using the mean and standard deviation *from the training set* using the same technique you used in the exercises from last week.

In [19]:
### DIVIDE THE DATA INTO FOLDS ###

def create_fold_index(N_samples, N_folds, random_state=0):
    return np.random.RandomState(random_state).permutation(N_samples) * N_folds // N_samples

fold_idx = create_fold_index(len(df), 5)

In [None]:
for i in range(5):
    
    x_train = df[fold_idx != i]
    x_test = df[fold_idx == i]
    
    y_train = y[fold_idx != i]
    y_test = y[fold_idx == i]
    
    ### STANDARDIZE THE DATA ###
    
    
    ### TRAIN THE MODEL ON THE TRAINING SET ###
    
    
    ### EVALUATE PERFORMANCE ON THE TEST SET ###
    
    
    ### SAVE THE PERFORMANCE FIGURES ###

## Training a First Model

We're finally ready to train and evaluate our first model: logistic regression. There are only a few lines of code in the block below, but each one is important.
- In the first line, we create a `LogisticRegression()` model object. This is our model; we can train it, then use it to make predictions. All information about the model and its parameters is stored within the object. See [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for further details.
- We'll pass a `random_state` parameter when initializing *all* of our models to ensure that we'll get a consistent result in cases where there would otherwise be randomness in training and/or the initialization of parameters. We'll also specify that model parameters should not be penalized in any way by passing `penalty='none'`, and ensure the model has suffient time to finish training by passing `max_iter=10000`.
- In the second line, we will use the `.fit()` method on our model object to fit the model to our training set. In other words, this very short line does all the work of actually training the model. For more complex models, this line may take some time to run.
- In the third line, we predict the *probability* that `y` is 1 for each of the samples in our test set. By default, `sklearn` returns two columns corresponding to the predicted probability that y is 0 and 1, respectively. We only need the latter, so we'll select this column with `[:, 1]`.
- In the fourth line, we predict the *value* of `y` based on this probability. Specifically, we'll predict that `y` is 1 whenever the predicted probability is greater than 0.5, otherwise we'll predict that `y` is 0. Using 0.5 as our threshold is not always the best idea, but it'll work for now.

We'll repeat these steps, with minor variations, each time we train a model.

## Exercise 2: Fit and Predict
Modify the following block to train your model on the training set you created, then make predictions on the test set. The changes you need to make here are minor, but understanding what is happening in these lines is important, and this syntax will be repeated in subsequent exercises. Please consult the explanations above as you work through this block.

In [7]:
from sklearn.linear_model import LogisticRegression

### FIRST LINE: No changes needed
lr_model = LogisticRegression(random_state=0, penalty='none', max_iter=10000)

### SECOND LINE: change X_train and y_train to the variable names you've created, then uncomment and run
# lr_model.fit(X_train, y_train)

### THIRD LINE: change X_test to the variable name you used, then uncomment and run
# y_test_pred_proba = lr_model.predict_proba(X_test)[:, 1]

### FOURTH LINE: No changes needed; simply uncomment
# y_test_pred_label = (y_test_pred_proba > .5).astype(int)

## Evaluating Performance

We can now evaluate our model by comparing its predictions to the true labels in the test set. Here we'll focus on two measures of performance:
- Accuracy, which has some important limitations but is convenient and easy to calculate
- Area under the receiver operating characteristic curve (AUC or AUROC)

We'll be learning more about the AUC in upcoming lectures and exercises. For now, what's important it that:
1. This is a common and useful performance metric.
2. It's based on comparing model-predicted probabilities (`y_test_pred_proba`) to the labels (`y`).
3. We can calculate it using `roc_auc_score` function from `sklearn.metrics`.

## Exercise 3: Accuracy and AUC
In the following block, you should:
- calculate the accuracy by comparing predicted labels (`y_test_pred_labels`) to `y`, counting how many times they match, and dividing by the total length (i.e. number of labels)
- calculate the AUC by applying the `roc_auc_score` function to `y` and `y_test_pred_proba`. The documentation for `roc_auc_score` is [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html) if needed

In [8]:
from sklearn.metrics import roc_auc_score

### CALCULATE THE ACCURACY ON THE TEST SET AND PRINT THE RESULT ###


### CALCULATE THE AUC ON THE TEST SET AND PRINT THE RESULT ###
# roc_auc_score(y, y_test_pred_proba)

## Exercise 4: MLP

In the following block, you will train a new model: a multilayer perceptron with a single, wide hidden layer.
- The model itself has been defined in the first two lines. We're using the `hidden_layer_sizes` parameter to tell `MLPClassifier` that we want an MLP with a single hidden layer of size 1000. If we wanted two hidden layers each of size 100, we'd pass `hidden_layer_sizes=(100, 100)`.
- As before, we use the `random_state` parameter to ensure we get a consistent result even if we run the block twice, for instance. Passing `max_iter=10000` ensures the model has enough time to finish training.
- Adapt the code from exercise 2 (above) to train the new model, then make predictions (both probability and label) on the *training* set.
- Adapt your code from exercise 3 (above) to evaluate the accuracy and AUC of these predictions
- In one sentence, state whether the MLP performed better than logistic regression, then list at least one characteristic of the model or data that may partly explain why this is the case.

In [9]:
from sklearn.neural_network import MLPClassifier

mlp_model = MLPClassifier(hidden_layer_sizes=(1000,), random_state=0, max_iter=10000)


### TRAIN THE MODEL ON THE TRAINING SET, THEN MAKE PREDICTIONS ON THE TEST SET ###



### CALCULATE ACCURACY AND AUC ON THE TEST SET, THEN PRINT THE RESULT ###



## Once you've completed these exercises, please turn in the assignment as follows:

If you're using Anaconda on your local machine:
1. download your notebook as html (see `File > Download as > HTML (.html)`)
2. .zip the file (i.e. place it in a .zip archive)
3. submit the .zip file in Talent LMS

If you're using Google Colab:
1. download your notebook as .ipynb (see `File > Download > Download .ipynb`)
2. if you have nbconvert installed, convert it to .html; if not, leave is as .ipynb
3. .zip the file (i.e. place it in a .zip archive)
4. submit the .zip file in Talent LMS