Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Crossvalidation and Nested crossvalidation: Problem solving

In this session, you will apply nested crossvalidation to logistic lasso regression, to find the optimal regularization parameter (or penalty term) for predicting breast cancer.

The [data](https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset) consists of the following variables as mean, standard error, and "worst" (mean of three largest variables) collected by digital imagery of a biopsy.

| Variable | Type | Description |
|:-------|:-------|:-------|
|radius | Ratio | mean of distances from center to points on the perimeter|
|texture | Ratio | standard deviation of gray-scale values|
|perimeter | Ratio | perimeter of cancer|
|area | Ratio | area of cancer|
|smoothness | Ratio | local variation in radius lengths|
|compactness | Ratio |  perimeter^2 / area - 1.0|
|concavity | Ratio |  severity of concave portions of the contour|
|concave points | Ratio |  number of concave portions of the contour|
|symmetry | Ratio | symmetry of cancer|
|fractal dimension | Ratio | "coastline approximation" - 1|

<div style="text-align:center;font-size: smaller">
    <b>Source:</b> This dataset was taken from the <a href="https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)">UCI Machine Learning Repository library
    </a>
</div>
<br>


In addition to these predictors is the class label:

| Variable | Type | Description |
|:-------|:-------|:-------|
| Target | Nominal (binary) | malignant (1) or benign (0)


The goal is to predict `Target`, i.e. the presence of breast cancer.

### Load data

Import `pandas` so we can load a dataframe.

Load the dataframe with `datasets/cancer.csv`.

## Explore data

Since there are clearly some NaN, `dropna` and store the results back into your dataframe.

Check the data makes sense with the five figure summary.

-----------
**QUESTION:**

Do the min, mean, and max look reasonable to you?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

What percentage of the data has `Target=1` and `Target=0`?

**ANSWER: (click here to edit)**


<hr>

To look at the correlations between variables, create a correlation heatmap.

First import `plotly.express`.

And create a correlation matrix.

Show a correlation heatmap with row/column labels.

-----------
**QUESTION:**

What can you say about the correlations amongst the variables?

**ANSWER: (click here to edit)**


<hr>


**QUESTION:**

Are there any more plots you'd want to do at this point? Why or why not?

**ANSWER: (click here to edit)**


<hr>

### Prepare train/test sets

Separate our predictors (`X`) from our class label (`Y`), putting each into its own dataframe.

### Train model with nested crossvalidation

Import libraries for 

- Logistic regression (`sklearn.linear_model`)
- Crossvalidation
- `ravel`
- Scale (lasso regression is very sensitive to standardization)
- Pipeline (to combine scaling and modeling)
- Metrics (for evaluation)

Create a pipeline to scale and train in one step:
- Call stage 1 `"scale"` and use `StandardScaler`
- Call stage 2 `"lasso"` and use `LogisticRegression` with `penalty="l2"`


Create a grid search with:
- `'lasso__C': [.25, .50, .75, 1.0]`
- `cv=10`

Create predictions using `cross_val_predict` with the grid search, data and `cv=10`

### Evaluate the model

Get the accuracy by comparing the predictions to *all* of `Y`.

Similarly we can get the recall and precision using all of `Y`.

### Hyperparameter

Use `fit` on the grid search to learn the best overall hyperparameter value for `C`.

Display the best hyperparameter value for `C`.

-----------
**QUESTION:**

In the previous lasso exercise, we used `C=.75`, but our accuracy was 2% worse.
Why do you think our results are better now?

**ANSWER: (click here to edit)**


<hr>