In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab5.ipynb")

---

<h1><center>SDSE Lab 5 <br><br> Logistic regression and Performance metrics </center></h1>

---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", message="Precision is ill-defined and being set to 0.0")

import sklearn.datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score,recall_score, accuracy_score
from sklearn.model_selection import cross_validate

rng_seed = 454

Description of the lab assignment...

# Load the data

In [None]:
data = sklearn.datasets.load_breast_cancer()
print(data.DESCR)

# Put the data into a pandas DataFrame

In [None]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data['target']
N = len(df['target'])
del data
df.head()

## Flip the target value

The scikit-learn dataset encodes a benign tumor as a 1 and a malignant tumor as a 0. This confuses the language, so let's flip it.

In [None]:
df['target'] = 1-df['target']

# Data exploration

In [None]:
df.info()

The pandas dataframe provides the `corr()` methods, which computes the correlation matrix. Good predictor variables are characterized by having a large correlation with the output, but small correlation with other predictors. 

In [None]:
df.corr()

Focus on the correlations with the target variable. Sort them from largest to smallest in absolute value. The ones at the top of the list are good candidates to include in our model. But maybe not, if for example the are highly correlated amongst each other. We will use Lasso regularization for feature selection. 

In [None]:
df.corr()['target'].abs().sort_values(ascending=False)

# Training and testing data

The first step is to split the dataset into training and test sets. We will reserve 20% of the data for testing.  The code below uses scikit-learn's `train_test_spit` method to generate training and testing datasets. 

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(df.iloc[:,:-1],
                                                df['target'], 
                                                test_size=0.2, 
                                                random_state=rng_seed )

len(Xtrain), len(Xtest)

# Train a logistic regression model

Our first model will be based on the most highly correlated feature only: `worst concave points`. 
The following code creates a logistic regression object. To compute the coefficients of the model, we pass the training data to the `fit` method.

In [None]:
Xtrain1 = Xtrain[['worst concave points']]

model = LogisticRegression()
model.fit(Xtrain1,ytrain) 

# 1. Compute the confusion matrix

Our next step is to assess the performance of the model by building its confusion matrix. The can be done easily with scikit-learn's `confusion_matrix()` function. However here we will build it by hand. 

The `compute_confusion_matrix` takes as parameters the trained model, along with the training or testing data (`X` and `y`). 

It should return a dictionary with keys `{'TP', 'FP', 'TN', 'FN'}` corresponding to the true positives, false positives, true negatives, and false negatives obtained by predicting the response for `X` and comparing it to `y`.

In our case a 'positive' outcome is `y==1` (a malignant tumor).

In [None]:
def compute_confusion_matrix(model,X,y):
    cm = dict.fromkeys({'TP', 'FP', 'TN', 'FN'})
    ...
    return cm

In [None]:
cm_train = compute_confusion_matrix(model,Xtrain1,ytrain)
cm_train

In [None]:
grader.check("q1")

# 2. Compute accuracy

`compute_accuracy` takes a dictionary returned by `compute_confusion_matrix` and returns the scalar value of the accuracy, found with:

$$ \text{accuracy} = \frac{TP+TN}{TP+TN+FP+FN} $$

Use `compute_accuracy` to find the training and testing accuracy for the model.

In [None]:
def compute_accuracy(cm):
    ...

In [None]:
acc_train = compute_accuracy(cm_train)
acc_train

In [None]:
grader.check("q2")

# 3. Compute precision and recall

Repeat part 2 but for precision and recall. 

In [None]:
def compute_precision(cm):
    ...

def compute_recall(cm):
    ...

In [None]:
prn_train = compute_precision(cm_train)
rcl_train = compute_recall(cm_train)

prn_train, rcl_train

In [None]:
grader.check("q3")

# 4. L1 regularized logistic regression

We now repeat the previous exercise, but instead of choosing the features by their correlation with the output, we will use the LASSO regularizer. 

## 4.1 Create a pipeline

Scaling input features is a theoretical necesity for logistic regression. However it can be helpful for a) improving the numerical search and b) making comparisons amongst the trained coefficients. Use a `Pipeline` combine a `StandardScaler` with logistic regression. Use these hyperparameters for the logistic regression:

``` python
C=1
penalty='l1'
solver='liblinear'
max_iter=1000
random_state=rng_seed
```

Fit the model using the pipeline's `fit` method and using the full training data.

In [None]:

model = Pipeline(...)
model.fit(...) 

In [None]:
grader.check("q4p1")

## 4.2 Cross validation

In the next part we will select features by sweeping over values of the regularization constant. We need a validation strategy for evaluating the performance of each level of regularization. We will use Use scikit-learn's `cross_validate` method for this. 

Read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) for an explanation of the input arguments. 

Then run `cross_validate` on the pipeline model we trained in the previous part. Use `cv=3` and record the accuracy, precision and recall by passing in `scoring=('accuracy','precision','recall')`.

Save the mean of the three test metrics as `cv_acc`, `cv_prn`, and `cv_rcl`.

In [None]:

scores = cross_validate(...)
acc_cv = ...
prn_cv = ...
rcl_cv = ...


In [None]:
grader.check("q4p2")

## 4.3 Sweep over the regularization weight

We will now use the regularization parameter $\lambda$ to shrink the coefficients. As we increase $\lambda$ we should see the coefficients for the less useful features shrink to zero. In scikit-learn, the regularization parameter is called `C`, and is passed into the constructor for `LogisticRegression`. `C` actually equals  $1/\lambda$, so increasing regularization strength (shrinking the parameters) corresponds to decreasing `C`.

In [None]:
numCs = 40

Cs = np.logspace(-3,2,numCs)
D = Xtrain.shape[1]

coefs = np.empty((numCs,D))

# Initialize the performance arrays to `np.empty(numCs)`.
acc_cv = ...
prn_cv = ...
rcl_cv = ...

for c, C in enumerate(Cs):   
    
    print(c)
    
    # Create a fit a pipeline, as you did in the previous part.
    ...
    
    # Extract the trained coefficients from the model and store them in the `coefs` array.
    coefs[c,:] = model.named_steps['logreg'].coef_[0,:]
    
    # Use the same code from the previous part to compute cross-validation scores
    scores = cross_validate(...)
    acc_cv[c] = ...
    prn_cv[c] = ...
    rcl_cv[c] = ...


In [None]:
grader.check("q4p3")

# Plots

The plot below should show that C=0.1 is amongst the lowest values that maximize the test performance metrics.

In [None]:
plt.subplots(figsize=(10,10),nrows=2)

plt.subplot(211)
plt.semilogx(Cs, acc_cv,'b--', label='cv acc')
plt.semilogx(Cs, prn_cv,'m--', label='cv pre')
plt.semilogx(Cs, rcl_cv,'g--', label='cv rec')
plt.legend(fontsize=12)
plt.ylim(0.8,1.02)
plt.grid()
plt.ylabel('performance',fontsize=16)

plt.subplot(212)
plt.semilogx(Cs, coefs)
plt.ylim(-10,10)
plt.grid()
plt.ylabel('coefficients',fontsize=16)

plt.xlabel('C',fontsize=20)

# Final model

The plot above shows that the best model occurs near C=0.1. We will take Cs[16]=0.11 to be the best value. Next, we sort and plot the absolute values of the coefficients for that model. Notice that only seven features have a non-zero coefficient.

In [None]:
best_C_ind = 16
abs_coef = np.abs(coefs[best_C_ind,:])
sorted_coeff_ind = np.argsort(abs_coef)

plt.figure(figsize=(10,3))
plt.stem(abs_coef[sorted_coeff_ind])

Let's see which seven features were selected.

In [None]:
feature_names = np.array(df.columns[:-1])
best_features = feature_names[sorted_coeff_ind[-1:-8:-1]]
best_features

Suppose that we can only keep 4 features, perhaps because we are constrained by the time and cost of performing medical examinations. We select the top 4 for our final model.
+ `worst radius`
+ `worst concave points`
+ `mean concave points`
+ `worst texture`

In [None]:
best_four_features = ['worst radius', 'worst concave points', 'mean concave points', 'worst texture']
Xtrain2 = Xtrain[best_four_features]

model = Pipeline([('scaler', StandardScaler()), 
                  ('logreg', LogisticRegression())])

model.fit(Xtrain2,ytrain) 

Finally, we calculate and report the test performance.

In [None]:
Xtest2 = Xtest[best_four_features]

acc_test = accuracy_score(ytest,model.predict(Xtest2)) 
prn_test = precision_score(ytest,model.predict(Xtest2)) 
rcl_test = recall_score(ytest,model.predict(Xtest2)) 
acc_test, prn_test, rcl_test

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)