In [1]:
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import train_test_split
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer

# Your own logistic regression model

This time it's your turn to create a logistic regression model. The [dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), that we're going to work with, contains diagnostic information about breast cancer. The medical features of individual tumors like size, shape and smoothness were measured and were labeled as maligant(0) or benign(1).  

In [2]:
data = load_breast_cancer(as_frame=True)
features = data["data"]
feature_names = data["feature_names"]
labels = data["target"]
label_names = data["target_names"]

print(features.columns)
print(labels[15:20])
print(label_names)

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')
15    0
16    0
17    0
18    0
19    1
Name: target, dtype: int32
['malignant' 'benign']


### Data exploration

Let's have a closer look on the distribution of the data. How many tumors are there in the dataset overall? Also find out how many tumors were classified as malignant and how many as benign. Is this a well balanced dataset or is one kind overrepresented?

Next we should deal with the features and decide if we have to preprocess them or use them as they are. What are the datatypes of the features? If they are non-numerical we need to convert them to quantify them. Are there any missing values or NaNs(not-a-number)? 

Have a look at a specific column. What is the mean and the standard deviation of the "mean radius" of the tumors? Can you test if there is a correlation between "mean radius", "mean perimeter" and "mean area"?

Hint: You can access a sub-dataframe of a pandas-dataframe by giving it a list of columns as its index:

```my_subframe = my_dataframe[["column1","column2","column3"]]```

## Data preparation

To validate your model in the end you will need a separate test-set. Therefore you should split your data in two random subset for training and testing now. Your test-set should contain 15% of your total dataset. Also make sure, that your subsets have the expected sample-size

Standardize the training-data and the test-data with the mean and the standard-deviation of the training-data

## Model training

Now it's finally time to create your model and fit it to the data in your training-set.

Evaluate your model by a metric of your choice

Finally visualize the results of your prediction in a confusion-matrix.

Right now we classify the tumors straight up according to which one has the higher probability, the decision-treshold is 0.5, which makes sense in most cases. In other cases we can improve the model by adjusting the decision-threshold, e.g. a tumor has to have a probability of 90% to be classified as benign otherwise we will classify it as malignant. This might even make sense, if it lowers the accuracy of the model, especially in this example. Why?

The next code-box offers a way to play around with different decision-thresholds. Uncomment the variables and update them with your variable-names have a look at how the threshold affects the prediction outcome.

In [None]:
# X_test_scaled = your_X_test_scaled
# y_test = your_test_scaled
# model = your_model


pred_proba = model.predict_proba(X_test_scaled)
threshold_list = [0.1, 0.3, 0.5, 0.7, 0.9]
for i in threshold_list:
    print (f"\n******** For i = {i: .1f} ******")
    Y_test_pred = (pred_proba[:,1] > i).astype(int)
    test_accuracy = metrics.accuracy_score(y_test, Y_test_pred)
    print(f"Our testing accuracy is {test_accuracy: .2f}")

    print(confusion_matrix(y_test, Y_test_pred))

**Congratulations on building your own logistic regression model!**