<a href="https://colab.research.google.com/github/jjzsilva9/padl/blob/main/PADL_Week5_practical_a.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#PADL Week 5 Practical: Logistic Regression

##Logistic regression with scikit-learn

**Initial reading:**

Reading and understanding the scikit-learn examples on logistic regression is a good way to get started. There are no fewer than 5 examples given in the [logistic regression section](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) of the scikit-learn User manual. Feel free to look at all five but for sure look at the first two: L1 Penalty and Sparsity in Logistic Regression and Regularization path of L1- Logistic Regression. You will see that by default scikit-learn uses an L2 penalty (like in ridge regression) but it is also possible to use an
L1 penalty (like in lasso regression). Built-in cross-validation support for choosing the ‘right’ value for the complexity parameter is also available via the [LogisticRegressionCV class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html).

**Diagnosing breast cancer:**

Go to the Breast Cancer Wisconsin (Diagnostic) data set [webpage](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)). If you click on the ‘Data Folder’ link near the top of the page, then you will be able to get the data. It is the file [breast-cancer-wisconsin.data](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/). You can either upload this to the session storage for your colab notebook (but it will be lost each time your session times out, although you can add `!wget -q https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data` to your script to automatically download to session storage each time it runs - as done below) or you can mount a google drive folder and store the file there.

To save you hassle of working out how to get this data into a Python program is some code to read this data in, and then remove any datapoints containing missing values:

In [1]:
!wget -q https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data

import numpy as np
from sklearn.linear_model import LogisticRegression

data = np.genfromtxt('breast-cancer-wisconsin.data',delimiter=',',missing_values='?')
data = data[~np.isnan(data).any(axis=1)]
X = data[:,1:-1] # ignore first column and omit class variable at the end
y = data[:,-1]

**To do:**

Now use logistic regression to build a model to predict either malignant or benign. In fact, I would like you to build a number of logistic regression models where you *vary the size of the training data* and where you *vary the complexity parameter setting*. In all cases use whatever data you have excluded from training as a test set, and compute the score.

Check that training on more data increases predictive accuracy and compare the performance of different complexity parameter settings on smaller training sets.

##Logistic Regression in PyTorch

Below is a straightforward re-implementation of logistic regression in PyTorch. Nearly all of this should now be very familiar to you. We put our logistic regression model in a superclass of `torch.nn.Module`. The model itself consists of a linear layer mapping 9 input features to 1 output. This is then passed through a sigmoid layer so that the model outputs probability of one of the two classes. Since we are doing binary cross entropy loss, we use `torch.nn.BCELoss` as our loss function (note: the sigmoid is applied inside the model so we don't use the version of the loss function that combines sigmoid and BCE - but this would be a perfectly valid alternative). We train as normal and evaluate on the test set. But this time, we threshold the output probabilities to make our final hard class decisions and compute the percent correct.

**To do:**

Read and understand this code block. Run it. Print out the shapes of the tensors as they pass through the `logisticRegression` model (all tensors have a `shape` attribute). Make sure the shapes corresponds with your understanding of what each layer is doing. Try changing the training set size, number of training iterations and learning rate and see the effect.

In [17]:
import torch

class logisticRegression(torch.nn.Module):
    def __init__(self, inputSize):
        # Call superclass constructor
        super(logisticRegression, self).__init__()
        # Initialise components of model:
        # 1. Linear layer
        self.linear = torch.nn.Linear(inputSize, 1)
        # 2. Sigmoid layer
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, x):
        # Forward pass through the model:
        # 1. Apply linear layer to input
        y = self.linear(x)
        # 2. Apply sigmoid to output of linear layer
        y = self.sigmoid(y)
        return y

# Instantiate model logistic regression model 9 channel input
model = logisticRegression(9)
# Instantiate loss function (binary cross entropy loss - sigmoid applied inside model)
criterion = torch.nn.BCELoss()
# Setup optimiser
optim = torch.optim.SGD(model.parameters(), lr=0.1)
print(len(X))
training_size = 500
epochs = 1200

# Convert labels to binary 0/1 classes as expected by PyTorch
y_01 = np.array([0 if x==2 else 1 for x in data[:,-1]])

# Split train/test and convert to PyTorch tensors
X_train_tensor = torch.from_numpy(np.float32(X[:training_size]))
Y_train_tensor = torch.from_numpy(np.float32(y_01[:training_size])).unsqueeze(1)
X_test_tensor = torch.from_numpy(np.float32(X[training_size:]))
Y_test_tensor = torch.from_numpy(np.float32(y_01[training_size:])).unsqueeze(1)

# Main training loop
for epoch in range(epochs):
    # Pass training data through model
    y_predict = model(X_train_tensor)
    # Compute BCE loss
    loss = criterion(y_predict,Y_train_tensor)
    # Backward pass and gradient step
    optim.zero_grad()
    loss.backward()
    optim.step()
    if not epoch % 10:
        # Print out the loss every 200 iterations
        print('epoch {}, loss {}'.format(epoch, loss.item()))

# Pass training set set through model
y_predict = model(X_train_tensor)
# Threshold probabilities to binary classes
predictions = (y_predict>0.5).float()
# Compare predicted classes to labels
correct = (predictions == Y_train_tensor).float().sum()
print("Percent training set correctly classified: {:.2f}%".format(100*correct/training_size))

# Pass test set through model
y_predict = model(X_test_tensor)
# Threshold probabilities to binary classes
predictions = (y_predict>0.5).float()
# Compare predicted classes to labels
correct = (predictions == Y_test_tensor).float().sum()
print("Percent test set correctly classified: {:.2f}%".format(100*correct/X_test_tensor.shape[0]))

683
epoch 0, loss 0.8508651256561279
epoch 10, loss 0.4974907636642456
epoch 20, loss 0.4363153278827667
epoch 30, loss 0.4002538323402405
epoch 40, loss 0.37596169114112854
epoch 50, loss 0.35786762833595276
epoch 60, loss 0.34333762526512146
epoch 70, loss 0.33104950189590454
epoch 80, loss 0.32029756903648376
epoch 90, loss 0.3106766939163208
epoch 100, loss 0.30193689465522766
epoch 110, loss 0.2939133942127228
epoch 120, loss 0.2864910960197449
epoch 130, loss 0.2795857787132263
epoch 140, loss 0.2731332778930664
epoch 150, loss 0.2670827805995941
epoch 160, loss 0.26139312982559204
epoch 170, loss 0.2560299336910248
epoch 180, loss 0.2509641945362091
epoch 190, loss 0.24617086350917816
epoch 200, loss 0.24162805080413818
epoch 210, loss 0.23731648921966553
epoch 220, loss 0.23321904242038727
epoch 230, loss 0.2293204516172409
epoch 240, loss 0.22560681402683258
epoch 250, loss 0.2220657914876938
epoch 260, loss 0.21868588030338287
epoch 270, loss 0.21545684337615967
epoch 280, lo