Install Scikit-learn into your computational methods environment with `conda install scikit-learn`

In [146]:
import numpy as np
import torch
import torch.nn as nn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Part 1: Let's get this data, fam

In [147]:
TESTING = True
data = load_breast_cancer()
n = data.data.shape[0]
valFrac = 0.2
X = data.data
Y = data.target.reshape([X.shape[0], 1])
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size = valFrac, random_state = 1876)
if TESTING:
    X_train = X_train[0:10,:]
    X_val = X_train
    Y_train = Y_train[0:10]
    Y_val = Y_train
X_train = torch.from_numpy(X_train).float()
X_val = torch.from_numpy(X_val).float()
Y_train = torch.from_numpy(Y_train).float()#.long()
Y_val = torch.from_numpy(Y_val).float()#.long()
print("n Train: {}\nn Val: {}".format(X_train.shape[0], X_val.shape[0]))

n Train: 10
n Val: 10


# Part 2: Classification Functions

## Logistic Regression
Logistic regression is your bread and butter baseline classification model. Before implementing anything too fancy, you should see how far logistic regression can get you. When should you use logistic regression? Whenever you're performing supervised learning and have a binary response variable (only two classes). Luckily for us, most of the math for a logistic regression model has already been worked out in Lab 3 with linear regression. But there's just one problem, linear regression is unbounded, and if we're doing classification we want our model to predict the probability of a model belonging to one class or the other. So instead of letting our model's outputs be between $-\infty$ and $\infty$ we want to constrain it between $0$ and $1$.

To blatantly plagiarize another TAs work, in least squares linear regression we have a feature matrix $X$ and a set of corresponding outcomes $Y$, and the goal is to learn a $\beta$ such that $\hat{Y} = X^\top \beta + \epsilon$ minimizes the loss function $\sum_i (Y - \hat{Y})^2$, with $\mathbb{E}[\epsilon] = 0$.

Using the input $X$ and our model parameters $\beta$ we'll convert linear regression into logistic regression. First, why do we want to bound our model between $0$ and $1$? Because we're doing classification we need an easy way to define when our prediction is for one class or the other, and if our model can only output probabilities then we can use a cutoff (say $0.5$) and bin every observation into a class. Squashing inputs between $0$ and $1$ is done using the sigmoid function $\sigma(a) = \frac{1}{1 + exp(-a)}$. Using our inputs $X$, our learned parameters $\beta$ and $\sigma(\cdot)$ we have the makings of greatness, or at least some kind of baseline model. We write the probability of our 'positive' class (an arbiterary designation) as $$p(Y = 1|X;\beta) = \sigma(\beta^{T}X)$$ and our 'negative' class as $$p(Y = 0|X;\beta) = 1 - \sigma(\beta^{T}X)$$. For simplicity's sake let $a = beta^{T}X$ for he remainder of this cell

The loss function for logistic regression is similar to what we used in Lab 2 for MLE. For a single observation $x_i$ and its response variable $y_i$ we define our prediction's loss as $L(\beta;y_i, x_i) = \sigma(a)^{i_i}(1 - \sigma(a))^{1-y_i}$. And for an entire dataset of $n$ examples after taking the log our loss is

$$\sum_{i=1}^n y_ilog(\sigma(a)) + (1-y_i)log(1-\sigma(a))$$

Using our usual tools of gradient descent, and stochastic gradient descent we can now learn the parameters $\beta$ which minimize our loss


In [171]:
class binaryClassifier(nn.Module):
    
    # The class constructor defines the parameters (ie layers) of the neural network
    def __init__(self, nFeats, activationFunction = None):#nn.LogSoftmax(dim = 1)):
        super(binaryClassifier, self).__init__()
        # What type of parameters do we need to add?
        self.linear = nn.Linear(in_features = nFeats, out_features = 1)
        self.activationFunction = activationFunction        
    # The forward method ties the layers together to build the network.
    # We take the gradient of this composite function using back propogation
    def forward(self, x):
        if self.activationFunction is None:
            out = self.linear(x)
        else:
            out = self.activationFunction(self.linear(x))
        return(out)

In [176]:
m = nn.Sigmoid()
logisticRegressionModel = binaryClassifier(nFeats = X_train.shape[1], activationFunction = None)
lossFunction = nn.BCEWithLogitsLoss()
optimizer = torch.optim.SGD(logisticRegressionModel.parameters(), lr = .0001, momentum = 0.9)

In [177]:
epochs = 1000

for epoch in range(epochs):
        
    # Estimate Y_hat with the current model
    
    Y_hat = logisticRegressionModel(X_train)
    
    # Compute the loss
    loss = lossFunction(Y_hat, Y_train)
        
    # Compute the gradient of the loss
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        print("loss {}".format(loss))

loss 89.2001724243164
loss 6.186859536683187e-06
loss 2.622601300572569e-07
loss 2.622601300572569e-07
loss 2.50339240892572e-07
loss 2.3841832330617763e-07
loss 2.3841832330617763e-07
loss 2.3841832330617763e-07
loss 2.26497419930638e-07
loss 2.26497419930638e-07


In [178]:
Y_train

tensor([[1.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [1.]])

In [179]:
m = nn.Sigmoid()
m(Y_hat)

tensor([[1.0000e+00],
        [1.0000e+00],
        [1.0000e+00],
        [3.6419e-21],
        [0.0000e+00],
        [8.8964e-09],
        [3.7559e-08],
        [1.0000e+00],
        [1.0000e+00],
        [1.0000e+00]], grad_fn=<SigmoidBackward>)

## Probit Model

We defined $p(Y = 1|a) = \sigma(a)$ in logistic regression, but the more general form of a linear model would be $p(Y = 1|X;\beta) = f(\beta^{T}X)$ where $f(\cdot)$ is known as an activation function. Another activation function we could have used is known as the inverse probit function which is the cumulative distribution function of a standard normal defined as $\Phi(a) = \frac{1}{2}(1 + erf(\frac{1}{\sqrt{2}}))$ (Did you hear that? That was the sound of an absolute ton of details being skipped over. For more information about probit regression on page 210 [here](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf)). Here $erf(\cdot)$ is known as the error function. All the same steps apply for the logistic regression example, except instead of $\sigma(\cdot)$ we use $\Phi(\cdot)$

$$\sum_{i=1}^n y_ilog(\Phi(a)) + (1-y_i)log(1-\Phi(a))$$



Tips:
1. For the love of your weekend don't try to implement the CDF of a normal distribution. Search around and find out how you can get the cdf of different distributions in using Pytorch.

In [133]:
# def probitActFunc(a):
#     normDist = torch.distributions.normal.Normal(0, 1)
#     out = 0.5*(1 + normDist.cdf(a/np.sqrt(2)))
#     out = normDist.cdf
#     return(out)

In [134]:
logisticRegressionModel = binaryClassifier(nFeats = X_train.shape[1], activationFunction = normDist.cdf)
lossFunction = nn.BCELoss()
optimizer = torch.optim.SGD(logisticRegressionModel.parameters(), lr = .0001, momentum = 0.9)

In [135]:
epochs = 1000

for epoch in range(epochs):
        
    # Estimate Y_hat with the current model
    
    Y_hat = logisticRegressionModel(X_train)
    
    # Compute the loss
    loss = lossFunction(Y_hat, Y_train)
        
    # Compute the gradient of the loss
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        print("loss {}".format(loss))

loss 16.57861328125
loss 16.57861328125
loss 16.57861328125
loss 16.57861328125
loss 16.57861328125
loss 16.57861328125
loss 16.57861328125
loss 16.57861328125
loss 16.57861328125
loss 16.57861328125


## Hinge Loss
We can also change the loss function we use in logistic regression to the hinge loss by reconfiguring how we view the data. To do this we formulate our response variable as either $-1$ or $1$. Now, $p(Y = 1|X;\beta) = \sigma(a)$ remains unchanged, but $p(Y = -1|X;\beta) = 1 - \sigma(a) = \sigma(-a) = \sigma(ya)$. In the last step recall that $Y \in \{-1, 1\}$ so $p(Y|X;\beta) = \sigma(ya)$

Using the log likelihood and our updated probability functions our loss now becomes:

$$\sum_{i=1}^{n}\sigma(ya)$$

In [None]:
# torch.nn.HingeEmbeddingLoss

In [199]:
m = nn.LogSigmoid()
logisticRegressionModel = binaryClassifier(nFeats = X_train.shape[1], activationFunction = m)
lossFunction = nn.HingeEmbeddingLoss()
optimizer = torch.optim.SGD(logisticRegressionModel.parameters(), lr = .000001, momentum = 0.9)

In [200]:
Y_train_hinge = Y_train.clone().detach()
Y_train_hinge[Y_train_hinge == 0] = -1

In [201]:
epochs = 1000
for epoch in range(epochs):
        
    # Estimate Y_hat with the current model
    
    Y_hat = logisticRegressionModel(X_train)
    
    # Compute the loss
    loss = lossFunction(Y_hat, Y_train_hinge)
        
    # Compute the gradient of the loss
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        print("loss {}".format(loss))

loss 39.34615707397461
loss -8.638486862182617
loss -9.689288139343262
loss -10.695261001586914
loss -11.700035095214844
loss -12.704723358154297
loss -13.70942211151123
loss -14.714147567749023
loss -15.718876838684082
loss -16.723583221435547


In [202]:
Y_train_hinge

tensor([[ 1.],
        [ 1.],
        [ 1.],
        [-1.],
        [-1.],
        [-1.],
        [-1.],
        [ 1.],
        [ 1.],
        [ 1.]])

In [204]:
torch.exp(Y_hat)

tensor([[1.2955e-17],
        [8.3004e-16],
        [3.7414e-20],
        [1.1391e-01],
        [1.0000e+00],
        [1.4588e-04],
        [4.6898e-13],
        [4.9082e-18],
        [2.0393e-16],
        [3.9848e-12]], grad_fn=<ExpBackward>)

# Part 3 Outliers