In [1]:
import torch
import torch.nn as nn

Linear regression

In [2]:
n = 10
d = 1
model = nn.Linear(n, d)

In [3]:
print(model)

Linear(in_features=10, out_features=1, bias=True)


In [5]:
print(model.weight)
print(model.bias)

Parameter containing:
tensor([[-0.1317,  0.0311,  0.2894, -0.0874,  0.0068, -0.2635, -0.2890, -0.2572,
         -0.2682, -0.2943]], requires_grad=True)
Parameter containing:
tensor([-0.0242], requires_grad=True)


In [18]:
x = torch.randn(n)

In [19]:
x

tensor([ 1.0906,  0.3910,  0.0868,  0.0980,  0.0124,  0.3860, -0.6620,  0.4804,
         0.0680,  1.3457])

In [20]:
model(x)

tensor([-0.5873], grad_fn=<ViewBackward0>)

### Logistic regression

Linear binary classification: 
$$f(X) = \sigma(Wx+b)$$
$$\sigma(x) = \frac{1}{1+e^{-x}}$$

Maps $$f_\theta: \R^{n} \rightarrow [0,1]$$

Examples:
- x approaches negative infinity: $x \rightarrow -\infty = f(X) \rightarrow \frac{1}{\infty} \rightarrow 0$
- x approaches 0: $x = 0 \rightarrow f(X) = \frac{1}{1+e^{0}} \rightarrow \frac{1}{1+1} = \frac{1}{2}$
- x approaches infinity: $x \rightarrow \infty = f(X) \rightarrow \frac{1}{1+0} \rightarrow 1$

In [22]:
class LinearClassifier(torch.nn.Module):
    """Logistic regression classifier."""
    def __init__(self, n, d):
        super().__init__()
        self.model = nn.Linear(n, d)

    def forward(self, x):
        """We first calculate the linear combination of the features and
        weights and then apply a sigmoid function."""
        regression_output = self.model(x)
        # NOTE: do not actually add sigmoid in the "forward" function since
        # you can't backpropagate without it becoming numerically unstable.
        return nn.functional.sigmoid(regression_output)


In [23]:
lc = LinearClassifier(n, d)

In [31]:
lc.model.weight

Parameter containing:
tensor([[ 0.3162,  0.2263, -0.0120,  0.1590,  0.1616, -0.0768, -0.2428, -0.2046,
         -0.1925,  0.2249]], requires_grad=True)

In [29]:
x = torch.randn(n)
x

tensor([ 0.0563, -0.9167, -1.0453,  0.0662,  0.3473,  0.0155,  1.1827, -0.0678,
         1.1243, -1.3446])

In [30]:
lc(x)

tensor([0.3444], grad_fn=<SigmoidBackward0>)

### Linear multi-class classification

Multi-class classification uses softmax.
$$f_\theta(x) = \text{softmax}(Wx+b)$$
$$\text{softmax}(v)_i = \frac{e^{v_i}}{\sum_{j=1}^{n} e^{v_j}}$$

The probability that a given observation is in the $i^{th}$ class is given by $e^{v_i}$ divided by the probabilities for the observation across all classes, $\sum_{j=1}^{n} e^{v_j}$.

I found this walkthrough quite useful: https://www.pinecone.io/learn/softmax-activation/

The softmax does the following to a vector $v=[v_1, ..., v_d]^{T} \in \R^{d}$:
1. Exponentiates the vector: $e^{v} = [e^{v_1}, e^{v_2}, ..., e^{v_d}]$
2. Normalizes that vector by multiplying it by a factor of $$\frac{1}{\sum_{i=0}^{d}e^{v_i}}$$

Let's work through an example. Let's say that we have $v=[1, -3, 5, -6, 10]$.
1. Exponentiate the vector: $e^{v} = [e^1, e^{-3}, e^5, e^{-6}, e^{10}]$.
2. Normalize the vector:
$$ = [e^1, e^{-3}, e^5, e^{-6}, e^{10}] * \frac{1}{e^1 + e^{-3} + e^5 + e^{-6} +e^{10}}$$

We can calculate this explicitly:

In [37]:
v = torch.tensor([1, -3, 5, -6, 10])
print(v)

tensor([ 1, -3,  5, -6, 10])


In [38]:
v_exp = torch.exp(v)
print(v_exp)

tensor([2.7183e+00, 4.9787e-02, 1.4841e+02, 2.4788e-03, 2.2026e+04])


In [40]:
normalizing_coefficient = torch.sum(v_exp)
print(normalizing_coefficient)

tensor(22177.6484)


In [41]:
normalized_v = v_exp / normalizing_coefficient
print(normalized_v)

tensor([1.2257e-04, 2.2449e-06, 6.6920e-03, 1.1177e-07, 9.9318e-01])


We can see several things here:
1. The softmax maintains the orders of its inputs.
2. The larger (and more positive) the array's value, the larger the result. The largest number here (the last element) corresponds to the largest value.

Given this, we can now take the argmax across this array in order to get our multiclass label.

In [42]:
label = torch.argmax(normalized_v)
print(f"Predicted class label: {label}")

Predicted class label: 4


##### Linear multi-class classification example

For us to do inference with this, our task has to output multiple values.

For example, let's do weather predict:

Input: $x$ = day of the way
Output: $f(x)$ = precipitation (rain, snow, hail, sun)

Prediction:
- $\mathbb{P}$(rain) = $f_0(x)_1$ = $\text{softmax}(Wx+b)_1$
- $\mathbb{P}$(snow) = $f_0(x)_2$ = $\text{softmax}(Wx+b)_2$
- $\mathbb{P}$(hail) = $f_0(x)_3$ = $\text{softmax}(Wx+b)_3$
- $\mathbb{P}$(sun) = $f_0(x)_4$ = $\text{softmax}(Wx+b)_4$

We would have 4 different weight vectors and 4 different bias values.

If, in the one-dimensional case, we had $W \in \R^{\text{10 x 1}}$, in this 4-D case we now have $W \in \R^{\text{10 x 4}}$.

Our bias, which was $b \in \R^{\text{1x1}}$, now is $b \in \R^{\text{1x4}}$.

In [43]:
n = 10

Let's initialize our vectors for each class

In [52]:
w_rain = torch.randn(n)
w_snow = torch.randn(n)
w_hail = torch.randn(n)
w_sun = torch.randn(n)

b_rain = torch.randn(1)
b_snow = torch.randn(1)
b_hail = torch.randn(1)
b_sun = torch.randn(1)

print(f"Shape of rain tensor: {w_rain.shape}\t Shape of rain bias: {b_rain.shape}")
print(f"Rain tensor: {w_rain}\t Rain bias: {b_rain}")
print(f"Shape of snow tensor: {w_snow.shape}\t Shape of snow bias: {b_snow.shape}")
print(f"Snow tensor: {w_snow}\t Snow bias: {b_snow}")
print(f"Shape of hail tensor: {w_hail.shape}\t Shape of hail bias: {b_hail.shape}")
print(f"Hail tensor: {w_hail}\t Hail bias: {b_hail}")
print(f"Shape of sun tensor: {w_sun.shape}\t Shape of sun bias: {b_sun.shape}")
print(f"Sun tensor: {w_sun}\t Sun bias: {b_sun}")

Shape of rain tensor: torch.Size([10])	 Shape of rain bias: torch.Size([1])
Rain tensor: tensor([ 0.8005, -0.2246, -1.7709,  0.9279,  0.8335, -1.9097, -0.0217,  0.2062,
         0.2626, -0.1394])	 Rain bias: tensor([-0.4068])
Shape of snow tensor: torch.Size([10])	 Shape of snow bias: torch.Size([1])
Snow tensor: tensor([-0.7112,  0.0871, -0.6663,  0.3715,  0.7649, -1.5910, -0.5670,  0.2699,
         0.4071,  0.3503])	 Snow bias: tensor([0.1228])
Shape of hail tensor: torch.Size([10])	 Shape of hail bias: torch.Size([1])
Hail tensor: tensor([ 1.8832, -0.2612, -0.3553,  0.3730, -1.8256, -0.2604,  1.2536, -0.2062,
         0.3728, -2.1392])	 Hail bias: tensor([1.4636])
Shape of sun tensor: torch.Size([10])	 Shape of sun bias: torch.Size([1])
Sun tensor: tensor([ 0.5999, -0.1561, -0.5780, -1.9524,  0.4827,  0.1182, -0.9592, -1.4935,
        -1.2073,  0.9927])	 Sun bias: tensor([0.3712])


Let's now make it 4-D. Each of our weight vectors is a "row vector" that we can stack vertically.

We need them to be row vectors for our shapes to work out as intended, as our inputs are column vectors by convention.

After this, we'll get $W \in \R^{\text{4 x 10}}$ and $b \in \R^{\text{4 x 1}}$

We expect our input vectors to be $x \in \R^{\text{10 x 1}}$

As a result, $Wx \in \R^{\text{4 x 1}}$ and $y=Wx + b \in \R^{\text{4}}$ We want our bias vector to be a 4-D vector of constants (note, this is different than defining $b \in \R^{\text{4x1}}$ as a matrix. We want it to be a vector).

In [74]:
W = torch.stack([w_rain, w_snow, w_hail, w_sun], dim=0)
b = torch.stack([b_rain, b_snow, b_hail, b_sun], dim=0).squeeze()

print(f"Shape of W: {W.shape}\t Shape of b: {b.shape}")

Shape of W: torch.Size([4, 10])	 Shape of b: torch.Size([4])


Now let's initialize a random vector $x$. Let's say that somehow this represents, say, a vector of the temperatures throughout the day.

In [66]:
x_temps = torch.randn(n)

We can now do inference on this vector.

In [75]:
print(f"Shape of x_temps: {x_temps.shape}")
print(f"Shape of W: {W.shape}")
print(f"Shape of b: {b.shape}")

Shape of x_temps: torch.Size([10])
Shape of W: torch.Size([4, 10])
Shape of b: torch.Size([4])


In [77]:
y = torch.matmul(W, x_temps) + b
print(y)
print(y.shape)

tensor([ 0.6637, -0.5460,  4.4479, -2.3486])
torch.Size([4])


We now have our result! Now what?

We can take the argmax and get the class that we'd want to classify in the first place:

In [79]:
label = torch.argmax(y)
print(f"Predicted class label: {label}")

Predicted class label: 2


However, we also want to take the softmax as well. Why?

1. The raw results can't be interpreted as probabilities, so we can't compare classes against each other. In the context of classification, we want results that will let us compare the probabilities of each class to the other classes.
2. Because softmax lets us interpret the results as probabilities, we can use them in loss functions, such as binary cross-entropy, to figure out how well the predicted probabilities match the class labels.
3. Backpropagation: softmax is a smooth continuous function, so we can backpropagate.

Let's take the softmax of our values.

In [81]:
softmax_y = torch.nn.functional.softmax(y, dim=0)
print(softmax_y)

tensor([0.0221, 0.0066, 0.9703, 0.0011])


As we see, the softmax gives us an interpretable set of labels, where we can see each output as the probability that the model assigns the input to the given $i^{th}$ class.

Now let's encapsulate this in a torch module:

In [89]:
class LinearClassifier(torch.nn.Module):
    """Logistic regression classifier."""
    def __init__(self, n, d):
        super().__init__()
        self.model = nn.Linear(n, d)

    def forward(self, x):
        """We first calculate the linear combination of the features and
        weights and then apply a sigmoid function."""
        regression_output = self.model(x)
        # NOTE: do not actually add sigmoid in the "forward" function since
        # you can't backpropagate without it becoming numerically unstable.
        return nn.functional.softmax(regression_output, dim=-1)

Let's try this again with our previous problem.

In [90]:
n = 10
d = 4

In [91]:
lc = LinearClassifier(n, d)
x = torch.randn(n)

In [92]:
y = lc(x)
print(y)

tensor([0.4456, 0.1840, 0.2596, 0.1108], grad_fn=<SoftmaxBackward0>)


In [94]:
label = torch.argmax(y)
print(f"Predicted class label: {label}")

Predicted class label: 0


#### Multi-class vs. multiple binary classification

We use the softmax for multiclass classification. But what if we want to predict the probability of multiple classes at the same time? Let's say that instead of predicting the weather as rain, snow, hail, or sun, we say that a day can have any of those labels, even multiple at the same time.

In this case, the problem becomes a multiple binary classification problem, since we're predicting the probability that there was rain on a given day, there was snow on a given day, there was hail on a given day, and/or there was sun on a given day.

In this case, we would change our function from a softmax to a sigmoid. Using a sigmoid allows for multiple categories and treats them all as independent binary classification problems. The probabilities are uncalibrated (i.e., they don't sum up to one), and so the results are used for multi-labeling tagging.

You can have an output like $y=[0.2, 0.6, 0.7, 0.1]$, which clearly doesn't sum up to 1, but indicates that the labels we'd assign to y are $y=[0, 3]$.