# Problem: Implement Softmax function from scratch

### Problem Statement
You are tasked with implementing **softmax activation function** in PyTorch that computes the following operation: 
 
$ \text{softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)} $ 

Once implemented, this activation function will be used in a simple multiclass classification model.

~~I guess it would've been better to compare implemented softmax values with the PyTorch one as a task...~~

<details>
  <summary>ðŸ’¡ Hint</summary>
  In this example, you should <strong>not</strong> use model's outputs as is, if you use CrossEntropyLoss as the loss.
</details>

<details>
  <summary>ðŸ˜œ Spoiler</summary>
  In most of the cases like this example, you should <strong>not</strong> apply softmax if you use CrossEntropyLoss. It applies softmax function in itself.
  <a href="https://stackoverflow.com/a/55675428">See this StackOverflow reply.</a>
</details>

In [15]:
import torch.nn as nn
import torch.optim as optim

import torch

In [16]:
# Generate synthetic data
torch.manual_seed(42)
X = torch.rand(100, 2) * 10  # 100 data points between 0 and 10, with two features
y = torch.round(torch.rand(100) * 4).long()  # Labels between 0 and 5

In [17]:
class SoftmaxFromScratch(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return torch.exp(x) / (
            torch.exp(x).sum(dim=-1, keepdim=True) + 1e-9
        )  # Small value to avoid zero division.

In [18]:
class ClassificationModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.fc1 = nn.Linear(2, 8)
        self.fc2 = nn.Linear(8, 16)
        self.fc3 = nn.Linear(16, 5)

        self.relu = nn.ReLU()
        self.custom_softmax = SoftmaxFromScratch()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)

        return self.custom_softmax(x)

In [19]:
model = ClassificationModel()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=model.parameters(), lr=0.01)

In [20]:
epochs: int = 1000

for epoch in range(epochs):
    # Forward pass
    pred = torch.log(model(X))
    loss = criterion(pred, y)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Log progress every 100 epochs
    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}")

Epoch [100/1000], Loss: 1.3182
Epoch [200/1000], Loss: 1.1652
Epoch [300/1000], Loss: 1.0619
Epoch [400/1000], Loss: 1.0103
Epoch [500/1000], Loss: 0.9676
Epoch [600/1000], Loss: 0.9287
Epoch [700/1000], Loss: 0.8994
Epoch [800/1000], Loss: 0.8783
Epoch [900/1000], Loss: 0.8556
Epoch [1000/1000], Loss: 0.8469


In [21]:
# Testing on new data
X_test = torch.tensor([[4.0, 2.0], [7.0, 3.33]])
with torch.no_grad():
    predictions = model(X_test).argmax(dim=1)
    print(f"Predictions for {X_test.tolist()}: {predictions.tolist()}")

Predictions for [[4.0, 2.0], [7.0, 3.3299999237060547]]: [1, 1]
