# <font color = 'indianred'> **Logistic Regression using Minibatch Stochastic Gradient Descent with PyTorch**

## <font color = 'indianred'>**Data**

In [None]:
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler

In [None]:
import torch
import torch.optim as optim
import torch.nn as nn
import torch.functional as F
from torch.utils.data import DataLoader, TensorDataset

In [None]:
X, y = make_moons(n_samples=1000, noise=0.05, random_state=0)

In [None]:
preprocessor = StandardScaler()
X = preprocessor.fit_transform(X)

In [None]:
print(X.shape, y.shape)

(1000, 2) (1000,)


In [None]:
X[0:5]

array([[ 1.75081891,  0.48393978],
       [ 1.35606443, -0.90706929],
       [-0.90150442,  1.22470664],
       [-0.60117222, -0.14688373],
       [ 0.0048725 , -1.28700543]])

In [None]:
print(y[0:10])

[1 1 0 1 1 1 0 0 0 1]


## <font color = 'indianred'>**Dataset and Data Loaders**

In [None]:
# Create Tensors from numpy
x_tensor = torch.tensor(X).float()
y_tensor = torch.tensor(y.reshape(-1, 1)).float()

# create a PyTorch Dataset from x and y tensors
train_dataset = TensorDataset(x_tensor, y_tensor)

# CReate Data lOader from Dataset
train_loader = DataLoader(dataset=train_dataset, batch_size=16, shuffle=True)

## <font color = 'indianred'>**Loss function**

**Binary Classification - using Sigmoid**

<img src ="https://drive.google.com/uc?export=view&id=1DT7n0Lmbt1hzUvSH9AS9SSydEvOJEPbE" >

### <font color = 'indianred'>**nn.BCELoss()**

In [None]:
torch.manual_seed(42)
input = torch.randn(6, requires_grad=True).view(3,2)
print(f'input.shape:{input.shape}\n')
target = torch.tensor([1.0, 0., 1.0]).view(3,1)
print(f'target.shape:{input.shape}\n')

# model - with sigmoid activation for outpur layer
model = # code here
output = model(input)
print(f'output:{output}\n')

# BCE Loss function
loss_fn = # code here
loss = loss_fn(output, target)
print(f'loss:{loss}')

input.shape:torch.Size([3, 2])

target.shape:torch.Size([3, 2])

output:tensor([[0.6293],
        [0.6190],
        [0.4345]], grad_fn=<SigmoidBackward0>)

loss:0.7539259791374207


### <font color = 'indianred'>**nn.BCEWithLogitsLoss()**

In [None]:
torch.manual_seed(42)
input = torch.randn(6, requires_grad=True).view(3,2)
print(f'input.shape:{input.shape}\n')
target = torch.tensor([1.0, 0., 1.0]).view(3,1)
print(f'target.shape:{input.shape}\n')

# model - without sigmoid activation for outpur layer
model = # code here
output = model(input)
print(f'output:{output}\n')

# BCE with Logistic Loss function
loss_fn = # code here
loss = loss_fn(output, target)
print(f'loss:{loss}')

input.shape:torch.Size([3, 2])

target.shape:torch.Size([3, 2])

output:tensor([[ 0.5292],
        [ 0.4855],
        [-0.2635]], grad_fn=<AddmmBackward0>)

loss:0.7539259791374207


<font color= 'indianred' fomnt size = 5>**Both `nn.BCELoss()` and `nn.BCEWithLogitsLoss()` are used for binary classification problems** </font>, but they differ in how they accept input and perform calculations, impacting their numerical stability and performance.

### nn.BCELoss():

1. **Input Range**: Expects the input tensor to be in the range \([0, 1]\), typically the output of a sigmoid activation function.
2. **Target**: Binary labels that are either 0 or 1.
3. **Standalone**: Applies only the Binary Cross Entropy loss function.

### nn.BCEWithLogitsLoss():

1. **Input Range**: Accepts raw scores (logits) without any activation function applied.
2. **Target**: Binary labels that are either 0 or 1.
3. **Combined**: Applies both the sigmoid activation function and the Binary Cross Entropy loss in a single, more numerically stable step.

#### Why nn.BCEWithLogitsLoss() is Generally Preferred:

1. **Numerical Stability**: Using `nn.BCEWithLogitsLoss()` is more numerically stable than using `nn.BCELoss()` with a separate sigmoid activation. This is because `nn.BCEWithLogitsLoss()` combines the sigmoid activation and the loss calculation into a single operation, which can avoid some of the numerical instability incurred by calculating them separately.

2. **Performance**: Combining the sigmoid operation with the loss calculation can lead to optimizations. Backpropagation through a combined operation is often faster than through separate operations.

3. **Memory Efficiency**: Since you don't need to store the intermediate sigmoid outputs for backpropagation, using `nn.BCEWithLogitsLoss()` can be more memory-efficient, particularly important for large-scale models.

So, for better numerical stability, performance, and memory efficiency, `nn.BCEWithLogitsLoss()` is generally the recommended choice.

<br><br><br><br>

**Binary Classification - using Softmax**

<img src ="https://drive.google.com/uc?export=view&id=1DVSgL5tEvWRYt4XhlqwEEAhu8FOqt9wK" >



### <font color = 'indianred'>**nn.NLLLoss()**

In [None]:
torch.manual_seed(42)
input = torch.randn(6, requires_grad=True).view(3,2)
print(f'input.shape:{input.shape}\n')
target = torch.tensor([1.0, 0., 1.0]).long()
print(f'target.shape:{input.shape}\n')

# model - without sigmoid activation for outpur layer
model = # code here
output = model(input)
print(f'output:{output}\n')

# Negative Log Likelihood Function
loss_fn = # code here
loss = loss_fn(output, target)
print(f'loss:{loss}')

input.shape:torch.Size([3, 2])

target.shape:torch.Size([3, 2])

output:tensor([[-0.4640, -0.9909],
        [-0.4635, -0.9917],
        [-0.5980, -0.7984]], grad_fn=<LogSoftmaxBackward0>)

loss:0.7509231567382812


### <font color = 'indianred'>**nn.CrossEntropy()**

In [None]:
torch.manual_seed(42)
input = torch.randn(6, requires_grad=True).view(3,2)
print(f'input.shape:{input.shape}\n')
target = torch.tensor([1.0, 0., 1.0]).long()
print(f'target.shape:{input.shape}\n')

# model - without sigmoid activation for outpur layer
model = # code here
output = model(input)
print(f'output:{output}\n')

# CrossEntropy Function
loss_fn = # code here
loss = loss_fn(output, target)
print(f'loss:{loss}')

input.shape:torch.Size([3, 2])

target.shape:torch.Size([3, 2])

output:tensor([[ 0.7333,  0.2065],
        [ 0.6896,  0.1615],
        [-0.0593, -0.2597]], grad_fn=<AddmmBackward0>)

loss:0.7509231567382812


<font color= 'indianred' fomnt size = 5>**Both `nn.NLLLoss()` and `nn.CrossEntropyLoss()` are used for classification tasks** </font>, but they operate on different kinds of inputs and perform different computations. Here's a breakdown:

### nn.NLLLoss():

1. **Input**: Expects log probabilities, typically the output of a `log_softmax` function.
2. **Target**: Requires class labels as integers.
3. **Operation**: Applies the Negative Log Likelihood (NLL) loss, essentially indexing into the log probabilities based on target labels and negating the values.

### nn.CrossEntropyLoss():

1. **Input**: Expects raw scores (logits), without any activation function applied.
2. **Target**: Requires class labels as integers.
3. **Operation**: Combines both the `log_softmax` and `nn.NLLLoss()` in a single operation, making it a more convenient and numerically stable option.

#### Why `nn.CrossEntropyLoss()` is Generally Preferred:

1. **Numerical Stability**: Similar to the advantage of `nn.BCEWithLogitsLoss()` over `nn.BCELoss()`, `nn.CrossEntropyLoss()` improves numerical stability by combining `log_softmax` and `nn.NLLLoss()` into a single operation.

2. **Code Simplification**: You don't have to explicitly include a softmax or log_softmax layer before the loss layer, making the code simpler and less prone to errors.

3. **Performance**: The combined operation can result in a performance gain during both forward and backward passes, as some computations can be merged.

4. **Memory Efficiency**: No need to store intermediate values from the softmax operation when using `nn.CrossEntropyLoss()`, potentially saving memory.

In summary, `nn.CrossEntropyLoss()` is often the more convenient and numerically stable option for classification tasks. It encapsulates the functionalities of `log_softmax` and `nn.NLLLoss()` in a single class, offering advantages in terms of code clarity, performance, and memory efficiency.

### <font color = 'indianred'>**Summary Loss Functions**

<font size = 10></font>

|                              | <font size = 4> BCE Loss| <font size = 4> BCE With Logits Loss| <font size = 4> NLL Loss| <font size = 4> Cross Entropy Loss|
|--------------------------|-------------|-----------|---------|-----------|
|<font size = 4> Classification|<font size = 4> binary| <font size = 4> binary| <font size = 4> multiclass|<font size = 4> multiclass|
|<font size = 4> Input (each datapoint)| <font size = 4> probbaility|<font size = 4> logit|<font size = 4> array of log probbailities|<font size = 4> array of logits|
<font size = 4> Label (Each data point)| <font size = 4> float (0.0 or 1.0)|<font size = 4> float (0.0 or 1.0)|<font size = 4> long(class index)|<font size = 4> long(class index)|
|<font size = 4> Model's Last layer| <font size = 4> Sigmoid | - | <font size = 4> LogSoftmax|-

### <font color = 'indianred'>**Recommendation**
- **nn.BCEWithLogitsLoss() for binary classification**
- **nn.CrossEntropyLoss() for multi-class classification**



## <font color = 'indianred'>**Model**

In [None]:
# Speciy your model
model = # code here

In [None]:
# specify your loss function
loss_function = # code here

## <font color = 'indianred'>**Optimizer**

In [None]:
# specify optimizer
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate )

## <font color = 'indianred'>**Initialization**

Create a function to initilaize weights.
- Initialize weights using normal distribution with mean = 0 and std = 0.05
- Initilaize the bias term with zeros

In [None]:
def init_weights(layer):
  if type(layer) == nn.Linear:
    torch.nn.init.normal_(layer.weight, mean = 0, std = 0.05)
    torch.nn.init.zeros_(layer.bias)

## <font color = 'indianred'>**Training Loop**

**Model Training** involves five steps:

- Step 0: Randomly initialize parameters / weights
- Step 1: Compute model's predictions - forward pass
- Step 2: Compute loss
- Step 3: Compute the gradients
- Step 4: Update the parameters
- Step 5: Repeat steps 1 - 4

Model training is repeating this process over and over, for many **epochs**.

We will specify number of ***epochs*** and during each epoch we will iterate over the complete dataset and will keep on updating the parameters.

***Learning rate*** and ***epochs*** are known as hyperparameters. We have to adjust the values of these two based on validation dataset.

We will now create functions for step 1 to 4.

In [None]:
torch.manual_seed(100)
epochs = 5
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

# move model tp gpu
# code here

# Step 0: Randomly initialize parameters / weights

model.apply(init_weights)

for epoch in range(epochs):

  # Training Loop
  # Initialize train_loss at the he start of the epoch
  running_train_loss = 0
  running_train_correct = 0

  # Iterate on batches from the dataset using train_loader
  for input_, targets in train_loader:

    # move inputs and outputs to GPUs
    input_ = # code here
    targets = # code here

    # Step 1: Forward Pass: Compute model's predictions
    output = model(input_)

    # Step 2: Compute loss
    loss = loss_function(output, targets)

    # Correct prediction
    y_pred = # code here # get the predicted class
    correct = # code here

    # Step 3: Backward pass -Compute the gradients
    optimizer.zero_grad()
    loss.backward()

    # Step 4: Update the parameters
    optimizer.step()

    # Add train loss of a batch
    running_train_loss += loss.item()

    # Add Corect counts of a batch
    running_train_correct += correct

  # Calculate mean train loss for the whole dataset for a particular epoch
  train_loss = running_train_loss/len(train_loader)

  # Calculate accuracy for the whole dataset for a particular epoch
  train_acc = running_train_correct/len(train_loader.dataset)

  # Print the train loss and accuracy for given number of epochs, batch size and number of samples
  print(f'Epoch : {epoch+1} / {epochs}')
  print(f'Train Loss: {train_loss : .4f} | Train Accuracy: {train_acc * 100 : .4f}%')

cuda:0
Epoch : 1 / 5
Train Loss:  0.4049 | Train Accuracy:  86.3000%
Epoch : 2 / 5
Train Loss:  0.2908 | Train Accuracy:  86.6000%
Epoch : 3 / 5
Train Loss:  0.2648 | Train Accuracy:  87.0000%
Epoch : 4 / 5
Train Loss:  0.2562 | Train Accuracy:  87.5000%
Epoch : 5 / 5
Train Loss:  0.2502 | Train Accuracy:  87.8000%


In [None]:
# print the estimated weight and bias term
for name, param in model.named_parameters():
  print(name, param.data)

0.weight tensor([[ 0.9944, -2.3283]], device='cuda:0')
0.bias tensor([-0.0195], device='cuda:0')
