
# **01-Introduction-to-Deep-Learning-with-PyTorch**


In [62]:
!git clone https://github.com/mohd-faizy/Developing-Large-Language-Models.git

Cloning into 'Developing-Large-Language-Models'...
remote: Enumerating objects: 55, done.[K
remote: Counting objects: 100% (55/55), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 55 (delta 16), reused 35 (delta 7), pack-reused 0[K
Receiving objects: 100% (55/55), 13.54 MiB | 17.63 MiB/s, done.
Resolving deltas: 100% (16/16), done.


## **1️⃣Introduction to PyTorch, a Deep Learning Library**

**Tensors: the building blocks of networks in PyTorch**

In [33]:
# Load from list
import torch

lst = [[1, 2, 3], [4, 5, 6]]
tensor = torch.tensor(lst)

print(tensor)
print(type(tensor))

tensor([[1, 2, 3],
        [4, 5, 6]])
<class 'torch.Tensor'>


In [34]:
# Load from NumPy array
import numpy as np

array = [[1, 2, 3], [4, 5, 6]]
np_array = np.array(array)
print(np_array)

print("\n")

np_tensor = torch.tensor(np_array)
print(np_tensor)

[[1 2 3]
 [4 5 6]]


tensor([[1, 2, 3],
        [4, 5, 6]])


In [35]:
# Tensor attributes
import torch

lst = [[1, 2, 3], [4, 5, 6]]
tensor = torch.tensor(lst)

print(tensor.shape)
print(tensor.dtype)
print(tensor.device)

torch.Size([2, 3])
torch.int64
cpu


> PyTorch doesn't guarantee the exact data type by default. It might allocate a different data type based on the hardware and software configuration. To ensure `32-bit` floats, you can explicitly specify it during creation:

In [36]:
import torch

lst_32 = [[1, 2, 3], [4, 5, 6]]
tensor = torch.tensor(lst_32, dtype=torch.float32)

print(tensor.dtype)

torch.float32


**Tensor Operations**

In [37]:
a = torch.tensor([[1, 1],
                 [2, 2]])

b = torch.tensor([[2, 2],
                  [3, 3]])

c = torch.tensor([[1, 1, 4],
                  [2, 2, 5]])

In [38]:
print(a+b)

tensor([[3, 3],
        [5, 5]])


In [39]:
print(a-b)

tensor([[-1, -1],
        [-1, -1]])


In [40]:
try:
  print(a + c)
except RuntimeError as e:
  print("Error:", e)

Error: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1


In [41]:
print(a*b)

tensor([[2, 2],
        [6, 6]])


**Neural Network (NN) Vs. Convolutional Neural Network (CNN)**
             


| Feature             | Neural Network (NN)                                            | Convolutional Neural Network (CNN)                                     |
|---------------------|-----------------------------------------------------------------|-------------------------------------------------------------------------|
| Type                | General purpose                                                 | Subtype of NN, specialized for grid-like data (images)                  |
| Architecture        | Fully-connected layers                                          | Convolutional layers, pooling layers, and fully-connected layers        |
| Data Processing     | Each neuron in a layer connected to all neurons in the previous layer | Local connections between neurons, exploiting spatial relationships in data |
| Parameter Efficiency | Less efficient, requires more parameters                       | More efficient, learns features automatically                           |
| Strengths           | Flexible, works well for various data types                     | Excellent for image recognition, computer vision                         |
| Applications        | Classification, regression, pattern recognition                 | Image classification, object detection, image segmentation              |


**Creating the Neural Network**

In [42]:
# Single Layer NN
import torch.nn as nn

# Create input_tensor with three features
input_tensor = torch.tensor([[0.3471, 0.4547, -0.2356]])

# Define our first linear layer
# linear_layer = nn.Linear(3, 2)
linear_layer = nn.Linear(in_features=3, out_features=2)

# Pass input through linear layer
output = linear_layer(input_tensor)

print(output)
print(linear_layer.weight) # Each layer has a weight
print(linear_layer.bias)   # and bias property

tensor([[-0.4295,  0.3468]], grad_fn=<AddmmBackward0>)
Parameter containing:
tensor([[-0.3432,  0.3303, -0.3077],
        [ 0.3601, -0.4120, -0.2342]], requires_grad=True)
Parameter containing:
tensor([-0.5330,  0.3540], requires_grad=True)


`output = linear_layer(input_tensor)`

for input $X$, weights $W_0$ and bias $b_0$, the linear layer performs

$y_0 = W_0.X + b_0$

- **In PyTorch**:
    - Weights and bias are initialized randomly
    - They are not useful until they are tuned


**Our two-layer network summary**

- Input dimensions: `1 × 3`
- Linear layer arguments:
    - in_features = `3`
    - out_features = `2`
- Output dimensions: `1 × 2`
- Networks with only linear layers are called **fully connected**.
- Each neuron in one layer is connected to
each neuron in the next layer

In [43]:
# Stacking layer with nn.Sequential() - Multiple Layer NN

import torch.nn as nn

# Create input_tensor with three features
input_tensor = torch.tensor([[0.3471, 0.4547, -0.2356]]) # Dim:1x3

# Define the model with the first layer matching input features
model = nn.Sequential(
    nn.Linear(3, 5),
    nn.Linear(5, 7),
    nn.Linear(7, 5)
)

# Pass input through the model
output_tensor = model(input_tensor)
print(output_tensor)

tensor([[-0.5969,  0.3930,  0.1064, -0.4350, -0.4013]],
       grad_fn=<AddmmBackward0>)


In [44]:
# Stacking layer with Activation function
import torch
from torch import nn

# Create input_tensor with three features
input_tensor = torch.tensor([[0.3471, 0.4547, -0.2356]]) # Dim:1x3

# Define the model using nn.Sequential
model = nn.Sequential(
    nn.Linear(3, 5),    # First linear layer with 3 input features and 5 hidden units
    nn.ReLU(),          # Activation function (ReLU in this case)
    nn.Linear(5, 7),    # Second linear layer with 5 input features (from previous layer) and 7 hidden units
    nn.ReLU(),          # Activation function (ReLU again)
    nn.Linear(7, 2)     # Output layer with 7 input features (from previous layer) and 2 output units
)

# Print the model architecture
print(model)

# Pass input through the model
output_tensor = model(input_tensor)
print(output_tensor)

Sequential(
  (0): Linear(in_features=3, out_features=5, bias=True)
  (1): ReLU()
  (2): Linear(in_features=5, out_features=7, bias=True)
  (3): ReLU()
  (4): Linear(in_features=7, out_features=2, bias=True)
)
tensor([[ 0.1882, -0.1405]], grad_fn=<AddmmBackward0>)


**Discovering Activation functions**

- Non-linearity allows neural networks to learn and represent complex patterns and relationships in data.

- Activation functions introduce non-linearity by transforming the input signal into an output signal.

- Without activation functions, neural networks would only be able to model linear relationships, limiting their capacity to learn and generalize.

- Activation functions introduce important properties like boundedness and differentiability, which aid in optimization during training.

- They enable neural networks to approximate any arbitrary function, making them  powerful tools for various tasks such as classification, regression, and reinforcement learning.

- Common activation functions include `sigmoid`, `tanh`, `ReLU` (Rectified Linear Unit), and variants like Leaky `ReLU` and `ELU` (Exponential Linear Unit).

These functions introduce complexity and flexibility, enabling neural networks to capture intricate data patterns and improve model performance.

```python
# Sigmoid

input_tensor = torch.tensor([[0.8]])

# Create a sigmoid function and apply it on input_tensor
sigmoid = nn.Sigmoid()
probability = sigmoid(input_tensor)
print(probability)
```


```python
# Softmax

input_tensor = torch.tensor([[1.0, -6.0, 2.5, -0.3, 1.2, 0.8]])

# Create a softmax function and apply it on input_tensor
softmax = nn.Softmax(dim=-1)
probabilities = softmax(input_tensor)
print(probabilities)
```

**Sigmoid**

In [45]:
# Neural Network

import torch
import torch.nn as nn

# Define the NN model
model = nn.Sequential(
    nn.Linear(6, 4),  # First linear layer
    nn.Linear(4, 1),  # Second linear layer
    nn.Sigmoid()      # Sigmoid activation function
)

# Example usage
input_tensor = torch.tensor([[6.0, 0.0, 0.0, 0.0, 0.0, 0.0]])
output = model(input_tensor)
print(output)

tensor([[0.2785]], grad_fn=<SigmoidBackward0>)


In [46]:
# Convolutional Neural Network

import torch
import torch.nn as nn

# Define the CNN model
model = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
    nn.Flatten(),
    nn.Linear(32 * 56 * 56, 128),  # 32 channels, 56x56 image size after pooling
    nn.ReLU(),
    nn.Linear(128, 1),             # Output layer for binary classification
    nn.Sigmoid()                   # Sigmoid activation function
)

# Example usage
input_tensor = torch.randn(1, 3, 224, 224)  # Example input tensor with shape (batch_size, channels, height, width)
output = model(input_tensor)
print(output)

tensor([[0.4766]], grad_fn=<SigmoidBackward0>)


- A neural network with a `single linear layer` followed by a `sigmoid` activation is similar to a **logistic regression model**.

- The `input` dimension of a linear layer must be **equal** to the `output` dimension of the previous layer.

**Softmax**

In [47]:
import torch
import torch.nn as nn

# Define the NN model
model = nn.Sequential(
    nn.Linear(3, 4),   # First linear layer
    nn.Linear(4, 3),   # Second linear layer, output size 3 for softmax
    nn.Softmax(dim=-1) # -1 indicates softmax is applied to the input tensor's last dimension
)                      # nn.Softmax() can be used as last step in nn.Sequential()

# Example usage
input_tensor = torch.tensor([[4.3, 6.1, 2.3]])
output = model(input_tensor)
print(output)

tensor([[0.0276, 0.7649, 0.2074]], grad_fn=<SoftmaxBackward0>)


## **2️⃣Training Our First Neural Network with PyTorch**

**Is there also a backward pass?**

The backward pass, also known as **backpropagation**, plays a crucial role in updating the weights and biases of a neural network during training. It is an essential step in the optimization process that allows the network to learn from the provided data.

In the **training loop**, several key steps occur:

1. **Forward Propagation**: Initially, data is propagated forward through the neural network. Each layer of the network applies transformations to the input data until the final output is generated.

2. **Comparison to Ground-Truth**: Once the forward propagation is complete, the output produced by the network is compared to the true values, often referred to as ground-truth. This step is crucial for assessing the performance of the model.

3. **Backpropagation for Weight and Bias Updates**: Following the comparison step, backpropagation is employed to update the model's weights and biases. During backpropagation, gradients are computed with respect to the loss function, and these gradients are then used to adjust the parameters of the network through techniques like gradient descent.

4. **Iteration and Tuning**: The process of forward propagation, comparison, and backpropagation is repeated iteratively until the weights and biases of the model are tuned to produce useful outputs. This iterative process allows the neural network to progressively improve its performance on the given task.

The backward pass is a fundamental aspect of training neural networks, enabling them to learn from data and adapt their parameters to make accurate predictions.


**Why is Backpropagation Important?**

- **Learning from Mistakes**: Backpropagation allows neural networks to learn from their mistakes by adjusting their parameters based on the errors made during prediction. By iteratively updating the weights and biases in the direction that minimizes the error, the network gradually improves its performance.

- **Efficient Training**: Without backpropagation, training neural networks would be extremely challenging and inefficient. Backpropagation efficiently computes the gradients needed to update the network's parameters, making it feasible to train deep neural networks with many layers and parameters.

- **Adaptability**: Neural networks trained using backpropagation can adapt to complex and non-linear relationships in the data. They can learn to recognize patterns, make predictions, and solve a wide range of tasks across various domains, including image recognition, natural language processing, and reinforcement learning.

- **Generalization**: Backpropagation helps prevent overfitting by adjusting the network's parameters to generalize well to unseen data. By optimizing the loss function during training, the network learns to capture the underlying patterns in the data without memorizing noise or outliers.


**Binary classification: forward pass**

In [55]:
import torch
import torch.nn as nn

# Create input data of shape 5x6 - (5 samples, 6 features)
input_data = torch.tensor([
     [-0.4421,  1.5207,  2.0607, -0.3647,  0.4691,  0.0946],
     [-0.9155, -0.0475, -1.3645,  0.6336, -1.9520, -0.3398],
     [ 0.7406,  1.6763, -0.8511,  0.2432,  0.1123, -0.0633],
     [-1.6630, -0.0718, -0.1285,  0.5396, -0.0288, -0.8622],
     [-0.7413,  1.7920, -0.0883, -0.6685,  0.4745, -0.4245]
])

n_classes = 1 # for Binary classif, if > 1 then Multi-class

# Create binary classification model
model = nn.Sequential(
    nn.Linear(6, 4), # First linear layer
    nn.Linear(4, n_classes), # Second linear layer
    nn.Sigmoid()     # Sigmoid activation function
)

# Pass input data through model
output = model(input_data)
print(output.shape)
print(output) # five probabilities between zero and one
              # one value for each sample (row) in data

torch.Size([5, 1])
tensor([[0.4739],
        [0.4578],
        [0.2866],
        [0.4935],
        [0.3632]], grad_fn=<SigmoidBackward0>)


**Multi-class classification: forward pass**

In [49]:
import torch
import torch.nn as nn

# Create input data of shape 5x6 - (5 samples, 6 features).
input_data = torch.tensor([
     [-0.4421,  1.5207,  2.0607, -0.3647,  0.4691,  0.0946],
     [-0.9155, -0.0475, -1.3645,  0.6336, -1.9520, -0.3398],
     [ 0.7406,  1.6763, -0.8511,  0.2432,  0.1123, -0.0633],
     [-1.6630, -0.0718, -0.1285,  0.5396, -0.0288, -0.8622],
     [-0.7413,  1.7920, -0.0883, -0.6685,  0.4745, -0.4245]
])

# Specify model has three classes
n_classes = 3

# Create binary classification model
model = nn.Sequential(
    nn.Linear(6, 4),         # First linear layer
    nn.Linear(4, n_classes), # Second linear layer
    nn.Softmax(dim=-1)
)

# Pass input data through model
output = model(input_data)
print(output.shape)
print(output)

torch.Size([5, 3])
tensor([[0.1129, 0.6338, 0.2533],
        [0.1831, 0.5304, 0.2865],
        [0.0971, 0.6695, 0.2334],
        [0.1819, 0.5236, 0.2945],
        [0.1013, 0.6489, 0.2498]], grad_fn=<SoftmaxBackward0>)


**Outputs:**
- The output dimension is `5 × 3`
- Each row sums to one.
- Value with highest probability is assigned predicted label in each row.
- Row 1 = class 1 (mammal), row 2 = class 1 (mammal), row 3 = class 3 (reptile)

$\color{red}{\textbf{Note:}}$

In the binary classification forward pass. if we change the second linear layer from `nn.Linear(4, 1)` to `nn.Linear(4, 3)`, we are effectively modifying it to handle multiclass classification.

However, using the sigmoid activation function (`nn.Sigmoid()`) with three output units is not recommended for multiclass classification. Here’s why:

- The sigmoid function produces independent probabilities for each class, but they are not guaranteed to sum up to 1.
- It treats each class independently, which can lead to incorrect predictions when dealing with multiple classes.

For multiclass tasks, it’s better to use the softmax activation function (`nn.Softmax()`), which ensures proper normalization across all classes.


**One Hot Encoding | Cross Entropy in pytorch**

In [58]:
import torch
import torch.nn as nn

# Create input data of shape 5x6 - (5 samples, 6 features).
input_data = torch.tensor([
    [-0.4421, 1.5207, 2.0607, -0.3647, 0.4691, 0.0946],
    [-0.9155, -0.0475, -1.3645, 0.6336, -1.9520, -0.3398],
    [0.7406, 1.6763, -0.8511, 0.2432, 0.1123, -0.0633],
    [-1.6630, -0.0718, -0.1285, 0.5396, -0.0288, -0.8622],
    [-0.7413, 1.7920, -0.0883, -0.6685, 0.4745, -0.4245]
])

# Specify the number of classes
n_classes = 3

# Create a binary classification model
model = nn.Sequential(
    nn.Linear(6, 4),  # First linear layer
    nn.Linear(4, n_classes),  # Second linear layer
    nn.Softmax(dim=-1)
)

# Pass input data through the model
output = model(input_data)
print("Output shape:", output.shape)
print("Output probabilities:\n", output)

# Transform labels with one-hot encoding
import torch.nn.functional as F

print("One-hot encoding for class 0:", F.one_hot(torch.tensor(0), num_classes=n_classes))
print("One-hot encoding for class 1:", F.one_hot(torch.tensor(1), num_classes=n_classes))
print("One-hot encoding for class 2:", F.one_hot(torch.tensor(2), num_classes=n_classes))

# Compute cross-entropy loss in PyTorch
from torch.nn import CrossEntropyLoss

scores = torch.tensor([[-0.1211, 0.1059]]) # Prediction
one_hot_target = torch.tensor([[1, 0]])    # Target

criterion = CrossEntropyLoss()
loss = criterion(scores.double(), one_hot_target.double()) # `.double()` converts a tensor to double-precision floating-point (64-bit)
print("Cross-entropy loss:", loss)

Output shape: torch.Size([5, 3])
Output probabilities:
 tensor([[0.3142, 0.3485, 0.3373],
        [0.3312, 0.3928, 0.2760],
        [0.3070, 0.4371, 0.2558],
        [0.3649, 0.2882, 0.3470],
        [0.3153, 0.3784, 0.3063]], grad_fn=<SoftmaxBackward0>)
One-hot encoding for class 0: tensor([1, 0, 0])
One-hot encoding for class 1: tensor([0, 1, 0])
One-hot encoding for class 2: tensor([0, 0, 1])
Cross-entropy loss: tensor(0.8131, dtype=torch.float64)


**Using derivatives to update model parameters(BACKPROPAGATION)**

In [59]:
import torch
import torch.nn as nn
import torch.optim as optim  # Import the optimizer module

# Create the model and run a forward pass
model = nn.Sequential(
    nn.Linear(16, 8),
    nn.Linear(8, 4),
    nn.Linear(4, 2)
)
sample = torch.randn(1, 16)  # Example input data
prediction = model(sample)

# Calculate the loss and compute the gradients
target = torch.tensor([0])  # Example target (class label)
criterion = nn.CrossEntropyLoss()
loss = criterion(prediction, target)
loss.backward()

# Access each layer's gradients
print("Gradients for layer 0:")
print("Weight gradient:", model[0].weight.grad)
print("Bias gradient:", model[0].bias.grad)

print("Gradients for layer 1:")
print("Weight gradient:", model[1].weight.grad)
print("Bias gradient:", model[1].bias.grad)

print("Gradients for layer 2:")
print("Weight gradient:", model[2].weight.grad)
print("Bias gradient:", model[2].bias.grad)

# Introduction to deep learning with PyTorch
# Updating model parameters
# Update the weights by subtracting local gradients scaled by the learning rate
lr = 0.001  # Learning rate (typically small)

# Update the weights
weight = model[0].weight
weight_grad = model[0].weight.grad
weight = weight - lr * weight_grad

# Update the biases
bias = model[0].bias
bias_grad = model[0].bias.grad
bias = bias - lr * bias_grad

# Convex and non-convex functions
# This is a convex function. This is a non-convex function.

# Gradient descent
# For non-convex functions, we will use an iterative process such as gradient descent
# In PyTorch, an optimizer takes care of weight updates
# The most common optimizer is stochastic gradient descent (SGD)

# Create the optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)

# Optimizer handles updating model parameters (or weights) after calculation of local gradients
optimizer.step()

Gradients for layer 0:
Weight gradient: tensor([[ 0.0338, -0.0135,  0.0515, -0.0179, -0.0081, -0.0350, -0.0051,  0.0138,
         -0.0448, -0.0301,  0.0274,  0.0077, -0.0399, -0.0198,  0.0286, -0.0226],
        [-0.0628,  0.0251, -0.0956,  0.0332,  0.0150,  0.0650,  0.0095, -0.0257,
          0.0831,  0.0559, -0.0509, -0.0143,  0.0741,  0.0367, -0.0532,  0.0420],
        [-0.0171,  0.0068, -0.0260,  0.0090,  0.0041,  0.0177,  0.0026, -0.0070,
          0.0226,  0.0152, -0.0138, -0.0039,  0.0201,  0.0100, -0.0144,  0.0114],
        [ 0.1200, -0.0479,  0.1825, -0.0633, -0.0286, -0.1241, -0.0182,  0.0491,
         -0.1588, -0.1068,  0.0972,  0.0273, -0.1414, -0.0701,  0.1015, -0.0802],
        [ 0.1221, -0.0488,  0.1857, -0.0645, -0.0291, -0.1263, -0.0186,  0.0499,
         -0.1616, -0.1086,  0.0989,  0.0277, -0.1439, -0.0713,  0.1033, -0.0817],
        [ 0.0620, -0.0247,  0.0943, -0.0327, -0.0148, -0.0641, -0.0094,  0.0253,
         -0.0820, -0.0551,  0.0502,  0.0141, -0.0731, -0.0362,  

**Writing our first training loop**

In [93]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler

# Load the diabetes dataset
diabetes = load_diabetes()
features = diabetes.data
target = diabetes.target

# Normalize features
scaler = StandardScaler()
features = scaler.fit_transform(features)

# Create the dataset and dataloader
dataset = TensorDataset(torch.tensor(features).float(), torch.tensor(target).float())
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

# Create a simple model
model = nn.Sequential(
    nn.Linear(10, 5),  # Input layer to hidden layer
    nn.ReLU(),         # Activation function (ReLU)
    nn.Linear(5, 1)    # Hidden layer to output layer
)

# Create the loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)  # Lower learning rate

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    for data in dataloader:
        features, target = data        # Get feature and target from the data loader
        target = target.unsqueeze(1)   # Unsqueeze the target tensor to match the shape expected by the loss function

        optimizer.zero_grad()          # Set the gradients to zero to avoid accumulation
        pred = model(features)         # Run a forward pass to get predictions
        loss = criterion(pred, target) # Compute loss between predictions and actual targets
        loss.backward()                # Compute gradients using backpropagation
        optimizer.step()               # Update the parameters (weights and biases) of the model based on gradients

# Evaluate on the entire dataset after training
with torch.no_grad():
    predictions = model(features)
    loss = criterion(predictions, target)
    print(f"Overall Loss: {loss.item():.4f}")

Overall Loss: 17.7421


**Dataset**:
- **Dataset** is an abstract class representing a dataset. It allows you to access individual samples in the dataset and optionally apply transformations to the data.
- PyTorch provides a utility class called **TensorDataset** which is a subclass of **Dataset**. It takes one or more tensors as input and combines them into a single dataset. In your code, `TensorDataset(torch.tensor(features).float(), torch.tensor(target).float())` combines the input features and target values into a dataset.

**DataLoader**:
- **DataLoader** is responsible for creating batches of data from a dataset. It provides options for shuffling the data, specifying batch size, and parallelizing data loading.
- By using a **DataLoader**, you can iterate over batches of data during training or evaluation, which is efficient and convenient.
- In your code, `DataLoader(dataset, batch_size=4, shuffle=True)` creates a **DataLoader** from the dataset created earlier. It specifies a batch size of 4, meaning each iteration will provide a batch containing 4 samples. Additionally, `shuffle=True` means the data will be shuffled before each epoch to ensure randomness and better training.


## **3️⃣Neural Network Architecture and Hyperparameters**

## **4️⃣Evaluating and Improving Models**

#### $\color{skyblue}{\textbf{Connect with me:}}$


[<img align="left" src="https://cdn4.iconfinder.com/data/icons/social-media-icons-the-circle-set/48/twitter_circle-512.png" width="32px"/>][twitter]
[<img align="left" src="https://cdn-icons-png.flaticon.com/512/145/145807.png" width="32px"/>][linkedin]
[<img align="left" src="https://cdn2.iconfinder.com/data/icons/whcompare-blue-green-web-hosting-1/425/cdn-512.png" width="32px"/>][Portfolio]

[twitter]: https://twitter.com/F4izy
[linkedin]: https://www.linkedin.com/in/mohd-faizy/
[Portfolio]: https://mohdfaizy.com/


## ⭐⭐⭐**Quick BrushUp - ML**⭐⭐⭐

### **Activation Functions and their use cases**


| **Activation Function** | **Formula** | **Range** | **Use Case** |
|-------------------------|-------------|-----------|--------------|
| Sigmoid (Logistic)      | $$f(x) = \frac{1}{1 + e^{-x}}$$ | [0, 1] | Binary classification (e.g., spam detection) |
| ReLU (Rectified Linear Unit) | $$f(x) = \max(0, x)$$ | Non-negative values | Hidden layers in most neural networks |
| Softmax                 | Given an input vector $$\mathbf{z} = (z_1, z_2, \ldots, z_k)$$, the softmax function computes: $$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^k e^{z_j}}$$ | Converts raw scores into probabilities (sums to 1) | Multiclass classification (e.g., image recognition) |
| Linear                  | $$f(x) = x$$ | Unbounded | Regression problems |
| Tanh (Hyperbolic Tangent) | $$f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$ | [-1, 1] | Hidden layers in certain architectures |
| Leaky ReLU              | $$f(x) = \begin{cases} x, & \text{if } x > 0 \\ \alpha x, & \text{otherwise} \end{cases}$$ | Non-negative values with a small negative slope for negative inputs | Hidden layers to prevent "dying ReLU" problem |




### **Loss Functions:**

**A loss function plays a crucial role in the world of machine learning. Let’s delve into why it’s so essential:**

- **Performance Measurement:** The loss function provides a clear metric to evaluate a model’s performance. It quantifies the difference between the model’s predictions and the actual target values. Essentially, it acts as a scorecard, allowing us to gauge how well the model is doing.

- **Direction for Improvement:** When training a machine learning model, we want it to learn and improve iteratively. The loss function guides this improvement process. By assessing the error margin, it directs the algorithm to adjust parameters (such as weights) to minimize the loss and enhance predictions.

- **Balancing Bias and Variance:** Effective loss functions help strike a balance between model bias (oversimplification) and variance (overfitting). Achieving this balance is crucial for the model’s ability to generalize well to new, unseen data.

- **Influencing Model Behavior:** Different loss functions can impact the model’s behavior. Some are more robust against outliers, while others prioritize specific types of errors. Choosing the right loss function can shape how the model learns and adapts.

Loss functions guide the learning process within a model, ensuring it moves in the right direction and continually improves.


- **Classification Loss Functions**

| **Loss Function** | **Formula** | **Use Case** |
|-------------------|-------------|--------------|
| Cross-Entropy | $$L(y, \hat{y}) = -\sum_{i=1}^N y_i \log(\hat{y}_i)$$ | - Binary Classification: Used when predicting probabilities for two classes (e.g., spam vs. not spam). - Multiclass Classification: Suitable for scenarios with more than two classes (e.g., image recognition). |
| Hinge Loss | $$L(y, \hat{y}) = \max(0, 1 - y \cdot \hat{y})$$ | - Support Vector Machines (SVM): Used for binary classification. - Emphasizes correct classification and margin maximization. |
| Log Loss (Logistic Loss) | $$L(y, \hat{y}) = -y \log(\hat{y}) - (1 - y) \log(1 - \hat{y})$$ | - Binary Classification: Commonly used with logistic regression. - Measures the difference between predicted probabilities and true labels. |
| Focal Loss | $$L(y, \hat{y}) = -(1 - \hat{y})^\gamma \cdot y \log(\hat{y})$$ | - Imbalanced Data: Addresses class imbalance by downweighting easy examples. - Widely used in object detection tasks. |
| Sparsemax Loss | $$L(y, \hat{y}) = -\sum_{i=1}^N y_i \log(\hat{y}_i)$$ (similar to cross-entropy) | - Multiclass Classification: Provides sparse probability distributions. - Suitable for cases where only a few classes should be activated. |

- **Regression Loss Functions**

| **Loss Function** | **Formula** | **Use Case** |
|-------------------|-------------|--------------|
| Mean Square Error (MSE) | $$L(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2$$ | - Regression: Widely used for continuous target variables. - Measures the average squared difference between predicted and true values. |
| Mean Absolute Error (MAE) | $$L(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^N |y_i - \hat{y}_i|$$ | - Regression: Measures average absolute difference between predicted and true values. |
| Huber Loss | $$L(y, \hat{y}) = \begin{cases} \frac{1}{2}(y_i - \hat{y}_i)^2, & \text{if } |y_i - \hat{y}_i| \leq \delta \\ \delta |y_i - \hat{y}_i| - \frac{1}{2}\delta^2, & \text{otherwise} \end{cases}$$ | - Regression: Combines MSE and MAE, robust to outliers. |
| Quantile Loss (Pinball Loss) | $$L(y, \hat{y}) = \begin{cases} \tau(y_i - \hat{y}_i), & \text{if } y_i \leq \hat{y}_i \\ (\tau - 1)(y_i - \hat{y}_i), & \text{otherwise} \end{cases}$$ | - Regression: Estimates conditional quantiles of the target variable. |
| Log-Cosh Loss | $$L(y, \hat{y}) = \log(\cosh(y_i - \hat{y}_i))$$ | - Regression: Smooth approximation of MAE, robust to outliers. |




### **Types of Machine Learning Tasks**

- **1.Classification**

    - **Classification** is the task of assigning categories (or classes) to given instances automatically. The machine learning model that has been trained to achieve such a goal is known as a **classifier**. Here are the main types of classification scenarios:

    1. **Binary Classification**:
        - An instance must belong to exactly one of two categories.
        - Examples: Spam vs. not spam emails, disease vs. healthy patients.

    2. **Multi-class Classification**:
        - An instance must belong to exactly one of many (more than two) categories.
        - Categories are mutually exclusive.
        - Examples: Image recognition (identifying objects in images), natural language processing (sentiment analysis).

    3. **Multi-labeled Classification**:
        - An instance may simultaneously belong to more than one category.
        - Categories are not mutually exclusive.
        - Examples: Document classification (a document can be about multiple topics), tagging images with multiple labels.

- **2.Regression**

    - **Regression** deals with predicting continuous target variables, which represent numerical values. It aims to map input features to a continuous numerical value. Here are some regression algorithms:

    1. **Linear Regression**:
        - Fits a linear relationship between input features and the target variable.
        - Used for predicting quantities like house prices, stock prices, etc.

    2. **Polynomial Regression**:
        - Extends linear regression by fitting higher-degree polynomial functions.
        - Useful when the relationship between features and target is nonlinear.

    3. **Ridge Regression** and **Lasso Regression**:
        - Variants of linear regression that handle multicollinearity and overfitting.

    4. **Decision Trees**:
        - Nonlinear regression method that splits data into segments based on feature thresholds.
        - Useful for complex relationships.



- 3.**Clustering**:
    - Clustering algorithms group similar data points together based on their features.
    - Commonly used for customer segmentation, image segmentation, and anomaly detection.

- 4.**Dimensionality Reduction**:
    - Techniques that reduce the number of features while preserving relevant information.
    - Examples include Principal Component Analysis (PCA) and t-SNE.

- 5.**Recommendation Systems**:
    - Used to recommend items (products, movies, etc.) to users based on their preferences and behavior.
    - Collaborative filtering and content-based filtering are common approaches.

- 6.**Time Series Forecasting**:
    - Predicting future values based on historical data.
    - Used in financial forecasting, weather prediction, and stock market analysis.

- 7.**Natural Language Processing (NLP)**:
    - Deals with understanding and generating human language.
    - Sentiment analysis, machine translation, and chatbots fall under NLP.

- 8.**Anomaly Detection**:
    - Identifying rare or unusual patterns in data.
    - Used for fraud detection, network security, and fault detection.

