### Mini-batch Gradient Descent and Data Management with PyTorch

#### Theory

Mini-batch gradient descent is an optimization algorithm used in training machine learning models. It combines the advantages of both batch gradient descent and stochastic gradient descent by dividing the training data into smaller batches. Each batch is used to compute the gradient and update the model parameters.

#### Key Concepts

1. **Batch Gradient Descent:**
   - Uses the entire dataset to calculate the gradient.
   - Pros: Stable convergence.
   - Cons: Memory-intensive and slow parameter updates.

2. **Stochastic Gradient Descent (SGD):**
   - Uses a single data point to calculate the gradient.
   - Pros: Faster updates.
   - Cons: Noisy convergence.

3. **Mini-batch Gradient Descent:**
   - Splits the dataset into smaller batches for gradient calculation.
   - Pros:
     - Efficient memory usage.
     - Faster convergence than batch gradient descent.
     - Reduces noise compared to SGD.

#### Problems with Mini-batch Gradient Descent

1. No standard interface for data handling (e.g., image classification datasets stored in different folders).
   - **Example:** In image classification, raw datasets may be organized differently across projects, such as images stored in class-specific directories or requiring specific preprocessing pipelines.

2. Difficulty in applying transformations (e.g., data augmentation).
   - **Example:** Applying transformations like cropping, flipping, or normalizing images may require manual implementation for every dataset.

3. Shuffling and sampling issues.
   - **Example:** Randomly shuffling large datasets to ensure generalization can be challenging without automated tools.

4. Managing batches and parallelization.
   - **Example:** Creating efficient batches while utilizing multiple cores for parallel data loading can be complex and error-prone.


#### Solution: PyTorch `Dataset` and `DataLoader`

PyTorch provides two essential classes to address these issues:

1. **`Dataset`:**
   - A standard interface for accessing and transforming data.
   - Custom datasets can be created by subclassing `torch.utils.data.Dataset`.

2. **`DataLoader`:**
   - Handles batching, shuffling, and parallelized data loading.
   - Simplifies the process of preparing data for training.

By using `Dataset` and `DataLoader`, we can:
- Apply transformations easily.
- Handle large datasets efficiently.
- Shuffle and sample data for better generalization.
- Parallelize data loading to speed up training.

Below is a practical implementation of these concepts:


# Dataset and DataLoader in PyTorch

PyTorch provides two core abstractions that **decouple how you define your data** from **how you efficiently iterate over it in training loops**:

---

### The Dataset

- **Dataset Class**: 
  The Dataset class acts as a blueprint(abstract class). When creating a custom Dataset, you decide how the data is loaded and returned. It defines:
  - `__init__()`: Specifies how data should be loaded.</span>
  - `__len__()` : Returns the total number of samples in the dataset.(batch_size)
  - `__getitem__(index)` : Fetches data (and labels) for a given index.

---

### DataLoader Class

- The DataLoader wraps a Dataset and:
  - Handles **batching**.
  - Manages **shuffling**.
  - Enables **parallelized data loading** for faster processing.

---

### DataLoader Control Flow

1. At the start of each epoch, the DataLoader (with `shuffle=True`) shuffles indices. 
- Example:  
     ```
     Initial indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]  
     Shuffled indices: [4, 5, 8, 9, 7, 2, 6, 1, 3, 0]
     ```
2. It divides these indices into chunks of the specified `batch_size`. 
- For example, with `batch_size=2`:  
     ```
     [[4, 5], [8, 9], [7, 2], [6, 1], [3, 0]]
     ```
3. For each index in the chunk, data samples are fetched from the Dataset object.
- Example:  
     - Chunk: `[4, 5]`  
     - `Dataset.__getitem__(4)` retrieves the data for index 4.  
     - `Dataset.__getitem__(5)` retrieves the data for index 5.  


4. The samples are then collected and combined into a batch (using `collate_fn`)
- This process uses the `collate_fn` function, which determines how individual samples are grouped together.  
   - Example:  
     ```
     Batch: [Sample_4, Sample_5]
     ```

5. Returning the Batch  
   - The batch is then returned to the main training loop for processing.  
   - This process is repeated until all chunks (batches) in the epoch are consumed.  

6. Iterating Over Epochs 
   - Once all batches have been processed, the process repeats for the next epoch, starting again with shuffling the indices.  
   - Example Workflow for an Epoch:  
     ```
     Epoch Start -> Shuffle Indices -> Divide into Batches -> Fetch Samples -> Collate Samples -> Return Batch
     ```


* Shuffling Indices  
* Dividing into Chunks  
* Fetching Data Samples  
* Combining Samples into a Batch  
* Returning the Batch  
* Iterating Over Epochs  


---
#### Flow of Data Management

```plaintext
The Dataset (memory)        
    |
    |
    V
Dataset Class (defines where data is stored and fetches data one by one)    
    |
    |
    v
DataLoader Class (manages batching and defines number of rows per batch)
```

By using `Dataset` and `DataLoader`, we can:
- Apply transformations easily.
- Handle large datasets efficiently.
- Shuffle and sample data for better generalization.
- Parallelize data loading to speed up training.

Below is a practical implementation of these concepts:


In [2]:
import torch
from sklearn.datasets import  make_classification

In [3]:
x,y=make_classification(
    n_samples=10,   # number of samples
    n_features=2,   # number of features
    n_informative=2,    # number of informative features
    n_redundant=0, # number of redundant features
    n_classes=2,   # number of classes
    random_state=42  # random seed for reproducibility
)

In [4]:
x

array([[ 1.06833894, -0.97007347],
       [-1.14021544, -0.83879234],
       [-2.8953973 ,  1.97686236],
       [-0.72063436, -0.96059253],
       [-1.96287438, -0.99225135],
       [-0.9382051 , -0.54304815],
       [ 1.72725924, -1.18582677],
       [ 1.77736657,  1.51157598],
       [ 1.89969252,  0.83444483],
       [-0.58723065, -1.97171753]])

In [5]:
#  convert the data to pytorch tensor 
x=torch.tensor(x,dtype=torch.float32)
y=torch.tensor(y,dtype=torch.long)

In [6]:
x

tensor([[ 1.0683, -0.9701],
        [-1.1402, -0.8388],
        [-2.8954,  1.9769],
        [-0.7206, -0.9606],
        [-1.9629, -0.9923],
        [-0.9382, -0.5430],
        [ 1.7273, -1.1858],
        [ 1.7774,  1.5116],
        [ 1.8997,  0.8344],
        [-0.5872, -1.9717]])

In [7]:
y

tensor([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [9]:
from torch.utils.data import Dataset,DataLoader

In [10]:
class Customdataset(Dataset):
    def __init__(self,features,labels):
        self.features=features
        self.labels=labels
    def __len__(self):
        return self.features.shape[0]
    
    def __getitem__(self,idx):
        return self.features[idx],self.labels[idx]
    

In [11]:
dataset=Customdataset(x,y)

In [12]:
len(dataset)

10

In [13]:
dataset[0]

(tensor([ 1.0683, -0.9701]), tensor(1))

In [16]:
dataloader=DataLoader(dataset,batch_size=2,shuffle=True)

In [17]:
for batch_feature,batch_label in dataloader:
    print(batch_feature)
    print(batch_label)
    print("-"*50)
    



tensor([[-0.9382, -0.5430],
        [-0.5872, -1.9717]])
tensor([1, 0])
--------------------------------------------------
tensor([[ 1.7273, -1.1858],
        [-2.8954,  1.9769]])
tensor([1, 0])
--------------------------------------------------
tensor([[-1.9629, -0.9923],
        [ 1.7774,  1.5116]])
tensor([0, 1])
--------------------------------------------------
tensor([[-1.1402, -0.8388],
        [-0.7206, -0.9606]])
tensor([0, 0])
--------------------------------------------------
tensor([[ 1.0683, -0.9701],
        [ 1.8997,  0.8344]])
tensor([1, 1])
--------------------------------------------------


In [None]:
|