<p style="color:#153462; 
          font-weight: bold; 
          font-size: 30px; 
          font-family: Gill Sans, sans-serif; 
          text-align: center;">
          Simple Dataset in PyTorch</p>

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       <p>
            In PyTorch, a dataset is a fundamental concept used for handling and organizing
            data for machine learning and deep learning tasks. A dataset typically represents
            a collection of data samples, where each sample is an input-output pair used for training,
            validation, or testing. Creating datasets in PyTorch serves several important purposes:
       </p>
       <ul>
            <li>
                <strong>Data Organization:</strong> Datasets help you organize your data in a structured manner. 
                They provide a convenient way to store and manage data samples, making it easier to load and
                process data during training.
            </li>
            <li>
                <strong>Data Loading:</strong> PyTorch's DataLoader class works seamlessly with datasets to
                load data in batches. This is essential for efficient training, especially when working with
                large datasets that do not fit entirely in memory.
            </li>
            <li>
                <strong>Data Augmentation:</strong> Datasets can be augmented by applying transformations
                to the data samples. For example, you can apply random rotations, translations, or other
                transformations to image data to increase the diversity of training examples.
            </li>
            <li>
                <strong>Customization:</strong> You can create custom datasets tailored to your specific
                machine learning or deep learning task. This is useful when working with non-standard data
                formats or data sources.
            </li>
            <li>
                <strong>Compatibility:</strong> PyTorch datasets are compatible with various data sources,
                including images, text, time series, and more. You can easily adapt existing data sources 
                to work with PyTorch datasets.
            </li>
            <li>
                <strong>Data Splitting:</strong> Datasets can be split into training, validation, and test sets,
                ensuring that your machine learning model is evaluated properly on independent data. This is crucial 
                for estimating model generalization.
            </li>
      </ul>
   </font>
</p>

### Importing Required Modules

In [1]:
import torch
from torch.utils.data import Dataset

### Creating Simple Dataset Class

In [2]:
class SimpleDataset(Dataset):
    def __init__(self, length=100, transform=None):
        self.x = 2 * torch.ones(100, 2)
        self.y = torch.ones(100)
        self.len = length
        self.transform = transform

    def __getitem__(self, index):
        """
        __getitem__ method is used to get an item from the 
        invoked instances’ attribute. __getitem__ is commonly
        used with containers like list, tuple, etc.
        """
        sample = self.x[index], self.y[index]
        if self.transform:
            sample = self.transform(sample)
        return sample
    
    def __len__(self):
        """
        __len__ magic method is used to find the length of 
        the instance attributes. When we use len(instance), 
        it returns the length of the instance attribute which
        is usually containers.
        """
        return self.len

In [3]:
# Creating an object
data = SimpleDataset()

In [4]:
len(data)

100

In [5]:
# Accessing dataset elements
data[0]

(tensor([2., 2.]), tensor(1.))

In [6]:
for i in range(3):
    x, y = data[i]
    print(i, f"x:{x}", f"y:{y}")

0 x:tensor([2., 2.]) y:1.0
1 x:tensor([2., 2.]) y:1.0
2 x:tensor([2., 2.]) y:1.0


### Performing Transformations

In [7]:
# creating a transformer class
class AddMulti(object):
    def __init__(self, addx=1, multy=2):
        self.addx = addx
        self.multy = multy
    
    def __call__(self, sample):
        """
        __call__ magic method is invoked when the instance of a class is invoked.
        Instead of writing another method to perform certain operations, we can 
        use the __call__ method to directly call from the instance name.
        """
        x = sample[0]
        y = sample[1]
        x_ = x + self.addx
        y_ = y * self.multy
        sample = x_, y_
        return sample

In [8]:
tran_obj = AddMulti()

In [9]:
# return X, y values, we created this dataset above
data[0]

(tensor([2., 2.]), tensor(1.))

In [10]:
# Example-1
tran_obj(data[0])

(tensor([3., 3.]), tensor(2.))

In [11]:
# Example-2
tran_obj([4, 5])

(5, 10)

In [12]:
# passing transformer object to SimpleDataset class
dataset_ = SimpleDataset(transform=tran_obj)

In [13]:
dataset_[0]

(tensor([3., 3.]), tensor(2.))

### Transform Compose

Transforms Compose helps to run serveral tranforms in series.

In [14]:
from torchvision.transforms import Compose

In [15]:
# Like above tranform class creating one more class
# creating a transformer class
class Multi(object):
    def __init__(self, mult=100):
        self.mult = mult
    
    def __call__(self, sample):
        """
        __call__ magic method is invoked when the instance of a class is invoked.
        Instead of writing another method to perform certain operations, we can 
        use the __call__ method to directly call from the instance name.
        """
        x = sample[0]
        y = sample[1]
        x_ = x * self.mult
        y_ = y * self.mult
        sample = x_, y_
        return sample

In [16]:
compose_obj = Compose([AddMulti(), Multi()])

In [17]:
# return X, y values, we created this dataset above
data[0]

(tensor([2., 2.]), tensor(1.))

In [18]:
# output of AddMulti(): (tensor([3., 3.]), tensor(2.))
# output of Multi(): (tensor([3., 3.])*100, tensor(2.)*100) => (tensor([300., 300.]), tensor(200.))
compose_obj(data[0])

(tensor([300., 300.]), tensor(200.))

### TensorDataset

Dataset wrapping tensors.
Each sample will be retrieved by indexing tensors along the first dimension. <br>
<b>Parameters</b>:<br>
*tensors (Tensor) – tensors that have the same size of the first dimension.

In [19]:
from torch.utils.data import TensorDataset

In [20]:
input_feat = torch.tensor([[1, 2, 3], [5, 4, 1], [5, 4, 2]])
input_feat

tensor([[1, 2, 3],
        [5, 4, 1],
        [5, 4, 2]])

In [21]:
target_labels = torch.tensor([0, 1, 2])
target_labels

tensor([0, 1, 2])

In [22]:
dataset = TensorDataset(input_feat, target_labels)
dataset

<torch.utils.data.dataset.TensorDataset at 0x21f35751d90>

In [23]:
dataset[0]

(tensor([1, 2, 3]), tensor(0))

### DataLoader

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None, *, prefetch_factor=2,
           persistent_workers=False)

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       The DataLoader class in PyTorch is used to load data efficiently from a dataset, such as TensorDataset, during the training or testing phase of a machine learning model. It provides functionalities to split the dataset into batches and shuffle the data. <br>
       <b>Parameters:</b><br>
       It has many parameter below are some importants ones,<br>
       <i>dataset</i>: Dataset<br>
       <i>batch_size</i>: The batch_size determines the number of samples to be included in each batch.<br>
       <i>shuffle</i>: By setting shuffle=True, the data in the dataset will be randomly shuffled before creating the batches.
    </font>
</p>

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       By using DataLoader, you can easily load and iterate over the data in batches, making it convenient to feed the data to your machine learning model for training or testing purposes.
   </font>
</p>

In [24]:
from torch.utils.data import TensorDataset, DataLoader

# Assuming we have a TensorDataset called "dataset"
dataset = TensorDataset(
    torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),  # input features
    torch.tensor([0, 1, 0])  # target labels
)

# Create a DataLoader by passing the dataset and batch size
batch_size = 2
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Iterate over the data in batches
for inputs, targets in dataloader:
    print("Batch inputs:", inputs)
    print("Batch targets:", targets)
    print()

Batch inputs: tensor([[1, 2, 3],
        [7, 8, 9]])
Batch targets: tensor([0, 0])

Batch inputs: tensor([[4, 5, 6]])
Batch targets: tensor([1])

