# Dataloaders in PyTorch

Dataloaders in PyTorch are used to efficiently handle data during model training by breaking the dataset into smaller chunks, called *batches*, which helps improve both memory usage and training speed. Instead of loading the entire dataset at once, which can overwhelm memory, dataloaders load a batch at a time, making it easier to work with large datasets. They also shuffle the data automatically (if needed), preventing the model from learning any accidental patterns in the data order. Additionally, dataloaders streamline the training loop by automating tasks like batching, shuffling, and parallel data loading, making your code simpler and faster without requiring you to write custom logic for these tasks.


In [225]:
import torch 
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load in and split data

In [226]:
digits = load_digits()
X = digits.data
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Create the Custom Dataset Class
The `CustomDataset` class in PyTorch is a custom dataset handler that simplifies data processing for machine learning tasks. It initializes with feature data and target labels, converting them into PyTorch tensors with appropriate data types (`float32` for features and `long` for labels). The `__len__` method returns the number of samples in the dataset, which allows PyTorch's DataLoader to understand the dataset size. The `__getitem__` method retrieves individual samples based on an index, returning a dictionary containing both the feature data and corresponding label. This class facilitates easy integration with PyTorch’s DataLoader for efficient data loading and batching during model training and evaluation.


In [227]:
class CustomDataset(Dataset):
    def __init__(self,data,target): #essentially x and y
        self.data = torch.tensor(data,dtype=torch.float32)
        self.target = torch.tensor(target,dtype=torch.long)

    def __len__ (self):
        return len(self.data) #return the length of data
    
    def __getitem__(self, index):
        sample = {'data': self.data[index], 'target': self.target[index]} #create a dictionary
        return sample

# Create instances of class

In [228]:
train_dataset = CustomDataset(X_train, y_train)
test_dataset = CustomDataset(X_test, y_test)

# Dataloaders

In [229]:
train_dataloader = DataLoader(dataset=train_dataset, batch_size = 32, shuffle=True)
test_dataloader = DataLoader(dataset=train_dataset, batch_size = 32, shuffle=False)

# Create Neural Network class and inherit from nn.Module

In [230]:
class SimpleNN(nn.Module):
    def __init__(self, input_size,hidden_size, output_size) :
        super(SimpleNN, self).__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        return x
    

In [231]:
input_size = X_train.shape[1]
hidden_size = 64
output_size = len(set(y_train))


# Create optimiser, criterion and instance of model.

In [232]:
model = SimpleNN(input_size, hidden_size, output_size)
criterion = nn.CrossEntropyLoss()
optimiser = optim.Adam(model.parameters(), lr = 0.001)

# Create Training loop

In [233]:
num_epochs = 10

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0

    for batch in train_dataloader:
        inputs = batch['data'] #same as X
        targets = batch['target'] #similar with y

        optimiser.zero_grad()
        outputs = model(inputs)

        loss = criterion(outputs, targets)
        loss.backward()
        optimiser.step()

        running_loss += loss.item()


   # print(f'epoch: {epoch+1}/{num_epochs}, loss: {running_loss/len(train_dataloader)}')



# Evaluation

In [234]:
model.eval()
all_predictions =[]
all_targets = []

with torch.no_grad():
    for batch in test_dataloader:
        inputs = batch['data']
        targets = batch['target']

        outputs = model(inputs)
        predictions = torch.argmax(outputs,dim = 1)

        all_predictions.extend(predictions.cpu().numpy())
        all_targets.extend(targets.cpu().numpy())

        accuracy = accuracy_score(all_targets, all_predictions)

print(f'accuracy test: {accuracy*100: .3f}')

accuracy test:  98.817
