# Preparing the Fashion MNIST Dataset for Modeling

With 60,000 images of clothing items, our goal is to classify these into ten different classes using a computer vision model. A crucial consideration in building this model is whether to use pure linear lines or incorporate non-linearity. This decision can significantly impact the model's ability to accurately classify the diverse items of clothing in the dataset.

## Data Preparation Steps

Before diving into model building, it's essential to prepare our dataset for efficient processing:

1. **Understanding Data Format**: Our data is initially in the form of a PyTorch dataset. This format includes the entire dataset but is not yet suitable for model training, which requires the data in mini-batches.

2. **Conversion to DataLoader**: To make the dataset ready for training, we convert it from a PyTorch dataset to a DataLoader. This conversion process batches the data, making it a Python iterable, which is a more efficient format for model training.

3. **Why Batch Data?**:
    - **Computational Efficiency**: Handling the entire dataset at once can be demanding on memory and computational resources. By breaking the dataset into smaller batches, we can make the training process more manageable and efficient.
    - **Frequent Gradient Updates**: Batching allows the model to update its gradients more frequently, improving the learning process. Instead of updating weights once per epoch (as would be the case with the entire dataset), the model updates its weights every few images, depending on the batch size chosen.

## Practical Steps for Batching

- **Batch Size**: A common starting point for batch size is 32, but this can vary based on the specific requirements of the dataset and the computational resources available.
- **Shuffling**: It's a good practice to shuffle the data before batching. This ensures that the model does not learn any unintended patterns from the order of the data.

## Implementing Data Batching

In the next steps, we will use PyTorch's `DataLoader` from `torch.utils.data` to batch our dataset. The process involves specifying the dataset, batch size, and whether to shuffle the data:



In [2]:
# Import necessary libraries
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Set transformation to convert images to PyTorch tensors
transform = transforms.Compose([transforms.ToTensor()])

# Load the Fashion MNIST dataset
train_data = datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
test_data = datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoaders for both training and testing datasets
batch_size = 32  # Common batch size for training

train_data_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
test_data_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False)

# Print out some information to verify DataLoader functionality
print(f"Number of batches in train_data_loader: {len(train_data_loader)}")
print(f"Number of batches in test_data_loader: {len(test_data_loader)}")

# Example of iterating over the train_data_loader
for images, labels in train_data_loader:
    print(f"Batch images shape: {images.shape}")
    print(f"Batch labels shape: {labels.shape}")
    break  # Break after the first batch to keep the output concise




Number of batches in train_data_loader: 1875
Number of batches in test_data_loader: 313
Batch images shape: torch.Size([32, 1, 28, 28])
Batch labels shape: torch.Size([32])
