# Building Data
In this demo we are going to go through the process of building a custom dataset to use to train our model.

This is an example process of building data to use for model training.

We will cover some new concepts and best practices as well as review some concepts we have learned over the last few lessons.



# Data Cleaning and preprocessing
Data cleaning and preprocessing help make sure that only clear, useful images are used to train the model, which leads to better accuracy. These steps remove any messy or confusing data so the model can learn patterns more easily.

This is no right or wrong way to do this but it is important to examine the data.

In [None]:
# View all images in our dataset
import glob
import matplotlib.pyplot as plt
from PIL import Image

# Get a list of images with jpg
images_list = glob.glob("images/*/*jpg")

# Open each image
for image in images_list:
    # Set the title
    plt.title(image)
    # Open the image
    img = Image.open(image)
    plt.axis("on")
    plt.imshow(img)
    plt.show()



In [None]:
# Print our image list
print(images_list)

In [None]:
# Remove images from our list that we dont want
images_list.remove()

In [39]:
# We are using an annotations file lets write the final list to a csv with its class
import os
import pandas as pd

data = []

for file_path in images_list:
    # Extract the class label from the path ie: dog or cat
    label = os.path.basename(os.path.dirname(file_path))
    # Append path and label 
    data.append({"file_path": file_path, "label": label})

# Save DF as CSV file
df = pd.DataFrame(data)
df.to_csv("image_data.csv", index=False)

# Here we created our intitial annotations file


### Create an initial PyTorch Dataset
Once our data is cleaned up and ready, we need to create an initial Dataset that consists of all eligible images.

This is so we can split our data into Training, Validation and Testing subsets.

In [40]:
import pandas as pd
from torch.utils.data import Dataset


class InitialDataset(Dataset):
    def __init__(self, annotations_file):
        self.img_labels = pd.read_csv(annotations_file)

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = self.img_labels.iloc[idx, 0]
        label = self.img_labels.iloc[idx, 1]
        return img_path, label

In [41]:
# Create a PyTorch Dataset
dataset = InitialDataset(annotations_file='image_data.csv')

In [None]:
# Print the annoations
dataset.img_labels

# Random Split
Review:

The `random_split` function in PyTorch helps divide your dataset into different parts such as training, validation and testin sets, by randomly selecting samples for each part. 

This is important because splitting data lets you train the model on one part and test it on another, helping you see how well the model performs on new, unseen data. 

Randomly splitting the data ensures each set has a good mix, making the model’s evaluation more reliable.

Data Splits:

**Training data** is the largest portion, and it’s what the model learns from by finding patterns in the data. 

**Validation data** is used during training to tune the model’s settings, helping prevent overfitting so the model doesn’t just memorize the training data. 

**Testing data** is used after training to check how well the model performs on completely new data. This setup ensures the model can make accurate predictions on data it hasn’t seen before, making it more useful and reliable.

In [43]:
# Import random_split from PyTorch's data utilities
from torch.utils.data import random_split

In [44]:
# Define size of Training data from the full dataset 70%
train_size = int(0.7 * len(dataset))

In [45]:
# Define size of Validation data from the full dataset 15%
val_size = int(0.15 * len(dataset))

In [46]:
# Finally define the rest as test data 15%
test_size = len(dataset) - train_size - val_size

In [47]:
# Create a training, validation and testing dataset by splitting the full dataset by size
# Here we use random_split
train_dataset, val_dataset, test_dataset = random_split(dataset, [train_size, val_size, test_size])

In [None]:
# Take a look at the outputs we get from the subsets
print(train_dataset.indices, val_dataset.indices, test_dataset.indices)

The output of `random_split` PyTorch is a list of Subset objects, each representing a portion of the original dataset.

We can use these lists of indexes to retrieve samples from our original dataset. 

In [None]:
# Compare the original dataset to our index
# Print the annotations
dataset.img_labels

In [None]:
# Print the item in the dataset at the first index of the train_dataset
dataset.img_labels.loc[train_dataset.indices[0]]
# We can do the same for other indexes and for validation datasets and testing datasets

# Data Versioning and Tracking
As we covered in the video, there are multiple ways to version your data and we wont cover any particular method in this course.

However, the reason we are covering this is because it is a best practice.

Versioning and tracking makes your work more reliable and allows you to reproduce results consistently, even if the data changes over time. 

In [51]:
# Lets write annotation files for each of our subsets of data. 
# This method can be used for other forms of model training outside of images such as text, audio, etc
import pandas as pd

data = []

# For each index in the training indices 
for idx in train_dataset.indices:
    # Extract the file_path and the label from the original dataset 
    img_path = dataset.img_labels['file_path'].loc[idx]
    label = dataset.img_labels['label'].loc[idx]
    # Append path and label 
    data.append({"file_path": img_path, "label": label})

# Save DF as CSV file
df = pd.DataFrame(data)
df.to_csv("training_data.csv", index=False)

In [52]:
# Do the same thing for our validation and testing sets.
import pandas as pd

data = []

# For each index in the validation indices 
for idx in val_dataset.indices:
    # Extract the file_path and the label from the original dataset 
    img_path = dataset.img_labels['file_path'].loc[idx]
    label = dataset.img_labels['label'].loc[idx]
    # Append path and label 
    data.append({"file_path": img_path, "label": label})

# Save DF as CSV file
df = pd.DataFrame(data)
df.to_csv("validation_data.csv", index=False)

# For each index in the test indices 
for idx in test_dataset.indices:
    # Extract the file_path and the label from the original dataset 
    img_path = dataset.img_labels['file_path'].loc[idx]
    label = dataset.img_labels['label'].loc[idx]
    # Append path and label 
    data.append({"file_path": img_path, "label": label})

# Save DF as CSV file
df = pd.DataFrame(data)
df.to_csv("testing_data.csv", index=False)

# Define Transformations
Lets go ahead and define transformations for our subsets of data. 

Remember that there is a possibility that training could have different transforms than validation. This is present a more diverse sample to the model during training.

In [53]:
# Begin by import transforms
from torchvision.transforms import v2

In [54]:
# Training Pipeline
import torch

train_transform = v2.Compose([
    v2.Resize((128, 128)), # Resize the image
    v2.RandomCrop(size=(75, 75)), # Random Crop
    v2.RandomHorizontalFlip(p=.7), # Randomly flip horizontally
    # Convert to tensor
    v2.ToImage(), 
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # Normalize
])

In [55]:
# Validation Pipeline
val_transform = v2.Compose([
    v2.Resize((128, 128)), # Resize to a fixed size
    # Convert to tensor
    v2.ToImage(), 
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # Normalize
])

# Define Datasets and DataLoaders
Now that we have subsets of our data and individual transformations, we can create PyTorch Datasets from our subsets of data and DataLoaders for those subsets to load them into our model.

In [56]:
# Lets begin by defining our Custom Dataset 
import pandas as pd


class CustomImageDataset(Dataset):
    def __init__(self, annotations_file, img_dir, transform, target_transform):
        self.img_labels = pd.read_csv(annotations_file)
        self.img_dir = img_dir
        self.transform = transform
        self.target_transform = lambda y: target_transform[y]

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        image = Image.open(img_path)
        label = self.img_labels.iloc[idx, 1]
        # Transform the image
        image = self.transform(image)
        # Get the label
        label = self.target_transform(label)
        
        return image, label

In [57]:
# Create the label encoding
label_encoding = {"cat": 0, "dog": 1}

In [58]:
# Create a training dataset 
train_dataset = CustomImageDataset(
    annotations_file='training_data.csv', 
    img_dir="./", 
    transform=train_transform, 
    target_transform=label_encoding
)

In [None]:
# Display the training data
train_dataset.img_labels

In [None]:

# Label encoding
train_dataset.target_transform('dog')

In [None]:
# Transformations
train_dataset.transform

In [62]:
# Create the validation dataset
val_dataset = CustomImageDataset(
    annotations_file='validation_data.csv', 
    img_dir="./", 
    transform=val_transform, 
    target_transform=label_encoding
)

In [None]:
# Show the transforms
val_dataset.transform

In [64]:
# Create the DataLoaders for each PyTorch Dataset
# Import DataLoader
from torch.utils.data import DataLoader


In [65]:
# Create the training DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

In [None]:
# Iterate through and print the batch size
features, labels = next(iter(train_loader))
print(f"Features shape: {features.size()}")

In [67]:
# Create the Validation DataLoader NOTICE the False shuffle
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

In [None]:
# Iterate 
features, labels = next(iter(val_loader))
print(f"Features shape: {features.size()}")

#### You are now ready to begin training a model!