# Modern Problem Requires Modern Solution

If you're working on classification problem with your own dataset, or dataset that is available in their native format (jpg, bmp, etc) and use PyTorch as your main weapon, you'll most likely feel that the **DatasetFolder** or **ImageFolder** is not good enough. So does vanilla **torch.utils.data.Dataset**. This library attempts to bridge that gap to effectively Extract, Transform, and Load your data by extending **torch.utils.data.Dataset**.

---------------------------------------------------------------------------------------------------------

In the first step to train a classifier is to prepare the dataset. By default, PyTorch provides the abstraction of this process through **DatasetFolder** or **ImageFolder**. However they requires us to arrange our folder in this way
##### root/class_x/xxx.ext
##### root/class_x/xxy.ext
##### root/class_x/xxz.ext
##### root/class_y/123.ext
##### root/class_y/nsdf3.ext
##### root/class_y/asd932_.ext

Most of the time, we don't have that, especially if we're using our own dataset (might be from scraping or a gift from someone). So we have to do something.

The first, naive way is to move folders manually or using some scripts, could be python scripts, bash scripts or anything else you're comfortable with.\n\nIt's kind of cool until we    
    1. Have terabytes of dataset    
    2. Want to partition them into train, validation, and or test    
    3. Want to explore/debug our dataset  
    4. Want others to reproduce our results    
    
The second option is to make your own custom Dataset, subclassed from **torch.utils.data.Dataset**. This approach is much simpler and cleaner than the naive way, but still, we would have problem number 2 and 3    
    2. Want to partition them into train, validation, and or test   
    3. Want to explore/debug our dataset  
    4. Want others to reproduce our results
    
This problem lead the design of ETL, a library that does Extract, Transform and Load on the fly. In a nutshell, Extract will read all the images from a parent directory, then partition those images into train, validation, and testing, and store them into csv files with the column of image path and encoded label. After that TransformAndLoad will ingest those samples efficiently to your classifier.

By the way, the reason I use csv rather than txt is because pandas can parse CSV flawlessly. If we're using txt it's a little bit more complicated but maybe I'll add those feature in the future"

# Pt.1 Extract

In [1]:
from pathlib import Path
from torchvision import transforms
from torchetl.etl import Extract, TransformAndLoad
import pandas as pd

In [2]:
parent_directory = Path.cwd().parent / 'data' 
print(parent_directory)

/Users/jedi/Repo/GitHub/torchetl/data


In [3]:
combined_dataset = Extract(parent_directory = parent_directory, 
              extension = 'jpg', 
              labels = ['attack', 'real'], 
              train_size = 0.8,
              random_state = 69,
              verbose = True,
            )

In [4]:
help(combined_dataset)

Help on Extract in module torchetl.etl object:

class Extract(torchetl.base.dataset.BaseDataset)
 |  Extract(parent_directory: str, extension: str, labels: List[str], train_size: float, random_state: int, verbose: bool) -> None
 |  
 |  Method resolution order:
 |      Extract
 |      torchetl.base.dataset.BaseDataset
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, parent_directory: str, extension: str, labels: List[str], train_size: float, random_state: int, verbose: bool) -> None
 |      Class for creating csv files of train, validation, and test
 |      
 |      Parameters
 |      ----------
 |      parent_directory
 |              The parent_directory folder path. It is highly recommended to use Pathlib
 |      extension
 |              The extension we want to include in our search from the parent_directory directory
 |      labels
 |      
 |      Returns
 |      -------
 |      None
 |  
 |  extract(self, file_prefix: str, save_path: str, is_random

From above we know that Extract inherits show_files method from BaseDataset. Let's print the first 5 files

In [5]:
combined_dataset.show_files(5)

[PosixPath('/Users/jedi/Repo/GitHub/torchetl/data/ori/attack/hand/112/4187.jpg'),
 PosixPath('/Users/jedi/Repo/GitHub/torchetl/data/ori/attack/hand/112/10099.jpg'),
 PosixPath('/Users/jedi/Repo/GitHub/torchetl/data/ori/attack/hand/112/3159.jpg'),
 PosixPath('/Users/jedi/Repo/GitHub/torchetl/data/ori/attack/hand/112/6021.jpg'),
 PosixPath('/Users/jedi/Repo/GitHub/torchetl/data/ori/attack/hand/112/1028.jpg')]

Now we know that everything went perfect, it's time to save them into our desired path

In [6]:
save_path = Path.cwd().parent / "data" / "combined"
save_path

PosixPath('/Users/jedi/Repo/GitHub/torchetl/data/combined')

In [7]:
combined_dataset.extract(file_prefix="combined", save_path=save_path, is_random_sampling=False)

Finished creating whole dataset array
Finished splitting dataset into train, validation, and test
Finished writing combined_train.csv into /Users/jedi/Repo/GitHub/torchetl/data/combined
Finished writing combined_validation.csv into /Users/jedi/Repo/GitHub/torchetl/data/combined
Finished writing combined_test.csv into /Users/jedi/Repo/GitHub/torchetl/data/combined


# Pt. 2 Transform and Load

In [8]:
combined_dataset = Path.cwd() / 'data' / 'combined'
train_dataset_path = combined_dataset / 'combined_trains.csv'

In [9]:
data_transform = transforms.Compose([
transforms.ToPILImage(),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(
                    mean=[0.485, 0.456, 0.406],
                    std=[0.229, 0.224, 0.225])
])

In [10]:
train_dataset = TransformAndLoad(parent_directory=parent_directory, 
                                extension="jpg", 
                                csv_file=train_dataset_path, 
                                transform=data_transform)

/Users/jedi/Repo/GitHub/torchetl/tutorial/data/combined/combined_trains.csv does not exist


FileNotFoundError. Maybe we slipped down a little bit

In [14]:
train_dataset_path = combined_dataset / 'combined_train.csv'

train_dataset = TransformAndLoad(parent_directory=parent_directory, 
                                extension="jpg", 
                                csv_file=train_dataset_path, 
                                transform=data_transform)

/Users/jedi/Repo/GitHub/torchetl/tutorial/data/combined/combined_train.csv does not exist


Finally!!!

Now let's see what can we do with train_dataset

In [12]:
help(train_dataset)

Help on TransformAndLoad in module torchetl.etl object:

class TransformAndLoad(torch.utils.data.dataset.Dataset)
 |  TransformAndLoad(parent_directory: str, extension: str, csv_file: str, transform: Callable = None) -> None
 |  
 |  An abstract class representing a Dataset.
 |  
 |  All other datasets should subclass it. All subclasses should override
 |  ``__len__``, that provides the size of the dataset, and ``__getitem__``,
 |  supporting integer indexing in range from 0 to len(self) exclusive.
 |  
 |  Method resolution order:
 |      TransformAndLoad
 |      torch.utils.data.dataset.Dataset
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, idx: int) -> Tuple[numpy.ndarray, numpy.ndarray]
 |      Return the X and y of a specific instance based on the index
 |      
 |      Parameters
 |      ----------
 |      idx
 |              The index of the instance 
 |      
 |      Returns
 |      -------
 |      Tuple of X and y of a specific instance
 |  


So by using getitem method we can inspect our desired instance X and y

In [13]:
train_dataset.__getitem__(0)

AttributeError: 'TransformAndLoad' object has no attribute 'csv_file'

Since we have the csv file, we could easily inspect our training dataset

In [None]:
import pandas as pd

train_df = pd.read_csv(train_dataset_path, header=None)

In [None]:
train_df.head()

This is very handy. For instance we want to make sure that all that contains "attack" must have consistent label

In [None]:
train_df[train_df[0].str.contains('attack')][0] == train_df[train_df[1] == 0][0]

In [None]:
len(train_df[train_df[0].str.contains('attack')][0]) == len(train_df[train_df[1] == 0][0])

Now we have confirmed that file that contains "attack" have consistent label. 

In [None]:
from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

With this, we have successfully create a train DataLoader. While this may looks like a long time, in practice we're saving quite a lot of time because we're not moving any files whatsoever. We also have more consistent and reproducible dataset. On top of that, debugging dataset is much easier than naive method.