# Datasets

Pytorch offers various datasets from different modules like:
- **`torchvision`** module:
  - Image datasets: MNIST, FashionMNIST, ImageNET etc.
  - Video datasets: HMDB51 (Human motion database), moving MNIST etc.
- **`torchaudio`** module:
  - Audio dataset: YESNO dataset
- **`torchtext`** module:
  - Text datasets: AG_NEWS, IMDb, WikiText etc.

Because these datasets are very large in size, we use pytorch's dataloader to iterate through the data.

In [1]:
import torch
from torch.utils.data import DataLoader, Dataset
from torchvision import datasets
from torchvision import transforms
from torchvision.transforms import v2

## Downloading the data

In [2]:
train_data = datasets.MNIST(root = "train_data", 
                            train = True,
                            download = True, # Downloads the data and places it in the directory mentioned
                            transform = transforms.ToTensor()) # Converting the image which in in PIL format to a tensor

test_data = datasets.MNIST(root = "test_data",
                           train = False,
                           download = True, # downloads the data and places it in the directory mentioned
                           transform = transforms.ToTensor()) # Converting the image which is in PIL format to a tensor

## Applying transforms on the data

Let's say we want to use a DNN instead of CNN for classification of this MNIST dataset but classifying a 2d image dataset (The images are in the format of 28 x 28) without flattening them is hard. So we will convert it into a Tensor first and then flatten them.

For this task we will implement a custom transform using `Compose()` function that will convert the image into a tensor and flatten it.

In [3]:
image_flatten = v2.Compose([
    v2.PILToTensor(),
    v2.ToDtype(torch.float64, scale = False),
    v2.Lambda(lambda x: x.view(-1)) # using lambda function we flattened the tensor.
])

To construct a preprocessing pipeline for image data, we use this `Compose()` function which can load different functions to performs transformations and also augment the data.

One example for preprocessing pipeline is:

```python
import torch
from torchvision.datasets import FakeData
from torchvision.transforms import v2


NUM_CLASSES = 100


preproc = v2.Compose([
    v2.PILToTensor(),
    v2.RandomResizedCrop(size=(224, 224), antialias=True),
    v2.RandomHorizontalFlip(p=0.5),
    v2.ToDtype(torch.float32, scale=True),  # to float32 in [0, 1]
    v2.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),  # typically from ImageNet
])

dataset = FakeData(size=1000, num_classes=NUM_CLASSES, transform=preproc)

img, label = dataset[0]
print(f"{type(img) = }, {img.dtype = }, {img.shape = }, {label = }")
```
Source: https://pytorch.org/vision/stable/auto_examples/transforms/plot_cutmix_mixup.html#sphx-glr-auto-examples-transforms-plot-cutmix-mixup-py

## Custom Dataset

Lets say we want to work with data from other sources (like datasets from sklearn, kaggle etc.), we can implement it using the following steps.

> **Note**: If the data is not preprocessed then its better to implement a pipeline to preprocess the data and convert into numpy array. This numpy array can be converted to pytorch tensor which then can be loaded using dataloader.

If you are already familiar with datapreprocessing and pipelines using sklearn and pandas then you can skip this code.

In [4]:
import numpy as np
import pandas as pd

In [5]:
data = pd.read_csv("./datasets/titanic_data.csv")
display(data)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


In [7]:
data.drop(columns=["Name"], inplace=True)
pid = data.pop("PassengerId")
data

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,male,34.5,0,0,330911,7.8292,,Q
1,1,3,female,47.0,1,0,363272,7.0000,,S
2,0,2,male,62.0,0,0,240276,9.6875,,Q
3,0,3,male,27.0,0,0,315154,8.6625,,S
4,1,3,female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...
413,0,3,male,,0,0,A.5. 3236,8.0500,,S
414,1,1,female,39.0,0,0,PC 17758,108.9000,C105,C
415,0,3,male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,0,3,male,,0,0,359309,8.0500,,S


In [8]:
labels = data.pop("Survived")
data

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,male,34.5,0,0,330911,7.8292,,Q
1,3,female,47.0,1,0,363272,7.0000,,S
2,2,male,62.0,0,0,240276,9.6875,,Q
3,3,male,27.0,0,0,315154,8.6625,,S
4,3,female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...
413,3,male,,0,0,A.5. 3236,8.0500,,S
414,1,female,39.0,0,0,PC 17758,108.9000,C105,C
415,3,male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,3,male,,0,0,359309,8.0500,,S


In [9]:
from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline

In [10]:
set_config(display="diagram")

In [11]:
column_transformer = ColumnTransformer(
    transformers=[("one_hot_encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["Sex", "Ticket", "Cabin", "Embarked"]),
                  ("imputer", KNNImputer(), ["Age", "Fare"])],
    remainder = "passthrough"
)

preprocessing_pipeline = Pipeline(
    steps=[
        ('column_transformer', column_transformer),
        ("scaler", MinMaxScaler()),
    ]
)

In [12]:
preprocessing_pipeline

In [13]:
X_train = preprocessing_pipeline.fit_transform(data, labels)
display(X_train, X_train.shape)

array([[0.        , 1.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [1.        , 0.        , 0.        , ..., 1.        , 0.125     ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.5       , 0.        ,
        0.        ],
       ...,
       [0.        , 1.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 1.        , 0.125     ,
        0.11111111]])

(418, 450)

# Creating Custom Dataset

In [14]:
X_train = torch.from_numpy(X_train).to(torch.float64)
y_train = torch.from_numpy(np.array(labels)).to(torch.float64)

display(X_train.shape, y_train.shape)

torch.Size([418, 450])

torch.Size([418])

In [15]:
class TitanicDataset(Dataset):
    def __init__(self, features, labels) -> None:
        self.features = features
        self.labels = labels
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

titanic_data = TitanicDataset(features= X_train, labels=y_train)

# Dataloader

`Dataloader()` function helps us iterate through the data by wrapping an iterator around the dataset object. It can also serve sone functions such as:

- Divide the data into batches
- Shuffling the data.
- Sampling and batch sampling the data.

and many more.

We can use dataloader to preload the data and feed it to the model.

In [16]:
titanic_dataloader = DataLoader(titanic_data, batch_size=5, shuffle=True)

In [17]:
for features, labels in titanic_dataloader:
    X_train = features
    y_train = labels
    break
display(X_train.shape, y_train.shape)

torch.Size([5, 450])

torch.Size([5])