#  PyTorch ```Dataset``` {#sec-pytorch-dataset}

## <a name="overview"></a> Overview

In [chapter @sec-pytorch-tensors] we discussed tensors; the building blocks for data in PyTorch. In this chapter, we want to go a step
further and discuss how to represent a dataset in PyTorch. 
This is done using the abstract ```Dataset``` class.  As its name suggests, this class represents a dataset. 
Our custom dataset should inherit ```Dataset``` and override the  following two methods [1]:


- ```__len__``` i.e. ```len(dataset)``` should return the size of the data set
- ```__getitem__``` to support the indexing such that ```dataset[i]``` can be used to get ith sample

The latter should return the item and its corresponding label.
Before looking into how to create a custom dataset, we will discuss how to work with images, text and tabular data in PyTorch

## Working with images

In this section, we will be working with images and see how to handle these with PyTorch. The first thing you need
to be aware of is that  PyTorch modules dealing with image data require tensors to be laid out as 
$C \times H \times W $ i.e. $(channels, height,  width)$. If this is not the case, we can use the
```tensor.permute``` method to fix this.

In [19]:
import PIL.Image as PILImage
import numpy as np
import torch
from pathlib import Path
import os

In [20]:
pil_image = PILImage.open(Path("../../data/images/random/img1.jpeg"))
img = np.array(pil_image)
img.shape

(598, 1600, 3)

In [21]:
# In[3]:
torch_img = torch.from_numpy(img)
out = torch_img.permute(2, 0, 1)
out.shape

torch.Size([3, 598, 1600])

Typically, we will feed our deep neural network with a batch of images. Let's see how can we create this. When storing images
in a batch the first dimension describes the number of images in it i.e. $N \times C \times H \times W $. There are two
ways one can use to create a batch of images; the using the ```torch.stack``` function we have seen in the previous chapter or
we can preallocate a tensor of appropriate size and fill it with images loaded from a directory. This is shown below



In [28]:
batch_size = 4
channels = 3
width = 256
height = 256
batch = torch.zeros(batch_size, channels, width, height, dtype=torch.uint8)

In [29]:
data_dir = Path('../../data/images/cracks')
filenames = [name for name in os.listdir(data_dir)
             if os.path.splitext(name)[-1] == '.jpg']

counter = 0
for i, filename in enumerate(filenames):
    img_arr = PILImage.open(os.path.join(data_dir, filename))
    img_arr = np.array(img_arr)
    img_t = torch.from_numpy(img_arr)
    img_t = img_t.permute(2, 0, 1)
    
    batch[i] = img_t
    
    counter +=1 
    
    if counter == batch_size:
        break

Now that we have our batch available, let's see how we can normalize the images. Normalization, is something that we typically want to
perform when dealing with deep neural networks. We will do this per channel using the following

$$\text{normalized img channel} = \frac{\text{channel values} - \text{channel mean}}{\text{channel std}}$$

In [30]:

# convert the batch into  a float
batch = batch.float()
#batch /= 255.0

n_channels = batch.shape[1]
for c in range(n_channels):
    mean = torch.mean(batch[:, c])
    std = torch.std(batch[:, c])
    batch[:, c] = (batch[:, c] - mean) / std

## Working with text

At the time of writing, NLP or natural language processing, is a very popular subfield in the general domain of artificial intellience. 
Therefore in this section we will discuss how to import and process text based data with PyTorch. This is overall based on transforming
text into numbers and feeding these in PyTorch.

In [11]:
from pathlib import Path
import torch

In [12]:
austine_jane_path = Path("../../data/text/austine_jane.txt")

In [13]:
with open(austine_jane_path, encoding='utf8') as f:
    text = f.read()

We will use one-hot-encoding to represent text. 

In [14]:

lines = text.split('\n')
line = lines[200]
line

'for there was a distinctly feminine element in “Mr. Spectator,” and in'

In [15]:
print(len(line))

70


Let’s create a tensor that can hold the total number of one-hot-encoded characters for the whole line:

In [16]:
# In[4]:
letter_t = torch.zeros(len(line), 128)
letter_t.shape

torch.Size([70, 128])

Note that letter_t holds a one-hot-encoded character per row. Now we just have to set a one on each row in the correct position so that each row represents the correct character. The index where the one has to be set corresponds to the index of the character in the encoding

In [17]:
for i, letter in enumerate(line.lower().strip()):
    letter_index = ord(letter) if ord(letter) < 128 else 0
    letter_t[i][letter_index] = 1

### One-hot encoding whole words

We have one-hot encoded our sentence into a representation that a neural network could digest. 
Word-level encoding can be done the same way by establishing a vocabulary 
and one-hot encoding sentences—sequences of words—along the rows of our tensor. 
Since a vocabulary has many words, this will produce very wide encoded vectors, which may not be practical. We will see in the next section that there 
is a more efficient way to represent text at the word level, using embeddings. 
For now, let’s stick with one-hot encodings and see what happens.

In [30]:
def clean_words(in_str):
    punctuation = '.,;:"!?""_-'
    word_list = in_str.lower().replace('\n',' ').split()
    word_list = [word.strip(punctuation) for word in word_list]
    return word_list

In [31]:
words_in_line = clean_words(line)
line, words_in_line


('for there was a distinctly feminine element in “Mr. Spectator,” and in',
 ['for',
  'there',
  'was',
  'a',
  'distinctly',
  'feminine',
  'element',
  'in',
  '“mr',
  'spectator,”',
  'and',
  'in'])

### Text embeddings

## PyTorch ```Dataset```

Now that we know how to handle various types of data with PyTorchDefine out own ```Dataset```

In [31]:
import numpy as np
from torch.utils.data import Dataset

In [32]:
X = np.array([[0.0, 1.0], 
              [2.0, 1.0], 
              [4.0, 5.5]])
y = np.array([0, 1, 2])

In [33]:
class ExampleDataSet(Dataset):
    
    def __init__(self, X, y, transform=None):
        self._X = X
        self._y = y
        self._transform = transform
        
    def __getitem__(self, index):
        """
        Returns the index-th training example and label
        
        """
        
        if self._transform is not None:
            x, y = elf._transform(self._X[index], self._y[index])
        else:
            
            x = self._X[index] 
            y = self._y[index] 
        
        return self._X[index], self._y[index] 
    
    def __len__(self):
        """
        Returns how many items are in the dataset
        """
        return self._X.shape[0]
    

In [34]:
dataset = ExampleDataSet(X=X, y=y)
print("Number of training examples={0}".format(len(dataset)))
print("The first training example is={0} with label={1}".format(dataset[0][0], dataset[0][1]))

Number of training examples=3
The first training example is=[0. 1.] with label=0


### PyTorch ```DataLoader```

A ```DataLoader``` takes in a dataset and defines rules for successively generating batches of data. For example

In [35]:

dataloader = DataLoader(dataset, batch_size=60, shuffle=True)


NameError: name 'DataLoader' is not defined

## Summary

## <a name="refs"></a> References

1. <a href="https://pytorch.org/tutorials/beginner/data_loading_tutorial.html">Writing Custom Datasets, DataLoaders and Transforms</a>