How to represent different types of data with tensors. Types covered: images, tabular, time series, text.

## Working with images

#### Loading

In [1]:
import imageio

img_arr = imageio.imread("../data/p1ch4/image-dog/bobby.jpg")
img_arr.shape # H x W x C

(1280, 855, 3)

In [2]:
import torch

img = torch.from_numpy(img_arr)
out = img.permute(2, 0, 1) # Permute to fit C x H x W dimension ordering
out.shape

# Note: permute doesn't create a new image, but alters the size and stride information at the level
# of the original tensor

torch.Size([3, 1280, 855])

In [3]:
"""
To creat a dataset of multiple images to use as input for our neural networks, we store the images
in a batch along the first dimension to obtain an N x C x H x W tensor.
"""

# An efficient way to create a batch is pre-allocation followed by loading from a directory

batch_size = 3
batch = torch.zeros(batch_size, 3, 256, 256, dtype=torch.uint8)

In [4]:
import os

data_dir = '../data/p1ch4/image-cats/'
filenames = [name for name in os.listdir(data_dir)
             if os.path.splitext(name)[-1] == ".png"] # Condition ensures images used are of a desired format.

for i, filename in enumerate(filenames):

    img_arr = imageio.imread(os.path.join(data_dir, filename))
    img_t = torch.from_numpy(img_arr)
    img_t = img_t.permute(2, 0, 1)
    img_t = img_t[:3]
    batch[i] = img_t

#### Normalising

In [5]:
batch[0]

tensor([[[ 90,  91,  93,  ..., 191, 191, 191],
         [ 91,  91,  93,  ..., 191, 191, 191],
         [ 91,  92,  93,  ..., 192, 192, 192],
         ...,
         [206, 210, 213,  ..., 220, 219, 218],
         [209, 214, 214,  ..., 221, 220, 219],
         [212, 212, 212,  ..., 219, 218, 218]],

        [[108, 109, 111,  ..., 201, 201, 201],
         [109, 109, 111,  ..., 201, 201, 201],
         [109, 110, 111,  ..., 202, 202, 202],
         ...,
         [198, 202, 205,  ..., 214, 213, 212],
         [201, 206, 206,  ..., 213, 212, 211],
         [204, 204, 204,  ..., 211, 210, 210]],

        [[120, 121, 123,  ..., 210, 210, 210],
         [121, 121, 123,  ..., 210, 210, 210],
         [121, 122, 123,  ..., 211, 211, 211],
         ...,
         [198, 202, 205,  ..., 214, 213, 212],
         [201, 206, 206,  ..., 213, 212, 211],
         [204, 204, 204,  ..., 211, 210, 210]]], dtype=torch.uint8)

In [6]:
# Best training performance is observed when input data values fall in the ranges [0, 1] or [-1, 1]

batch = batch.float()
batch /= 255.0
batch[0]

tensor([[[0.3529, 0.3569, 0.3647,  ..., 0.7490, 0.7490, 0.7490],
         [0.3569, 0.3569, 0.3647,  ..., 0.7490, 0.7490, 0.7490],
         [0.3569, 0.3608, 0.3647,  ..., 0.7529, 0.7529, 0.7529],
         ...,
         [0.8078, 0.8235, 0.8353,  ..., 0.8627, 0.8588, 0.8549],
         [0.8196, 0.8392, 0.8392,  ..., 0.8667, 0.8627, 0.8588],
         [0.8314, 0.8314, 0.8314,  ..., 0.8588, 0.8549, 0.8549]],

        [[0.4235, 0.4275, 0.4353,  ..., 0.7882, 0.7882, 0.7882],
         [0.4275, 0.4275, 0.4353,  ..., 0.7882, 0.7882, 0.7882],
         [0.4275, 0.4314, 0.4353,  ..., 0.7922, 0.7922, 0.7922],
         ...,
         [0.7765, 0.7922, 0.8039,  ..., 0.8392, 0.8353, 0.8314],
         [0.7882, 0.8078, 0.8078,  ..., 0.8353, 0.8314, 0.8275],
         [0.8000, 0.8000, 0.8000,  ..., 0.8275, 0.8235, 0.8235]],

        [[0.4706, 0.4745, 0.4824,  ..., 0.8235, 0.8235, 0.8235],
         [0.4745, 0.4745, 0.4824,  ..., 0.8235, 0.8235, 0.8235],
         [0.4745, 0.4784, 0.4824,  ..., 0.8275, 0.8275, 0.

In [7]:
"""
May also want to compute the mean sand stdev of the input data and scale it
so that the output has zero mean and unit stdev across each channel

Torch provides functions for calculating these for tensors
"""

n_channels = batch.shape[1]
for c in range(n_channels):
    mean = torch.mean(batch[:, c])
    std = torch.std(batch[:, c])
    batch[:, c] = (batch[:, c] - mean) / std
    
# NOTE: it's good practice to compute the mean and stdev on all training data in 
# advance and then subtract nad divide by these fixed, precomputed quanities

### 3D images

In some domains, sequences of images are stacked along the head-to-foot axis. E.g. the slices in CT scans.

By stacking individual 2D slices into a 3D tensor, we can built _volumetric data_ representing the 3D anatomy of a subject. Storing volumetric data is just like storing image data, except that an extra dimension, _depth_, comes after the standard channel dimension, resulting in a 5D tensor of shape `N x C x D x H x W`.

In [8]:
# Loading the specialised format
# Volumetric data can be downloaded from: https://github.com/deep-learning-with-pytorch/dlwpt-code/tree/master/data/p1ch4/volumetric-dicom/2-LUNG%203.0%20%20B70f-04083

import imageio
dir_path = "../data/p1ch4/volumetric-dicom/2-LUNG 3.0  B70f-04083"
vol_arr = imageio.volread(dir_path, 'DICOM')
vol_arr.shape

Reading DICOM (examining files): 1/99 files (1.0%7/99 files (7.1%15/99 files (15.2%18/99 files (18.2%26/99 files (26.3%35/99 files (35.4%44/99 files (44.4%50/99 files (50.5%62/99 files (62.6%66/99 files (66.7%79/99 files (79.8%88/99 files (88.9%92/99 files (92.9%99/99 files (100.0%)
  Found 1 correct series.
Reading DICOM (loading data): 9/99  (9.127/99  (27.344/99  (44.456/99  (56.679/99  (79.892/99  (92.999/99  (100.0%)


(99, 512, 512)

### Representing tabular data

In [15]:
# As pandas is recommended, I used pandas instead of numpy (which is used in the book),
# as I need to get comfortable with it.

import pandas as pd

wine_path = "../data/p1ch4/tabular-wine/winequality-white.csv"
wineq_df = pd.read_csv(wine_path, delimiter=";")
wineq_df[:10]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
5,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6
7,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
8,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
9,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6


In [36]:
wineq_df.shape, wineq_df.columns

((4898, 12),
 Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
        'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
        'pH', 'sulphates', 'alcohol', 'quality'],
       dtype='object'))

In [37]:
wineq = torch.from_numpy(wineq_df.values)
wineq.shape, wineq.dtype

(torch.Size([4898, 12]), torch.float64)

In [38]:
"""
Choices with the "quality" score:

1. Treat as a continuous variable, keep it as a real number, perform a regression task
    (by definition real number prediction)
2. Treat it as a label and try to guess the label from the chemical analysis in a classification task

Regardless, we separate this out so it's not an input which influences model training,
as this is what we're trying to predict.
"""

data = wineq[:, :-1] # Selects all rows and all columns except the last
data, data.shape

(tensor([[ 7.0000,  0.2700,  0.3600,  ...,  3.0000,  0.4500,  8.8000],
         [ 6.3000,  0.3000,  0.3400,  ...,  3.3000,  0.4900,  9.5000],
         [ 8.1000,  0.2800,  0.4000,  ...,  3.2600,  0.4400, 10.1000],
         ...,
         [ 6.5000,  0.2400,  0.1900,  ...,  2.9900,  0.4600,  9.4000],
         [ 5.5000,  0.2900,  0.3000,  ...,  3.3400,  0.3800, 12.8000],
         [ 6.0000,  0.2100,  0.3800,  ...,  3.2600,  0.3200, 11.8000]],
        dtype=torch.float64),
 torch.Size([4898, 11]))

In [45]:
target = wineq[:, -1] # Selects all rows and the last column
target, target.shape

(tensor([6., 6., 6.,  ..., 6., 7., 6.], dtype=torch.float64),
 torch.Size([4898]))

In [48]:
# Convert targets to integer vector of scores, which we then one hot encode (target.unsqueeze() below)
# We could also just keep it like this

target = wineq[:, -1].long()
target

tensor([6, 6, 6,  ..., 6, 7, 6])

In [50]:
# We can now treat then targets as an integer vector of scores, or one-hot encode.
# One-hot encoding has the advantage of not implying (and thus embedding) ordering or distance between targets

target_onehot = torch.zeros(target.shape[0], 10)
target_onehot.scatter_(1, target.unsqueeze(1), 1.0) # trailing underscore == in place operation
target_onehot

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])