How to represent different types of data with tensors. Types covered: images, tabular, time series, text.

## Working with images

#### Loading

In [1]:
import imageio

img_arr = imageio.imread("../data/p1ch4/image-dog/bobby.jpg")
img_arr.shape # H x W x C

(1280, 855, 3)

In [2]:
import torch

img = torch.from_numpy(img_arr)
out = img.permute(2, 0, 1) # Permute to fit C x H x W dimension ordering
out.shape

# Note: permute doesn't create a new image, but alters the size and stride information at the level
# of the original tensor

torch.Size([3, 1280, 855])

In [3]:
"""
To creat a dataset of multiple images to use as input for our neural networks, we store the images
in a batch along the first dimension to obtain an N x C x H x W tensor.
"""

# An efficient way to create a batch is pre-allocation followed by loading from a directory

batch_size = 3
batch = torch.zeros(batch_size, 3, 256, 256, dtype=torch.uint8)

In [4]:
import os

data_dir = '../data/p1ch4/image-cats/'
filenames = [name for name in os.listdir(data_dir)
             if os.path.splitext(name)[-1] == ".png"] # Condition ensures images used are of a desired format.

for i, filename in enumerate(filenames):

    img_arr = imageio.imread(os.path.join(data_dir, filename))
    img_t = torch.from_numpy(img_arr)
    img_t = img_t.permute(2, 0, 1)
    img_t = img_t[:3]
    batch[i] = img_t

#### Normalising

In [5]:
batch[0]

tensor([[[ 90,  91,  93,  ..., 191, 191, 191],
         [ 91,  91,  93,  ..., 191, 191, 191],
         [ 91,  92,  93,  ..., 192, 192, 192],
         ...,
         [206, 210, 213,  ..., 220, 219, 218],
         [209, 214, 214,  ..., 221, 220, 219],
         [212, 212, 212,  ..., 219, 218, 218]],

        [[108, 109, 111,  ..., 201, 201, 201],
         [109, 109, 111,  ..., 201, 201, 201],
         [109, 110, 111,  ..., 202, 202, 202],
         ...,
         [198, 202, 205,  ..., 214, 213, 212],
         [201, 206, 206,  ..., 213, 212, 211],
         [204, 204, 204,  ..., 211, 210, 210]],

        [[120, 121, 123,  ..., 210, 210, 210],
         [121, 121, 123,  ..., 210, 210, 210],
         [121, 122, 123,  ..., 211, 211, 211],
         ...,
         [198, 202, 205,  ..., 214, 213, 212],
         [201, 206, 206,  ..., 213, 212, 211],
         [204, 204, 204,  ..., 211, 210, 210]]], dtype=torch.uint8)

In [6]:
# Best training performance is observed when input data values fall in the ranges [0, 1] or [-1, 1]

batch = batch.float()
batch /= 255.0
batch[0]

tensor([[[0.3529, 0.3569, 0.3647,  ..., 0.7490, 0.7490, 0.7490],
         [0.3569, 0.3569, 0.3647,  ..., 0.7490, 0.7490, 0.7490],
         [0.3569, 0.3608, 0.3647,  ..., 0.7529, 0.7529, 0.7529],
         ...,
         [0.8078, 0.8235, 0.8353,  ..., 0.8627, 0.8588, 0.8549],
         [0.8196, 0.8392, 0.8392,  ..., 0.8667, 0.8627, 0.8588],
         [0.8314, 0.8314, 0.8314,  ..., 0.8588, 0.8549, 0.8549]],

        [[0.4235, 0.4275, 0.4353,  ..., 0.7882, 0.7882, 0.7882],
         [0.4275, 0.4275, 0.4353,  ..., 0.7882, 0.7882, 0.7882],
         [0.4275, 0.4314, 0.4353,  ..., 0.7922, 0.7922, 0.7922],
         ...,
         [0.7765, 0.7922, 0.8039,  ..., 0.8392, 0.8353, 0.8314],
         [0.7882, 0.8078, 0.8078,  ..., 0.8353, 0.8314, 0.8275],
         [0.8000, 0.8000, 0.8000,  ..., 0.8275, 0.8235, 0.8235]],

        [[0.4706, 0.4745, 0.4824,  ..., 0.8235, 0.8235, 0.8235],
         [0.4745, 0.4745, 0.4824,  ..., 0.8235, 0.8235, 0.8235],
         [0.4745, 0.4784, 0.4824,  ..., 0.8275, 0.8275, 0.

In [7]:
"""
May also want to compute the mean sand stdev of the input data and scale it
so that the output has zero mean and unit stdev across each channel

Torch provides functions for calculating these for tensors
"""

n_channels = batch.shape[1]
for c in range(n_channels):
    mean = torch.mean(batch[:, c])
    std = torch.std(batch[:, c])
    batch[:, c] = (batch[:, c] - mean) / std
    
# NOTE: it's good practice to compute the mean and stdev on all training data in 
# advance and then subtract nad divide by these fixed, precomputed quanities

### 3D images

In some domains, sequences of images are stacked along the head-to-foot axis. E.g. the slices in CT scans.

By stacking individual 2D slices into a 3D tensor, we can built _volumetric data_ representing the 3D anatomy of a subject. Storing volumetric data is just like storing image data, except that an extra dimension, _depth_, comes after the standard channel dimension, resulting in a 5D tensor of shape `N x C x D x H x W`.

In [8]:
# Loading the specialised format
# Volumetric data can be downloaded from: https://github.com/deep-learning-with-pytorch/dlwpt-code/tree/master/data/p1ch4/volumetric-dicom/2-LUNG%203.0%20%20B70f-04083

import imageio
dir_path = "../data/p1ch4/volumetric-dicom/2-LUNG 3.0  B70f-04083"
vol_arr = imageio.volread(dir_path, 'DICOM')
vol_arr.shape

Reading DICOM (examining files): 1/99 files (1.0%3/99 files (3.0%6/99 files (6.1%9/99 files (9.1%12/99 files (12.1%16/99 files (16.2%20/99 files (20.2%26/99 files (26.3%33/99 files (33.3%40/99 files (40.4%46/99 files (46.5%56/99 files (56.6%68/99 files (68.7%83/99 files (83.8%98/99 files (99.0%99/99 files (100.0%)
  Found 1 correct series.
Reading DICOM (loading data): 27/99  (27.363/99  (63.699/99  (100.0%)


(99, 512, 512)

### Representing tabular data

In [27]:
# As pandas is recommended, I used pandas instead of numpy (which is used in the book),
# as I need to get comfortable with it.

import pandas as pd

wine_path = "../data/p1ch4/tabular-wine/winequality-white.csv"
wineq_df = pd.read_csv(wine_path, delimiter=";")
col_list = list(wineq_df.columns)
wineq_df[:10]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
5,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6
7,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
8,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
9,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6


In [28]:
wineq_df.shape, wineq_df.columns

((4898, 12),
 Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
        'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
        'pH', 'sulphates', 'alcohol', 'quality'],
       dtype='object'))

In [29]:
wineq = torch.from_numpy(wineq_df.values)
wineq.shape, wineq.dtype

(torch.Size([4898, 12]), torch.float64)

In [30]:
"""
Choices with the "quality" score:

1. Treat as a continuous variable, keep it as a real number, perform a regression task
    (by definition real number prediction)
2. Treat it as a label and try to guess the label from the chemical analysis in a classification task

Regardless, we separate this out so it's not an input which influences model training,
as this is what we're trying to predict.
"""

data = wineq[:, :-1] # Selects all rows and all columns except the last
data, data.shape

(tensor([[ 7.0000,  0.2700,  0.3600,  ...,  3.0000,  0.4500,  8.8000],
         [ 6.3000,  0.3000,  0.3400,  ...,  3.3000,  0.4900,  9.5000],
         [ 8.1000,  0.2800,  0.4000,  ...,  3.2600,  0.4400, 10.1000],
         ...,
         [ 6.5000,  0.2400,  0.1900,  ...,  2.9900,  0.4600,  9.4000],
         [ 5.5000,  0.2900,  0.3000,  ...,  3.3400,  0.3800, 12.8000],
         [ 6.0000,  0.2100,  0.3800,  ...,  3.2600,  0.3200, 11.8000]],
        dtype=torch.float64),
 torch.Size([4898, 11]))

In [31]:
target = wineq[:, -1] # Selects all rows and the last column
target, target.shape

(tensor([6., 6., 6.,  ..., 6., 7., 6.], dtype=torch.float64),
 torch.Size([4898]))

In [32]:
# Convert targets to integer vector of scores, which we then one hot encode (target.unsqueeze() below)
# We could also just keep it like this

target = wineq[:, -1].long()
target

tensor([6, 6, 6,  ..., 6, 7, 6])

### One-hot encoding

In [33]:
# We can now treat then targets as an integer vector of scores, or one-hot encode.
# One-hot encoding has the advantage of not implying (and thus embedding) ordering or distance between targets

target_onehot = torch.zeros(target.shape[0], 10)
target_onehot.scatter_(1, target.unsqueeze(1), 1.0) # trailing underscore == in place operation
target_onehot

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

What `scatter_` does:

Take the index of the target label (== the score or 'target' itself in the case of this one-hot encoding),
use it as the column index (hence first argument dim=1), to set the value 1.0.

The result is a one-hot encoded tensor.

Note: unsqueeze is necessary because the index tensor is required to have the same number of dimensions
as the tensor we scatter into. Unsqueeze just adds a singleton dimension for this purpose. It doesn't
affect the existing values of the index tensor.

In [34]:
# Obtaining mean and standard deviations for each column

data_mean = torch.mean(data, dim=0)
data_mean

tensor([6.8548e+00, 2.7824e-01, 3.3419e-01, 6.3914e+00, 4.5772e-02, 3.5308e+01,
        1.3836e+02, 9.9403e-01, 3.1883e+00, 4.8985e-01, 1.0514e+01],
       dtype=torch.float64)

In [35]:
data_var = torch.var(data, dim=0) # variance
data_var

# At this point we can normalize the data by subtracing the mean and dividing by the stdev, which
# helps the learning process.

tensor([7.1211e-01, 1.0160e-02, 1.4646e-02, 2.5726e+01, 4.7733e-04, 2.8924e+02,
        1.8061e+03, 8.9455e-06, 2.2801e-02, 1.3025e-02, 1.5144e+00],
       dtype=torch.float64)

In [36]:
# Normalising data
data_normalised = (data - data_mean) / torch.sqrt(data_var)
data_normalised

# NOTE: you can also use torch.std to get the stdev immediately.

tensor([[ 1.7208e-01, -8.1762e-02,  2.1326e-01,  ..., -1.2468e+00,
         -3.4915e-01, -1.3930e+00],
        [-6.5743e-01,  2.1587e-01,  4.7996e-02,  ...,  7.3995e-01,
          1.3417e-03, -8.2419e-01],
        [ 1.4756e+00,  1.7450e-02,  5.4378e-01,  ...,  4.7505e-01,
         -4.3677e-01, -3.3663e-01],
        ...,
        [-4.2043e-01, -3.7940e-01, -1.1915e+00,  ..., -1.3130e+00,
         -2.6153e-01, -9.0545e-01],
        [-1.6054e+00,  1.1666e-01, -2.8253e-01,  ...,  1.0049e+00,
         -9.6251e-01,  1.8574e+00],
        [-1.0129e+00, -6.7703e-01,  3.7852e-01,  ...,  4.7505e-01,
         -1.4882e+00,  1.0448e+00]], dtype=torch.float64)

### Finding thresholds

In [37]:
# Find wines with scores <= 3
bad_indexes = target <= 3
bad_indexes.shape, bad_indexes.dtype, bad_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(20))

In [38]:
# With advanced indexing, we can use this 20-element bool tensor to index the data tensor.
# This works like filtering -> give me only items corresponding to True in the indexing tensor.

bad_data = data[bad_indexes] # super neat
bad_data.shape

# So, done quickly... bad_data = data[target <= 3]

torch.Size([20, 11])

In [39]:
# Get information about wine grouped into good, middle, and bad categories

bad_data = data[target <= 3]
mid_data = data[(target > 3) & (target < 7)]
good_data = data[target >= 7]

bad_mean = torch.mean(bad_data, dim=0)
mid_mean = torch.mean(mid_data, dim=0)
good_mean = torch.mean(good_data, dim=0)

for i, args in enumerate(zip(col_list, bad_mean, mid_mean, good_mean)):
    print('{:2} {:20} {:6.2f} {:6.2f} {:6.2f}'.format(i, *args))

 0 fixed acidity          7.60   6.89   6.73
 1 volatile acidity       0.33   0.28   0.27
 2 citric acid            0.34   0.34   0.33
 3 residual sugar         6.39   6.71   5.26
 4 chlorides              0.05   0.05   0.04
 5 free sulfur dioxide   53.33  35.42  34.55
 6 total sulfur dioxide 170.60 141.83 125.25
 7 density                0.99   0.99   0.99
 8 pH                     3.19   3.18   3.22
 9 sulphates              0.47   0.49   0.50
10 alcohol               10.35  10.26  11.42


In [44]:
# Bad wines seem to have a higher total sulfur dioxide.
# A threshold on this could be a crude criteron for discriminating good wines

total_sulfur_threshold = 141.83
total_sulfur_data = data[:, 6] # All rows of column 6.
# Interesting that rows and columns are inverted to the way we index dimensions (dim=0 is column).

predicted_indexes = torch.lt(total_sulfur_data, total_sulfur_threshold)
predicted_indexes.shape, predicted_indexes.dtype, predicted_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(2727))

In [45]:
actual_indexes = target > 5
actual_indexes.shape, actual_indexes.dtype, actual_indexes.sum()
# There are about 500 more good wines than the threshoold predicted

(torch.Size([4898]), torch.bool, tensor(3258))

In [49]:
# How well did we do relative to the actual ranking?

n_matches = torch.sum(actual_indexes & predicted_indexes).item() # .item() removes number from tensor result
n_predicted = torch.sum(predicted_indexes).item()
n_actual = torch.sum(actual_indexes).item()

n_matches, n_matches / n_predicted, n_matches / n_actual

(2018, 0.74000733406674, 0.6193984039287906)

Over 2000 scores predicted correctly, 74% accuracy of good wine prediction, 61% of actual good wines predicted.

Slightly better than random, but ultimately we know that multiple variables in fact contribute to wine quality, and their relationship with the outcome is likely more complicated than a threshold on a single value can account for.

A simple neural network could overcome this limitation (to come later.)

### Working with time series

Switch to the Washington, D.C. bike-sharing system dataset.

This reports the hourly count of rental bikes in 2011 - 2012 in the Capital Bikeshare syustem, along with weather and seasonal information.

In [51]:
import numpy as np

bikes_numpy = np.loadtxt(
    "../data/p1ch4/bike-sharing-dataset/hour-fixed.csv",
    dtype=np.float32,
    delimiter=",",
    skiprows=1,
    # Convert date strings to numbers corresponding to the day of the month in column 1
    converters={1: lambda x: float(x[8:10])}
)

bikes = torch.from_numpy(bikes_numpy)
bikes

tensor([[1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 3.0000e+00, 1.3000e+01,
         1.6000e+01],
        [2.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 8.0000e+00, 3.2000e+01,
         4.0000e+01],
        [3.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 5.0000e+00, 2.7000e+01,
         3.2000e+01],
        ...,
        [1.7377e+04, 3.1000e+01, 1.0000e+00,  ..., 7.0000e+00, 8.3000e+01,
         9.0000e+01],
        [1.7378e+04, 3.1000e+01, 1.0000e+00,  ..., 1.3000e+01, 4.8000e+01,
         6.1000e+01],
        [1.7379e+04, 3.1000e+01, 1.0000e+00,  ..., 1.2000e+01, 3.7000e+01,
         4.9000e+01]])

For every **hour**, the dataset reports the following variables:

- Index of record: instant
- Day of month: day
- Season: season (1: spring, 2: summer, 3: fall, 4: winter)
- Year: yr (0: 2011, 1: 2012)
- Month: mnth (1 to 12)
- Hour: hr (0 to 23)
- Holiday status: holiday
- Day of the week: weekday
- Working day status: workingday
- Weather situation: weathersit (1: clear, 2:mist, 3: light rain/snow, 4: heavy rain/snow)
- Temperature in °C: temp
- Perceived temperature in °C: atemp
- Humidity: hum
- Wind speed: windspeed
- Number of casual (non-registered) users: casual
- Number of registered users: registered
- Count of rental bikes: cnt

Rows represent successive time-points: this is the dimension along which they are ordered. Ordering provides an opporutnity to exploit causal temporal relationships.

In [52]:
# Shaping the data into 24-hour chunks (time periods). 24-hour is chosen based on the size of the dataset.
# Result will be daily sequences of ride counts and other exogenous variables.
# torch.sort would order appropriately if not already

bikes.shape, bikes.stride() # 17520 hours, 17 columns.

(torch.Size([17520, 17]), (17, 1))

In [54]:
# Reshape the data to have 3 axes - day, hour, then 17 columns

daily_bikes = bikes.view(-1, 24, bikes.shape[1])
daily_bikes.shape, daily_bikes.stride()
# What's left: N sequences of L hours in a day, for C channels. (C dimension "Channels" is the columns in this case).

(torch.Size([730, 24, 17]), (408, 17, 1))

NOTE: view changes the way the tensor looks at the same data as contained in storage. It returns a new tensor that changes the number of dimensions and striding information, without changing the storage. It's therefore a zero-cost rearranging of a tensor.

We have to provide the shape of the new tensor to view. -1 is used as a placeholder for "however many indexes are left, given the other dimensions and the orignal number of elements."

In [55]:
# For N x C x L ordering, transpose
daily_bikes = daily_bikes.transpose(1, 2)
daily_bikes.shape, daily_bikes.stride()

(torch.Size([730, 17, 24]), (408, 1, 17))

Weather situation variable is ordinal: 1 for good weather, 4 for worst. Treating it as categorical will require one-hot encoding. For first day:

In [56]:
first_day = bikes[:24].long()
weather_onehot = torch.zeros(first_day.shape[0], 4)
first_day[:, 9]

tensor([1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2])

In [57]:
# Scatter ones into the matrix according to the corresponding level at each row.

weather_onehot.scatter_(
    dim=1,
    index=first_day[:,9].unsqueeze(1).long() - 1,
    value=1.0
)

tensor([[1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.]])

In [58]:
# Now concatenate the matrix to the original dataset
torch.cat((bikes[:24], weather_onehot), 1)[:1]

tensor([[ 1.0000,  1.0000,  1.0000,  0.0000,  1.0000,  0.0000,  0.0000,  6.0000,
          0.0000,  1.0000,  0.2400,  0.2879,  0.8100,  0.0000,  3.0000, 13.0000,
         16.0000,  1.0000,  0.0000,  0.0000,  0.0000]])

In [62]:
# Alternative

daily_weather_onehot = torch.zeros(daily_bikes.shape[0], 4, daily_bikes.shape[2])
daily_weather_onehot.shape

torch.Size([730, 4, 24])

In [63]:
daily_weather_onehot.scatter_(1, daily_bikes[:, 9, :].long().unsqueeze(1), - 1, 1.0)
daily_weather_onehot.shape

TypeError: scatter_() received an invalid combination of arguments - got (int, Tensor, int, float), but expected one of:
 * (int dim, Tensor index, Tensor src)
 * (int dim, Tensor index, Number value)
