# Real world data
In this notebook we will learn how to handle different types of data: 2D images, 3D images, tabular data and text.

## 2D images

In [None]:
!git clone https://github.com/deep-learning-with-pytorch/dlwpt-code.git

Cloning into 'dlwpt-code'...
remote: Enumerating objects: 703, done.[K
remote: Total 703 (delta 0), reused 0 (delta 0), pack-reused 703[K
Receiving objects: 100% (703/703), 176.00 MiB | 21.27 MiB/s, done.
Resolving deltas: 100% (309/309), done.
Checking out files: 100% (228/228), done.


In [None]:
cd /content/dlwpt-code/

/content/dlwpt-code


In [None]:
import numpy as np
import torch
torch.set_printoptions(edgeitems=2, threshold=50)

In [None]:
import imageio

img_arr = imageio.imread('data/p1ch4/image-dog/bobby.jpg')
img_arr.shape

(720, 1280, 3)

An image is represented by several Python packages as a three dimensional NumPy array, in this case as [height, width, channels]. A PyTorch tensor uses the 1st dimension for the channels and the other two as height and width respectively [channles, height, width]. So if we want to work with an image we have to change the layout, the metadata of the image, by permuting the 1st dimension of the image with the 3rd. 

In [None]:
img = torch.from_numpy(img_arr)
out = img.permute(2, 0, 1)
out.shape

torch.Size([3, 720, 1280])

We want to create a batch of  three RGB images. In order to do so we create an empty four dimensional tensor [batch-size, channels, height, width] that we will fill with images

In [None]:
batch_size = 3
channels = 3
height = 256
width = 256
batch = torch.zeros(batch_size, channels, height, width, dtype=torch.uint8)
batch.shape

torch.Size([3, 3, 256, 256])

In [None]:
import os

data_dir = 'data/p1ch4/image-cats/'
filenames = [name for name in os.listdir(data_dir)
             if os.path.splitext(name)[-1] == '.png']
for i, filename in enumerate(filenames):
    img_arr = imageio.imread(os.path.join(data_dir, filename))
    img_t = torch.from_numpy(img_arr)
    img_t = img_t.permute(2, 0, 1)
    img_t = img_t[:3] # <1>
    batch[i] = img_t

In [None]:
batch = batch.float()
batch /= 255.0

In [None]:
n_channels = batch.shape[1]
for c in range(n_channels):
    mean = torch.mean(batch[:, c])
    std = torch.std(batch[:, c])
    batch[:, c] = (batch[:, c] - mean) / std

## 3D images
3D images are used in the medical domain, such as for Computed Tomography (CT) scan and Magnetic resonance Imaging (MRI), where a scan is represented as a batch of images where the batch index represents the length of the body and each image is a slice of the body. They are also used in Geophysics where one dimension can be the height and the other two are used to represent a surface or layer. CT scans use single band images, that is only one channel. A complete dataset can be represented as a 5 dimensional tensor:   
[number of batches, channels, depth, height, width]  

We open 99 DICOM files and put them into an array

In [None]:
data_dir = 'data/p1ch4/volumetric-dicom/LUNG-IMAGES/'
vol_arr = imageio.volread(data_dir, 'DICOM')
vol_arr.shape

Reading DICOM (examining files): 1/99 files (1.0%)99/99 files (100.0%)
  Found 1 correct series.
Reading DICOM (loading data): 45/99  (45.5%)99/99  (100.0%)


(99, 512, 512)

We have to transform the array of integers into a PyTorch tensor of floats by adding a dimension so that we will have one channel, of depth=99 with images of height=512 and width=512

In [None]:
vol = torch.from_numpy(vol_arr).float()
vol = torch.unsqueeze(vol, 0)
vol.shape

torch.Size([1, 99, 512, 512])

## Tabular data
We open a tabular dataset that contains values for several characteristics used to assess the quality of wines. We import the data into a NumPy array and then into a Pytorch tensor.

In [None]:
wine_path = 'data/p1ch4/tabular-wine/winequality-white.csv'
wineq_numpy = np.loadtxt(wine_path, dtype=np.float32, delimiter=";", skiprows=1)
wineq_numpy.shape

(4898, 12)

In [None]:
wineq = torch.from_numpy(wineq_numpy)
wineq.shape, wineq.dtype

(torch.Size([4898, 12]), torch.float32)

We print the column names

In [None]:
import csv
col_list = next(csv.reader(open(wine_path), delimiter=';'))
col_list

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality']

We want to see what relation there is between the sulphates content and the quality of wines. That is a regression task. We separate the data from the labels that in this case is the value of the 'quality' column.

In [None]:
data = wineq[:, :-1]
data.shape

torch.Size([4898, 11])

In [None]:
label = wineq[:, -1].long()
label.shape

torch.Size([4898])

### Continuous values
The label, that is the quality of the wine, is represented by an integer value between 3 and 9. This way of representing the quality makes sense since we assume a wine with quality=9 is 3 times better than a wine with quality=3

In [None]:
label.min()

tensor(3)

In [None]:
label.max()

tensor(9)

### Categorical values
In other case the label might indicate only that the item belongs to a certain class without any further meaning. In this case we might use the one-hot encoding mapping a value to a vector with all element 0 but only one set to 1 that represents its category.

In [None]:
label_onehot = torch.zeros(label.shape[0], 10)
label_onehot.scatter_(1, label.unsqueeze(1), 1.0)
label_onehot[0]

tensor([0., 0., 0., 0., 0., 0., 1., 0., 0., 0.])

We compute the mean and the variance for each column of our data

In [None]:
data_mean = data.mean(dim=0)
data_mean

tensor([6.8548e+00, 2.7824e-01, 3.3419e-01, 6.3914e+00, 4.5772e-02, 3.5308e+01,
        1.3836e+02, 9.9403e-01, 3.1883e+00, 4.8985e-01, 1.0514e+01])

In [None]:
data_var = data.var(dim=0)
data_var

tensor([7.1211e-01, 1.0160e-02, 1.4646e-02, 2.5726e+01, 4.7733e-04, 2.8924e+02,
        1.8061e+03, 8.9455e-06, 2.2801e-02, 1.3025e-02, 1.5144e+00])

We can normalize the data by subtracting the mean and dividing by the standard deviation.

In [None]:
data_normalized = (data - data_mean) / torch.sqrt(data_var)
data_normalized

tensor([[ 1.7208e-01, -8.1761e-02,  ..., -3.4915e-01, -1.3930e+00],
        [-6.5743e-01,  2.1587e-01,  ...,  1.3422e-03, -8.2419e-01],
        ...,
        [-1.6054e+00,  1.1666e-01,  ..., -9.6251e-01,  1.8574e+00],
        [-1.0129e+00, -6.7703e-01,  ..., -1.4882e+00,  1.0448e+00]])

We compute the number of wine with quality equal or less than 3. We can see that there are only 20 wines out of 4898 that have a such a low quality.

In [None]:
bad_indexes = label <= 3
bad_indexes.shape, bad_indexes.dtype, bad_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(20))

tensor([], size=(0, 4898), dtype=torch.bool)

We can filter these wines from the data tensor using the bad_indexes

In [None]:
bad_data = data[bad_indexes]
bad_data.shape

torch.Size([20, 11])

It might be interesting to see what attributes values are linked to such a low quality for each of these 20 wines. We can divide the wines in three categories.

In [None]:
bad_data = data[label <= 3]
mid_data = data[(label > 3) & (label < 7)]
good_data = data[label >= 7]

In [None]:
bad_mean = torch.mean(bad_data, dim=0)
mid_mean = torch.mean(mid_data, dim=0)
good_mean = torch.mean(good_data, dim=0)

We can print the mean values for each of the 11 attributes in the three categories. We can see that there is an inverse correlation between the sulfur dioxied and the quality: a good wine contains less sulfur than a medium wine, and a medium wine contains less sulfur than a bad one. 

In [None]:
for i, args in enumerate(zip(col_list, bad_mean, mid_mean, good_mean)):
  print('{:2} {:20} {:6.2f} {:6.2f} {:6.2f}'.format(i, *args))

 0 fixed acidity          7.60   6.89   6.73
 1 volatile acidity       0.33   0.28   0.27
 2 citric acid            0.34   0.34   0.33
 3 residual sugar         6.39   6.71   5.26
 4 chlorides              0.05   0.05   0.04
 5 free sulfur dioxide   53.33  35.42  34.55
 6 total sulfur dioxide 170.60 141.83 125.25
 7 density                0.99   0.99   0.99
 8 pH                     3.19   3.18   3.22
 9 sulphates              0.47   0.49   0.50
10 alcohol               10.34  10.26  11.42


## Time series

In [None]:
bike_path = 'data/p1ch4/bike-sharing-dataset/hour-fixed.csv'
bikes_numpy = np.loadtxt(bike_path, dtype=np.float32, delimiter=',', skiprows=1, converters={1: lambda x: float(x[8:10])})
bikes_numpy[0]

array([ 1.    ,  1.    ,  1.    ,  0.    ,  1.    ,  0.    ,  0.    ,
        6.    ,  0.    ,  1.    ,  0.24  ,  0.2879,  0.81  ,  0.    ,
        3.    , 13.    , 16.    ], dtype=float32)

In [None]:
bikes = torch.from_numpy(bikes_numpy)
bikes

tensor([[1.0000e+00, 1.0000e+00,  ..., 1.3000e+01, 1.6000e+01],
        [2.0000e+00, 1.0000e+00,  ..., 3.2000e+01, 4.0000e+01],
        ...,
        [1.7378e+04, 3.1000e+01,  ..., 4.8000e+01, 6.1000e+01],
        [1.7379e+04, 3.1000e+01,  ..., 3.7000e+01, 4.9000e+01]])

We want to have a look at the attributes 

In [None]:
col_list = next(csv.reader(open(bike_path), delimiter=','))
col_list

['instant',
 'dteday',
 'season',
 'yr',
 'mnth',
 'hr',
 'holiday',
 'weekday',
 'workingday',
 'weathersit',
 'temp',
 'atemp',
 'hum',
 'windspeed',
 'casual',
 'registered',
 'cnt']

In [None]:
bikes.shape, bikes.stride()

(torch.Size([17520, 17]), (17, 1))

We want to add two dimensions to separate the days and the times of bike sharing, so one dimension will be for the day (730 days), one for the hour (24 hours), and another for the attributes (17 attributes). We will create a new view of the data. A new view creates new metadata of the same data so that the same data will be interpreted in a different way. 

In [None]:
daily_bikes = bikes.view(-1, 24, bikes.shape[1])
daily_bikes.shape, daily_bikes.stride()

(torch.Size([730, 24, 17]), (408, 17, 1))

Each day is represented by a sequence of 24 hourly attributes. The dataset contains N=17520 / 24 = 730 sequences, each sequence has C=17 channels (attributes) and L=24 ordered data points. We have to transpose the dataset to be NxCxL

In [None]:
daily_bikes = daily_bikes.transpose(1, 2)
daily_bikes.shape, daily_bikes.stride()

(torch.Size([730, 17, 24]), (408, 1, 17))

In [None]:
first_day = bikes[:24].long()
weather_onehot = torch.zeros(first_day.shape[0], 4)
first_day[:,9]

tensor([1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2])

In [None]:
weather_onehot.scatter_(dim=1, index=first_day[:,9].unsqueeze(1).long() - 1, value=1.0)

tensor([[1., 0., 0., 0.],
        [1., 0., 0., 0.],
        ...,
        [0., 1., 0., 0.],
        [0., 1., 0., 0.]])

In [None]:
torch.cat((bikes[:24], weather_onehot), 1)[:1]

tensor([[ 1.0000,  1.0000,  1.0000,  0.0000,  1.0000,  0.0000,  0.0000,  6.0000,
          0.0000,  1.0000,  0.2400,  0.2879,  0.8100,  0.0000,  3.0000, 13.0000,
         16.0000,  1.0000,  0.0000,  0.0000,  0.0000]])

In [None]:
daily_weather_onehot = torch.zeros(daily_bikes.shape[0], 4, daily_bikes.shape[2])

In [None]:
daily_weather_onehot.shape

torch.Size([730, 4, 24])

In [None]:
daily_weather_onehot.scatter_(1, daily_bikes[:,9,:].long().unsqueeze(1) - 1, 1.0)
daily_weather_onehot.shape

torch.Size([730, 4, 24])

In [None]:
daily_bikes = torch.cat((daily_bikes, daily_weather_onehot), dim=1)

In [None]:
daily_bikes[:, 9, :] = (daily_bikes[:, 9, :] - 1.0) / 3.0

## Text
In this section we will see how to represent text in a way that is suitable to be processed by a neural network for tasks such as machine translation and other Natural Language Processing (NLP) tasks. Text can be processed at the characters level or at the words level. In both cases they can be represented as vectors in a way similar to the one-hot encoding.Characters are represented as a sequence of bits in different encoding schemes. The simplest is ASCII that uses 7 bits to represent a character. Each character can be thought as a vector in a 7 dimensional space.

In [None]:
with open('data/p1ch4/jane-austen/1342-0.txt', encoding='utf8') as f:
  text = f.read()

In [None]:
lines = text.split('\n')
len(lines)

13428

In [None]:
line = lines[200]
line

'“Impossible, Mr. Bennet, impossible, when I am not acquainted with him'

In [None]:
line[1]

'I'

### One-hot encoding of characters
We create a 2D tensor from the characters in the line. For each character (row) we will write a '1' in the column that matches the character encoding 

In [None]:
letter_t = torch.zeros(len(line), 128)
letter_t.shape

torch.Size([70, 128])

We parse the line to change all characters to lower case

In [None]:
for i, letter in enumerate(line.lower().strip()):
  letter_index = ord(letter) if ord(letter) < 128 else 0
  letter_t[i][letter_index] = 1

We can see that row[1] represents 1st letter "I" changed to lower case "i" (ASCII code 105). It represents the one-hot encoding of character "i" in ASCII. The tensor represents the first line of the text so that the text can be represented as a batch of tensors.

In [None]:
torch.nonzero(letter_t[1])

tensor([[105]])

### One-hot encoding of words
We can use words in a vocabulary as vector basis. A vocabulary has usually thousands of words so the one-hot encoding uses vectors that are very long and a line is represented by a big and sparse tensor. We apply the same transformation as before to change the characters in words to be lower case and without punctuation

In [None]:
def clean_words(input_str):
  punctuation = '.,;:"!?”“_-'
  word_list = input_str.lower().replace('\n',' ').split()
  word_list = [word.strip(punctuation) for word in word_list]
  return word_list

In [None]:
words_in_line = clean_words(line)
line, words_in_line

('“Impossible, Mr. Bennet, impossible, when I am not acquainted with him',
 ['impossible',
  'mr',
  'bennet',
  'impossible',
  'when',
  'i',
  'am',
  'not',
  'acquainted',
  'with',
  'him'])

We build a dictionary of the words in the text, the keys are words and the values are numbers.

In [None]:
word_list = sorted(set(clean_words(text)))
word2index_dict = {word: i for (i, word) in enumerate(word_list)}

Our dictionary has 7261 words. This means that we will need vectors of the same length to represent each word

In [None]:
len(word2index_dict)

7261

We can see the value associated to each word

In [None]:
word2index_dict['impossible']

3394

We can now represent each line as a tensor

In [None]:
word_t = torch.zeros(len(words_in_line), len(word2index_dict))

for i, word in enumerate(words_in_line):
  word_index = word2index_dict[word]
  word_t[i][word_index] = 1
  print('{:2} {:4} {}'.format(i, word_index, word))

print(word_t.shape)

 0 3394 impossible
 1 4305 mr
 2  813 bennet
 3 3394 impossible
 4 7078 when
 5 3315 i
 6  415 am
 7 4436 not
 8  239 acquainted
 9 7148 with
10 3215 him
torch.Size([11, 7261])


### Text embeddings
Even using grammar rules to reduce the number of terms that can be found in a text we will end up with verctors with thousands of zeros and just a single one. We can improve the situation by using float numbers to represent words. This technique is called embeddings. The idea is to represent words that are related with vectors in a small space, let's say 100 dimensions, that are closer than to other words that do not have any relationships with them. For example a sea and a lake are related by the fact that it's water, a german sheperd and a collie are related by the fact that they're dogs and so on. We do not want to build the embeddings by hand. Embeddings can be created by a neural network to cluster words that are related in some way. We will not address embeddings in this book. Embeddings are an alternative to one-hot encodings for categorical data.
