<a href="https://colab.research.google.com/github/prat8897/DL_PyTorch/blob/master/Chapter4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Real-world data representation using tensors

This chapter covers
- Representing real-world data as PyTorch tensors
- Working with a range of data types
- Loading data from a file
- Converting data to tensors
- Shaping tensors so they can be used as inputs for neural network models

## Continuous, ordinal, and categorical values

We should be aware of three different kinds of numerical values as we attempt to make sense of our data. The first kind is `continuous` values. These are the most intuitive when represented as numbers. They are strictly ordered, and a difference between various values has a strict meaning. Stating that package `A` is 2 kilograms heavier than package `B`, or that package `B` came from 100 miles farther away than `A` has a fixed meaning, regardless of whether package `A` is 3 kilograms or 10, or if `B` came from 200 miles away or 2,000. If you’re counting or measuring something with units, it’s probably a `continuous` value.
The literature actually divides continuous values further: in the previous examples, it makes sense to say something is twice as heavy or three times farther away, so those values are said to be on a ratio scale. The time of day, on the other hand, does have the notion of difference, but it is not reasonable to claim that 6:00 is twice as late as 3:00; so time of day only offers aninterval scale.


Next we have `ordinal` values. The strict ordering we have with continuous values remains, but the fixed relationship between values no longer applies. A good example of this is ordering a small, medium, or large drink, with small mapped to the value 1, medium 2, and large 3. The large drink is bigger than the medium, in the same way that 3 is bigger than 2, but it doesn’t tell us anything about how much bigger. If we were to convert our 1, 2, and 3 to the actual volumes (say, 8, 12, and 24 fluid ounces), then they would switch to being interval values. It’s important to remember that we can’t “do math” on the values outside of ordering them; trying to average large = 3 and small = 1 does not result in a medium drink!


Finally, `categorical` values have neither ordering nor numerical meaning to their values. These are often just enumerations of possibilities assigned arbitrary numbers. Assigning water to 1, coffee to 2, soda to 3, and milk to 4 is a good example. There’s no real logic to placing water first and milk last; they simply need distinct values to differentiate them. We could assign coffee to 10 and milk to –3, and there would be no significant change (though assigning values in the range `0..N – 1` will have advantages for one-hot encoding and the embeddings we’ll discuss later). Because the numerical values bear no meaning, they are said to be on a `nominal` scale.

## Working with images

### Loading an image file

Images come in several different file formats, but luckily there are plenty of ways to load images in Python. Let’s start by loading a PNG image using the imageio module ([code/p1ch4/1_image_dog.ipynb](https://github.com/deep-learning-with-pytorch/dlwpt-code/blob/master/p1ch4/1_image_dog.ipynb)).



In [38]:
import numpy as np
import torch
torch.set_printoptions(edgeitems=2, threshold=50)

In [39]:
import imageio
img_arr = imageio.imread('/content/bobby.jpg')
img_arr.shape

(720, 1280, 3)

At this point, img is a NumPy array-like object with three dimensions: two spatial dimensions, width and height; and a third dimension corresponding to the red, green, and blue channels. Any library that outputs a NumPy array will suffice to obtain a PyTorch tensor. The only thing to watch out for is the layout of the dimensions. PyTorch modules dealing with image data require tensors to be laid out as C × H × W : `channels`, `height`, and `width`, respectively.

We can use the tensor’s `permute` method with the old dimensions for each new dimension to get to an appropriate layout. Given an input tensor H × W × C as obtained previously, we get a proper layout by having channel 2 first and then channels 0 and 1:

In [40]:
img = torch.from_numpy(img_arr)
out = img.permute(2, 0, 1)
out, out.shape

(tensor([[[ 77,  77,  ..., 117, 116],
          [ 75,  76,  ..., 117, 116],
          ...,
          [215, 216,  ..., 174, 174],
          [215, 216,  ..., 158, 158]],
 
         [[ 45,  45,  ...,  77,  76],
          [ 43,  44,  ...,  77,  76],
          ...,
          [165, 166,  ..., 124, 124],
          [165, 166,  ..., 107, 107]],
 
         [[ 22,  22,  ...,  51,  51],
          [ 20,  21,  ...,  51,  50],
          ...,
          [ 78,  79,  ...,  55,  55],
          [ 78,  79,  ...,  41,  41]]], dtype=torch.uint8),
 torch.Size([3, 720, 1280]))

As a slightly more efficient alternative to using stack to build up the tensor, we can pre-allocate a tensor of appropriate size and fill it with images loaded from a directory, like so:


In [41]:
batch_size = 3
batch = torch.zeros(batch_size, 3, 256, 256, dtype=torch.uint8)

This indicates that our batch will consist of three RGB images 256 pixels in height and 256 pixels in width. Notice the type of the tensor: we’re expecting each color to be represented as an 8-bit integer, as in most photographic formats from standard consumer cameras. We can now load all PNG images from an input directory and store them in the tensor:

In [42]:
import os
data_dir = '/content/image-cats/'
filenames = [name for name in os.listdir(data_dir) if os.path.splitext(name)[-1] == '.png']

for i, filename in enumerate(filenames):
  img_arr = imageio.imread(os.path.join(data_dir, filename))
  img_t = torch.from_numpy(img_arr)
  img_t = img_t.permute(2, 0, 1) #H × W × C
  img_t = img_t[:3] #only keep first 3 channels
  batch[i] = img_t

### Normalizing the data

We mentioned earlier that neural networks usually work with floating-point tensors as their input. Neural networks exhibit the best training performance when the input data ranges roughly from `0` to `1`, or from `-1` to `1` (this is an effect of how their building blocks are defined).


So a typical thing we’ll want to do is cast a tensor to floating-point and normalize the values of the pixels. Casting to floating-point is easy, but normalization is trickier, as it depends on what range of the input we decide should lie between `0` and `1` (or `-1` and `1`). One possibility is to just divide the values of the pixels by `255` (the maximum representable number in 8-bit unsigned):

In [43]:
print(batch)
batch = batch.float()
batch /= 255.0  #normalize between 0 and 1
print(batch)

tensor([[[[156, 152,  ..., 149, 158],
          [174, 134,  ..., 136, 138],
          ...,
          [129, 130,  ..., 121, 114],
          [129, 123,  ..., 121, 120]],

         [[139, 135,  ..., 135, 147],
          [160, 119,  ..., 122, 124],
          ...,
          [111, 111,  ..., 112, 105],
          [111, 104,  ..., 110, 111]],

         [[129, 123,  ..., 132, 145],
          [155, 110,  ..., 119, 121],
          ...,
          [108, 108,  ..., 117, 110],
          [107,  98,  ..., 115, 116]]],


        [[[238, 238,  ..., 215, 215],
          [238, 238,  ..., 215, 215],
          ...,
          [214, 213,  ..., 190, 192],
          [214, 213,  ..., 190, 192]],

         [[195, 195,  ..., 175, 175],
          [195, 195,  ..., 175, 175],
          ...,
          [128, 127,  ..., 103, 105],
          [128, 127,  ..., 103, 105]],

         [[137, 137,  ..., 126, 126],
          [137, 137,  ..., 126, 126],
          ...,
          [ 79,  78,  ...,  69,  71],
          [ 79,  78,  ..

Another possibility is to compute the *mean* and *standard deviation* of the input data and scale it so that the output has `zero` mean and unit standard deviation across each channel:

In [44]:
n_channels = batch.shape[1]
for c in range(n_channels):
  mean = torch.mean(batch[:, c])
  std = torch.std(batch[:, c])
  batch[:, c] = (batch[:, c] - mean) / std

Here, we normalize just a single batch of images because we do not know yet how to operate on an entire dataset. In working with images, it is good practice to compute the *mean* and *standard deviation* on all the training data in advance and then subtract and divide by these fixed, precomputed quantities.

We can perform several other operations on inputs, such as geometric transforma- tions like rotations, scaling, and cropping. These may help with training or may be required to make an arbitrary input conform to the input requirements of a network, like the size of the image. 

## 3D images: Volumetric data

### Loading a specialized format

Let’s load a sample CT scan using the volread function in the imageio module, which takes a directory as an argument and assembles all Digital Imaging and Communi- cations in Medicine (DICOM) files2 in a series in a NumPy 3D array [(code/p1ch4/ 2_volumetric_ct.ipynb)](https://github.com/deep-learning-with-pytorch/dlwpt-code/blob/master/p1ch4/2_volumetric_ct.ipynb).


In [45]:
import imageio
dir_path = "/content/2-LUNG 3.0  B70f-04083"
vol_arr = imageio.volread(dir_path, 'DICOM')
vol_arr.shape

Reading DICOM (examining files): 1/99 files (1.0%)99/99 files (100.0%)
  Found 1 correct series.
Reading DICOM (loading data): 65/99  (65.7%)99/99  (100.0%)


(99, 512, 512)


The layout is different from what PyTorch expects, due to having no channel information. So we’ll have to make room for the channel dimension using `unsqueeze`:

In [46]:
vol = torch.from_numpy(vol_arr).float()
vol = torch.unsqueeze(vol, 0)
vol.shape

torch.Size([1, 99, 512, 512])


At this point we could assemble a 5D dataset by stacking multiple volumes along the `batch` direction, just as we did in the previous section.

## Representing tabular data

The simplest form of data we’ll encounter on a machine learning job is sitting in a spreadsheet, CSV file, or database. Whatever the medium, it’s a table containing one row per sample (or record), where columns contain one piece of information about our sample.

At first we are going to assume there’s no meaning to the order in which samples appear in the table: such a table is a collection of independent samples, unlike a time series, for instance, in which samples are related by a time dimension.

Columns may contain numerical values, like temperatures at specific locations; or labels, like a string expressing an attribute of the sample, like “blue.” Therefore, tabular data is typically not homogeneous: different columns don’t have the same type. We might have a column showing the weight of apples and another encoding their color in a label.
PyTorch tensors, on the other hand, are homogeneous. Information in PyTorch is typically encoded as a number, typically `floating-point` (though `integer` types and `Boolean` are supported as well). This numeric encoding is deliberate, since neural networks are mathematical entities that take real numbers as inputs and produce real numbers as output through successive application of matrix multiplications and nonlinear functions.

### Using a real-world dataset

Our first job as deep learning practitioners is to encode heterogeneous, real-world data into a tensor of floating-point numbers, ready for consumption by a neural network. A large number of tabular datasets are freely available on the internet; see, for instance, https://github.com/caesar0301/awesome-public-datasets.

Let’s start with something fun: wine! The Wine Quality dataset is a freely available table containing chemical characterizations of samples of vinho verde, a wine from north Portugal, together with a sensory quality score. The dataset for white wines can be downloaded here: http://mng.bz/90Ol. For convenience, we also created a copy of the dataset on the Deep Learning with PyTorch Git repository, under data/p1ch4/tabular-wine.


The file contains a comma-separated collection of values organized in 12 columns preceded by a header line containing the column names. The first 11 columns contain values of chemical variables, and the last column contains the sensory quality score from 0 (very bad) to 10 (excellent). These are the column names in the order they appear in the dataset:

  - fixed acidity
  - volatile acidity
  - citric acid
  - residual sugar
  - chlorides
  - free sulfur dioxide
  - total sulfur dioxide
  - density
  - pH
  - sulphates
  - alcohol
  - quality
  

A possible machine learning task on this dataset is predicting the quality score from chemical characterization alone. We’re hoping to find a relationship between one of the chemical columns in our data and the quality column. Here, we’re expecting to see quality increase as sulfur decreases.

### Loading a wine data tensor

We need to be able to examine the data in a more usable way than opening the file in a text editor. Let’s see how we can load the data using Python and then turn it into a PyTorch tensor. Python offers several options for quickly loading a CSV file. Three popular options are


- The csv module that ships with Python
- NumPy
- Pandas

Let's use NumPy for this example. We can load our file and turn the resulting NumPy array into a PyTorch tensor [(code/p1ch4/3_tabular_wine.ipynb)](https://github.com/deep-learning-with-pytorch/dlwpt-code/blob/master/p1ch4/3_tabular_wine.ipynb).

In [47]:
import csv
wine_path = "/content/winequality-white.csv"
wineq_numpy = np.loadtxt(wine_path, dtype=np.float32, delimiter=";",skiprows=1)
wineq_numpy

array([[ 7.  ,  0.27,  0.36, ...,  0.45,  8.8 ,  6.  ],
       [ 6.3 ,  0.3 ,  0.34, ...,  0.49,  9.5 ,  6.  ],
       [ 8.1 ,  0.28,  0.4 , ...,  0.44, 10.1 ,  6.  ],
       ...,
       [ 6.5 ,  0.24,  0.19, ...,  0.46,  9.4 ,  6.  ],
       [ 5.5 ,  0.29,  0.3 , ...,  0.38, 12.8 ,  7.  ],
       [ 6.  ,  0.21,  0.38, ...,  0.32, 11.8 ,  6.  ]], dtype=float32)

Here we just prescribe what the type of the 2D array should be (`32-bit floating-point`), the `delimiter` used to separate values in each row, and the fact that the first line should not be read since it contains the column names. Let’s check that all the data has been read:

In [48]:
col_list = next(csv.reader(open(wine_path), delimiter=';'))
wineq_numpy.shape, col_list

((4898, 12),
 ['fixed acidity',
  'volatile acidity',
  'citric acid',
  'residual sugar',
  'chlorides',
  'free sulfur dioxide',
  'total sulfur dioxide',
  'density',
  'pH',
  'sulphates',
  'alcohol',
  'quality'])

and proceed to convert the `NumPy` array to a `PyTorch` tensor:

In [49]:
wineq = torch.from_numpy(wineq_numpy)
wineq.shape, wineq.dtype

(torch.Size([4898, 12]), torch.float32)

At this point, we have a floating-point `torch.Tensor` containing all the columns, including the last, which refers to the `quality` score.

### Representing scores

We could treat the score as a `continuous` variable, keep it as a real number, and perform a regression task, or treat it as a `label` and try to guess the `label` from the chemical analysis in a classification task. In both approaches, we will typically remove the score from the tensor of input data and keep it in a separate tensor, so that we can use the score as the ground truth without it being input to our model:

In [50]:
data = wineq[:, :-1] #Select all rows and all columns except the last
data, data.shape

(tensor([[ 7.0000,  0.2700,  ...,  0.4500,  8.8000],
         [ 6.3000,  0.3000,  ...,  0.4900,  9.5000],
         ...,
         [ 5.5000,  0.2900,  ...,  0.3800, 12.8000],
         [ 6.0000,  0.2100,  ...,  0.3200, 11.8000]]), torch.Size([4898, 11]))

In [51]:
target = wineq[:, -1] #Select all rows and the last columns
target, target.shape

(tensor([6., 6.,  ..., 7., 6.]), torch.Size([4898]))

If we want to transform the `target` tensor in a tensor of labels, we have two options, depending on the strategy or what we use the categorical data for. One is simply to treat labels as an integer vector of scores:

In [52]:
target = wineq[:, -1].long()
target

tensor([6, 6,  ..., 7, 6])

If targets were string labels, like wine color, assigning an integer number to each string would let us follow the same approach.


### One-hot encoding

The other approach is to build a one-hot encoding of the scores: that is, encode each of the 10 scores in a vector of 10 elements, with all elements set to 0 but one, at a different index for each score. This way, a score of `1` could be mapped onto the vector `(1,0,0,0,0,0,0,0,0,0)`, a score of `5` onto `(0,0,0,0,1,0,0,0,0,0)`, and so on. Note that the fact that the score corresponds to the index of the nonzero element is purely incidental: we could shuffle the assignment, and nothing would change from a classification standpoint.

Keeping wine quality scores in an integer vector of scores induces an ordering on the scores—which might be totally appropriate in this case, since a score of 1 is lower than a score of 4. It also induces some sort of distance between scores: that is, the distance between 1 and 3 is the same as the distance between 2 and 4. If this holds for our quantity, then great. If, on the other hand, scores are purely discrete, like grape variety, one-hot encoding will be a much better fit, as there’s no implied ordering or distance. One-hot encoding is also appropriate for quantitative scores when fractional values in between integer scores, like `2.4`, make no sense for the application—for when the score is either *this* or that.

We can achieve one-hot encoding using the `scatter_` method, which fills the tensor with values from a source tensor along the indices provided as arguments:


In [53]:
target_onehot = torch.zeros(target.shape[0], 10)
target_onehot.scatter_(1, target.unsqueeze(1), 1.0)

tensor([[0., 0.,  ..., 0., 0.],
        [0., 0.,  ..., 0., 0.],
        ...,
        [0., 0.,  ..., 0., 0.],
        [0., 0.,  ..., 0., 0.]])

Let’s see what `scatter_` does. First, we notice that its name ends with an underscore. As you learned in the previous chapter, this is a convention in PyTorch that indicates the method will not return a new tensor, but will instead modify the tensor in place. The arguments for `scatter_` are as follows:
- The dimension along which the following two arguments are specified
- A column tensor indicating the indices of the elements to scatter
- A tensor containing the elements to scatter or a single scalar to scatter (`1.0` in this case)

In other words, the previous invocation reads, “For each row, take the index of the target label (which coincides with the score in our case) and use it as the column index to set the value 1.0.” The end result is a tensor encoding categorical information.
The second argument of `scatter_`, the `index` tensor, is required to have the same number of dimensions as the tensor we scatter into. Since `target_onehot` has two dimensions `(4,898 × 10)`, we need to add an extra dummy dimension to target using `unsqueeze`:


In [56]:
target_unsqueezed = target.unsqueeze(1)
target_unsqueezed

tensor([[6],
        [6],
        ...,
        [7],
        [6]])

The call to unsqueeze adds a `singleton` dimension, from a 1D tensor of `4,898` elements to a 2D tensor of size `(4,898 × 1)`, without changing its contents—no extra elements are added; we just decided to use an extra index to access the elements. That is, we access the first element of target as `target[0]` and the first element of its unsqueezed counterpart as `target_unsqueezed[0,0]`.

PyTorch allows us to use class indices directly as targets while training neural net- works. However, if we wanted to use the score as a categorical input to the network, we would have to transform it to a one-hot-encoded tensor.