# Preporcessing Data 

This notebook covers reading tabular data (i.e. a file of rows where each row contains input data and its associated label) into PyTorch tensors. It also includes operations to inspect the elements.   

In particular the dataset will describe various features of wine such as acidity, sugar, pH, and other qualities. Each row ends with a label indicating an integer score of 1, 2, ..., 10. 

In [59]:
import csv
import numpy as np
import torch as T

In [60]:
# Read data into np array. 
wine_path = "C:\\Users\\kylec\\data_dump\\winequality-white.csv"
wine_data_np = np.loadtxt(wine_path, dtype=np.float32, delimiter=";", skiprows=1)

In [61]:
print(wine_data_np)
# 4898 rows, 12 cols (11 input, 1 class)
print(wine_data_np.shape)

[[ 7.    0.27  0.36 ...  0.45  8.8   6.  ]
 [ 6.3   0.3   0.34 ...  0.49  9.5   6.  ]
 [ 8.1   0.28  0.4  ...  0.44 10.1   6.  ]
 ...
 [ 6.5   0.24  0.19 ...  0.46  9.4   6.  ]
 [ 5.5   0.29  0.3  ...  0.38 12.8   7.  ]
 [ 6.    0.21  0.38 ...  0.32 11.8   6.  ]]
(4898, 12)


In [62]:
"""
Read data into tensor, separate into input and labelled tensors.
"""
data = T.from_numpy(wine_data_np)
x = wine_data_t[:, :-1]
label_t = wine_data_t[:, -1].long()

In [63]:
"""
Represent labels as one-hot vector. Encode each of the 10 classes into
one of 10 vectors, each which has a single element with value 1, the 
remainder with 0s. 

"""
y = T.zeros(label_t.shape[0], 10).long()
y.scatter_(1, label_t.unsqueeze(1), 1.0)

y.shape

torch.Size([4898, 10])

In [72]:
# Compute mean and variance of features. 

x_mean = T.mean(x, dim=0)
x_var = T.var(x, dim=0)
print("Mean: ", x_mean)
print("Variance: ", x_var)

Mean:  tensor([  6.8548,   0.2782,   0.3342,   6.3914,   0.0458,  35.3081, 138.3607,
          0.9940,   3.1883,   0.4898,  10.5142])
Variance:  tensor([7.1211e-01, 1.0160e-02, 1.4646e-02, 2.5726e+01, 4.7733e-04, 2.8924e+02,
        1.8061e+03, 8.9455e-06, 2.2801e-02, 1.3025e-02, 1.5144e+00])


In [75]:
# Noramlize data
x_norm = (x - x_mean) / T.sqrt(x_var)
x_norm

tensor([[ 0.1721, -0.0818,  0.2133,  ..., -1.2468, -0.3491, -1.3930],
        [-0.6574,  0.2159,  0.0480,  ...,  0.7399,  0.0013, -0.8242],
        [ 1.4756,  0.0174,  0.5438,  ...,  0.4750, -0.4368, -0.3366],
        ...,
        [-0.4204, -0.3794, -1.1915,  ..., -1.3131, -0.2615, -0.9054],
        [-1.6054,  0.1167, -0.2825,  ...,  1.0048, -0.9625,  1.8574],
        [-1.0129, -0.6770,  0.3785,  ...,  0.4750, -1.4882,  1.0448]])

In [71]:
# Count number of data points with label score less than 5
bad = label_t <= 4
bad.shape, bad.dtype, bad.sum()

(torch.Size([4898]), torch.uint8, tensor(183))