In [None]:
# Comment the following lines if you're not in colab:
from google.colab import drive
drive.mount('/content/drive')
# If you're in colab, cd to your own working directory here:
%cd ..//..//content//drive//MyDrive//Colab-Notebooks//HY-673-Tutorials//Tutorial-2

Mounted at /content/drive
/content/drive/MyDrive/Colab-Notebooks/HY-673-Tutorials/Tutorial-2


## <u>From CSV Files to Datasets</u>

We will prepare a PyTorch Dataset for a very simple classification problem, i.e., distinguishing between different types of Iris flowers.

The dataset is stored as a CSV file in the `"data/iris_data.csv"` folder. In order to prepare a PyTorch Dataset, we will need to go through the following steps:
- Load the file with Pandas and perform any preprocessing that's necessary
- Divide the data between training, test (and sometimes validation)
- Create a PyTorch Dataset by extending the `torch.utils.data.Dataset` class

In [None]:
import os
import pandas as pd

## <u>Pandas</u>

A useful tool to preprocess your data in Python is Pandas. You can use it to manipulate data files, such as, CSV, XLSX, and others, which are read as `DataFrame` objects.

In order to read such data, you should specify either their absolute or relative path. I personally recommend using relative paths, since they make the code easier to run on other people's machines.

You can specify the path to a file simply as a string (e.g., `"data//iris_data.csv"`). However, it is better to use the `os.path.join()` function to make your program work across many different platforms:

In [None]:
datapath = os.path.join('data', 'iris_data.csv')
print("Local Path:", datapath)

Local Path: data/iris_data.csv


Now, let's read the CSV file `"data/iris_data.csv"` with Pandas. We can do that using the `pd.read_csv()` function. Next, let's visualize its contents:

In [None]:
df = pd.read_csv(datapath)
print(df)

     sepal_length  sepal_width  petal_length  petal_width           label
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]


Individual columns of a `DataFrame` are called `Series` objects. Typically the operations that you can perform on DataFrames and Series are similar, but it is a good idea to always check the documentation.

In [None]:
print(f"Object is of type: {type(df)}")
print(f"\nThe columns are of type: {type(df['sepal_length'])}")
print(f"\nLabels are of type: {type(df['label'][0])}")

Object is of type: <class 'pandas.core.frame.DataFrame'>

The columns are of type: <class 'pandas.core.series.Series'>

Labels are of type: <class 'str'>


You can see that the `label` column of the Iris dataset contains strings. However, in order to classify data, we can encode the labels into integer values: a string-to-integer mapping.

One simple way to do that is to create a dictionary, with keys and values that represent this mapping. Let's start by seeing what are all the possible labels for our data. We can do that by using the `unique()` method of Pandas Series:

In [None]:
print(df['label'].unique())

['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']


We can see that there are 3 (unique) classes in total. We can now define our dictionary for the Iris labels as follows:

In [None]:
label_dict = {
    'Iris-setosa': 0,
    'Iris-versicolor': 1,
    'Iris-virginica': 2
}
print(f"The dictionary: {label_dict}")
print(f"Class that corresponds to 'Iris-versicolor' is {label_dict['Iris-versicolor']}.")

The dictionary: {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
Class that corresponds to 'Iris-versicolor' is 1.


However, this is only workable for simple cases, because such method is not scalable, and prone to errors. What if, say, we have 100 classes, or we accidentally misspell `Iris-versicolor`? The `enumerate()` built-in function comes in handy here:

In [None]:
for i, j in enumerate(['ab', 'cdef', 'gh', 'i', 'jkl']):
  print(f"i = {i}, j = {j}")

i = 0, j = ab
i = 1, j = cdef
i = 2, j = gh
i = 3, j = i
i = 4, j = jkl


This will help us create the same dictionary but in a more automated way, i.e., without the need to hardcode it. We can do so with this beautiful one-liner:

In [None]:
label_dict = {label: i for i, label in enumerate(df['label'].unique())}
print(label_dict)

{'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}


Now we just need to "apply" our dictionary to the original dataset labels in order to change them from string to integer. We can do that by simply using the `map()` built-in function (there are also other ways e.g., using `apply()` with a lambda expression). Remember, since both `Series` and `DataFrame` objects are containers, the modifications will affect the original `df` object as well:

In [None]:
df['label'] = df['label'].map(label_dict)
# Same as:
# df['label'] = df['label'].apply(lambda x: label_dict[x])
print(f"New dataset with string labels replaced by integers:\n\n{df}")

New dataset with string labels replaced by integers:

     sepal_length  sepal_width  petal_length  petal_width  label
0             5.1          3.5           1.4          0.2      0
1             4.9          3.0           1.4          0.2      0
2             4.7          3.2           1.3          0.2      0
3             4.6          3.1           1.5          0.2      0
4             5.0          3.6           1.4          0.2      0
..            ...          ...           ...          ...    ...
145           6.7          3.0           5.2          2.3      2
146           6.3          2.5           5.0          1.9      2
147           6.5          3.0           5.2          2.0      2
148           6.2          3.4           5.4          2.3      2
149           5.9          3.0           5.1          1.8      2

[150 rows x 5 columns]


Now, we can save our updated DataFrame if we want using the `to_csv()` method:

(In most cases you want to set the parameter `index` to `False`. Otherwise, the CSV file will contain an extra column with the index values of the DataFrame. I have no idea why this parameter is set to `True` by default.)

In [None]:
datapath_new = os.path.join("data", "iris_new.csv")
df.to_csv(datapath_new, index=False)

## <u>Train/Test Split</u>

If you have any experience with machine learning, you know that you should split your dataset into a *train set*, a *test set*, and, optionally, a *validation set*. Here, we will just split it into a train and a test set. The model must never be trained on the test set; it should never "see" any part of it in any way.

First of all, let's convert our DataFrame to a Numpy array, and separate the input data from the ground-truth labels:

In [None]:
all_data = df.to_numpy()
x_data = all_data[:, :-1]  # inputs
y_data = all_data[:, -1]   # ground truth (label)

print(f"total number of data points  = {x_data.shape[0]}")
print(f"dimensionality of each point = {x_data.shape[1]}")

print(f"\n10 inputs:\n {x_data[:10]}\n10 ground truths:\n {y_data[:10]}")

total number of data points  = 150
dimensionality of each point = 4

10 inputs:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
10 ground truths:
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


Now, time to actually split the dataset into training and test data. We'll do a 70—30 split, meaning that 70% of our data will be used for training and 30% for testing.

We can do that by using the `train_test_split()` function of Scikit-Learn. You can find the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

I have two main recommendations for you regarding this function:
- Make the split reproducible by setting a `random_state` (e.g., 42)
- Use the `stratify` option with respect to the labels: this will ensure that the training and test data have roughly the same class distribution

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    x_data, y_data, test_size=1/3, random_state=42, stratify=y_data
)

Let's print some information to check if everything has probably gone smoothly:

In [None]:
print(f"x_train shape: {x_train.shape}, y_train shape: {y_train.shape}")
print(f"x_test shape:  {x_test.shape},  y_test shape:  {y_test.shape}")
print(f"\nx_train[:3] = \n{x_train[:3]} \ny_train[:3]:\n {y_train[:3]}")
print(f"\nx_test[:3] = \n{x_test[:3]} \ny_test[:3]:\n {y_test[:3]}")

x_train shape: (100, 4), y_train shape: (100,)
x_test shape:  (50, 4),  y_test shape:  (50,)

x_train[:3] = 
[[6.3 3.4 5.6 2.4]
 [5.1 3.5 1.4 0.3]
 [5.8 2.7 5.1 1.9]] 
y_train[:3]:
 [2. 0. 2.]

x_test[:3] = 
[[6.3 2.8 5.1 1.5]
 [6.3 3.3 4.7 1.6]
 [5.  3.4 1.5 0.2]] 
y_test[:3]:
 [2. 1. 0.]


## <u>Pickle</u>

In cases like this, preprocessing the data does not take much time computation-wise. So, perhaps there is no problem preprocessing the data every time we run the entire code. But what if we cannot afford that, because, say, it is too expensive to do so every single time?

In Python, we can use the in-built Pickle module to save and load groups of variables in a compressed format.

Important: Use Pickle only to save and load variables within your working space! The adopted format varies depeding on the version, so, it is bad practice to use Pickle files to share data with others.

In [None]:
import pickle

pickledatapath = os.path.join('data', 'iris_data.pkl')

with open(pickledatapath, 'wb') as f:
    pickle.dump([x_train, x_test, y_train, y_test], f)

## <u>Pytorch Datasets and DataLoaders</u>

The best practice to feed data to your machine learning model, is by storing them as a dataset and subsequently to divide them in batches with a `DataLoader`.

In order to define a Dataset in Pytorch, you need to extend the `Dataset` class and specify 3 methods:
- `__init__()`: the class constructor
- `__len__()`: the method that returns the dataset length
- `__getitem__`: the method that defines how samples are retrieved

Let's create a Dataset class for Iris data:

In [None]:
import torch as tc
from torch.utils.data import Dataset

class MyIrisDataset(Dataset): # Need to extend Pytorch's Dataset class

    def __init__(self, x_data, y_data):
        super().__init__()
        # can throw in an assert to check that
        # inputs and labels have the same size:
        assert len(x_data) == len(y_data)
        self.x_data = tc.tensor(x_data, dtype=tc.float64)
        self.y_data = tc.tensor(y_data, dtype=tc.int64)

    def __len__(self):
        return len(self.x_data)

    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]

Now, we can use the `MyIrisDataset` class to create the train and test dataset objects:

In [None]:
train_dataset = MyIrisDataset(x_train, y_train)
test_dataset = MyIrisDataset(x_test, y_test)

print(f"{train_dataset}\n{test_dataset}")

<__main__.MyIrisDataset object at 0x7fcea03cbc40>
<__main__.MyIrisDataset object at 0x7fcea03cb340>


Pytorch Datasets can be fed to `DataLoaders`, which are essentially generators that divide the data into batches of some defined size, a.k.a., the *batch size*.

A very good and common practice is to shuffle your training data. There are many reasons for it, such as, avoiding order bias, achieving better generalization, etc.

Apart from being a historical convention, the batch size is commonly set to be a power of 2, due to how memory is organized within the OS (caching, memory allocation, etc.), and because it is assumed for many internal optimizations to take place.   

In [None]:
from torch.utils.data import DataLoader

batch_size = 2**4
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

Before doing anything, let's firstly see what a batch looks like:

In [None]:
x_batch, y_batch = next(iter(train_loader))

print(f"x_batch shape: {x_batch.shape}\ny_batch shape: {y_batch.shape}")
print(f"\nx_batch[3:6, 0] = {x_batch[3:6, 0]}\ny_batch[3:6] = {y_batch[3:6]}")

x_batch shape: torch.Size([16, 4])
y_batch shape: torch.Size([16])

x_batch[3:6, 0] = tensor([7.9000, 4.8000, 6.3000], dtype=torch.float64)
y_batch[3:6] = tensor([2, 0, 1])


Usually, the number of data points is not expected to be a multiple of the batch size. Hence, the last batch will have size `len(data) % batch_size`:

In [None]:
print(f"Last iteration will have a batch of size {x_train.shape[0] % batch_size}:\n")

for x_batch, y_batch in train_loader:
    print(f"x_batch.shape = {x_batch.shape}")
    # here we would feed the input batch
    # of this iteration into our model etc.


Last iteration will have a batch of size 4:

x_batch.shape = torch.Size([16, 4])
x_batch.shape = torch.Size([16, 4])
x_batch.shape = torch.Size([16, 4])
x_batch.shape = torch.Size([16, 4])
x_batch.shape = torch.Size([16, 4])
x_batch.shape = torch.Size([16, 4])
x_batch.shape = torch.Size([4, 4])
