# Programming for Data Science and Artificial Intelligence

## Datasets with PyTorch

In [2]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Loading data from files
We've seen how to load NumPy arrays into PyTorch, and anyone familiar with <tt>pandas.read_csv()</tt> can use it to prepare data before forming tensors.

In [3]:
df = pd.read_csv('data/iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
df.shape

(100, 6)

The iris dataset consists of 50 samples each from three species of Iris (<em>Iris setosa</em>, <em>Iris virginica</em> and <em>Iris versicolor</em>), for 150 total samples. We have four features (sepal length & width, petal length & width) and three unique labels:
0. <em>Iris setosa</em>
1. <em>Iris virginica</em>
2. <em>Iris versicolor</em>

### The classic method for building train/test split tensors
Before introducing PyTorch's Dataset and DataLoader classes, we'll take a quick look at the alternative.

In [36]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(df.drop('target',axis=1).values,
                                                    df['target'].values, test_size=0.2,
                                                    random_state=33)

X_train = torch.FloatTensor(train_X)
X_test  = torch.FloatTensor(test_X)
y_train = torch.LongTensor(train_y).reshape(-1, 1)
y_test  = torch.LongTensor(test_y).reshape(-1, 1)

In [24]:
print(f'Training size: {len(y_train)}')
labels, counts = y_train.unique(return_counts=True)
print(f'Labels: {labels}\nCounts: {counts}')

Training size: 120
Labels: tensor([0, 1, 2])
Counts: tensor([42, 42, 36])


## Using PyTorch's Dataset and DataLoader classes
A far better alternative is to leverage PyTorch's <a href='https://pytorch.org/docs/stable/data.html'><strong><tt>Dataset</tt></strong></a> and <a href='https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader'><strong><tt>DataLoader</strong></tt></a> classes.

Usually, to set up a Dataset specific to our investigation we would define our own custom class that inherits from <tt>torch.utils.data.Dataset</tt> (we'll do this in the CNN section). For now, we can use the built-in <a href='https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset'><strong><tt>TensorDataset</tt></strong></a> class.

In [40]:
from torch.utils.data import TensorDataset, DataLoader

data = df.drop('target',axis=1).values
labels = df['target'].values

iris = TensorDataset(torch.FloatTensor(data),torch.LongTensor(labels))

In [41]:
len(iris)

150

In [42]:
type(iris)

torch.utils.data.dataset.TensorDataset

In [43]:
for i in iris:
    print(i)

(tensor([5.1000, 3.5000, 1.4000, 0.2000]), tensor(0))
(tensor([4.9000, 3.0000, 1.4000, 0.2000]), tensor(0))
(tensor([4.7000, 3.2000, 1.3000, 0.2000]), tensor(0))
(tensor([4.6000, 3.1000, 1.5000, 0.2000]), tensor(0))
(tensor([5.0000, 3.6000, 1.4000, 0.2000]), tensor(0))
(tensor([5.4000, 3.9000, 1.7000, 0.4000]), tensor(0))
(tensor([4.6000, 3.4000, 1.4000, 0.3000]), tensor(0))
(tensor([5.0000, 3.4000, 1.5000, 0.2000]), tensor(0))
(tensor([4.4000, 2.9000, 1.4000, 0.2000]), tensor(0))
(tensor([4.9000, 3.1000, 1.5000, 0.1000]), tensor(0))
(tensor([5.4000, 3.7000, 1.5000, 0.2000]), tensor(0))
(tensor([4.8000, 3.4000, 1.6000, 0.2000]), tensor(0))
(tensor([4.8000, 3.0000, 1.4000, 0.1000]), tensor(0))
(tensor([4.3000, 3.0000, 1.1000, 0.1000]), tensor(0))
(tensor([5.8000, 4.0000, 1.2000, 0.2000]), tensor(0))
(tensor([5.7000, 4.4000, 1.5000, 0.4000]), tensor(0))
(tensor([5.4000, 3.9000, 1.3000, 0.4000]), tensor(0))
(tensor([5.1000, 3.5000, 1.4000, 0.3000]), tensor(0))
(tensor([5.7000, 3.8000, 1.7

Once we have a dataset we can wrap it with a DataLoader. This gives us a powerful sampler that provides single- or multi-process iterators over the dataset.

In [50]:
iris_loader = DataLoader(iris, batch_size=15, shuffle=True)
print(iris_loader)

<torch.utils.data.dataloader.DataLoader object at 0x0000022D2CF71978>


In [55]:
for i_batch, sample_batched in enumerate(iris_loader):
    print(i_batch, sample_batched)

0 [tensor([[6.4000, 2.8000, 5.6000, 2.2000],
        [6.4000, 3.1000, 5.5000, 1.8000],
        [5.8000, 2.7000, 5.1000, 1.9000],
        [6.8000, 3.2000, 5.9000, 2.3000],
        [5.7000, 4.4000, 1.5000, 0.4000],
        [5.6000, 3.0000, 4.5000, 1.5000],
        [6.5000, 3.2000, 5.1000, 2.0000],
        [5.1000, 3.8000, 1.5000, 0.3000],
        [5.9000, 3.0000, 4.2000, 1.5000],
        [5.9000, 3.2000, 4.8000, 1.8000],
        [7.0000, 3.2000, 4.7000, 1.4000],
        [4.9000, 3.1000, 1.5000, 0.1000],
        [5.1000, 3.8000, 1.6000, 0.2000],
        [5.9000, 3.0000, 5.1000, 1.8000],
        [7.2000, 3.0000, 5.8000, 1.6000]]), tensor([2, 2, 2, 2, 0, 1, 2, 0, 1, 1, 1, 0, 0, 2, 2])]
1 [tensor([[7.7000, 2.8000, 6.7000, 2.0000],
        [6.2000, 2.9000, 4.3000, 1.3000],
        [6.7000, 3.0000, 5.2000, 2.3000],
        [6.9000, 3.2000, 5.7000, 2.3000],
        [6.7000, 2.5000, 5.8000, 1.8000],
        [6.7000, 3.1000, 4.4000, 1.4000],
        [4.6000, 3.2000, 1.4000, 0.2000],
        [5.70

In [56]:
list(iris_loader)[0][1].bincount()

tensor([3, 6, 6])

In [57]:
next(iter(iris_loader))

[tensor([[4.6000, 3.2000, 1.4000, 0.2000],
         [7.2000, 3.2000, 6.0000, 1.8000],
         [6.9000, 3.1000, 4.9000, 1.5000],
         [5.6000, 3.0000, 4.5000, 1.5000],
         [5.1000, 3.7000, 1.5000, 0.4000],
         [5.0000, 3.5000, 1.6000, 0.6000],
         [5.0000, 3.4000, 1.5000, 0.2000],
         [5.6000, 3.0000, 4.1000, 1.3000],
         [6.6000, 2.9000, 4.6000, 1.3000],
         [4.3000, 3.0000, 1.1000, 0.1000],
         [5.1000, 3.3000, 1.7000, 0.5000],
         [4.9000, 3.0000, 1.4000, 0.2000],
         [5.8000, 2.6000, 4.0000, 1.2000],
         [5.5000, 3.5000, 1.3000, 0.2000],
         [4.9000, 2.4000, 3.3000, 1.0000]]),
 tensor([0, 2, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1])]

In [66]:
len(list(iris_loader))

10

In [73]:
print(list(iris_loader))

[[tensor([[5.6000, 2.5000, 3.9000, 1.1000],
        [5.4000, 3.4000, 1.7000, 0.2000],
        [6.7000, 2.5000, 5.8000, 1.8000],
        [4.6000, 3.1000, 1.5000, 0.2000],
        [5.9000, 3.2000, 4.8000, 1.8000],
        [4.7000, 3.2000, 1.6000, 0.2000],
        [6.7000, 3.3000, 5.7000, 2.1000],
        [5.6000, 3.0000, 4.1000, 1.3000],
        [6.7000, 3.1000, 4.7000, 1.5000],
        [6.1000, 2.6000, 5.6000, 1.4000],
        [5.8000, 2.6000, 4.0000, 1.2000],
        [6.7000, 3.0000, 5.0000, 1.7000],
        [6.7000, 3.0000, 5.2000, 2.3000],
        [7.7000, 3.0000, 6.1000, 2.3000],
        [5.1000, 3.8000, 1.5000, 0.3000]]), tensor([1, 0, 2, 0, 1, 0, 2, 1, 1, 2, 1, 1, 2, 2, 0])], [tensor([[5.8000, 2.7000, 4.1000, 1.0000],
        [6.5000, 3.0000, 5.2000, 2.0000],
        [4.3000, 3.0000, 1.1000, 0.1000],
        [5.1000, 3.7000, 1.5000, 0.4000],
        [6.7000, 3.3000, 5.7000, 2.5000],
        [5.2000, 4.1000, 1.5000, 0.1000],
        [5.0000, 3.3000, 1.4000, 0.2000],
        [6.2000

In [72]:
len((list(iris_loader)))

10

In [76]:
print((iris_loader))

<torch.utils.data.dataloader.DataLoader object at 0x0000022D2CF71978>


## A Quick Note on Torchvision
PyTorch offers another powerful dataset tool called <a href='https://pytorch.org/docs/stable/torchvision/index.html'><tt><strong>torchvision</strong></tt></a>, which is useful when working with image data. We'll go into a lot more detail in the Convolutional Neural Network (CNN) section. For now, just know that torchvision offers built-in image datasets like <a href='https://en.wikipedia.org/wiki/MNIST_database'>MNIST</a> and <a href='https://en.wikipedia.org/wiki/CIFAR-10'>CIFAR-10</a>, as well as tools for transforming images into tensors.