# Dataloading 01

In this notebook, we'll figure out how to use PyTorch's DataLoader class to load our massive files without reading the entirety of them into memory

In [1]:
import dask.dataframe as dd
import pandas as pd 
import torch
import linecache 
import csv
import numpy as np
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

We'll first design a custom dataset to use with PyTorch's `DataLoader` class

In [2]:
class GeneExpressionData(Dataset):
    def __init__(self, filename, labelname):
        self._filename = filename
        self._labelname = labelname
        self._total_data = 0
        
        with open(filename, "r") as f:
            self._total_data = len(f.readlines()) - 1
    
    def __getitem__(self, idx):
        if idx == 0:
            return self.__getitem__(1)
        
        line = linecache.getline(self._filename, idx + 1)
        label = linecache.getline(self._labelname, idx + 1)
        
        csv_data = csv.reader([line])
        csv_label = csv.reader([label])
        
        data = [x for x in csv_data][0]
        label = [x for x in csv_label][0]
        
        return (
            torch.from_numpy(np.array([float(x) for x in data])),
            torch.from_numpy(np.array([int(float(x)) for x in label])),
        )
    
    def __len__(self):
        return self._total_data
    
    def num_labels(self):
        return pd.read_csv(self._labelname)['# label'].nunique()
    
    def num_features(self):
        return len(self.__getitem__(0)[0])

Since PyTorch loss functions require classes in $[0, C]$, we'll first add $1$ to the labels and re-write it out so we can use it for training

In [3]:
def fix_labels(file):
    labels = pd.read_csv(file)
    labels['# label'] = labels['# label'].astype(int) + 1
    labels.to_csv('fixed_' + file.split('/')[-1], index=False)

# fix_labels('../data/processed/primary_reduction_neighbors_10_components_3.csv')

Let's test this quickly and then continue

In [4]:
fixed_labels = pd.read_csv('fixed_primary_labels_neighbors_15_components_100_clust_size_100.csv')
fixed_labels['# label'].unique()

array([ 0,  6,  5,  9,  3,  1,  2, 14, 15, 18, 16, 10, 13, 12,  8,  4,  7,
       17, 11, 19])

Great, we now continue as normal

In [5]:
t = GeneExpressionData(
    filename='../data/processed/primary_reduction_neighbors_10_components_3.csv',
    labelname='fixed_primary_labels_neighbors_15_components_100_clust_size_100.csv'
)

In [6]:
t.num_features()

3

Before we train our model, we need to split our data into training and testing sets, in order to get an unbiased evaluation of our model's performance. Likely, we will initially overfit the training set since we provide no regularization.

In [7]:
train_size = int(0.8 * len(t))
test_size = len(t) - train_size

train, test = torch.utils.data.random_split(t, [train_size, test_size])

In [8]:
traindata = DataLoader(train, batch_size = 64, num_workers = 0)

Now that we've defined our `DataLoader`, let's test it when training a simple Neural Network

In [9]:
class NN(nn.Module):
    def __init__(self, N_features, N_labels):
        super().__init__()
        
        self.network = nn.Sequential(
            nn.Linear(in_features=N_features, out_features=16),
            nn.ReLU(),
            nn.Linear(in_features=16, out_features=32),
            nn.ReLU(),
            nn.Linear(in_features=32, out_features=16),
            nn.ReLU(),
            nn.Linear(in_features=16, out_features=N_labels),
        )
        
    def forward(self, x):
        return self.network(x)

In [10]:
network = NN(
    N_features=t.num_features(),
    N_labels=t.num_labels()
)

Now we can define our criterion, optimization method and train our model on our dataset

In [11]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(network.parameters(), lr = 0.01)
loss_arr = []

And finally train our model

In [12]:
epochs = 50

for i in range(1, epochs):
    if i > 1 and i %% 10 == 0:
        print(f'Epoch {i} has loss {loss_arr[i]}')
        
    for X, y in traindata:
        yhat = network(X.float())
        loss = criterion(yhat, y.flatten())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        loss_arr.append(loss.item())

Epoch 2 has loss 2.9951248168945312
Epoch 3 has loss 2.928852081298828
Epoch 4 has loss 2.815742015838623
Epoch 5 has loss 2.760014772415161
Epoch 6 has loss 2.55064058303833
Epoch 7 has loss 2.5633928775787354
Epoch 8 has loss 2.4664502143859863
Epoch 9 has loss 2.2520246505737305
Epoch 10 has loss 2.168370246887207
Epoch 11 has loss 2.09975266456604
Epoch 12 has loss 2.21954607963562
Epoch 13 has loss 2.2362663745880127
Epoch 14 has loss 2.0091419219970703
Epoch 15 has loss 1.973549723625183
Epoch 16 has loss 1.9382860660552979
Epoch 17 has loss 2.110527515411377
Epoch 18 has loss 1.9396253824234009
Epoch 19 has loss 2.0526015758514404
Epoch 20 has loss 1.9217842817306519
Epoch 21 has loss 1.8541680574417114
Epoch 22 has loss 1.7285218238830566
Epoch 23 has loss 1.870560646057129
Epoch 24 has loss 1.8828296661376953
Epoch 25 has loss 1.7172486782073975
Epoch 26 has loss 1.8873571157455444
Epoch 27 has loss 1.723275899887085
Epoch 28 has loss 1.7409210205078125
Epoch 29 has loss 1.811

Now, let's test our model on the test set and evaluate our results

In [44]:
test_loss = []

for X, y in test:
    output = network(X.float())
#     output = output.data.numpy()
    
    prediction = int(torch.max(output.data, 1)[1].numpy())
    print(prediction)
    

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)