# PyTorch Basics

In PyTorch, all data is stored in `Tensors`. These are just like Numpy ndarrays, but used in computational graphs. Tensors can be created directly, or from numpy arrays.

In [None]:
import torch
import numpy as np

# Creating Tensors from numpy arrays:
matrix_np = np.random.normal(0, 0.1, (3,4))
matrix_t = torch.from_numpy(matrix_np)
matrix_t

In [None]:
# Creating Tensors from lists:
matrix_t = torch.Tensor([[3.2, 4.5, 1.2], [1.2, 3.4, 1.2]])
matrix_t

In [None]:
# Creating Tensors directly in PyTorch
matrix_t = torch.rand(3, 4) # Uniform in [0,1] range
matrix2_t = torch.normal(torch.Tensor([-1, 1]), torch.Tensor([[1.2, 1.2]])) # Normal
print(matrix_t)
print(matrix2_t)

PyTorch has many built-in math operations that can be performed on Tensors.

In [None]:
# Math operations:
W = torch.normal(torch.zeros([10]), 0.01).view([5,2]) # 5x5 matrix, each element i.i.d. mean=0, stddev=0.01
b = torch.zeros(5)
x = torch.Tensor([1.0, 2.0])
y = torch.sigmoid(torch.matmul(W, x) + b)
y

Each tensor has a shape (just like in numpy) and a number of dimensions.

In [None]:
print(f"W.shape: {W.shape}")
print(f"W.dim(): {W.dim()}")
print(f"b.shape: {b.shape}")
print(f"y.shape: {y.shape}")

Tensors representing scalars have no shape, and zero-dimensions. These are called zero-dimensional tensors.

In [None]:
y_sum = torch.sum(y)
print(f"y_sum.shape: {y_sum.shape}")
print(f"y_sum.dim(): {y_sum.dim()}")
y_sum

To recover a Python scalar from a zero-dimensional `Tensor`, we use the method `item()`.

In [None]:
y_sum.item()
y_sum.detach() + 5

## Computational Graphs in PyTorch

PyTorch uses __eager execution__, which means it evaluates expressions on the computational graph __immediately__, as soon as the graph is built, but it keeps the graph in memory so that a backward pass can later be performed. DyNet  uses __lazy execution__ by default.

This is great for debugging, because an error causes an Exception at exactly the line at which the error occured, making it easy to track down.

**Careful:** When performing operations on the GPU, PyTorch is still eager, but CUDA is asynchronous. This means that operations on the GPU will __start__ running at the point at which they are called, but may finish later. Exceptions are thrown only at the next CUDA call. This behavior can be disabled by setting the environment variable CUDA_LAUNCH_BLOCKING=1.

Each `Tensor` has an attribute `requires_grad`. If it is set to `True`, then PyTorch will compute gradients with respect to this `Tensor` during the backward pass. To demonstrate:

In [None]:
# Create 3 tensors:
W = torch.normal(torch.zeros([10]), 0.01).view([5,2]) # 5x5 matrix, each element i.i.d. mean=0, stddev=0.01
b = torch.zeros(5)
x = torch.Tensor([1.0, 2.0])

# Tell PyTorch that we want to compute gradients with respect to W
W.requires_grad = True

# Perform the computation
y = torch.sigmoid(torch.matmul(W, x) + b)
sum_y = torch.sum(y)

# Run the backward pass (this can only be called on 0-dimensional tensors, a.k.a. scalars)
sum_y.backward()

# Print the gradients:
print(f"W gradients: {W.grad}")
print(f"b gradients: {b.grad}")

## Model parameters in PyTorch

Most functionality useful for neural network development is stored in `torch.nn` package.
`torch.nn.Parameter` is a subclass of `Tensor` that is typically used to represent model parameters. It has 2 good features:
* All `torch.nn.Parameter`s have `requires_grad=True` by default.
* When used as attributes of `torch.nn.Module` classes, can be easily found and given to the optimizer.

In [None]:
from torch.nn import Parameter
weights_parameter = Parameter(torch.zeros(5))
weights_parameter.requires_grad

# Training a model in PyTorch

### Preparing data (copied from DyNet notebook)

In [None]:
import os
import nltk

def load_data(directory):
    l = [ ]
    for filename in os.listdir(directory):
        if filename.endswith("txt"):
            words = nltk.word_tokenize((open(directory + filename).readlines()[0]).lower())
            l.append({"id" : filename, "data" : words})
    return l

negative_examples = load_data("neg/")
positive_examples = load_data("pos/")

print("num per class: negative " + str(len(negative_examples)) + "; positive " + str(len(positive_examples)))

dev_examples = [ ]
train_examples = [ ]

import random
random.shuffle(negative_examples)
random.shuffle(positive_examples)

for example in negative_examples:
    example["label"] = 0
    randnum = random.random()
    if randnum < 0.8:
        train_examples.append(example)
    else:
        dev_examples.append(example)
        
for example in positive_examples:
    example["label"] = 1
    randnum = random.random()
    if randnum < 0.8:
        train_examples.append(example)
    else:
        dev_examples.append(example)
        
print("lengths: train " + str(len(train_examples)) + "; dev " + str(len(dev_examples)))

vocab = list(set([word for example in train_examples for word in example["data"]])) + ["UNK"]
print(f"vocab size: {len(vocab)}")

## Implementing simple neural network the PyTorch way

We already saw how we can implement neural networks in PyTorch by writing the mathematical operations manually and then calling `backward` to compute gradients with respect to model parameters. This is great, but can get out of hand when building large models with millions or billions of parameters.

PyTorch adopts a modular programming paradigm. Every neural network is a subclass of `nn.Module` base class. 
`nn.Module` represents a node of arbitrary complexity in a computational graph. Each `nn.Module` can internally consist of many other `nn.Module`s, allowing us to easily re-use building blocks.

When subclassing `nn.Module`, we have to implement the `__init__` and `forward` methods. There is also the `backward` method, but it is already implemented for us.

Let's implement a simple two-layer neural network `ReviewClassifier` and train it to classify online review sentiment. This is the same classifier as was used in the `DyNet` tutorial.

In [None]:
import torch.nn as nn
from torch.optim import Adam

class ReviewClassifier(nn.Module):
    
    def __init__(self, hidden_size, num_classes, vocab_size):
        super(ReviewClassifier, self).__init__() # Unfortunately we have to call the superclass __init__ method
        
        self.embeddings = nn.Embedding(vocab_size, hidden_size)
        self.layer1 = nn.Linear(hidden_size, hidden_size, bias=True)
        self.layer2 = nn.Linear(hidden_size, num_classes, bias=True)
        
    def forward(self, tokenized_sentence):
        word_embeds = self.embeddings(tokenized_sentence)
        sentence_embed = torch.mean(word_embeds, dim=1)
        h = self.layer1(sentence_embed)
        h = torch.tanh(h)
        out = self.layer2(h)
        return out


## Optimizing parameters

Let's train our ReviewClassifier on the same task - online review classification.

In [None]:
import time
import torch.optim as optim

HIDDEN_SIZE = 64
NUM_CLASSES = 2

model = ReviewClassifier(hidden_size=HIDDEN_SIZE, num_classes=NUM_CLASSES, vocab_size=len(vocab))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

epoch_start_time = time.time()

for i, example in enumerate(train_examples[:20]):
    
    # First, tokenize the input sentence
    tok_sentence = []
    for word in example["data"]:
        if word in vocab:
            tok_sentence.append(vocab.index(word))
        else:
            tok_sentence.append(vocab.index("UNK"))
    
    # Convert the sentence and labels to Tensors of type long.
    tok_sentence = torch.Tensor(tok_sentence).long()
    label = torch.Tensor([example["label"]]).long()
    
    # Add an additional batch dimension
    tok_sentence = tok_sentence[np.newaxis, :]
    
    # Run the forward pass and compute loss
    score = model(tok_sentence)
    loss = criterion(score, label)
    
    # Zero the gradients, run backward pass, and take a gradient step
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(loss.item())
print("total time: " + str(time.time() - epoch_start_time))

### Batching

In PyTorch with eager execution, we need to manually pack our training examples into batches before feeding them to the network.

Most operations in `torch.nn` assume that the first dimension of all input and output tensors is the batch dimension.  Basic math operations in `torch.tensor` or `torch` do not make this assumption.

To use batching for the review classifier, we need to stack together multiple reviews into a single tensor, so that we can simultaneously apply our model on all reviews in the batch.

Different reviews have different lengths. We can't just stack them together into a single tensor.

What we can do, is to make sure there is space for the longest review, and simply pad the shorter reviews with zeros. Then we need to keep track of how long each review is, so that we avoid treating these padded zeros as words that are part of the review. A simple way to do this is by using a mask, where 1 indicates that the corresponding token (word) is part of a review and 0 indicates that it is a padded value and should be ignored.

An alternative way to do this is to keep a list of review lengths around, but the masking approach is easier in this case.

In [None]:
class ReviewClassifierWithBatching(nn.Module):
    
    def __init__(self, hidden_size, num_classes, vocab_size):
        super(ReviewClassifierWithBatching, self).__init__() # Unfortunately we have to call the superclass __init__ method
        
        self.embeddings = nn.Embedding(vocab_size, hidden_size)
        self.layer1 = nn.Linear(hidden_size, hidden_size, bias=True)
        self.layer2 = nn.Linear(hidden_size, num_classes, bias=True)
        
    def forward(self, tokenized_sentences, mask):
        # tokenized_sentences is a tensor of size BxL where L is the length of the longest review
        # mask is also a tensor of size BxL, but only stores zeros and ones, indicating whether the 
        #  corresponding word in tokenized_sentences is a real word and part of a review, or a padded value.
        word_embeds = self.embeddings(tokenized_sentences)
        
        # multiplying word_embeds with the mask zeroes out the word embeddings corresponding to padded zeroes
        sentence_embed = torch.sum(word_embeds * mask[:, :, np.newaxis], dim=1) / \
            (torch.sum(mask, 1)[:, np.newaxis] + 1e-32)
        
        h = self.layer1(sentence_embed)
        out = self.layer2(h)
        return out

We also need to make changes to our data pre-processing code, to create batches and masks. We create a new function examples_to_batch that takes as input multiple training examples from our dataset, and produces a batch of tokenized input reviews, labels and masks.

In [None]:
BATCH_SIZE = 100

def examples_to_batch(batch_of_examples):
    sentence_list = []
    label_list = []
    max_len = 0

    for example in batch_of_examples:
        # First, tokenize the input review
        tok_sentence = []
        for word in example["data"]:
            if word in vocab:
                tok_sentence.append(vocab.index(word))
            else:
                tok_sentence.append(vocab.index("UNK"))

        # Update length of longest sentence
        if len(tok_sentence) > max_len:
            max_len = len(tok_sentence)
        
        # Convert the review and labels to Tensors of type long and add to the list
        tok_sentence = torch.Tensor(tok_sentence).long()
        sentence_list.append(tok_sentence)
        label = torch.Tensor([example["label"]]).long()
        label_list.append(label)
    
    # Stack all labels into a batch
    batch_of_labels = torch.cat(label_list)
    # For reviews it's not that easy since they have variable length.
    # Create a new tensor that has enough space
    batch_of_sentences = torch.zeros([len(sentence_list), max_len]).long()
    mask = torch.zeros([len(sentence_list), max_len])
    for i,sentence in enumerate(sentence_list):
        batch_of_sentences[i,:len(sentence)] = sentence
        mask[i, :len(sentence)] = 1.0
    
    return batch_of_sentences, batch_of_labels, mask
    

def epoch_train(train_examples):
    epoch_start_time = time.time()
    random.shuffle(train_examples)
    for i in range(0, len(train_examples), BATCH_SIZE):
        batch_of_examples = train_examples[i:i+BATCH_SIZE]
        reviews, labels, mask = examples_to_batch(batch_of_examples)

        # Run the forward pass and compute loss
        scores = model(reviews, mask)
        loss = criterion(scores, labels)

        # Zero the gradients, run backward pass, and take a gradient step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        #print(loss.item())
    print("total time: " + str(time.time() - epoch_start_time))
    
def evaluate_accuracy(dev_examples):
    num_correct = 0
    num_total = 0
    for i in range(0, len(dev_examples), BATCH_SIZE):
        batch_of_examples = dev_examples[i:i+BATCH_SIZE]
        sentences, labels, mask = examples_to_batch(batch_of_examples)
        scores = model(sentences, mask)
        
        num_correct += sum(np.argmax(scores.detach().numpy(), 1) == labels.numpy())
        num_total += len(batch_of_examples)
    return float(num_correct) / (num_total + 1e-32)

In [None]:
NUM_EPOCHS = 25

model = ReviewClassifierWithBatching(hidden_size=HIDDEN_SIZE, num_classes=NUM_CLASSES, vocab_size=len(vocab))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

max_accuracy = 0.
best_epoch = 0
start_time = time.time()
for i in range(NUM_EPOCHS):
    epoch_train(train_examples)
    accuracy = evaluate_accuracy(dev_examples)
    print("epoch " + str(i) + " accuracy: " + str(accuracy))
    if accuracy > max_accuracy:
        print("improved!")
        with open("model-epoch" + str(i) + ".pytorch", "wb") as fp:
            torch.save(model.state_dict(), fp)
        best_epoch = i
        max_accuracy = accuracy

total_time = time.time() - start_time
print("total training time: " + str(total_time) + "; " + str(float(total_time) / NUM_EPOCHS) + " per epoch")
print("loading from model at epoch " + str(best_epoch))
with open("model-epoch" + str(best_epoch) + ".pytorch", "rb") as fp:
    model.load_state_dict(torch.load(fp))