# <center>Lab 1: Introduction to Pytorch</center>

We introduce the syntax of the [PyTorch](https://pytorch.org/) library for deep learning and optimization. Libraries such as PyTorch enable us to run massive computations on the GPU, which is essential for building large-scale neural models. A PyTorch project usually consists of the following components. While these are not very different from the typical components of a machine learning pipeline, they must be implemented in a specific way. 

- **Tensors**: The PyTorch `Tensor` can be thought of as analogous to the `numpy.ndarray`. We often will have tensors that are higher than 2D.
- **Device**: The *device* refers to either CPU or GPU, depending on where models/tensors are stored, and computation happens.
- **Dataset / Dataloader**: We instantiate (and sometimes implement) a `Dataset` class which specifies how to index the training, validation, and test data. This is then used to create a `DataLoader` class which specifies a scheme for loading batches of data, used both in training (e.g. stochastic gradient descent) and validation (e.g. batch-level accuracy). 
- **Architecture / Forward Pass**: We use a `Module` class to specify an *architecture* for our neural network ( layers, activations, etc.). The term *forward pass* refers to the sequence of computations that the network runs on an input given its parameters. Crucially, you implement the forward pass in the `forward` method, and the software uses automatic differentiation algorithms to compute the gradient without hand-coding it.
- **Loss**: This is the final tensor which results from pushing the output of the network and the true labels through a function, usually called `criterion`. After this step, we call `backward` on the loss tensor in order to compute the gradient via the backpropagation algorithm (this is called the *backward pass*).
- **Optimizer**: The `Optimizer` abstraction allows one to compute steps of iterative algorithms such as stochastic gradient descent. We call `step` on this object after the backward pass, which automatically runs the update step.

<center> Adapted from teaching material created by Alec Greaves-Tunnell and Ronak Mehta, https://pytorch.org/  and https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html </center>


In [1]:
import torch
import numpy as np

# 1 Pytorch Element: Tensors
`Tensors` are specialized data structures and the building blocks of the Pytorch package, . The `Tensor` can be thought of as analogous to the `numpy.ndarray`. We often will have tensors that are higher than 2D and can be run on GPUs.
Now, we will show the some examples using tensors instead of numpy arrays. 

## 1.1 Creating Tensors

In [2]:
# Create data manually
data = [[1, 2], [3, 4]]

# Using numpy array
data_np = np.array(data)
print(data_np)

# Using tensors
x_data = torch.tensor(data)
print(x_data)

[[1 2]
 [3 4]]
tensor([[1, 2],
        [3, 4]])


In [3]:
# Create data from numpy array
np_array = np.array(data)
x_np = torch.from_numpy(np_array)
print(x_np)

tensor([[1, 2],
        [3, 4]])


In [4]:
# Create tensor of all 1's or 0's or random
shape = (2, 3)

zeros_tensor = torch.zeros(shape)
print(f"Zeros Tensor: \n {zeros_tensor}")

ones_tensor = torch.ones(shape)
print(f"Ones Tensor: \n {ones_tensor} \n")

rand_tensor = torch.rand(shape)
print(f"Random Tensor: \n {rand_tensor} \n")

# Create tensor using the shape of another tenosr
x_ones = torch.zeros_like(x_data) # retains the properties of x_data
print(f"Ones Tensor: \n {x_ones} \n")

x_rand = torch.ones_like(x_data, dtype=torch.float) # overrides the datatype of x_data
print(f"Random Tensor: \n {x_rand} \n")

x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data
print(f"Random Tensor: \n {x_rand} \n")

Zeros Tensor: 
 tensor([[0., 0., 0.],
        [0., 0., 0.]])
Ones Tensor: 
 tensor([[1., 1., 1.],
        [1., 1., 1.]]) 

Random Tensor: 
 tensor([[0.9916, 0.9160, 0.4153],
        [0.1395, 0.8509, 0.8796]]) 

Ones Tensor: 
 tensor([[0, 0],
        [0, 0]]) 

Random Tensor: 
 tensor([[1., 1.],
        [1., 1.]]) 

Random Tensor: 
 tensor([[0.8083, 0.6783],
        [0.2024, 0.4519]]) 



Tensors have attributes similar to numpy arrays such as shape, datatype, and which device they are stored on. 

In [5]:
tensor = torch.rand(3, 4)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

# You can change the dtype or device by uinsg tensor.dtype() and tensor.device()

Shape of tensor: torch.Size([3, 4])
Datatype of tensor: torch.float32
Device tensor is stored on: cpu


## 1.2 Tensor Operations
Almost all operations can be done using tensors and follows a very similar API as NumPy. We will cover just a few operations but a full list can be found [here](https://pytorch.org/docs/stable/torch.html).

In [6]:
# Indexing and Slicing
tensor = torch.ones(4, 4)
print("Full Tensor: ", tensor)
print("Partial Tensor: ", tensor[:,3])

# Replace column 2 with 0's
tensor[:,1] = 0
print("Edited Tensor:", tensor)

Full Tensor:  tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])
Partial Tensor:  tensor([1., 1., 1., 1.])
Edited Tensor: tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]])


In [7]:
# Joining tensors: this can be done on a specific demension
join_tensor_dim0 = torch.cat([tensor, tensor, tensor], dim=0)
print("Join on Dim 0:", join_tensor_dim0)

join_tensor_dim1 = torch.cat([tensor, tensor, tensor], dim=1)
print("Join on Dim 1:",join_tensor_dim1)

Join on Dim 0: tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]])
Join on Dim 1: tensor([[1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
        [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
        [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
        [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.]])


You can multiply tensor either element-wise or using matrix multiplication.

In [8]:
# This computes the element-wise product
print(f"tensor.mul(tensor) \n {tensor.mul(tensor)} \n")

# Alternative syntax:
print(f"tensor * tensor \n {tensor * tensor}")

tensor.mul(tensor) 
 tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]]) 

tensor * tensor 
 tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]])


In [9]:
# This computes the matrix multplication between two tensors
print(f"tensor.matmul(tensor.T) \n {tensor.matmul(tensor.T)} \n")

# Alternative syntax:
print(f"tensor @ tensor.T \n {tensor @ tensor.T}")

tensor.matmul(tensor.T) 
 tensor([[3., 3., 3., 3.],
        [3., 3., 3., 3.],
        [3., 3., 3., 3.],
        [3., 3., 3., 3.]]) 

tensor @ tensor.T 
 tensor([[3., 3., 3., 3.],
        [3., 3., 3., 3.],
        [3., 3., 3., 3.],
        [3., 3., 3., 3.]])


These are called "in-place" operations and are denoted with the suffix `_`. They will change the tensor they are added to. 

In [10]:
print("Original Tensor:", tensor, "\n")
print("Add 5: ", tensor.add_(5))
print("Transpose: ", tensor.t_())



Original Tensor: tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]]) 

Add 5:  tensor([[6., 5., 6., 6.],
        [6., 5., 6., 6.],
        [6., 5., 6., 6.],
        [6., 5., 6., 6.]])
Transpose:  tensor([[6., 6., 6., 6.],
        [5., 5., 5., 5.],
        [6., 6., 6., 6.],
        [6., 6., 6., 6.]])


### 2.1 Datasets and Dataloaders

As mentioned, data needs to be in a specific form to use some of the machine learning functionality of Pytorch. To illustrate this, we will use a simple simulated dataset $(x_1, y_1), ..., (x_n, y_n)$ with each $y_i \sim \text{Bernoulli}\left(\frac{1}{2}\right)$, and given $y$,
$$
x \sim \begin{cases}
\mathcal{N}(\mu 1_d, \sigma_1 I_d) &\text{ if } y = 1\\
\mathcal{N}(-\mu 1_d, \sigma_0 I_d) &\text{ if } y = 0
\end{cases}. 
$$
To implement a `Dataset` class,  you must implement a `__len__` method (which returns the number of examples) and a `__getitem__` method, which returns the example at a particular index. Functions such as `TensorDataset` can make it so that you do not have to implement the class yourself.

In [11]:
import torch
from torch.utils.data import Dataset

class SimulatedDataset(Dataset):
    def __init__(self, n, d, mean_scale=0.62, cov_scales=[1.0, 0.5]):
        self.n = n
        self.labels = torch.bernoulli(0.5 * torch.ones(n)).long()
        distributions = [
            torch.distributions.MultivariateNormal(
                -mean_scale * torch.ones(d),
                covariance_matrix=cov_scales[0] * torch.eye(d),
            ),
            torch.distributions.MultivariateNormal(
                mean_scale * torch.ones(d),
                covariance_matrix=cov_scales[1] * torch.eye(d),
            ),
        ]
        self.examples = []
        for i in range(n):
            self.examples.append(distributions[int(self.labels[i])].sample())

    def __len__(self):
        return self.n

    def __getitem__(self, index):
        return self.examples[index], self.labels[index]

In [12]:
n = 100
d = 10
batch_size = 32
val_size = 0.4

dataset = SimulatedDataset(n, d)
print(dataset[0])

(tensor([-0.3561,  0.2879,  0.1981,  0.1062, -0.0160,  0.3194,  0.4970, -0.1615,
         0.3289,  0.2793]), tensor(1))


After we have a `Dataset`, we can create a `DataLoader`, which is an iterable that cycles through batches of data. Some common schemes for sampling batches are the `RandomSampler` and `SequentialSampler`. Each batch will return a tuple based on the implementation of `__gititem__`.

In [13]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, random_split

val_len = int(len(dataset) * val_size)
train_len = len(dataset) - val_len

training_data, val_data = random_split(dataset, [train_len, val_len])

train_dataloader = DataLoader(training_data, sampler=RandomSampler(training_data), batch_size=batch_size)
val_dataloader = DataLoader(val_data, sampler=SequentialSampler(val_data), batch_size=batch_size)

print("Train Batches")
for i, batch in enumerate(train_dataloader):
    X_batch, y_batch = batch
    
    print("Batch %d:" % i)
    print("\t X shape:", X_batch.shape)
    print("\t y shape:", y_batch.shape)


Train Batches
Batch 0:
	 X shape: torch.Size([32, 10])
	 y shape: torch.Size([32])
Batch 1:
	 X shape: torch.Size([28, 10])
	 y shape: torch.Size([28])


# 3: Model Architecture, Loss, and Optimization



### Example: Bag-of-Words Language Binary Classifier (Logistic Regression)

Our model will map a sparse BoW representation to log probabilities over
labels. We will use this to fit a binary (logistic) regression model. 

#### BoW
We assign each word in the vocab an index. For example, say our
entire vocab is two words "hello" and "world", with indices 0 and 1
respectively. The BoW vector for the sentence "hello hello hello hello"
is

$[ 4, 0 ]$

For "hello world world hello", it is

$[ 2, 2 ]$

etc. In general, it is

$[ \text{Count}(\text{hello}), \text{Count}(\text{world})] $

#### Example Data

In this example we will have a list of sentences in either Spanish or English
that are labeled with the correct language. We denote a BOW vector for each sentence as $x$. 
The output of our network is:

$\log \text{Softmax}(Ax + b)$

That is, we pass the input through an linear system (affine map) and then do log
softmax.




In [14]:
# The Data
data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]

label_to_ix = {"SPANISH": 0, "ENGLISH": 1}

# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

VOCAB_SIZE = len(word_to_ix)
print("Vocab Size: ", VOCAB_SIZE)
NUM_LABELS = 2

{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}
Vocab Size:  26


In [15]:
# Transform sentence into BoW
def make_bow_vector(sentence, word_to_ix):
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.view(1, -1)


def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])

# Example
print("Sample Input:", data[0][0])
print(make_bow_vector(data[0][0], word_to_ix))
print(make_bow_vector(data[0][0], word_to_ix).size())

print("")

print("Sample Output:", data[0][1])
print(make_target(data[0][1], label_to_ix))
print(make_target(data[0][1], label_to_ix).size())

Sample Input: ['me', 'gusta', 'comer', 'en', 'la', 'cafeteria']
tensor([[1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.]])
torch.Size([1, 26])

Sample Output: SPANISH
tensor([0])
torch.Size([1])


Now that we have the data we can create the logisitic model. In PyTorch we generally use
torch.nn.Module to build neural netowrks. This parent class contains many different classes
to help build complex neural networks (linear, Conv1d, Transformer, etc.). We will be using 
nn.Linear but documentation on the other choices can be found [here](https://pytorch.org/docs/stable/nn.html#linear-layers).

Torch.nn.Module always consist of a `_init_` and `forward` function. Also, for each layer of your neural network you will need to provide the input and output sizes. More information on torch.nn.Module can be found [here](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module).


In [33]:
import torch.nn as nn

class BinaryClassifier(nn.Module):  # using nn.Module class!

    def __init__(self, num_labels, vocab_size):
        # Calls the init function of nn.Module.  
        # It allows you to call the methods from inside the nn.Module parent class.
        super(BinaryClassifier, self).__init__()

        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the linear mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        # Why is the input dimension is vocab_size and the output is num_labels?
        self.linear = nn.Linear(vocab_size, num_labels)

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax (since we are doing logistic regression).
        # Many non-linearities and other functions are in torch.nn.functional
        return nn.functional.log_softmax(self.linear(bow_vec), dim=1)

model = BinaryClassifier(NUM_LABELS, VOCAB_SIZE)

# The model knows its parameters since we used nn.Linear().  The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function of a module, which was done with the line
# self.linear = nn.Linear(...) then through your module (in this case, BinaryClassifier) will store knowledge of the nn.Linear's parameters.
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.1655,  0.1671,  0.1390,  0.1662, -0.1932, -0.0908,  0.0526, -0.0791,
          0.1503,  0.1018, -0.1878,  0.0487,  0.0638,  0.0197, -0.0379,  0.1740,
          0.0890, -0.1647, -0.1136,  0.1922,  0.0229,  0.0801, -0.0531, -0.1287,
          0.1897, -0.1934],
        [ 0.0682, -0.0177,  0.1369, -0.1165, -0.1008,  0.1811, -0.1784, -0.0171,
          0.1763, -0.0209, -0.1126, -0.1333,  0.1108,  0.1652, -0.0051, -0.0956,
          0.1906,  0.1741, -0.0285,  0.1515, -0.1415,  0.0501,  0.1052,  0.1945,
          0.0713, -0.1223]], requires_grad=True)
Parameter containing:
tensor([-0.0286, -0.0707], requires_grad=True)


In [34]:
# To run the model, pass in a BoW vector
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    sample = data[0]
    bow_vector = make_bow_vector(sample[0], word_to_ix)
    log_probs = model(bow_vector)
    print("Sample:", data[0])
    print("Output:", log_probs)

Sample: (['me', 'gusta', 'comer', 'en', 'la', 'cafeteria'], 'SPANISH')
Output: tensor([[-0.5782, -0.8230]])


So lets train! To do this, we pass instances through to get log
probabilities, compute a loss function, compute the gradient of the loss
function, and then update the parameters with a gradient step. 
Loss
functions are provided by Torch in the nn package. We will use `nn.NLLLoss()` which is the
negative log likelihood loss. Other loss functions can be found [here](https://pytorch.org/docs/stable/nn.html#loss-functions).

It also defines optimization
functions in torch.optim. Here, we will just use SGD. Other optimization functions can be found [here](https://pytorch.org/docs/stable/optim.html).

Note that the *input* to NLLLoss is a vector of log probabilities, and a
target label. It doesn't compute the log probabilities for us. This is
why the last layer of our network is log softmax. The loss function
`nn.CrossEntropyLoss()` is the same as `NLLLoss()`, except it does the log
softmax for you.




In [35]:
loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

In [36]:
# Run on test data before we train, just to see a before-and-after
print("Original Test Output")
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

# Print the matrix column corresponding to "creo"
print("Probabilities of 'Creo':", next(model.parameters())[:, word_to_ix["creo"]])

Original Test Output
tensor([[-0.7215, -0.6656]])
tensor([[-0.7263, -0.6610]])
Probabilities of 'Creo': tensor([-0.1878, -0.1126], grad_fn=<SelectBackward0>)


Now we train the model. 

In [37]:

# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(100):
    for instance, label in data:
        # Step 1. PyTorch accumulates gradients.
        # We need to clear them out before each instance - we will discuss this more later
        model.zero_grad()

        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Tensor as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = make_bow_vector(instance, word_to_ix)
        target = make_target(label, label_to_ix)

        # Step 3. Run our forward pass.
        log_probs = model(bow_vec)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(log_probs, target)
        loss.backward() # computes the gradient of the loss
        optimizer.step() # updates parameters using the gradient of the loss
print("True Test Values: ", test_data)
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

# Index corresponding to Spanish goes up, English goes down!
print("Probabilities of 'Creo':", next(model.parameters())[:, word_to_ix["creo"]])

True Test Values:  [(['Yo', 'creo', 'que', 'si'], 'SPANISH'), (['it', 'is', 'lost', 'on', 'me'], 'ENGLISH')]
tensor([[-0.1734, -1.8377]])
tensor([[-2.8461, -0.0598]])
Probabilities of 'Creo': tensor([ 0.2199, -0.5202], grad_fn=<SelectBackward0>)


We got the right answer! You can see that the log probability for
Spanish is much higher in the first example, and the log probability for
English is much higher in the second for the test data, as it should be. 
Also, the we see the specific log probability for the word "Creo" for Spanish is 
much larger than English.

Now you see how to make a PyTorch component, pass some data through it
and do gradient updates. We are ready to dig deeper into what PyTorch
has to offer.


