<a href="https://colab.research.google.com/github/kingb12/nlp244-section1-test/blob/main/NLP244PytorchRefresher.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
%ls /content/drive/MyDrive/colab

Mounted at /content/drive
[0m[01;34mnlp244-section1-test[0m/


In [11]:
import torch
assert torch.cuda.is_available(), "not on a GPU runtime!"
print(torch.__version__)
torch.manual_seed(1234)

1.13.1+cu116


<torch._C.Generator at 0x7f1176cd4e10>

In [12]:
# some imports I use in my solution, not all are needed in yours
import json
from torch import nn, Tensor
from typing import List, Dict, Union, Set
from typing_extensions import TypedDict


### Cloning *from* GitHub

In [13]:
! git clone https://github.com/kingb12/nlp244-section1-test.git
%cd nlp244-section1-test/
! git fetch && git pull

Cloning into 'nlp244-section1-test'...
remote: Enumerating objects: 22, done.[K
remote: Counting objects:   4% (1/22)[Kremote: Counting objects:   9% (2/22)[Kremote: Counting objects:  13% (3/22)[Kremote: Counting objects:  18% (4/22)[Kremote: Counting objects:  22% (5/22)[Kremote: Counting objects:  27% (6/22)[Kremote: Counting objects:  31% (7/22)[Kremote: Counting objects:  36% (8/22)[Kremote: Counting objects:  40% (9/22)[Kremote: Counting objects:  45% (10/22)[Kremote: Counting objects:  50% (11/22)[Kremote: Counting objects:  54% (12/22)[Kremote: Counting objects:  59% (13/22)[Kremote: Counting objects:  63% (14/22)[Kremote: Counting objects:  68% (15/22)[Kremote: Counting objects:  72% (16/22)[Kremote: Counting objects:  77% (17/22)[Kremote: Counting objects:  81% (18/22)[Kremote: Counting objects:  86% (19/22)[Kremote: Counting objects:  90% (20/22)[Kremote: Counting objects:  95% (21/22)[Kremote: Counting objects: 100% (22/22)[Kremo

In [14]:
%pwd

'/content/nlp244-section1-test/nlp244-section1-test'

### Saving *to* GitHub

File > Save a copy in GitHub! Note this is a distinct version from the one in your drive, and cannot be editted. You need to save your drive version to GitHub anytime you want to preserve changes that you share via GitHub!

### Some useful boiler-plate for exercises

In [6]:
from torch import nn, Tensor

def describe(x: Tensor) -> None:
    print("Type: {}".format(x.type()))
    print("Shape/size: {}".format(x.shape))
    print("Values: \n{}".format(x))

describe(torch.randn(size=(3, 4)))

Type: torch.FloatTensor
Shape/size: torch.Size([3, 4])
Values: 
tensor([[ 0.0461,  0.4024, -1.0115,  0.2167],
        [-0.6123,  0.5036,  0.2310,  0.6931],
        [-0.2669,  2.1785,  0.1021, -0.2590]])


## Finally, some exercises:

#### Simple tensor and autograd example:

The following represents a linear equation with scalar variables:

$$
y = 2x + 3
$$

In [7]:
# Create tensors.
x = torch.tensor(1., requires_grad=True)
w = torch.tensor(2., requires_grad=True)
b = torch.tensor(3., requires_grad=True)

# Build a computational graph.
y = w * x + b    # y = 2 * x + 3

**Q:** For $x=1$, how would one compute $\frac{dy}{dx}$? What about $\frac{dy}{dw}$?

In [8]:
# Compute gradients.
y.backward()
# Verify the gradients.
assert x.grad == 2, x.grad
assert w.grad == 1, w.grad
print(x.grad, w.grad)

tensor(2.) tensor(1.)


## Building our own `nn.Linear` layer:

As you probably know, a linear layer generalizes the computation above to support input vectors and weight matrices

$$
y = Wx + b
$$

Let's build our own! Create an `nn.Module` `MyLinear` which on a forward pass takes a `Tensor x` and computes `y` according to learned a learned `weight` and `bias` matrix. Any initialization method is ok.

In [9]:
# need to sub-class nn.Module
from torch import nn, Tensor


class MyLinear(nn.Module):

    # need to implement initialization:
    # what parameters do we need to accept? What fields do we use them to define?
    def __init__(self):
        super().__init__()
        pass

    # calculate our output y = wx + b and returnn
    def forward(self, x: Tensor) -> Tensor:
        pass

# verify:
my_linear = MyLinear(4, 5)
real_linear = nn.Linear(4, 5)
# surgery for equivalent weight and bias
my_linear.weight, my_linear.bias = real_linear.weight, real_linear.bias
x = torch.randn(4,)
assert torch.equal(my_linear(x), real_linear(x))

TypeError: ignored

## A longer exercise (borrowed from [Stanford CS 224n](https://colab.research.google.com/drive/13HGy3-uIIy1KD_WFhG4nVrxJC-3nUUkP?usp=sharing) )


## Word Window Classification

Until this part of the notebook, we have learned the fundamentals of PyTorch and built our own linear layeer. Now we will attempt to solve an example NLP task. Here are the things we will learn:

1. Data: Creating a Dataset of Batched Tensors
2. Modeling
3. Training
4. Prediction

In this section, our goal will be to train a model that will find the words in a sentence corresponding to a `LOCATION`, which will be always of span `1` (meaning that `San Fransisco` won't be recognized as a `LOCATION`). Our task is called `Word Window Classification` for a reason. Instead of letting our model to only take a look at one word in each forward pass, we would like it to be able to consider the context of the word in question. That is, for each word, we want our model to be aware of the surrounding words. Let's dive in!

**Q:** What is our output space?

### Part 1: Loading Data as a [Custom Dataset](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class)

Let's say I give two files `train_data.json` and `test_data.json`. Each file contains a `List` of Dictionaries, with a `sentence` and a `label`, where `sentence` is a string and `label` is a list of integers labelling each word in the sentence (determined by whitespace) as being a location or not.

Write a Custom dataset class `WWCDataset` for loading and using this data with pytorch, accepting one of the files as input. E.g:

`train_dataset: WWCDataset = WWCDataset("train_data.json")` should load the data as expected by PyTorch. Results from `__getitem__` should be preprocessed using the provided function. **Since this dataset is small, you can load it into memory inits entirety as needed**.

In [None]:
# ================================================================== #
#                Input pipeline for custom dataset                 #
# ================================================================== #

# You should build your custom dataset as below.
import torch


def preprocess_sentence(sentence):
  return sentence.lower().split()


class WWCDataset(torch.utils.data.Dataset):
    def __init__(self):
        # TODO: Load the data from the target file
        # TODO: Construct a Vocabulary: a dictionary of all (processed) words to an integer between 0 and |V|
        # TODO: Add tokens <unk> and <pad> to this vocabulary. Set <pad> to have its value be zero in the dict!
        pass

    def __getitem__(self, index: int):
        # TODO: Select the item from our set for index
        # TODO: preprocess sentence with the function above
        # TODO add the window_size of <pad> token to the left and right of each the sentence
        # TODO: convert our sentence to a tensor of integers using the vocabulary
        # TODO: return a dict {"sentence" : Tensor, "label": Tensor}
        pass

    def __len__(self):
        # TODO: change 0 to the total size of your dataset.
        return 0

if __name__ == '__main__':
    train_dataset: WWCDataset = WWCDataset("data/train_data.json")
    print(train_dataset[11])

In [None]:
# You can test with the following:
train_dataset: WWCDataset = WWCDataset("data/train_data.json")
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                            batch_size=1,
                                            shuffle=True)
for item in train_loader:
    print(item)

## Aside: 

Notice while the above works, it only works for a batch size of 1. Trying a larger batch size fails!

In [None]:
try:
    # You can test with the following:
    train_dataset: WWCDataset = WWCDataset("data/train_data.json")
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                            batch_size=8,
                                            shuffle=True)
    for item in train_loader:
        print(item)
except RuntimeError as e:
    print(type(e), e)

**We need to pad our sequences!** To do this, we'll have to define a **custom collate function**. For now though, we'll skip over this and just get a working prototype with batch sizes of 1.

## Part 2: Training

Now we have a dataloader that works for batch size 1, we can try to train a model.

Our model will work as follows:
- for each word in a sentence, take the two words to the left and