Name:

Student ID:

In this exercise, you should develop a character-level RNN language model.

You are free to choose the architecture, but you must use GRUs and not LSTMs. A linear embedding layer (hidden size 64), a 2-layer GRU (hidden size 128, dropout 0.1), and a linear classifier head is an example architecture.

You should generate some example outputs using beam search.

Some parts of the code has been done for you. You need to implement the parts that raise `NotImplementedError`.

The index zero has been reserved for the padding token/character. By subtracting one from the token indices, the indices will become ASCII indices. (And the padding index will become `-1`.)

The model's classification head should directly predict ASCII characters (256 possibilities). It should not predict any special tokens, such as padding, start or end.

# Bootstrap

# Install

In [None]:
! pip install -U torch datasets pyperclip icecream numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch
  Downloading torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl (619.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m619.9/619.9 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyperclip
  Downloading pyperclip-1.8.2.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting icecream
  Downloading icecream-2.1.3-py2.py3-none-any.whl (8.4 kB)
Collecting numpy
  Downloading numpy-1.24.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m79.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cudnn-cu11==8.5.0.96
  Download

# Download the Data

In [None]:
!wget https://files.lilf.ir/Black%20Luminary.txt

--2023-03-31 21:40:02--  https://files.lilf.ir/Black%20Luminary.txt
Resolving files.lilf.ir (files.lilf.ir)... 82.102.11.148
Connecting to files.lilf.ir (files.lilf.ir)|82.102.11.148|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3148450 (3.0M) [text/plain]
Saving to: ‘Black Luminary.txt’


2023-03-31 21:40:03 (3.97 MB/s) - ‘Black Luminary.txt’ saved [3148450/3148450]



In [None]:
! ls -lh

total 3.1M
-rw-r--r-- 1 root root 3.1M Oct 14  2021 'Black Luminary.txt'
drwxr-xr-x 1 root root 4.0K Mar 30 13:53  sample_data


In [None]:
! realpath *.txt

/content/Black Luminary.txt


# User Config

In [None]:
data_paths = [
    '/content/Black Luminary.txt',
    ]

## imports

In [None]:
import pyperclip

In [None]:
from icecream import ic

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
device = torch.device("cpu")
#: We will set device again in the training loop.

In [None]:
import datasets as D

In [None]:
import numpy
np = numpy

import statistics

# Utils

In [None]:
class NumpyPrintOptions:
    def __init__(self, **kwargs):
        self.options = kwargs
        self.original_options = np.get_printoptions()

    def __enter__(self):
        np.set_printoptions(**self.options)

    def __exit__(self, exc_type, exc_value, traceback):
        np.set_printoptions(**self.original_options)

class NoTruncationNumpyPrintOptions(NumpyPrintOptions):
    def __init__(self):
        super().__init__(
            threshold=np.inf, 
            linewidth=200, 
            suppress=True, 
            precision=4
        )

In [None]:
import jax

def torch_shape_get(input):
    def h_shape_get(x):
        return x.dtype, x.shape

    return jax.tree_map(h_shape_get, input)

In [None]:
def has_nan(tensor):
    return torch.any(torch.isnan(tensor))

In [None]:
class ModelEvalMode:
    def __init__(self, model):
        self.model = model

    def __enter__(self):
        self.model.eval()

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.model.train()

# Data

In [None]:
d = D.load_dataset("text",
                         data_files=data_paths, sample_by="paragraph")
d

Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-751f099585d07f63/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-751f099585d07f63/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 18423
    })
})

In [None]:
d = d['train']
d

Dataset({
    features: ['text'],
    num_rows: 18423
})

In [None]:
d[1000:1010]

{'text': ['Professor Snape threw him backwards, and Harry stumbled, but just managed to keep standing.',
  "'I understand, sir. I apologise if my careless words have offended.'",
  "'Get going then!' The man turned around and marched towards his desk.",
  'Not keen on his company for the moment, Harry hastily made his way towards the hall.',
  "Merlin, did he have to grab my arm like that? If being implicated in murder is child's play to him, then I can indeed do without partaking in his 'problems'…",
  '~BLHD~',
  "Harry paused before entering the Great Hall and forced his countenance as hard as he could into a blank expression. He must not let up; he must not relent for one second. Weakness would not help him here. It might also be prudent to distance himself a bit from other people for a while to limit the damage to their reputation and family; depending on how the whole situation turned out, the political fallout could be immense. With a sense of foreboding, he imagined Daphne's re

In [None]:
def str_to_np(s, dtype=np.int8):
    s = s.encode('ascii', errors='ignore')
    return np.frombuffer(s, dtype=dtype)

str_to_np('hello')

array([104, 101, 108, 108, 111], dtype=int8)

In [None]:
def str_to_onehot(s):
    return np.eye(256)[str_to_np(s)]

In [None]:
dc = d.map(lambda batch: {'input': [str_to_np(t).astype(np.int32) + 1 for t in batch['text']]}, batched=True) #: added one to the char indices to make zero available for the pad token
dc

Map:   0%|          | 0/18423 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'input'],
    num_rows: 18423
})

In [None]:
dc = dc.filter(lambda x: (len(x['input']) > 30 and len(x['text'].split()) > 4), batched=False)
dc

Filter:   0%|          | 0/18423 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'input'],
    num_rows: 16371
})

In [None]:
dc.set_format("torch", columns=["input",])

In [None]:
torch_shape_get(dc[1000:1010]['input'])

[(torch.int64, torch.Size([1126])),
 (torch.int64, torch.Size([727])),
 (torch.int64, torch.Size([163])),
 (torch.int64, torch.Size([232])),
 (torch.int64, torch.Size([106])),
 (torch.int64, torch.Size([88])),
 (torch.int64, torch.Size([82])),
 (torch.int64, torch.Size([69])),
 (torch.int64, torch.Size([127])),
 (torch.int64, torch.Size([64]))]

In [None]:
dc = dc.shuffle()

In [None]:
dcs = dc.train_test_split(test_size=0.2)
dcs

DatasetDict({
    train: Dataset({
        features: ['text', 'input'],
        num_rows: 13096
    })
    test: Dataset({
        features: ['text', 'input'],
        num_rows: 3275
    })
})

# Model

- [GRU --- PyTorch 2.0 documentation](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html)

- [torch.nn.utils.rnn.pack_sequence --- PyTorch 2.0 documentation](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_sequence.html#torch.nn.utils.rnn.pack_sequence) (not necessarily needed)

- [Embedding --- PyTorch 2.0 documentation](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)

- [torch.nn.utils.rnn.pad_sequence --- PyTorch 2.0 documentation](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html)


In [None]:
class Model(nn.Module):
    def __init__(self):
        super().__init__()

        raise NotImplementedError()

    
    def forward(self, x, hidden=None):
        #: x: list of tensors shaped (seq_length, 256)
        #: The seq_length will not necessarily be equal for all list items!
        ##
        raise NotImplementedError()
        
        return x, hidden

In [None]:
def loss_fn(y, y_hat):
    #: y: list of tensors shaped (seq_length, classes_n+1)
    #: y_hat: tensor shaped (B, T, classes_n)
    #:
    #: Be sure to skip padding!
    ##

    raise NotImplementedError()

    return average_loss

In [None]:
def shift_left(tensor_list, pad_value=0.0):
    shifted_tensors = []
    for tensor in tensor_list:
        raise NotImplementedError()
    
    return shifted_tensors

# Example usage:
input_ids = [torch.tensor([1, 2, 3, 4]), torch.tensor([5, 6, 7, 8])]
print("Input Ids:")
print(input_ids)

target_ids = shift_left(input_ids)
print("Shifted Left (Target Ids):")
print(target_ids)

Input Ids:
[tensor([1, 2, 3, 4]), tensor([5, 6, 7, 8])]
Shifted Left (Target Ids):
[tensor([2, 3, 4, 0]), tensor([6, 7, 8, 0])]


# Beam Search Generation

In [None]:
import torch
import heapq

def tensor_to_string(tensor):
    chars = [chr(c) for c in tensor]
    return ''.join(chars)

def tensor_append_scalar(tensor, scalar):
    scalar_tensor = torch.tensor(scalar).view(1)  # Add a dimension to match the original tensor's dimensions
    scalar_tensor = scalar_tensor.to(device)

    # Append the scalar to the original tensor
    result = torch.cat((tensor, scalar_tensor), dim=0)
    return result


def generate_next_top_k(model, input_sequence, k):
    logits, _ = model.forward([input_sequence])
    logits = logits[0, -1, :]
    # ic(torch_shape_get(logits))
    
    probabilities = torch.softmax(logits, dim=-1)
    # ic(torch_shape_get(probabilities))

    top_k_values, top_k_indices = torch.topk(probabilities, k)

    return [(tensor_append_scalar(input_sequence, idx.item() + 1), log_prob.item()) for idx, log_prob in zip(top_k_indices, top_k_values.log())]

def beam_search(model, desired_length, starting_string, k=5):
    with ModelEvalMode(model), torch.no_grad():
      input_sequence = torch.tensor(str_to_np(starting_string).astype(np.int32) + 1, dtype=torch.long)
      input_sequence = input_sequence.to(device)
      # ic(torch_shape_get(input_sequence))
      
      log_prob = 0.0

      beam = [(input_sequence, log_prob)]

      while len(beam[0][0]) < desired_length:
          new_beam = []
          for seq, log_prob in beam:
              next_top_k = generate_next_top_k(model, seq, k)
              new_beam.extend([(new_seq, new_log_prob + log_prob) for new_seq, new_log_prob in next_top_k])

          beam = heapq.nlargest(k, new_beam, key=lambda x: x[1])

      return [tensor_to_string(seq - 1) for seq, _ in beam]

In [None]:
a = str_to_np("Harry ").astype(np.int32) + 1
ic(a)
tensor_to_string(a - 1)

ic| a: array([ 73,  98, 115, 115, 122,  33], dtype=int32)


'Harry '

In [None]:
def eval_gen(*args, display=999999, **kwargs):
    generated_texts = beam_search(
        *args, **kwargs,
    )

    for idx, text in enumerate(generated_texts):
        if idx >= display:
            break
        
        print(f"Generated text {idx + 1}: {text}")

# Train

In [None]:
dt = dcs['train']
dt

Dataset({
    features: ['text', 'input'],
    num_rows: 13096
})

In [None]:
#: Training Loop

if torch.cuda.is_available():
	device = 'cuda'
	non_blocking = True
elif True:
	device = 'cpu'
	non_blocking = False
else:
	#: causes NaNs
	device = 'mps' 
	non_blocking = False

i = 0

#: Feel free to edit these hyperparameters or the optimizer
#: You might want to use a learning-rate scheduler, such as
#: [ReduceLROnPlateau --- PyTorch 2.0 documentation](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html)
epochs = 400
batch_size = 4096
learning_rate = 0.01
max_len = 0

m = Model().to(device=device, non_blocking=non_blocking)
m.train()

optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)

counter = 0
for epoch in range(epochs):
	dt = dt.shuffle()
	for i in range(0, len(dt), batch_size):
		batch = dt[i:i+batch_size]
		inputs = batch['input']

		if max_len > 0:
			inputs = list(map(lambda x: x[:max_len] if len(x) > max_len else x, inputs))

		lens = [len(seq) for seq in inputs]
		current_max_len = max(lens)
		mean_len = statistics.mean(lens)
		# ic(current_max_len, mean_len)

		inputs = list(map(lambda x: x.to(device, non_blocking=non_blocking), inputs))
		# ic(torch_shape_get(inputs))

		targets = shift_left(inputs)
		# ic(torch_shape_get(targets))

		total_loss = 0

		raise NotImplementedError()
	
		l = total_loss / mean_len

		if counter % 1 == 0:
			l = l.item()
			print(f"loss: {l:>7f}  [{counter:>5d}, epoch={epoch}]")
		
		counter += 1

	# print(f"loss: {l:>7f}  [{counter-1:>5d}, epoch={epoch} finished!]")
	if epoch % 15 == 0:
		eval_gen(display=3, model=m, desired_length=100, starting_string="Harry ", k=32)
	
None

In [None]:
eval_gen(display=3, model=m, desired_length=100, starting_string="Harry ", k=32)

In [None]:
eval_gen(display=50, model=m, desired_length=250, starting_string="Harry ", k=100)

In [None]:
eval_gen(display=50, model=m, desired_length=250, starting_string="Arcturus ", k=100)

In [None]:
eval_gen(display=50, model=m, desired_length=150, starting_string="Draco ", k=100)

In [None]:
eval_gen(display=50, model=m, desired_length=150, starting_string="Harry looked at ", k=100)