## Wavnet(ish)

we successively combine layer outputs, beginning with raw character input, to form larger contexts 

consider a list:

[1,2,3,4,5,6,7,8]

our first layer combines

(1,2), (3,4), (5,6), (7,8)

our second

((1,2), (3,4)), ((5,6), (7,8))

and our output layer represents the whole combined list input to a single output

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [2]:
%run ../lib/bookreader.py

### Open the book

and create an example 3D embedding for the characters

In [3]:
alice = BookReader()
alice.read("../resources/alice.txt")

vocab_size = len(alice.itos)
embedding_size = 3
context_length = 8
ex_em = torch.randn(vocab_size, embedding_size)

lowercase only


### Take some text and embed it

In [4]:
encoded = [alice.stoi[c] for c in alice.train[:40]]
print(torch.tensor(encoded).view(5, context_length))
embedded_text = ex_em[encoded].view(5, context_length, embedding_size)
print("first sample:", embedded_text[0])

tensor([[ 6, 11,  4, 19, 23,  8, 21,  1],
        [12,  1,  0,  7, 18, 26, 17,  1],
        [23, 11,  8,  1, 21,  4,  5,  5],
        [12, 23,  1, 11, 18, 15,  8,  0],
        [ 0,  0,  4, 15, 12,  6,  8,  1]])
first sample: tensor([[-2.3582,  0.4293, -0.2912],
        [ 0.6305, -2.3464, -0.2017],
        [ 0.1161,  1.7447, -1.7238],
        [-0.4493,  0.3968, -0.8936],
        [ 0.7610, -1.5595,  2.1578],
        [ 0.5193, -2.1500,  0.4494],
        [-1.6135,  0.1362,  0.6421],
        [-0.7695, -0.3563,  0.4386]])


### Split into character pairs

to do this we can just view the tensor differently so the 8 characters of our first sample grouped in 2's (each of dimension 3)

In [5]:
pairs = embedded_text.view(5, 4, 2, embedding_size)
pairs[0]

tensor([[[-2.3582,  0.4293, -0.2912],
         [ 0.6305, -2.3464, -0.2017]],

        [[ 0.1161,  1.7447, -1.7238],
         [-0.4493,  0.3968, -0.8936]],

        [[ 0.7610, -1.5595,  2.1578],
         [ 0.5193, -2.1500,  0.4494]],

        [[-1.6135,  0.1362,  0.6421],
         [-0.7695, -0.3563,  0.4386]]])

### Create a layer

We'll return to building our own layers to demonstrate what's happening inside

In [6]:
class ViewDilaltion(nn.Module):

    def __init__(self, n):
        super().__init__()
        self.n = n
        self.type = 'flatten'

    def __call__(self, x):
        B, T, C = x.shape
        x = x.view(B, -1, self.n * C)
        
        self.out = x
        return self.out

    def parameters(self):
        return []

### Not quite the same

we want to represent the batch, the number of combined tokens and a 3rd dimension 

our 3rd dimensions holds both input characters instead of separating them

#### The dialation needs a linear layer

just changing the view doesn't do much

what we want is this followed by a linear layer, the linear layer makes the paired characters talk *only* to each other

if our linear layer had a single neuron then the two input characters (in their embedded form) would now be represented by a single new
*fused* representation

In [7]:
fl = ViewDilaltion(2)

out = fl(embedded_text)

print(out.shape, out[0])

ll = torch.randn(6, 1)

ll_out = out @ ll

ll_out.shape, ll_out

torch.Size([5, 4, 6]) tensor([[-2.3582,  0.4293, -0.2912,  0.6305, -2.3464, -0.2017],
        [ 0.1161,  1.7447, -1.7238, -0.4493,  0.3968, -0.8936],
        [ 0.7610, -1.5595,  2.1578,  0.5193, -2.1500,  0.4494],
        [-1.6135,  0.1362,  0.6421, -0.7695, -0.3563,  0.4386]])


(torch.Size([5, 4, 1]),
 tensor([[[-0.9607],
          [-2.2880],
          [ 2.1361],
          [-1.1945]],
 
         [[-0.5021],
          [ 0.5503],
          [-2.7257],
          [ 0.2443]],
 
         [[ 1.3186],
          [ 1.7463],
          [-3.8608],
          [ 0.3361]],
 
         [[ 3.4059],
          [ 0.2266],
          [ 1.9840],
          [ 4.7713]],
 
         [[ 4.4159],
          [ 2.0203],
          [-2.9771],
          [ 1.7463]]]))

## Create a model

we'll create a sequential model to successively combine our character into an output

pull in our old homemade layers and build a sequential model to represent this.

In [8]:
%run ../lib/nn_layers.py

In [9]:
n_hidden = 40

In [10]:
model = Sequential([
    Embedding(vocab_size, embedding_size),
    ViewDilaltion(2),
    Linear(embedding_size * 2, n_hidden, bias=False), 
    Tanh(),
    ViewDilaltion(2),
    Linear(n_hidden * 2, n_hidden, bias=False), 
    Tanh(),
    ViewDilaltion(2),
    Linear(n_hidden * 2, n_hidden, bias=False), 
    Tanh(),
    Linear(n_hidden, vocab_size, bias=True),
])

## Run the model

if we run our first basic sample through the network we get what we want

i.e for each sample in the batch we get a single value in our vocab - space

so this looks like our [sequence](../sequences/sequence.ipynb) network from before

* ...e  -> m
* ..em  -> m
* .emm  -> a
* emma  -> .

In [11]:
model_out = model(torch.tensor(encoded).view(5, 8))
model_out.shape, model_out[0]

(torch.Size([5, 1, 30]),
 tensor([[ 0.5829, -5.1795, -2.3836, -0.2851, -0.8202, -4.9881, -3.6234,  2.9232,
           1.5306, -0.9459,  2.2703,  2.0584, -6.6450, -7.0613, -2.0522, -6.1872,
           3.5458,  1.3866, 12.6264,  4.4885, -0.7386, -3.1797, -7.6663,  3.9959,
          -6.1365,  6.0381, -1.2655, 14.0181,  0.4901, -6.0630]]))

## So is it the same?

Or at least does it work in a similar way.

Lets go right back to [sequence](../sequences/sequence.ipynb) and see if we get a similar result

In [12]:
context_length = 8

In [13]:
from random import randrange, randint

import string
letters = [l for l in string.ascii_lowercase]

itos = {0: "."}
stoi = {".": 0}

for i, l in enumerate(letters):
    offset = i+1
    stoi[l] = offset
    itos[offset] = l

with open("../resources/names.txt", "r") as r:
    encoded_names = [[stoi[c] for c in f] for f in r.read().split()]

names_length = len(encoded_names)

def sample(size=5, context_length=8):
    prepend = [0 for _ in range(context_length)]
    xs = []
    ys = []
    for i in range(size):
        ni = randrange(names_length-1)
        name = prepend + encoded_names[ni] + [0] 
        offset = randint(0, len(name)-context_length-1)
        xs.append(name[offset:offset+context_length])
        ys.append(name[offset+context_length])
    return xs, ys

In [14]:
print(sample())

([[0, 0, 0, 0, 0, 0, 1, 12], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 18, 1], [0, 0, 1, 1, 12, 1, 14, 1], [0, 0, 0, 0, 0, 0, 0, 0]], [9, 8, 25, 0, 11])


## Compare
let's do roughly the same amount of learing as [sequence](../sequences/sequence.ipynb)

In [15]:
epochs = 120
batch_size = 400
learning_rate = .2
samples = 1000

In [16]:
model = Sequential([
    Embedding(vocab_size, embedding_size),
    ViewDilaltion(2),
    Linear(embedding_size * 2, n_hidden, bias=False), 
    Tanh(),
    ViewDilaltion(2),
    Linear(n_hidden * 2, n_hidden, bias=False), 
    Tanh(),
    ViewDilaltion(2),
    Linear(n_hidden * 2, n_hidden, bias=False), 
    Tanh(),
    Linear(n_hidden, vocab_size, bias=True),
])

for p in model.parameters():
    p.requires_grad = True

In [17]:
for ep in range(epochs):
    epoch_loss = 0
    for s in range(samples):
        x, y = sample(batch_size)
        Y = torch.tensor(y)
        X = torch.tensor(x)

        logits = model(X)

        loss = F.cross_entropy(logits.view(-1, vocab_size), Y) # loss function
        
        with torch.no_grad():
            epoch_loss += loss

        # again stuff on parameters should probably be in model?
        for p in model.parameters():
          p.grad = None
        loss.backward()

        for p in model.parameters():
            p.data -= learning_rate * p.grad

    #just keep any epoch stuff in a no grad block
    with torch.no_grad():
        if ep % 10 == 0:
            print(epoch_loss/samples)
            learning_rate *= .92
            print(ep, learning_rate)

print(epoch_loss/samples)

tensor(2.9918)
0 0.18400000000000002
tensor(2.4656)
10 0.16928000000000004
tensor(2.4518)
20 0.15573760000000003
tensor(2.4285)
30 0.14327859200000004
tensor(2.4033)
40 0.13181630464000005
tensor(2.3909)
50 0.12127100026880006
tensor(2.3769)
60 0.11156932024729606
tensor(2.3678)
70 0.10264377462751238
tensor(2.3581)
80 0.0944322726573114
tensor(2.3525)
90 0.08687769084472649
tensor(2.3445)
100 0.07992747557714837
tensor(2.3361)
110 0.0735332775309765
tensor(2.3289)


### Not so promising

now we haven't initialized our weights or done any optimization but it's worse than our basic sequence model

one thing we have now is a deeper network

if gradients explode or go to zero on one layer this will effect surronding layers

so here's were we probably want to start looking at normalization techniques for our network

#### start slowly

lets use torch to create our model and apply the basic initialization techniques from before

In [18]:
from collections import OrderedDict

class Wavenetish(nn.Module):

    def __init__(self, vocab_size, embedding_size, context_length, hidden_size):
        super().__init__()

        layers = OrderedDict([
            ('embed', nn.Embedding(vocab_size, embedding_size)),
            ('flatten_a', ViewDilaltion(2)),
            ('feed_forward_a', nn.Linear(embedding_size*2, hidden_size, bias=True)),
            ('non_linearity_a', nn.Tanh()),
            ('flatten_b', ViewDilaltion(2)),
            ('feed_forward_b', nn.Linear(n_hidden*2, hidden_size, bias=True)),
            ('non_linearity_b', nn.Tanh()),
            ('flatten_c', ViewDilaltion(2)),
            ('feed_forward_c', nn.Linear(n_hidden*2, hidden_size, bias=True)),
            ('non_linearity_c', nn.Tanh()),
            ('logits', nn.Linear(hidden_size, vocab_size, bias=True)),
        ])

        nn.init.xavier_uniform_(layers['feed_forward_a'].weight, gain=5/3)
        nn.init.zeros_(layers['feed_forward_a'].bias)

        nn.init.xavier_uniform_(layers['feed_forward_b'].weight, gain=5/3)
        nn.init.zeros_(layers['feed_forward_b'].bias)

        nn.init.xavier_uniform_(layers['feed_forward_c'].weight, gain=5/3)
        nn.init.zeros_(layers['feed_forward_c'].bias)

        nn.init.zeros_(layers['logits'].bias)

        self.model = nn.Sequential(layers)

    def forward(self, idx, targets=None):
        logits = self.model(idx)

        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.view(-1, vocab_size), targets)

        return logits, loss

In [19]:
epochs = 120
batch_size = 800
learning_rate = .2
samples = 1000

model = Wavenetish(vocab_size, embedding_size, context_length, n_hidden)

optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)

In [20]:
for ep in range(epochs):
    epoch_loss = 0
    for s in range(samples):
        x, y = sample(batch_size)
        Y = torch.tensor(y)
        X = torch.tensor(x)

        logits, loss = model.forward(X, Y)
        
        epoch_loss += loss.detach()
        model.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

    #just keep any epoch stuff in a no grad block
    with torch.no_grad():
        if ep % 10 == 0:
            print(epoch_loss/samples)
            learning_rate *= .92
            print(ep, learning_rate)

print(epoch_loss/samples)

tensor(2.2917)
0 0.18400000000000002
tensor(2.1066)
10 0.16928000000000004
tensor(2.0942)
20 0.15573760000000003
tensor(2.0882)
30 0.14327859200000004
tensor(2.0847)
40 0.13181630464000005
tensor(2.0821)
50 0.12127100026880006
tensor(2.0809)
60 0.11156932024729606
tensor(2.0795)
70 0.10264377462751238
tensor(2.0800)
80 0.0944322726573114
tensor(2.0759)
90 0.08687769084472649
tensor(2.0756)
100 0.07992747557714837
tensor(2.0777)
110 0.0735332775309765
tensor(2.0753)


### Better already

some initialization again seems to do a lot of the necessary work - our large batch does some of the normalization for us

still it's worth looking at using a Normalization technique (we'll use LayerNorm straight away because BatchNorm is sloooow)

note that we remove bias from the linear layers as LayerNorm has a bias term that should work 

In [21]:
class NormalWavenetish(nn.Module):

    def __init__(self, vocab_size, embedding_size, context_length, hidden_size):
        super().__init__()

        layers = OrderedDict([
            ('embed', nn.Embedding(vocab_size, embedding_size)),
            ('flatten_a', ViewDilaltion(2)),
            ('feed_forward_a', nn.Linear(embedding_size*2, hidden_size, bias=False)),
            ('layer_norm_a', nn.LayerNorm(hidden_size)),
            ('non_linearity_a', nn.Tanh()),
            ('flatten_b', ViewDilaltion(2)),
            ('feed_forward_b', nn.Linear(n_hidden*2, hidden_size, bias=False)),
            ('layer_norm_b', nn.LayerNorm(hidden_size)),
            ('non_linearity_b', nn.Tanh()),
            ('flatten_c', ViewDilaltion(2)),
            ('feed_forward_c', nn.Linear(n_hidden*2, hidden_size, bias=False)),
            ('layer_norm_c', nn.LayerNorm(hidden_size)),
            ('non_linearity_c', nn.Tanh()),
            ('logits', nn.Linear(hidden_size, vocab_size, bias=True)),
        ])

        nn.init.xavier_uniform_(layers['feed_forward_a'].weight, gain=5/3)
        nn.init.xavier_uniform_(layers['feed_forward_b'].weight, gain=5/3)
        nn.init.xavier_uniform_(layers['feed_forward_c'].weight, gain=5/3)

        nn.init.zeros_(layers['logits'].bias)

        self.model = nn.Sequential(layers)

    def forward(self, idx, targets=None):
        logits = self.model(idx)

        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.view(-1, vocab_size), targets)

        return logits, loss

In [22]:
epochs = 120
batch_size = 100
learning_rate = .1
samples = 4000

model = NormalWavenetish(vocab_size, embedding_size, context_length, n_hidden)

optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)

sum(p.nelement() for p in model.parameters())

8200

In [23]:
for ep in range(epochs):
    epoch_loss = 0
    for s in range(samples):
        x, y = sample(batch_size)
        Y = torch.tensor(y)
        X = torch.tensor(x)

        logits, loss = model.forward(X, Y)
        
        epoch_loss += loss.detach()
        model.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

    #just keep any epoch stuff in a no grad block
    with torch.no_grad():
        if ep % 10 == 0:
            print(epoch_loss/samples)
            learning_rate *= .92
            print(ep, learning_rate)

print(epoch_loss/samples)

tensor(2.3080)
0 0.09200000000000001
tensor(2.1408)
10 0.08464000000000002
tensor(2.1189)
20 0.07786880000000002
tensor(2.1094)
30 0.07163929600000002
tensor(2.1030)
40 0.06590815232000002
tensor(2.0999)
50 0.06063550013440003
tensor(2.0938)
60 0.05578466012364803
tensor(2.0885)
70 0.05132188731375619
tensor(2.0868)
80 0.0472161363286557
tensor(2.0841)
90 0.043438845422363245
tensor(2.0801)
100 0.039963737788574184
tensor(2.0816)
110 0.03676663876548825
tensor(2.0785)


### Faster

we were able to make ouir batch size much smaller and achieve the same learing

In [34]:
def generate(num_names):
    for i in range(num_names):
        
        out = []
        ix = [[0 for _ in range(context_length)]]
    
        for _ in range(14):
            logits, _ = model.forward(torch.tensor(ix))
            
            p = F.softmax(logits.view(-1, vocab_size), dim=1)

            prediction = torch.multinomial(p, num_samples=1).item()
    
            for i in range(context_length-1):
                ix[0][i]= ix[0][i+1]
    
            ix[0][context_length-1] = prediction
    
            if prediction == 0:
                break
            out.append(itos[prediction])
        print("".join(out))

    return out

In [25]:
generate(10)

gbellah
colli
adam
seii
anaid
keni
rowice
rulei
branton
kyvan


['k', 'y', 'v', 'a', 'n']

## More thoughts

there's a few things here

the hidden layer size - what are we trying to do with it?

it's operating on pairs:
* initially its taking in characters and combining them in different ways to form fused representations
* these fused representations are then themselves combined
do we have an intuition that these pairings operate in similarly sized spaces?

how complex does that representation need to be?

(i'd suggest the necessary representation space increases layer by layer - we can write more words than characters - more sentences than words)

what if we change those parameter sizes?

In [26]:
class VariableWavenetish(nn.Module):

    def __init__(self, vocab_size, embedding_size, context_length, hidden_sizes):
        super().__init__()

        layers = OrderedDict([
            ('embed', nn.Embedding(vocab_size, embedding_size)),
            ('flatten_a', ViewDilaltion(2)),
            ('feed_forward_a', nn.Linear(embedding_size*2, hidden_sizes[0], bias=False)),
            ('layer_norm_a', nn.LayerNorm(hidden_sizes[0])),
            ('non_linearity_a', nn.Tanh()),
            ('flatten_b', ViewDilaltion(2)),
            ('feed_forward_b', nn.Linear(hidden_sizes[0]*2, hidden_sizes[1], bias=False)),
            ('layer_norm_b', nn.LayerNorm(hidden_sizes[1])),
            ('non_linearity_b', nn.Tanh()),
            ('flatten_c', ViewDilaltion(2)),
            ('feed_forward_c', nn.Linear(hidden_sizes[1]*2, hidden_sizes[2], bias=False)),
            ('layer_norm_c', nn.LayerNorm(hidden_sizes[2])),
            ('non_linearity_c', nn.Tanh()),
            ('logits', nn.Linear(hidden_sizes[2], vocab_size, bias=True)),
        ])

        nn.init.xavier_uniform_(layers['feed_forward_a'].weight, gain=5/3)
        nn.init.xavier_uniform_(layers['feed_forward_b'].weight, gain=5/3)
        nn.init.xavier_uniform_(layers['feed_forward_c'].weight, gain=5/3)

        nn.init.zeros_(layers['logits'].bias)

        self.model = nn.Sequential(layers)

    def forward(self, idx, targets=None):
        logits = self.model(idx)

        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.view(-1, vocab_size), targets)

        return logits, loss

In [27]:
epochs = 120
batch_size = 100
learning_rate = .1
samples = 4000

model = VariableWavenetish(vocab_size, embedding_size, context_length, [10, 30, 40])

optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)

sum(p.nelement() for p in model.parameters())

4540

### Paramaters

that's given us a lot of parameters to play around with - lets increase our dimensions, we'll still have a bunch of parameters to spare

In [28]:
epochs = 120
batch_size = 100
learning_rate = .1
samples = 4000
hidden_states = [10, 30, 40]
embedding_size = 12

model = VariableWavenetish(vocab_size, embedding_size, context_length, hidden_states)

optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)

sum(p.nelement() for p in model.parameters())

4990

## Or allow us to increase our context length

In [None]:
class DeeperWavenetish(nn.Module):

    def __init__(self, vocab_size, embedding_size, context_length, hidden_sizes):
        super().__init__()

        layers = OrderedDict([
            ('embed', nn.Embedding(vocab_size, embedding_size)),
        ])

        hidden_sizes = [embedding_size] + hidden_sizes

        sufixes = ['_a', '_b', '_c', '_d', '_e', '_f', '_g', '_h']
        for i in range(len(hidden_sizes)-1):
            self.add_block(layers, sufixes[i], hidden_sizes[i], hidden_sizes[i+1])
        
        layers['logits'] = nn.Linear(hidden_sizes[-1], vocab_size, bias=True)
        nn.init.zeros_(layers['logits'].bias)

        self.model = nn.Sequential(layers)

    def add_block(self, layer_dict, suffix="_a", fan_in=10, fan_out=20):
        linear_layer = nn.Linear(fan_in*2, fan_out, bias=False)
        nn.init.xavier_uniform_(linear_layer.weight, gain=5/3)
        layer_dict['flatten'+suffix] = ViewDilaltion(2)
        layer_dict['feed_forward'+suffix] = linear_layer
        layer_dict['layer_norm'+suffix] = nn.LayerNorm(fan_out)
        layer_dict['non_linearity'+suffix] = nn.Tanh()
        print("added block", suffix, fan_in, fan_out)
        
    def forward(self, idx, targets=None):
        logits = self.model(idx)

        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.view(-1, vocab_size), targets)

        return logits, loss

In [30]:
vocab_size = 29
epochs = 120
batch_size = 100
learning_rate = .1
samples = 4000
hidden_states = [30, 30, 30, 40]
embedding_size = 12

context_length = 2**len(hidden_states)

model = DeeperWavenetish(vocab_size, embedding_size, context_length, hidden_states)

optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)

context_length, sum(p.nelement() for p in model.parameters())

added block _a 12 30
added block _b 30 30
added block _c 30 30
added block _d 30 40


(16, 8517)

In [31]:
for ep in range(epochs):
    epoch_loss = 0
    for s in range(samples):
        x, y = sample(batch_size, context_length)
        
        Y = torch.tensor(y)
        X = torch.tensor(x)

        logits, loss = model.forward(X, Y)
        
        epoch_loss += loss.detach()
        model.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

    #just keep any epoch stuff in a no grad block
    with torch.no_grad():
        if ep % 10 == 0:
            print(epoch_loss/samples)
            learning_rate *= .92
            print(ep, learning_rate)

print(epoch_loss/samples)

tensor(2.2849)
0 0.09200000000000001
tensor(2.1224)
10 0.08464000000000002
tensor(2.1027)
20 0.07786880000000002
tensor(2.0944)
30 0.07163929600000002
tensor(2.0828)
40 0.06590815232000002
tensor(2.0825)
50 0.06063550013440003
tensor(2.0813)
60 0.05578466012364803
tensor(2.0805)
70 0.05132188731375619
tensor(2.0775)
80 0.0472161363286557
tensor(2.0743)
90 0.043438845422363245
tensor(2.0697)
100 0.039963737788574184
tensor(2.0702)
110 0.03676663876548825
tensor(2.0705)


In [40]:
generate(10)

ealynn
jurashi
join
merris
sena
shaus
banden
keena
caslyn
camarah


['c', 'a', 'm', 'a', 'r', 'a', 'h']

### Longer context

we've achieved the same learning, given ourselves a longer context 

things haven't got noticably better but we're still using the names dataset - 
let's read Alice again and see what our results look like in the [next chapter](wavenet_alice.ipynb)