# Training Tests for Part Of Speech tagging

This notebook is dedicated to start working with the PoS dataset already pre-processed and the column networks that I'm creating.

The network will be constructed from small parts, each will be trained on top of the previous one, adding a new column and decoder.


In [1]:
from datetime import datetime
import numpy as np
from langmodels.models import *
import langmodels.utf8codec as utf8codec
from langmodels.utils.tools import *
from langmodels.utils.preprocess_conllu import UPOS, DEPREL
import torch.nn.functional as F
import torch.nn as nn
import torch



Loading faiss with AVX2 support.
Loading faiss.


Load the embeddings first

In [2]:
# load the codebook and all the dictionaries mapping the data
# utf8codes, txt2code, code2txt, txt2num, num2txt = utf8codec._load_codebook()
utf8codes = np.load("./utf8-codes/utf8_codebook_overfit_matrix_2seg_dim64.npy")

In [3]:
utf8codes = utf8codes.reshape(1987,64)

In [4]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [5]:
net = Conv1DPoS(utf8codes)
net = net.to(device)

In [6]:
count_parameters(net)

1949540

for the original network 11266152 parameters, I have cut the number of features and dimensions to make it smaller

for nlayers = 5 of dim 5 is 6912424 and 6846888 trainable

for the following Conv1DPartOfSpeech the number of parameters is: 2161960 where 2096424 are trainable

    nchannels_in=[64, 128, 256, 512, 256],
    nchannels_out=[128, 256, 512, 256, 96],
    kernels=[3, 3, 3, 3, 3],
    nlayers=[6, 6, 4, 4, 3],
    groups=[1, 4, 8, 4, 1],
    
And LinearUposDeprelDecoder params are:

    lin_in_dim=96, 
    lin_hidd_dim=768,
    upos_dim=18, 
    deprel_dim=278,

In [7]:
count_trainable_parameters(net)

1884004

Datasets are the one that are heavy, so I'll just load them and check what happens

In [8]:
dataset_train = "/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.4/traindev_np_batches_779000x3x1024_uint16.npy"

In [9]:
data_train = np.load(dataset_train)

In [10]:
data_train.shape

(779000, 3, 1024)

In [11]:
len(data_train)

779000

In [12]:
data_train.dtype

dtype('uint16')

In [13]:
dta_train_txt = data_train[:,0,:]
dta_train_upos = data_train[:,1,:]
dta_train_deprel = data_train[:,2,:]

In [14]:
x = torch.from_numpy(dta_train_txt[:50].astype("int64")).to(device)

In [15]:
x

tensor([[113, 117, 111,  ...,   0,   0,   0],
        [ 77, 111, 110,  ...,   0,   0,   0],
        [ 65, 117, 223,  ...,   0,   0,   0],
        ...,
        [ 68,  97, 103,  ...,   0,   0,   0],
        [ 67, 111, 109,  ...,   0,   0,   0],
        [113, 117, 105,  ...,   0,   0,   0]], device='cuda:0')

In [16]:

txtcode, positions, latent, dec = net(x)
last_latent = latent[-1]
upos, deprel = dec

In [17]:
txtcode.shape, positions.shape, last_latent.shape, upos.shape, #  deprel.shape

(torch.Size([50, 64, 1024]),
 torch.Size([1, 1024]),
 torch.Size([50, 96, 1024]),
 torch.Size([50, 1024, 18]))

In [18]:
# out = torch.cat([upos,deprel], dim=2)

In [19]:
# out.shape

In [20]:
# upos and deprel data are given by indices, this keeps memory as low as possible, but they need to be encoded
upos_eye = torch.eye(len(UPOS))
# deprel_eye = torch.eye(len(DEPREL))

upos_emb = nn.Embedding(*upos_eye.shape)
upos_emb.weight.data.copy_(upos_eye)
upos_emb.to(device)

# deprel_emb = nn.Embedding(*deprel_eye.shape)
# deprel_emb.weight.data.copy_(deprel_eye)
# deprel_emb.to(device)


Embedding(18, 18)

In [21]:

# from https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
def chunks(data, n, dim=0):
    """Yield successive n-sized chunks from data by the dimension dim"""
    for i in range(0, data.shape[dim], n):
        yield data_train[i:i + n,:,:]


In [22]:
def loss_function(upos, deprel, target_upos, target_deprel):

    # TODO check a more sofisticated loss function, for the moment only the sum to see if it runs
    # the issue is that upos is easier than deprel (18 vs 278 classes)
#     upos_loss = F.mse_loss(upos, target_upos)
#     deprel_loss = F.mse_loss(deprel, target_deprel)
    # issue with the size of target and tensors for cross_entropy ... I don't understand
#     upos_loss = F.cross_entropy(upos, target_upos)
#     deprel_loss = F.cross_entropy(deprel, target_deprel)
#     print(upos.shape, target_upos.shape, deprel.shape, target_deprel.shape)
#     upos_loss = F.nll_loss(upos, target_upos)
#     deprel_loss = F.nll_loss(deprel, target_deprel)
#     upos_loss = F.kl_div(upos, target_upos)
#     deprel_loss = F.kl_div(deprel, target_deprel)
#     loss = upos_loss + deprel_loss
    loss = F.kl_div(torch.cat([upos, deprel], dim=-1).contiguous(), torch.cat([target_upos, target_deprel], dim=-1).contiguous())
    return loss


In [70]:

def train(model, optimizer, loss_function, batches, epoch, ndatapoints, device, log_interval=100):
#     model.train()
    train_loss = 0
#     batch_loss = []
    batch_idx = 0
    for b_data in batches:
        torch.cuda.empty_cache()  # make sure the cache is emptied to begin the nexxt batch
        b_train = torch.from_numpy(b_data[:,0,:].astype("int64")).squeeze().to(device).long()
        b_upos = torch.from_numpy(b_data[:,1,:].astype("int64")).squeeze().to(device).long()
#         b_deprel = torch.from_numpy(b_data[:,2,:].astype("int64")).squeeze().to(device).long()
#         tensor_data = torch.from_numpy(bdata).to(device).long()  #.double()  #.float()
        
        optimizer.zero_grad()
        txtcode, positions, latent, dec = model(b_train)
        last_latent = latent[-1]
        upos, deprel = dec
#         print(emb.shape,emb.dtype, res.shape, res.dtype)
#         loss = loss_function(upos, deprel, upos_emb(b_upos), deprel_emb(b_deprel))
        # Untill I make it work, work only with the UPOS PoS as it will be faster MUCH faster
        loss = F.kl_div(upos, upos_emb(b_upos), reduction="batchmean")
#         loss = F.mse_loss(upos, upos_emb(b_upos))
        
        loss.backward()
        train_loss += loss.data.item()  # [0]
        optimizer.step()
        
        if batch_idx % log_interval == 0:
            print('Timestamp {} Train Epoch: {} [{}/{} ]\tLoss: {:.6f}'.format(
                datetime.now(),
                epoch, batch_idx , (ndatapoints//len(b_data)),
                loss.data.item() / b_data.shape[0]))
#             batch_loss.append(loss)
        batch_idx += 1
        del(b_train)
        del(b_upos)
#         del(b_deprel)
    print('====> Timestamp {} Epoch: {} Average loss: {:.8f}'.format(datetime.now(), epoch, train_loss / ndatapoints))
    return train_loss


# def test(model, test_data, epoch, device):
#     model.eval()
#     test_loss = 0
#     for d in test_data:
#         tensor_data = torch.from_numpy(d).to(device)
#         res = model(data)
#         test_loss += loss_function(tensor_data, res).data.item()  # [0]

#     test_loss /= len(test_data)
#     print('epoch: {}====> Test set loss: {:.4f}'.format(epoch, test_loss))


In [24]:
model = net

In [57]:
# optimizer = torch.optim.Adam(model.parameters(), lr=1e-6, weight_decay=0, amsgrad=False )
# optimizer = torch.optim.AdamW(model.parameters())
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)

In [26]:
data_train.shape, data_train.shape[0]//50

((779000, 3, 1024), 15580)

In [71]:
batch_size = 50
data = data_train[-1000*batch_size:,:,:]  # just for the trials, use the last 1000 batches only

In [72]:
print(data.shape)

(50000, 3, 1024)


In [73]:
batches = chunks(data, batch_size, dim=0)

In [74]:
%%time
epoch_count = 2
# epochs = range(1)
# for e in epochs:
eloss = train(model, optimizer, loss_function, batches, epoch_count, len(data), device, log_interval=100)
#     epoch_count+=1
#     if epoch_count == 20:
#         print("epoch {} decreasing learning_rate to {}".format(epoch_count, 1e-5))
#         optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5, weight_decay=1e-4)
#     epoch_loss.append(eloss)

Timestamp 2019-11-26 23:43:53.870421 Train Epoch: 2 [0/1000 ]	Loss: -18.555200
Timestamp 2019-11-26 23:44:16.984003 Train Epoch: 2 [100/1000 ]	Loss: -18.852000
Timestamp 2019-11-26 23:44:40.097733 Train Epoch: 2 [200/1000 ]	Loss: -18.846400
Timestamp 2019-11-26 23:45:03.317705 Train Epoch: 2 [300/1000 ]	Loss: -18.587999
Timestamp 2019-11-26 23:45:26.615978 Train Epoch: 2 [400/1000 ]	Loss: -18.891200
Timestamp 2019-11-26 23:45:49.917358 Train Epoch: 2 [500/1000 ]	Loss: -18.941599
Timestamp 2019-11-26 23:46:13.255728 Train Epoch: 2 [600/1000 ]	Loss: -18.807600
Timestamp 2019-11-26 23:46:36.497815 Train Epoch: 2 [700/1000 ]	Loss: -18.763600
Timestamp 2019-11-26 23:46:59.805018 Train Epoch: 2 [800/1000 ]	Loss: -18.970399
Timestamp 2019-11-26 23:47:23.135896 Train Epoch: 2 [900/1000 ]	Loss: -18.632000
====> Timestamp 2019-11-26 23:47:46.196081 Epoch: 2 Average loss: -18.75240836
CPU times: user 2min 34s, sys: 1min 17s, total: 3min 51s
Wall time: 3min 52s


In [75]:
model.network.save_model("./trained_models/conv1dcol", "conv1dcol_kl-div+1000batches-mse-loss_epoch-3")

Tried different Nx50 sizes for batches but the only one that works is 50, it seems will be the maximum number of samples in each batch for the training in my GPU

The issue is that training does not seem to work correctly.

All training losses (kl_div, mse_loss) seem to learn well only the first 100 batches and then nothing, it oscilates. After several different initializations with kl_div it worked better (the first loss was about initialized to -1 ... ) so initialization seems to take an important role here.

I need to write a test function now to be able to measure with the test datasets and see the real accuracy



In [65]:
torch.cuda.memory_allocated()

2153514496

In [66]:
torch.cuda.memory_cached()

4483710976

In [67]:
torch.cuda.empty_cache()

In [68]:
torch.cuda.memory_allocated()

2153514496

In [69]:
torch.cuda.memory_cached()

2304770048