# Training Tests for Part Of Speech tagging

This notebook is dedicated to start working with the PoS dataset already pre-processed and the column networks that I'm creating.

The network will be constructed from small parts, each will be trained on top of the previous one, adding a new column and decoder.


In [1]:
from datetime import datetime
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter

from langmodels.models import *
import langmodels.utf8codec as utf8codec
from langmodels.utils.tools import *
from langmodels.utils.preprocess_conllu import *



  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Loading faiss with AVX2 support.
Loading faiss.


Load the embeddings first

In [2]:
# load the codebook and all the dictionaries mapping the data
# utf8codes, txt2code, code2txt, txt2num, num2txt = utf8codec._load_codebook()
utf8codes = np.load("./utf8-codes/utf8_codebook_overfit_matrix_2seg_dim64.npy")

In [3]:
utf8codes = utf8codes.reshape(1987,64)

In [4]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [5]:
net = Conv1DPoS(utf8codes)
net = net.to(device)

In [6]:
count_parameters(net)

1949540

for the original network 11266152 parameters, I have cut the number of features and dimensions to make it smaller

for nlayers = 5 of dim 5 is 6912424 and 6846888 trainable

for the following Conv1DPartOfSpeech the number of parameters is: 2161960 where 2096424 are trainable

    nchannels_in=[64, 128, 256, 512, 256],
    nchannels_out=[128, 256, 512, 256, 96],
    kernels=[3, 3, 3, 3, 3],
    nlayers=[6, 6, 4, 4, 3],
    groups=[1, 4, 8, 4, 1],
    
And LinearUposDeprelDecoder params are:

    lin_in_dim=96, 
    lin_hidd_dim=768,
    upos_dim=18, 
    deprel_dim=278,

In [7]:
count_trainable_parameters(net)

1884004

Datasets are the one that are heavy, so I'll just load them and check what happens

In [8]:
dataset_train = "/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.4/traindev_np_batches_779000x3x1024_uint16.npy"

In [9]:
data_train = np.load(dataset_train)

In [10]:
data_train.shape

(779000, 3, 1024)

In [11]:
len(data_train)

779000

In [12]:
data_train.dtype

dtype('uint16')

In [13]:
dta_train_txt = data_train[:,0,:]
dta_train_upos = data_train[:,1,:]
dta_train_deprel = data_train[:,2,:]

In [14]:
# x = torch.from_numpy(dta_train_txt[:50].astype("int64")).to(device)

In [15]:
# txtcode, positions, latent, dec = net(x)
# last_latent = latent[-1]
# upos, deprel = dec

In [16]:
# txtcode.shape, positions.shape, last_latent.shape, upos.shape, #  deprel.shape

In [17]:
# out = torch.cat([upos,deprel], dim=2)

In [18]:
# out.shape

In [19]:
# upos and deprel data are given by indices, this keeps memory as low as possible, but they need to be encoded
upos_eye = torch.eye(len(UPOS))
deprel_eye = torch.eye(len(DEPREL))
with torch.no_grad():
    upos_emb = nn.Embedding(*upos_eye.shape)
    upos_emb.weight.data.copy_(upos_eye)
    upos_emb = upos_emb.to(device)

    deprel_emb = nn.Embedding(*deprel_eye.shape)
    deprel_emb.weight.data.copy_(deprel_eye)
    deprel_emb.to(device)


In [20]:

# from https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
def chunks(data, n, dim=0):
    """Yield successive n-sized chunks from data by the dimension dim"""
    for i in range(0, data.shape[dim], n):
        yield data_train[i:i + n,:,:]


In [21]:
def loss_function(upos, deprel, target_upos, target_deprel):

    # TODO check a more sofisticated loss function, for the moment only the sum to see if it runs
    # the issue is that upos is easier than deprel (18 vs 278 classes)
#     upos_loss = F.mse_loss(upos, target_upos)
#     deprel_loss = F.mse_loss(deprel, target_deprel)
    # issue with the size of target and tensors for cross_entropy ... I don't understand
#     upos_loss = F.cross_entropy(upos, target_upos)
#     deprel_loss = F.cross_entropy(deprel, target_deprel)
#     print(upos.shape, target_upos.shape, deprel.shape, target_deprel.shape)
    upos_loss = F.nll_loss(upos, target_upos)
    deprel_loss = F.nll_loss(deprel, target_deprel)
#     upos_loss = F.kl_div(upos, target_upos)
#     deprel_loss = F.kl_div(deprel, target_deprel)
    loss = upos_loss + deprel_loss
#     loss = F.kl_div(torch.cat([upos, deprel], dim=-1).contiguous(), torch.cat([target_upos, target_deprel], dim=-1).contiguous())
    return loss


In [22]:
writer = SummaryWriter()

In [23]:
# indata = torch.from_numpy(data_train[-2:,0,:].astype("int64")).to(device)

In [24]:
# indata.shape

In [25]:
# %%time
# testing tensorboard add_graph to see if the network graph is drawn correctly ;)
# indata = torch.from_numpy(data_train[-2:,0,:].astype("int64")).to(device)
# writer.add_graph(net, indata)
# Kernel dies when I do this ... so ... :O

In [26]:
def train(model, optimizer, loss_function, batches, epoch, ndatapoints, device, log_interval=100):
    model.train()
    train_loss = 0
#     batch_loss = []
    batch_idx = 1
    for b_data in batches:
        torch.cuda.empty_cache()  # make sure the cache is emptied to begin the nexxt batch
        b_train = torch.from_numpy(b_data[:,0,:].astype("int64")).squeeze().to(device).long()
        b_upos = torch.from_numpy(b_data[:,1,:].astype("int64")).squeeze().to(device).long()
#         b_deprel = torch.from_numpy(b_data[:,2,:].astype("int64")).squeeze().to(device).long()
#         tensor_data = torch.from_numpy(bdata).to(device).long()  #.double()  #.float()
        
        optimizer.zero_grad()
        txtcode, positions, latent, dec = model(b_train)
        last_latent = latent[-1]
        upos, deprel = dec
#         print(emb.shape,emb.dtype, res.shape, res.dtype)
#         print(upos.shape, b_upos.shape)
#         loss = loss_function(upos, deprel, upos_emb(b_upos), deprel_emb(b_deprel))
#         loss = loss_function(upos, deprel, b_upos, b_deprel)
        # Untill I make it work, work only with the UPOS PoS as it will be faster MUCH faster
#         loss = F.kl_div(upos, upos_emb(b_upos), reduction="batchmean")
        loss = F.nll_loss(upos.view([-1,18]),b_upos.view([-1]))
#         loss = F.cross_entropy(upos.view([-1,18]),b_upos.view([-1]))
#         loss = F.cross_entropy(upos,b_upos)
#         loss = F.mse_loss(upos, upos_emb(b_upos))
        
        loss.backward()
        train_loss += loss.data.item()  # [0]
        writer.add_scalar("Loss/train", loss.data.item(), global_step=epoch*batch_idx)
        optimizer.step()
        
        if batch_idx % log_interval == 0:
            print('Timestamp {} Train Epoch: {} [{}/{} ]\tLoss: {:.6f}'.format(
                datetime.now(),
                epoch, batch_idx , (ndatapoints//len(b_data)),
                loss.data.item() / b_data.shape[0]))
#             batch_loss.append(loss)
        batch_idx += 1
        del(b_train)
        del(b_upos)
#         del(b_deprel)
        torch.cuda.empty_cache()
    writer.add_scalar("EpochLoss/train", train_loss / batch_idx, epoch)
    print('====> Timestamp {} Epoch: {} Average loss: {:.8f}'.format(datetime.now(), epoch, train_loss / ndatapoints))
    return train_loss

In [27]:
# load testing data ALL the training data
base_dir = "/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.4"
# get all file paths for testing
all_fnames = get_all_files_recurse(base_dir)
fnames = [f for f in all_fnames if "test-charse" in f and f.endswith(".npy")]


In [28]:
len(fnames)

117

In [29]:
# load all test files 
test_data = []
for f in fnames:
    data = np.load(f)
    lang_name = path_leaf(f).split("-ud")[0]
    test_data.append((lang_name, data))

In [30]:
def test(model, loss_function, test_data, epoch, device, max_data=100):
    model.eval()
    test_loss = 0
    for lang, d in test_data:
        torch.cuda.empty_cache()  # make sure the cache is emptied to begin the nexxt batch
        b_test = torch.from_numpy(d[:max_data,0,:].astype("int64")).squeeze().to(device).long()
        b_upos = torch.from_numpy(d[:max_data,1,:].astype("int64")).squeeze().to(device).long()
#         b_deprel = torch.from_numpy(d[:,2,:].astype("int64")).squeeze().to(device).long()
        _, _, _, dec = model(b_test)
#         last_latent = latent[-1]
        upos, _ = dec
        loss = loss_function(upos.view([-1,18]),b_upos.view([-1]))
#         loss =  loss_function(res, tensor_data).data.item()  # [0]
        test_loss += loss.data.item()
        writer.add_scalar("LangLoss/test/"+lang, loss.data.item(), global_step=epoch)
        del(b_test)
        del(b_upos)
        torch.cuda.empty_cache()
    test_loss /= len(test_data)  # although this is not faire as different languages give different results
    writer.add_scalar("EpochLangLoss/test/", test_loss, global_step=epoch)
    print('epoch: {}====> Test set loss: {:.8f}'.format(epoch, test_loss))

In [31]:
# reload model from saved state:
# net.network.load_model("./trained_models/conv1dcol", "conv1dcol_kl-div+1000batches-mse-loss_epoch-3_001")

In [32]:
model = net.to(device)

In [33]:
# optimizer = torch.optim.Adam(model.parameters(), lr=1e-6, weight_decay=0, amsgrad=False )
# optimizer = torch.optim.AdamW(model.parameters())
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)

In [34]:
loss_function = F.nll_loss

In [35]:
data_train.shape, data_train.shape[0]//50

((779000, 3, 1024), 15580)

In [36]:
epoch_size = 10000
batch_size = 50
# data = data_train[-1000*batch_size:,:,:]  # just for the trials, use the last 1000 batches only
data = data_train

In [37]:
print(data.shape)

(779000, 3, 1024)


In [38]:
epochs = chunks(data, epoch_size, dim=0)
# batches = chunks(data, batch_size, dim=0)

In [39]:
# %%time
# epoch_count = 0
# test(model, loss_function, test_data, epoch_count, device, max_data=50)

In [None]:
%%time
epoch_count = 1
for e in epochs:
    batches = chunks(e, batch_size, dim=0)
    eloss = train(model, optimizer, loss_function, batches, epoch_count, epoch_size, device, log_interval=10)
    test(model, loss_function, test_data, epoch_count, device, max_data=50)
    epoch_count+=1


Timestamp 2019-11-27 18:32:16.779965 Train Epoch: 1 [10/200 ]	Loss: 0.053510
Timestamp 2019-11-27 18:32:19.232719 Train Epoch: 1 [20/200 ]	Loss: 0.033693
Timestamp 2019-11-27 18:32:21.648737 Train Epoch: 1 [30/200 ]	Loss: 0.016544
Timestamp 2019-11-27 18:32:24.090981 Train Epoch: 1 [40/200 ]	Loss: 0.009193
Timestamp 2019-11-27 18:32:26.545630 Train Epoch: 1 [50/200 ]	Loss: 0.006553
Timestamp 2019-11-27 18:32:28.921684 Train Epoch: 1 [60/200 ]	Loss: 0.006662
Timestamp 2019-11-27 18:32:31.285176 Train Epoch: 1 [70/200 ]	Loss: 0.006000
Timestamp 2019-11-27 18:32:33.650147 Train Epoch: 1 [80/200 ]	Loss: 0.005756
Timestamp 2019-11-27 18:32:36.014267 Train Epoch: 1 [90/200 ]	Loss: 0.005029
Timestamp 2019-11-27 18:32:38.388077 Train Epoch: 1 [100/200 ]	Loss: 0.005981
Timestamp 2019-11-27 18:32:40.753877 Train Epoch: 1 [110/200 ]	Loss: 0.004569
Timestamp 2019-11-27 18:32:43.120740 Train Epoch: 1 [120/200 ]	Loss: 0.004130
Timestamp 2019-11-27 18:32:45.476898 Train Epoch: 1 [130/200 ]	Loss: 0.00

In [None]:
# %%time
# epoch_count = 2
# eloss = train(model, optimizer, loss_function, batches, epoch_count, len(data), device, log_interval=20)


In [None]:
model.network.save_model("./trained_models/conv1dcol", "conv1dcol_nll-loss_epoch-{}".format(epoch_count))

Tried different Nx50 sizes for batches but the only one that works is 50, it seems will be the maximum number of samples in each batch for the training in my GPU

The issue is that training does not seem to work correctly.

All training losses (kl_div, mse_loss) seem to learn well only the first 100 batches and then nothing, it oscilates. After several different initializations with kl_div it worked better (the first loss was about initialized to -1 ... ) so initialization seems to take an important role here.

I need to write a test function now to be able to measure with the test datasets and see the real accuracy



In [None]:
torch.cuda.memory_allocated()

In [None]:
torch.cuda.memory_cached()

In [None]:
torch.cuda.empty_cache()

In [None]:
torch.cuda.memory_allocated()

In [None]:
torch.cuda.memory_cached()