# Training Tests for Part Of Speech tagging

This notebook is dedicated to start working with the PoS dataset already pre-processed and the column networks that I'm creating.

The network will be constructed from small parts, each will be trained on top of the previous one, adding a new column and decoder.


In [1]:
from datetime import datetime
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter

from langmodels.models import *
import langmodels.utf8codec as utf8codec
from langmodels.utils.tools import *
from langmodels.utils.helpers import *
from langmodels.utils.preprocess_conllu import *
from langmodels.train import *

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Loading faiss with AVX2 support.
Loading faiss.


In [2]:
import random

In [3]:
# torch.manual_seed(42)
# torch.manual_seed(random.randint(0,1000))
torch.manual_seed(0)

<torch._C.Generator at 0x7f69ada1db10>

In [4]:
# %%time
# main_conv1dcolnet()

There is a weird thing that the training curves jump up at a certain moment, with one try it went up at epoch 11,12 and the other at epochs 6,7,8; 72,73,74,75 after those it goes down. I will have to see that.

Also the observation that upos loss is less than the deprel loss, this makes sense as deprel task is much more difficult (many more number of classes) than upos and we are using the same network for both and giving the same importance in the loss to both tasks. This serves the purpose for pre-training (starting) the networks, but nothing else. A better loss must be created in order to improve the accuracy (which I'm not yet measuring)


Note, the TensorBoard names of these 2 training samples are:
*Dec04_11-52-54_labestia* and *Dec04_13-56-43_labestia* 


In [5]:
print(torch.cuda.current_device())
# torch.cuda.set_device(1)  # device 0 is gpu 1 and vice-versa so: RTX2080ti is GPU:1 but cuda:0
# print(torch.cuda.current_device())
print(0, torch.cuda.get_device_name(0))
print(1, torch.cuda.get_device_name(1))

0
0 GeForce RTX 2080 Ti
1 GeForce GTX 1080


In [6]:
torch.cuda.memory_allocated()

0

In [7]:
torch.cuda.memory_cached()

0

In [8]:
torch.cuda.empty_cache()

In [9]:
torch.cuda.memory_allocated()

0

In [10]:
torch.cuda.memory_cached()

0

Now I'll do a first try with the pretrained networks that jump, then I'll try again with a newly trained conv1dcolnet and then without convolutional pre-training and training all from scratch.

This will take a lot of time unless my new rtx2080ti arrives ...so it'll take less time there but still, a lot of time -> ARRIVED!!!

In [None]:
%%time
main_convattnet()

Starting training for model with column type ConvAttNetCol and pretrained Conv1dColNet
Parameter model details: 
conv1d_encoder parameters 2173824 from which 2173824 are trainable 
ConvAttColNet parameters 13016064 from which 13016064 are trainable 
decoder parameters 378832 from which 378832 are trainable 
Total model parameters 14037712 from which 14037712 are trainable 
====> Timestamp 2019-12-06 14:42:07.453593 Epoch: 1 Average loss: 0.04023628
epoch: 1====> Test set loss: 1.70771577
====> Timestamp 2019-12-06 14:43:35.835037 Epoch: 2 Average loss: 0.03531876
epoch: 2====> Test set loss: 1.71208881
====> Timestamp 2019-12-06 14:45:04.130045 Epoch: 3 Average loss: 0.03149185
epoch: 3====> Test set loss: 1.58856511
====> Timestamp 2019-12-06 14:46:32.542815 Epoch: 4 Average loss: 0.02931030
epoch: 4====> Test set loss: 1.96122572
====> Timestamp 2019-12-06 14:48:00.516503 Epoch: 5 Average loss: 0.02930105
epoch: 5====> Test set loss: 1.61341013
====> Timestamp 2019-12-06 14:49:27.289

In [None]:
# utf8codes = np.load(utf8codematrix)
# # utf8codes = utf8codes.reshape(1987, 324)
# # the convolutional encoder must NOT be retrained (that is what I'm trying to test)
# # with torch.no_grad():
# #     conv1d_encoder = Conv1DColNet(transpose_output=False)  # use default parameters
# #     conv1d_decoder = LinearUposDeprelDecoder(transpose_input=False)
# #     conv1d_model = NetContainer(utf8codes, conv1d_encoder, conv1d_decoder)
# #     # load pre-trained conv1dcolnet
# #     # conv1d_model.load_checkpoint(conv1d_pretrain_file)
# #     # cleanup things that we'll not use, we just need the encoder
# #     del conv1d_model
# #     del conv1d_decoder
# #     torch.cuda.empty_cache()
# conv1d_encoder = Conv1DColNet(transpose_output=False)  # use default parameters
# encoder = ConvAttColNet(conv1d_encoder, transpose_output=False)
# decoder = LinearUposDeprelDecoder(transpose_input=False)
# model = NetContainer(utf8codes, encoder, decoder)
# print("Starting training for model with column type ConvAttNetCol and pretrained Conv1dColNet")
# print("Parameter model details: ")
# print("conv1d_encoder parameters {} from which {} are trainable ".
#       format(count_parameters(conv1d_encoder), count_parameters(conv1d_encoder)))
# print("ConvAttColNet parameters {} from which {} are trainable ".
#       format(count_parameters(encoder), count_parameters(encoder)))
# print("decoder parameters {} from which {} are trainable ".
#       format(count_parameters(decoder), count_parameters(decoder)))
# print("Total model parameters {} from which {} are trainable ".
#       format(count_parameters(model), count_parameters(model)))
# path = "./trained_models/ConvAttNet"
# base_name = "ConvAttNet_nll-loss"

In [None]:
# device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
# print(device)

In [None]:
# model.to(device)

In [None]:
# %%time
# train_test(model, path, base_name, max_seq_len=384, max_data=60)

Training now goes smoothly with the new  card rtx2080ti, takes about 2 hours for the entire network from scratch. 

There is an issue with the training where there are a couple of spikes in the training (and testing) results (epochs 33, 34 and 63), although it comes back later better, this behaviour was shown already in some other tests on the Conv1dColNet training today in the GTX1080 card.

Results from Conv1dColNet and ConvAttColNet are not directly comparable due to the shape of the vectors (the later outputs only a part of the result, the las 384 elements). What might be interesting is to use this pretrained network to do some fine-tuninig for a full lenght (1024) Attention output, this might be faster if only I train the last (big) layer and then fine tuning the previous ones (fast.ai results on ULMFit for example, although the method needs to be tweaked for the network architecture here that is a bit more complex)


Next stage is actually measuring the accuracy (need to create the measurement) 

Later need to work on the:

* Do a training on supervized tasks where the input is noisy (something like for the denoiser autoencoders for Language Modeling)
* Language Model
* language encoding (having the list of languages and making it as a new input to the network for the output language selection)
* input language detection
* being able to add more context
* ... Many Many more things...


Training seems unstable .... loss oscilates on an (bad, really bad) asymptote, one case the test set even seems to stabilize in a loss that is even bigger than the original one.