# Training Tests for Part Of Speech tagging

This notebook is dedicated to start working with the PoS dataset already pre-processed and the column networks that I'm creating.

The network will be constructed from small parts, each will be trained on top of the previous one, adding a new column and decoder.


In [1]:
import numpy as np
from langmodels.models import *
import langmodels.utf8codec as utf8codec
import torch.nn.functional as F
import torch.nn as nn
import torch


Loading faiss with AVX2 support.
Loading faiss.


Load the embeddings first

In [2]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())


def count_trainable_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


In [3]:
# load the codebook and all the dictionaries mapping the data
# utf8codes, txt2code, code2txt, txt2num, num2txt = utf8codec._load_codebook()
utf8codes = np.load("./utf8-codes/utf8_codebook_overfit_matrix_2seg_dim64.npy")

In [4]:
utf8codes = utf8codes.reshape(1987,64)

In [5]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [6]:
net = Conv1DPoS(utf8codes)
net = net.to(device)

In [12]:
count_parameters(net)

2161960

for the original network 11266152 parameters, I have cut the number of features and dimensions to make it smaller

for nlayers = 5 of dim 5 is 6912424 and 6846888 trainable

for the following Conv1DPartOfSpeech the number of parameters is: 2161960 where 2096424 are trainable

    nchannels_in=[64, 128, 256, 512, 256],
    nchannels_out=[128, 256, 512, 256, 96],
    kernels=[3, 3, 3, 3, 3],
    nlayers=[6, 6, 4, 4, 3],
    groups=[1, 4, 8, 4, 1],
    
And LinearUposDeprelDecoder params are:

    lin_in_dim=96, 
    lin_hidd_dim=768,
    upos_dim=18, 
    deprel_dim=278,

In [8]:
count_trainable_parameters(net)

2096424

Datasets are the one that are heavy, so I'll just load them and check what happens

In [9]:
dataset_train = "/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.4/traindev_np_batches_779000x3x1024_uint16.npy"

In [14]:
data_train = np.load(dataset_train)

In [15]:
data_train.shape

(779000, 3, 1024)

In [16]:
data_train.dtype

dtype('uint16')

In [17]:
dta_train_txt = data_train[:,0,:]
dta_train_upos = data_train[:,1,:]
dta_train_deprel = data_train[:,2,:]

In [18]:
x = torch.from_numpy(dta_train_txt[:50].astype("int64")).to(device)

In [19]:
x

tensor([[113, 117, 111,  ...,   0,   0,   0],
        [ 77, 111, 110,  ...,   0,   0,   0],
        [ 65, 117, 223,  ...,   0,   0,   0],
        ...,
        [ 68,  97, 103,  ...,   0,   0,   0],
        [ 67, 111, 109,  ...,   0,   0,   0],
        [113, 117, 105,  ...,   0,   0,   0]], device='cuda:0')

In [20]:

txtcode, positions, latent, dec = net(x)
last_latent = latent[-1]
upos, deprel = dec

In [21]:
txtcode.shape, positions.shape, last_latent.shape, upos.shape, deprel.shape

(torch.Size([50, 64, 1024]),
 torch.Size([1, 1024]),
 torch.Size([50, 96, 1024]),
 torch.Size([50, 1024, 18]),
 torch.Size([50, 1024, 278]))

In [23]:
out = torch.cat([upos,deprel], dim=2)

In [24]:
out.shape

torch.Size([50, 1024, 296])