# Training Preparation

This document states the goals for the training

## Goals
For the current work the training is focused in learning language representations for:
* multi-lingual
* multi-task
* Text-to-Text (similar to T5 from DeepMind) 

With the goal of:

* [Few-Shot|Zero-Shot] learning of new tasks
* Zero-Shot translation for language pairs
* Being able to add new knowledge without catastrophic forgetting

The particularity of the trained model is that they are intended to be able to recognize and solve the given tasks solely from the description in the input. The models should be able to perform well enough without any retraining or fine-tunning. Any re-training or fine-tunning should also be able to be done without hindering (much?) the previously trained tasks.

Also there is the important point to make that all this work focuses on having an input as RAW as possible (meaning no text normalization, separation, .... etc) and having NO Out of Vocabulary (OOV) input. So all the languages should be feasible as input.


## Model Outputs

Particularly the model should have multiple outputs being those the following:

1. Origin Language (language detection) - Language Name (would be less confusing and more informative than code)
~~2. Destination Language (the original one if none given) - Language Name~~
~~3. Task Language (The language in which the task was described) - Language CODE (2 or 3 character code)~~
4. The task at hand
5. **The TARGET output** of the task for the current input

~~5. PoS UPOS tagging ckech if needed or useful?~~


## Training

The current work follows the example of [ERNIE 2.0](https://arxiv.org/abs/1907.12412) in that uses supervised training to improve the performance. In that paper they show that supervised tasks allow better training than just Denoising LM target with unsupervised learning. 
In this work we leverage many other tasks and meta-learning to take advantage of existing mono and multi-lingual training datasets.

The training will be done in the following stages:

1. A general pre-training with many datapoints on all the tasks
    - Evaluation on different validation sets
2. A few-shot meta-learninig training based on the pre-trained weights
    - Evaluation on different validation sets
3. Add architectural changes to the network (parallel columns and memory) and train for tasks with the meta-learned 
    - Evaluation on different validation sets and tasks
4. Overfit new input or input that fails -> create an entry in the memory for it.
    - Evaluate
5. ... ?


The intuition between all this is:
 - Having a strong statistical baseline with many tasks
 - Meta-learning to be able to quickly learn new tasks
 - Use the few-shot learn ability to be able to add new knowledge in the memory as in [Large Memory Layers with Product Keys](https://arxiv.org/abs/1907.052420)

##  Tasks

The tasks to prepare for the training, many of these are intended solely for training and there is no interest in a validation dataset for these for comparison with other models as they are not available in the literature. The current work does not intend to add new tasks or metrics.

The default task is to denoise and correct the input (if for example there are multiple spaces together, or diacritics missing, uppercases, etc).

Many of the tasks can be taken from standard Text pre-processing tasks, the idea being here to use available weakly-supervised with minimal transformations in the input.

The tasks can be separated into:

### [Un|Weakly]-supervised

* MLM (Masked Language Model) BERT|mBERT|BART|ERNIE training objective (TODO) 
* Denoising (see previous pont instead)
* Capitalization
* to-Lowercase
* to-UPPERCASE
* Add diacritics
* Remove Diacritics
* character shuffling, character deletion, character addition (close in vocab to the ones in the sentence), character duplication, ...
~~* Text Normalization: [NFD|NFKC|...]~~ This is done before passing the text to the NN.


### Supervised

* [GLUE](https://gluebenchmark.com/)
* [SuperGLUE](https://super.gluebenchmark.com/); [Paper](https://w4ngatang.github.io/static/papers/superglue.pdf) 
* Grammar tagging and recognition (PoS, NER, Dependency detection, ...) [UD-Treebank v2.5](https://universaldependencies.org/)
* Translation [WikiMatrix](https://ai.facebook.com/blog/wikimatrix/); [Paper](https://arxiv.org/abs/1907.05791); [Github](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix)
* Language Detection (the goal would be to have this as an extra output)

#### TODO but easy to make during data preparation:

* Add accents to string in ASCII ,remove accents (for French, Spanish and other languages like these) .... [Answer in stackoverflow](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) and [open science](https://www.science-emergence.com/Articles/How-to-remove-string-accents-using-python-3/) with [Gensim](https://radimrehurek.com/gensim/utils.html#gensim.utils.deaccent), should also do Remove Diacritics ...
* Fill the placeholder (for example eliminate only one word and make it fill it ...?)

#### TODO - prepare and do the following too:

* Tatoeba:  Wikimatrix is nice but this one has different kind of phrases (questions, answers and some other things)
* [EuroParliament](http://www.statmt.org/europarl/)
* [Wikipedia Translation Dataset](http://opus.nlpl.eu/Wikipedia.php); [WikiExtractor](https://github.com/tatuylonen/wiktextract)
* [ConceptNET](http://conceptnet.io/); [Github](https://github.com/commonsense/conceptnet5/wiki) 
* [Open Multilingual WordNet](http://compling.hss.ntu.edu.sg/omw/) and [Global WordNet Association](http://globalwordnet.org/resources/wordnets-in-the-world/)


* Conjugate verbs for different languages -> do as with the language textbooks !!!
* Give the dictionary definition of word ...


While the starting datasets and tasks are given here, there are many others like multi-hop search and question answering that I would love to add later.

## Input Corruption techniques

### BART pre-training details - Noise Generation Techniques:

* Token Masking
* Token Deletion
* Tex Infilling
~~* Sentence Permutation~~ Can't use in the current setting, need more dev for this
~~* Document Rotation~~ Can't use in the current setting, need more dev for this

### Other Techniques
* Explained 

### Add a noise layer every N layers in the NN

TODO, this is a technique that should be tested too, this might be enough for another paper

There is one on that subject: [Adaptive Noise Injection: A Structure-Expanding Regularization for RNN](https://arxiv.org/abs/1907.10885)

## Neural Networks to test

The Neural Networks to test will be the following:

* Vanilla CNN (autoencoder like, non seq2seq but the input length can change if fully convolutional), this is for easy baseline
* ~~Vanilla LSTM ?? -> might be slow/heavy to train~~
* Vanilla Transformer Encoder only (non seq2seq)
* Vanilla Transformer Decoder only (non seq2seq??)
* Transformer (Encoder-Decoder Seq2Seq architecture - limit the number of parameters here, to study)
* [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451)
* Hybrid Seq2Seq (description below - goal to cut as many params as possible and ):
    - CNN + Light Dynamic Convolutions at input level  (if possible compress the time dimension)
    - Encoder [Transformer|Reformer] layers With Large Memory 
    - Decoder [Transformer|Reformer] Layers with Large Memory
    - Deconv CNN +  Light Dynamic Convolution (to decompress the input time dimension into tokens)


### Modifications to network and training

All networks should also test the following modifications (though I incline myself to just do it in the hybrid architecture due to expense) 

* Use REPTILE (better for rsource usage than MAML) for meta-training
* Use Plastic (hebbian?) + Modulated Plastic meta-learning
* Use NeuralDB
* Use Overfitting for new tasks and for examples where there are errors, overfit in a new memory position and later do attention over the memory outputs
* The encoder just as a circular convolution of the inputs instead of making some comples NN ?? 
* Use [The Evolved Transformer](https://arxiv.org/abs/1901.11117) instead of the base Transformer (need to be implemented in pytorch)
* Use Relative Positional Encoding and keeping past activations from TransformerXL
* Use Compressive Transformers' ideas


After finding a couple of AMAZING papers from google The goal should be to work on:
* [Universal Transformers](https://arxiv.org/abs/1807.03819)
* [MEMO: A Deep Network for Flexible Combination of Episodic Memories](https://arxiv.org/abs/2001.10913)

And try to make from these two (adding circular convolution compression and/or/ reptile and plasticity) the NeuralDB

## Evaluation Tasks and Metrics

The tasks that will be given importance at the validation moment and should be used for comparison with other SoTA models available in the literature.

* GLUE: standard by the site's validation set
* SuperGLUE: standard by the site's validation set
* Translation [BLEU](https://en.wikipedia.org/wiki/BLEU), [F-Score](https://en.wikipedia.org/wiki/F1_score), [ROUGE](https://en.wikipedia.org/wiki/F1_score) ... ?


## Architecture Description and decisions


* As with [BART](https://arxiv.org/pdf/1910.13461.pdf) use a *Bidirectional Encoder* with an *Autoregressive Decoder* 

* The model will be a Seq2Seq based on the transformer architecture, the input and output sizes should be of dynamic dimension. The input dimension, if possible will be compressed by the network (example, from an original input of 1024 go to a dimension of 384 or 256).
* The dimensions will be divisible by 2, and mostly power of 2

* The model will have as an input a character level of 128 dimensions based on 3-segment UTF-8 text encoder developed previously by the same author. No OOV characters should exist for most languages and input types (only emojis and extended lost languages as egyptian will not be represented, all that is represented by the 4th segment on the UTF-8 encoding, the last code-point used will be U+FFFF). This is to be able to encode most (if not all) the languages in the available datasets for the current work. Using a 2-segment based coding for utf-8 would be much more memory and processor savy but would leave JCK languages OOV which we don't want.

* The decoding from vector to text will be done at character-level based on the [FAISS facebook's library](https://github.com/facebookresearch/faiss)

## Training Tools

* Development in PyTorch (easiest for research and the one the authors know best)
* Tensorboard with torch-tensorboard to check during training
* Mixed Precision training with [NVIDIA Apex](https://github.com/NVIDIA/apex)

## Training Techniques:

Most of the issues in training will be due to memory, so for this there are the following techniques to use:

* [Gradient Checkpointing](https://arxiv.org/abs/1604.06174); [Fitting larger networks into memory.
](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9) in [Pytorch](https://qywu.github.io/2019/05/22/explore-gradient-checkpointing.html)
* [Reformer - Google](https://arxiv.org/pdf/2001.04451.pdf)
* [Reversible Residual Network: Backpropagation Without Storing Activations](https://arxiv.org/abs/1707.04585) [some](https://github.com/tbung/pytorch-revnet) [sources](https://github.com/renmengye/revnet-public)


## Limitations and Specifications

The entire development, pre-processing and training is to be limited to the following software and HW:

    System Ubuntu 19.04
    Python 3.7
    PyTorch+=1.4


    CPU = 8 core intel i7700
    GPU_1 = RTX 2080 Ti - 11GB RAM
    GPU_2 = GTX 1080 - 8 GB RAM ( -1GB for the system that uses it)
    RAM = 64 GB
    Local Disk =  1TB NVMe
    Remote NFS mounted = RAID1 4TB (on Raspberry-PI 4+ 4GB) 


### Byte Pair Encoding

Although feasible with the proposed encoding, [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) compression will not be dealt with in the current and is left as future work. The posibilities for BPE are:
 * Circular Convolution (in the given order) <- I would bet for this encoding type: in pytorch is with padding_type = circular
 * Additive (although this one can be a problem)
 * Other ....

In [1]:
import torch
from models import *
from trainer_helpers import *
from data_loader import *
# from prepare_data import *
from utils import *
from tools import *

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
BASE_PATH = '/home/leo/projects/Datasets/text/selected_monofile/partitions'

fpaths = get_all_files_recurse(BASE_PATH) 

train_files = [f for f in fpaths if 'train' in f]
dev_files = [f for f in fpaths if 'dev' in f]
valid_files = [f for f in fpaths if 'valid' in f]

train_glue_files = [f for f in train_files if 'glue-' in f]
dev_glue_files = [f for f in dev_files if 'glue-' in f]

train_all_files = [f for f in train_files if 'all' in f]
test_all_files = [f for f in dev_files if 'all' in f]

In [3]:
train_files = [ f for f in train_files if 'glue' not in f and 'all' not in f]
test_files = [ f for f in dev_files if 'glue' not in f and 'all' not in f]

In [4]:
codebook_name = "codes/adhoc-codebook-1871.pkl"
import pickle
codebook_path = '/home/leo/projects/mix_nlp/utf8/codes/adhoc-codebook-1871.pkl'
f = open(codebook_path, 'rb')
codebook, char2int, int2char = pickle.load(f)

In [5]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = ConvModel(codebook)

In [6]:
count_parameters(model), count_trainable_parameters(model)

(21561546, 21381930)

In [7]:
device

device(type='cuda', index=0)

In [8]:
model = model.to(device)

In [9]:
# commented to avoid accidental training while 
# %%time
# main(model, train_files, test_files, codebook_path,
# #      batch_size=10, 
# #      batch_size=175, # with opt_level=O1 this is the max
#      batch_size=185, # this one works with opt_level=O2
# #     optimizer='FusedAdam',  # Adam goes down really fast but then starts giving losses as NaN
#      optimizer='FusedLAMB',  # Fused lamb decreases slowly but steady and goes to better loss than Adam. NaN after 21730 batches, 13h30m36s
# #     optimizer='FusedNovoGrad', # is definetly the slowest one at the beginning, stabilizes at the worst value
#      opt_level='O2',
#      add_str_noise_to_input=True,
#      test_period=-1,  # No tests, as I don't know why they are not called ... FIXME
# #      checkpoint_period=10,
#      checkpoint_period=200,
#      checkpoint_path="/media/nfs/mix_nlp/checkpoints"
#     )

First complete with Lang Model try (interrupted for time things)

    Batch Id: 1 | Timestamp 2020-02-14T16:49:10.407032
    TEST Batch Id: 1 | Timestamp 2020-02-14T16:50:29.512620
    TEST Batch Id: 62 | Timestamp 2020-02-14T18:11:37.447121
    Batch Id: 1249 | Timestamp 2020-02-14T18:12:09.529383
    

In [10]:
# checkpoint = {
#                 'model': model.state_dict(),
#                 'optimizer': optimizer.state_dict(),
#                 'amp': amp.state_dict()
#             }
import os
chkp_path = "/media/nfs/mix_nlp/checkpoints"
# chkp = "amp-checkpoint_opt-O2_batch-800_loss-0.006_2020-02-18T10:52:05.255814.pt"
# chkp = "amp-checkpoint_opt-O2_batch-9300_loss-0.534_2020-02-19T00:11:20.645807.pt"
# chkp = "amp-checkpoint_opt-O2_batch-100_loss-0.006_2020-02-19T10:18:54.487049.pt"
chkp = "selected/amp-checkpoint_opt-O2_batch-11200_loss-0.544_2020-02-19T18:08:42.975215.pt"
chkp_fname = os.path.join(chkp_path, chkp)

def load_checkpoint(clean_model, fname, optimizer=None, amp=None):
    chkp = torch.load(fname)
    clean_model.load_state_dict(chkp['model'])
    if optimizer:
        optimizer.load_state_dict(chkp['optimizer'])
    if amp:
        amp.load_state_dict(chkp['amp'])
    return chkp

In [11]:
load_checkpoint(model, chkp_fname)

{'model': OrderedDict([('embeds.weight',
               tensor([[0., 0., 0.,  ..., 0., 0., 0.],
                       [0., 1., 0.,  ..., 1., 1., 1.],
                       [0., 0., 1.,  ..., 1., 1., 1.],
                       ...,
                       [0., 0., 1.,  ..., 0., 0., 0.],
                       [1., 0., 0.,  ..., 0., 0., 0.],
                       [0., 1., 0.,  ..., 0., 0., 0.]], device='cuda:0')),
              ('lin.0.bias',
               tensor([ 0.0978, -0.0657, -0.0116,  ...,  0.0131, -0.0188, -0.0254],
                      device='cuda:0')),
              ('lin.0.weight_g', tensor([[0.8159],
                       [0.8960],
                       [0.5674],
                       ...,
                       [0.4946],
                       [0.6289],
                       [0.3179]], device='cuda:0')),
              ('lin.0.weight_v',
               tensor([[-0.0048,  0.0940, -0.0135,  ..., -0.0754, -0.0672,  0.0368],
                       [ 0.0357,  0.0747, -0.

In [12]:
embeds = model.embeds

In [13]:
embweights = embeds.weight

In [14]:
embweights.shape

torch.Size([1871, 96])

In [15]:
embweights = embweights.detach().cpu().numpy()

In [16]:
embweights.shape

(1871, 96)

Something is WRONG during training, it is training the Embedding layer that I DONT want to touch.
Even if the requires_grad is False

There is a discussion [here](https://discuss.pytorch.org/t/why-is-it-when-i-call-require-grad-false-on-all-my-params-my-weights-in-the-network-would-still-update/22126/15) about it

In [17]:
embeds.weight.requires_grad

False

When preloading a trained model the FusedAdam is less stable and achieves worst performance than FusedLamb.


In [18]:
chkp_path = "/media/nfs/mix_nlp/checkpoints"
# chkp_fname = os.path.join(chkp_path, "amp-checkpoint_opt-O2_loss-0.003_2020-02-17T13:12:18.881112.pt")
# chkp_fname = os.path.join(chkp_path, "amp-checkpoint_opt-O2_loss-0.004_2020-02-17T13:39:38.887943.pt")

In [19]:
# chkp = load_checkpoint(model, chkp_fname)

In [20]:
model.embeds.weight.requires_grad

False

Training results without touching the embeddings is worst than with it.... this is crap then I need to find some solution wihtout touching the embeddings.

### Training Notes:

Starting the training with FusedAdam until it fails (NaN) accelerates training a lot.
After I reloaded these weights and used FusedLamb and it starts better getting to better loss faster.

Nevertheless, adding the Language Modeling tasks duplicates training data volume (data augmentation), hence the training time duplicates too.

The learning curves advance better for language determination, concerning the language model training, it seems to reach astable level after about 1K batches, though the language determination of LM tasks keeps improving slowly. 

The general loss (Task + lang determination and NO Language Model) draws a 90degree S, first going doing very fast (after adam warmup) then going to horizontal and then back again parallel to the original FusedLamb training (the original one was WITHOUT language model task).

I'll do this training, then I'll do a Dynamic Convolution column trianing with the same kind of output as the ConvModel (that contains also one transformer layer) and

After that I'll do an encoder-decoder architecture fusing both pre-trained models and adding a few memory layers on top (one or two memory encoder layers and maybe 3-5 memory decoding layers) .... I still have to unerstand correctly where this is going and how to make it happen .... 

It must also be noted that as the training set is gigantonormous (for the current setup and network) the train datapoints are never repeated so basically the loss of the train evaluation would be equivalent to the dev evaluation.


Training with Language Model as well as with tasks seems to give worst loss than training without it before getting to a NaN , nevertheless, after this pretraining and restarting the task training without the  Language Model alterations seems to drastically improve the loss results. Although later during training it again breaks jumping to high loss and having NaNs again

In [21]:
test_files

['/home/leo/projects/Datasets/text/selected_monofile/partitions/pos_tasks-dev.shuf.txt-04',
 '/home/leo/projects/Datasets/text/selected_monofile/partitions/pos_tasks-dev.shuf.txt-02',
 '/home/leo/projects/Datasets/text/selected_monofile/partitions/pos_tasks-dev.shuf.txt-00',
 '/home/leo/projects/Datasets/text/selected_monofile/partitions/pos_tasks-dev.shuf.txt-05',
 '/home/leo/projects/Datasets/text/selected_monofile/partitions/pos_tasks-dev.shuf.txt-03',
 '/home/leo/projects/Datasets/text/selected_monofile/partitions/pos_tasks-dev.shuf.txt-01',
 '/home/leo/projects/Datasets/text/selected_monofile/partitions/pos_tasks-dev.shuf.txt-06']

In [41]:
batch_size=10
num_workers=6
max_seq_len=512
add_noise_to_task=True
add_str_noise_to_input=True

test_dataset = Txt2TxtDataset(test_files, char2int, max_len=max_seq_len, add_noise_to_task=add_noise_to_task,
                              add_str_noise_to_input=add_str_noise_to_input)
test_data_loader = DataLoader(test_dataset, batch_size=batch_size,
                              pin_memory=True,
                              num_workers=num_workers, worker_init_fn=Txt2TxtDataset.worker_init_fn)

# chkp = load_checkpoint(model, chkp_fname)

In [42]:
datas = []
data_count = 0

for d in test_dataset:
    datas.append(d)
    data_count+=1
    if data_count > 10:
        break

In [43]:
datas

[(array([  2,  26,  68,  73, 101,  26,  97, 109,  26,  80, 114,  26, 106,
          69,  84,  26,  98, 101, 116, 101,  73, 108,  73, 103, 116, 101,
         110,  32, 108, 101, 103, 101, 110,  32,  87, 101, 114, 116,  32,
         100,  97, 114,  97,  26, 102,  32,  44,  32, 100, 100,  97, 115,
          83,  32, 105,  26,  32,  26,  68,  97, 116,  26, 110,  98,  97,
         110, 107,  32, 107, 101,  26, 110, 101,  32, 112, 111, 114, 110,
          26,  26,  26,  97,  70,  73, 115,  99,  72,  26,  26, 110,  32,
         105,  78,  72, 108,  26, 116, 101,  32,  69, 116,  78,  72,  97,
         108,  84, 101, 110,  32,  83, 111, 108,   3,   4,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,

In [44]:
%%time
batches = []
for l in test_data_loader:
    batches.append(l)
    if len(batches) > 10:
        break

CPU times: user 11.2 ms, sys: 148 ms, total: 159 ms
Wall time: 350 ms


In [45]:
b0 = batches[0]

In [46]:
len(b0)

5

In [47]:
model.eval()

In [48]:
%%time
results = []
for batch_data in batches:
    batch = []
    for d in batch_data:
        batch.append(d.to(device))
    batch_data = batch
    noise_masked, noise_target, source, target, target_lang = batch_data
    msk_res, msk_lang_res = model(noise_masked)
    tsk_res, tsk_lang_res = model(source)
    
    msk_res = msk_res.cpu().detach().numpy()
    msk_lang_res = msk_lang_res.cpu().detach().numpy()
    tsk_res = tsk_res.cpu().detach().numpy()
    tsk_lang_res = tsk_lang_res.cpu().detach().numpy()
    
    del(noise_masked)
    del(noise_target)
    del(source)
    del(target)
    del(target_lang)
        
    res = (msk_res, msk_lang_res, tsk_res, tsk_lang_res)
    results.append(res)
    torch.cuda.empty_cache()
    


CPU times: user 419 ms, sys: 520 ms, total: 939 ms
Wall time: 940 ms


In [49]:
# batch_count

In [50]:
len(results)

11

In [51]:
# r0 = results[0]

In [52]:
# r0[1].shape

In [53]:
# lmlangr0 = r0[1][0,:,:].reshape(60,1871)

In [54]:
import numpy as np
# amax = np.argmax(lmlangr0, 0)

In [55]:
def batch2txt(batch):
    all_txt = []
    for i in range(batch.shape[0]):
        txt = code2str(batch[i,:].reshape(-1), int2char)
        all_txt.append(txt)
    return all_txt

def result_batch2txt(batch):
    all_txt = []
#     print(batch.shape)
    idxs = np.argmax(batch, axis=-1)
    for i in range(idxs.shape[0]):
        idx = idxs[i,:]
#         print(idx.shape)
        txt = code2str(idx, int2char)
        all_txt.append(txt)
    return all_txt

In [56]:
b00 = batches[0][0][0,:].reshape(-1)


In [57]:
code2str(b00.cpu().detach().numpy(), int2char)

'◂Di ▒m projeKT BE▒▒ilIgTen▒lEgeГnW▒rt darauf ▒▒das SiH▒eD aTenbanK ▒ne▒ pOrnoGraafi▒CHhen ▒nhalTe eNtὺALtEN  sOL▸▶◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌'

In [58]:
txt_compare = []

for batch_data, batch_results in zip(batches, results):
    txt = []
    for d in batch_data:
        txt.append(batch2txt(d.cpu().detach().numpy()))
    for d in batch_results:
        txt.append(result_batch2txt(d))
    res = zip(*txt)
    txt_compare = res
    


In [59]:
orig = []
for batch_data in batches:
    txt = []
    for d in batch_data:
        d = d.cpu().detach().numpy()
        txt.append(batch2txt(d))
    orig = zip(*txt)    

In [60]:
len(batches)

11

In [61]:
txt_compare = list(txt_compare)

In [62]:
# clean null characters
txt_nonull = [t.replace("◌","") for b in txt_compare for t in b]


In [63]:
txt_nonull[0:9]

['◂mi▒rsoO▒t Koop▒rieRt miț VErT▸▶',
 '◂Microsoft kooperieren mit unknown▸▶',
 '◀Lemmatization◂MicrsOft Kooperiert  mit VERTe▸▶',
 '◂Microsoft kooperieren mit unknown▸▶',
 '◂German▸',
 '◀Universal Dependencies OPOS Tagging◂◂iiisoott koopprrert  m  ertt▸▶',
 '◂Latie▸h',
 '◀Universal Depen◂encies ZPOS Taggiii◂fe kotizitttn  imirrtt kooppret   t  ▸',
 '◂Lorin▸']

In [64]:
# txt_nonull[18]
txt_nonull[9:38]

['◂SIn embaRo▒, Per úEn ap▒NA▒ DoS Minuto▒ ▒ogrÓ S▒Nndos gol▒s  Que d▒jaBA sE▒TenCiada▒RáCtc▒🏽menTe ▒▒eliMinatorI▒a Pese▒a loS esf▒▒rzos de los▒hO▒dUREñosׅPAra remOntAR AlGo ▒uE▒ya▒▒e veíA pr▒ctiCa▒EnTe▒ ▒MPOs▸▶',
 '◂sin embargo , Perú en apenas dos minuto lograr sendos gol que dejar sentenciado prácticamente el eliminatoria pese a el esfuerzo de el hondureño para remontar algo que ya él ver prácticamente imposible .▸▶',
 '◀Lemmatization◂Sin embaroO, Perú En apeNAS dos minutos logró Sendos goles  Que dejaBA sentenCiada ráCtciamente la eliminatorIa Pese a loS esfueRzos de los hondUREños para remontar AlGo QuE ya se veíA prácticamentee iMPosiB▸▶',
 '◂sin embargo , Perú en apenas dos minuto lograr sendos gol que dejar sentenciado prácticamente el eliminatoria pese a el esfuerzo de el hondureño para remontar algo que ya él ver prácticamente imposible .▸▶',
 '◂Spanish▸',
 '◀Universal Dependencies OnOS Tagging◂g◂n   eaaoo,,ppe     appna  ooo   nnnoo  ogg     nnds gglls  queeddjjaa ssteeeeiia

In [65]:
# code2str(lmlang, int2char)

Made a mistake on the data training generation ... already fixed, now it should do better during inference (after the new training ... of an entire day)

First results on training with LM  + Tasks on the ConvModel, it does NOT work well, language detection is good, but the tasks are not well answered, the network is like trying to do text cleanup instead of doing the task correctly. Now I'm training a new 

In [66]:
# np.argmax?

Training with mixed precision always falls later in NaN, while training with float32 needs more memory (and time) and thus smaller batches leading to oscilation without really learning anything new besides the things that were pre-trained before.....

Need to do gradient accumulation or loss accumulation for many mini-batches to create a virtual big batch and see if that makes the problem go away.
