# Data Preparation

This notebook develops the data preparation for text-to-text learning for supervised datasets (like T5 from Deep Mind), it extends T5 for more tasks and is developed with PyTorch.

The source code is open-sourced.

For the processed text, it will be given when/if I get resources to get it in the open (due to data volumes).



## Dataset preparation.

One of the ideas of this process is to do less pre-processing and use the least pre-processed text possible. Uppercase, punctuation and other simbols have information that with some pre-processing is lost. This might not be too problematic for English or other languages, but certainly is for German (and might be for others).

Due to this, many of the pre-processsd (tokenized) datasets available are discarded and the data preparation will be done from Raw data (example for the GLUE and SuperGLUE benchmmarks)

Data preparation would be much faster with Scala in Spark than with Python but for ease of portability and usage I'll be using python. Also the data preparation is one off only, no need to re-process once done.

Nevertheless, even if working with Python, choosing the right libraries is good. This is why for json we choose [orjson](https://github.com/ijl/orjson) and for csv even though there seems to be a [faster library ](https://github.com/juancarlospaco/faster-than-csv) it does not have many users or community so we keep with the standard csv library which is the fastest other way of doing it.

### Text Task Description

In the original T5 paper the tasks are described in english and with a single representation, for example: 
 
    Source String: "translate {}"
    Target String: "to {}"
 
In this work we add a few variations to this. The first variation is that the task will be described in multiple languages, for starting:

* English
* Spanish
* French
* German

TODO The second change is that instead of a single description of the task, there will be multiple ones and they'll be chosen randomly.

Examples for language translation:
 
    " Cómo se dice: {} en {} ?"
    " Cómo se escribe: {} en {} ?"
    " Escribe: {} en {} ?"
    " Traducir: {} al {}."
    " Por favor traduce: {} al {}"
    " Traduce: {} al {}"



## Datasets List to process/analyze

* ~~MUSE~~ Issue downloading data, only multilang dictionaries available
* GLUE
    - [CoLA](https://nyu-mll.github.io/CoLA/); [Neural Network Acceptability Judgments ](https://arxiv.org/abs/1805.12471); [Source Code](https://github.com/nyu-mll/CoLA-baselines)
    - [MNLI](https://www.nyu.edu/projects/bowman/multinli/); [Paper](https://arxiv.org/abs/1704.05426); [Baseline](https://github.com/nyu-mll/multiNLI/blob/master/README.md)
    - MRPC [Paper](https://pdfs.semanticscholar.org/13d7/cbe9035abbb0f243a5e63e19d9c01bcf69d8.pdf); [Original Dataset](https://www.microsoft.com/en-us/download/details.aspx?id=52398&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F607d14d9-20cd-47e3-85bc-a2f65cd28042%2F)
    - QNLI [Paper](https://www.nyu.edu/projects/bowman/glue.pdf) 
    - QQP
    - RTE
    - SNLI
    - SST-2
    - STS-B
    - WNLI
* [SuperGLUE](https://w4ngatang.github.io/static/papers/superglue.pdf) 
    - BoolQ
    - CB
    - COPA
    - MultiRC
    - ReCoRD
    - RTE
    - WiC
    - WSC
* [XNLI](https://github.com/facebookresearch/XNLI) <- this one is interesting
* UD-Treebank v2.5 <- this one is interesting
* [SWAG](http://rowanzellers.com/swag/); [Paper](https://arxiv.org/abs/1808.05326); [Source Code](https://github.com/rowanz/swagaf)
* [WikiMatrix](https://ai.facebook.com/blog/wikimatrix/); [Paper](https://arxiv.org/abs/1907.05791); [Github](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix)
* ~~[SETimes](http://nlp.ffzg.hr/resources/corpora/setimes/)~~ No need of it, already many samples at WikiMatrix and UD-Treebank
* Tatoeba:  Wikimatrix is nice but this one has different kind of phrases (questions, answers and some other things)
* [EuroParliament](http://www.statmt.org/europarl/)
* [Wikipedia Translation Dataset](http://opus.nlpl.eu/Wikipedia.php); [WikiExtractor](https://github.com/tatuylonen/wiktextract)
* [ConceptNET](http://conceptnet.io/); [Github](https://github.com/commonsense/conceptnet5/wiki) 
* [Open Multilingual WordNet](http://compling.hss.ntu.edu.sg/omw/) and [Global WordNet Association](http://globalwordnet.org/resources/wordnets-in-the-world/)


* [BabelNET](https://babelnet.org/) [Downloads](https://babelnet.org/download) seem proprietary ...
* [PanLex](https://panlex.org/)  Word level traductions for many (many) language pairs. [Downloads](https://panlex.org/source-list/) and [Vocabulary](https://vocab.panlex.org/)
* [ASJD Database](https://asjp.clld.org)
* Thesaurus [Some](https://old.datahub.io/dataset/open-data-thesaurus) [links](http://vocabulary.semantic-web.at/PoolParty/wiki/OpenData) [where](https://www.thesaurus.net/) [to](https://www.powerthesaurus.org/multilingual) find
* [bAbI](https://research.fb.com/downloads/babi/); [Code on Github](https://github.com/facebook/bAbI-tasks). Although it seems that there are [issues](https://www.reddit.com/r/MachineLearning/comments/3ohkt8/i_solved_facebooks_babi_and_found_lots_of_errors/) in the [dataset](http://jamesknighton.com/2015/babi/)
* [MALMO](https://www.microsoft.com/en-us/research/project/project-malmo/) Minecraft Artificial Intelligence; [Github](https://github.com/Microsoft/malmo)
* [FastText](https://fasttext.cc/docs/en/dataset.html)
* [DBPedia](https://wiki.dbpedia.org/develop/datasets)
* [W3C](https://www.w3.org/community/sentiment/wiki/Datasets)
* [Europarl](http://opus.nlpl.eu/Europarl.php)
* [Amazon Registry Open Data on AWS](https://registry.opendata.aws/)
* [Peter Jansen Cognitiveai.org Explanation Bank](http://cognitiveai.org/explanationbank/)
* [List of Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets#naturallanguage)
* [Emoji Database - Kaggle](https://www.kaggle.com/eliasdabbas/emoji-data-descriptions-codepoints)
* [Emoji Sentiment Data - Kaggle](https://www.kaggle.com/thomasseleck/emoji-sentiment-data)
* [EmojiNet - Kaggle](https://www.kaggle.com/rtatman/emojinet)
* [Twitter Emoji Prediction - Kaggle](https://www.kaggle.com/hariharasudhanas/twitter-emoji-prediction)
* [Sentiment Analysis multi-language - Kaggle](https://www.kaggle.com/weywenn/sentiment-analysis-multilanguage)
* [BigQuery public Dataset List](https://www.reddit.com/r/bigquery/wiki/datasets)

### Question Answering:

* XuAD;  [Paper](https://arxiv.org/abs/1910.11856) [Dataset](https://github.com/deepmind/xquad)
* XQA; [Paper](https://www.aclweb.org/anthology/P19-1227/)
* MLQA; [Paper](https://arxiv.org/abs/1910.07475)


### Many more datasets here:

* https://quantumstat.com/dataset/dataset.html

## Unsupervised Datasets

* Gutenberg
* [Wiktionary](https://dumps.wikimedia.org/enwiktionary/)
* Scholarpedia
* [Wikipedia](https://dumps.wikimedia.org/)
* ArXiv
* Wikitext-2
* Wikitext-103 

## Source Code (Programming) Datasets

* [Github data](https://medium.com/google-cloud/github-on-bigquery-analyze-all-the-code-b3576fd2b150); [Original Post](https://github.blog/2016-06-29-making-open-source-data-more-available/); [GitHub BigQuery](https://cloud.google.com/blog/products/gcp/github-on-bigquery-analyze-all-the-open-source-code);  [BigQuery Public Data](https://cloud.google.com/bigquery/public-data)
* [GHArchive - Github](https://www.gharchive.org/); [Analyzing Github repo](https://github.com/fhoffa/analyzing_github)

### CoLA




## MNLI - MultiNLI Dataset

There are more than one task that are possible as the dataset contains also the parse tree for each sentence, which is nice. So the output format of the json will be:

    {
        'input': "task: MNLI | Sentence 1: {} | Sentence 2: {}".format(sentence_1, sentence_2),
        'target': e['gold_label'],
        'input_sentence_1': "task: MNLI parse tree of: {}".format(sentence_1),
        'input_sentence_2': "task: MNLI parse tree of: {}".format(sentence_2),
        'parse_target_1': e['sentence1_parse'],
        'parse_target_2': e['sentence2_parse'],
    }

## MRPC 



This data consists of 5 columns:

    label: 0 Not equivalent, 1 semantically equivalent
    sentence 1 id
    sentence 2 id
    sentence 1 text
    sentence 2 text
    
    
    
The note to make is that the dataset is already tokenized meaning is not the raw text. Nothing else will be done to the text

## QNLI

The dataset download contains the following columns:

    ndex
    Question
    Sentence
    Label - [entailment|not_entailment]


## QQP

Columns in the dataset:

    id
    qid1
    qid2
    question1
    question2
    is_duplicate



In [1]:
from preprocess import process_glue, process_superglue, rename_files

In [2]:
# %time rename_files()

In [None]:
%time process_glue()

opening /home/leo/projects/Datasets/text/GLUE/MNLI/original/mnli_multinli_1.0_dev_mismatched.jsonl
opening /home/leo/projects/Datasets/text/GLUE/MRPC/mrpc_dev.tsv
opening /home/leo/projects/Datasets/text/GLUE/CoLA/cola_train.tsv
opening /home/leo/projects/Datasets/text/GLUE/MNLI/original/mnli_multinli_1.0_dev_matched.jsonl
opening /home/leo/projects/Datasets/text/GLUE/CoLA/cola_test.tsv
opening /home/leo/projects/Datasets/text/GLUE/MRPC/mrpc_test.tsv
saving to /home/leo/projects/Datasets/text/GLUE/CoLA/cola_test-txt2txt.json
opening /home/leo/projects/Datasets/text/GLUE/CoLA/cola_dev.tsv
opening /home/leo/projects/Datasets/text/GLUE/MNLI/original/mnli_multinli_1.0_train.jsonl
saving to /home/leo/projects/Datasets/text/GLUE/MRPC/mrpc_test-txt2txt.json
saving to /home/leo/projects/Datasets/text/GLUE/MRPC/mrpc_dev-txt2txt.json
saving to /home/leo/projects/Datasets/text/GLUE/CoLA/cola_dev-txt2txt.json
opening /home/leo/projects/Datasets/text/GLUE/MRPC/mrpc_dev_ids.tsv
opening /home/leo/pro

## SuperGLUE

In [None]:
%time process_superglue()

## SwagAF


## Universal Dependencies v2.5

In [None]:
# from preprocess_conllu import conllu_process
from preprocess_conllu import *

In [None]:
%%time
conllu_process()

In [None]:
all_wm = get_all_files_recurse("/media/nfs/Datasets/text/WikiMatrix/")

## WikiMatrix

File structure is:
 
    v1/*.gz - 65 GB
    vi/SMALL/*.gz - 4,6GB
    
We can use all the big files for the training and the small ones for validation. Checking the files they are different language pairs, so this can be used for Zero-Shot learning on translation pairs.



In [None]:
from utils import *
import pickle

In [None]:
WIKIMATRIX_BASEPATH = "/media/nfs/Datasets/text/WikiMatrix/v1"
# all_files = get_all_files_recurse(WIKIMATRIX_BASEPATH)

In [None]:
# all_files = [ f for f in all_files if 'txt2txt' in f]

In [None]:
# import os

# for f in all_files:
#     os.system('rm {}'.format(f))

In [None]:
from preprocess_wikimatrix import *

I'll first erase the data I'm sure I ĺl not be using, the original tar file is complete, so there is no issue with deleting individual gz files if I need them later. This frees some space and I can start to work on checking the rest of the data to see if there is any encoding issue with the current codebook



In [None]:
all_files = get_all_files_recurse(WIKIMATRIX_BASEPATH)
blacklist = [b + '-' for b in BLACKLIST_LANGS] + ['-' + b for b in BLACKLIST_LANGS]
to_remove = []

for f in all_files:
    for b in blacklist:
        if b in f:
            to_remove.append(f)
            break
            

In [None]:
'war' in BLACKLIST_LANGS

In [None]:
sorted(blacklist)

In [None]:
len(all_files), len(to_remove)

In [None]:
to_remove

In [None]:
import os
for f in to_remove:
    os.system("rm {}".format(f))

In [None]:
all_files = sorted(get_all_files_recurse(WIKIMATRIX_BASEPATH))

In [None]:
len(all_files)

It seems quite a big win on the pre-pre-processing.
Now I have to deal with actually checking the rest of the languages, to do this I could filter 2 or 3 samples of each language instead of having to check all files. This will might faster but will be an issue as there might be characters of non recognized languages in the input so for the moment I'll process them all and check by handl later

In [None]:
# # get all remaining language codes:
# lang_codes = set([])
# for f in all_files:
#     codes = path_leaf(f).replace("WikiMatrix.","").replace(".tsv.gz","").split("-")
#     lang_codes.update(codes)

In [None]:
# codebook_path = 'codes/adhoc-codebook-'
# f = open(codebook_path, 'rb')
# codebook, char2int, int2char = pickle.load(f)

In [None]:
# all_files[0]

In [None]:
# all_files.remove('/media/nfs/Datasets/text/WikiMatrix/v1/SMALL/WikiMatrix.q*tsv.gz')
# all_files.index('/media/nfs/Datasets/text/WikiMatrix/v1/SMALL/WikiMatrix.pt-war.tsv.gz')

In [None]:
# checks = []

In [None]:
# %%time
# for f in all_files[844:]:
#     checks.append(check_encoding_works(char2int, f))

In [None]:
accepted = [c for c in checks if c[0]]

In [None]:
failed = [c for c in checks if c[0] is False]

In [None]:
len(accepted), len(failed)

In [None]:
failed = [f[1] for f in  failed]
accepted = [f[1] for f in  accepted]

In [None]:
accepted

In [None]:
failed

In [None]:
failedset = set([])
for f in failed:
    failedset.update(f)

In [None]:
accset = set([])
for a in accepted:
    accset.update(a)

In [None]:
failed = failedset.difference(accset)

In [None]:
failed

In [None]:
bl = set(['wuu', 'gom', 'lmo', 'mwl', 'ilo', 'ckb', "ar", "hi", "sh", "hu", "eo", "fo", "si",
                   "bn", "ml", "fa", "ne", "as", "azb", "ka", 'as', 'bn', 'fa', 'ka', 'ml', 'ne', 'si', 'zb',
                   # "sq", "he", maybe yes
                   # "hr", "br" ???
                   "ur", "id", "kk", "mr", "ta", "th", "hi", "zh", "ko", "tl", "vi", "te", "ja",'bp', 'ew', 'gu', 'pa', 'py'
                   ])

In [None]:
from preprocess_wikimatrix import *

In [None]:
%%time
wikimatrix_charset_process()

In [None]:
# def extract_charset(fname):
#     charset = set([])
#     with gzip.open(fname, 'rb') as f:
#         lines = f.readlines()
#         for txt in lines:
#             txt = txt.decode('utf-8')
#         charset.update(set(list(txt)))
#     saveto = fname.replace('.tsv.gz', '-charset.txt')
#     with gzip.open(saveto, 'wb') as f:
#         # print("saving to {}".format(saveto))
#         otxt = ''.join(list(charset)).encode('utf-8')
#         f.write(otxt)
#         f.flush()
#     return charset


In [None]:
# fail1 = '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.eu-tr-charset.txt'
# fail2 = '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.es-ro-charset.txt'
# fail3 = '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.ca-it-charset.txt'
# extract_charset(fail2)

In [None]:
# # obtain the entire charsets previously extracted into one file
# all_files = get_all_files_recurse(WIKIMATRIX_BASEPATH)
# all_files = [f for f in all_files if f.endswith(".txt")]

In [None]:
# len(all_files)

In [None]:
# %%time
# all_chars = set([])
# errors = []
# for fname in all_files:
#     with(open(fname, 'rb')) as f:
#         try:
# #             flines = f.readlines()
#             for line in f.readlines():  # flines:
#                 chars = list(line.decode('utf-8'))
#                 all_chars.update(chars)
#         except Exception as e:
#             errors.append(e)
#             print("error processing {} with e= {}".format(fname, e))

In [None]:
# len(errors)

In [None]:
# len(all_chars)

In [None]:
# sorted(list(all_chars))

In [None]:
%%time
wikimatrix_process()

In [None]:
# sum([1620, 1925])  # 1620 are the complete files, 1925 are the files in the SMALL dataset

In [None]:
# WIKIMATRIX_BASEPATH = "/media/nfs/Datasets/text/WikiMatrix/v1"

# allfiles = get_all_files_recurse(WIKIMATRIX_BASEPATH)

In [None]:
# t2t = [f for f in allfiles if 'txt2txt' in f]

In [None]:
# len(t2t)

3545 files processed and 3545 files existing, everything seems OK.

# Data preparation by length and task

This part checks some things that should work

In [None]:
from prepare_data import *

In [None]:
# tfile = '/home/leo/projects/Datasets/text/SuperGLUE/CB/val-txt2txt.json'
# fname = '/media/nfs/Datasets/text/WikiMatrix/v1/SMALL/WikiMatrix.ja-su-txt2txt.json.gz'

In [None]:
%%time
process()

In [None]:
# separate_by_strlen(fname)

In [None]:
%%time
prepare_select_all()

In [None]:
%%time
prepare_lm_data_wikimatrix()


In [None]:
# import gzip
# import orjson as json
# fname = '/home/leo/projects/Datasets/text/train_selected/WikiMatrix.arz-he-txt2txtmax-512.json.gz'

# f = gzip.open(fname, 'rb')

In [None]:
# flines = f.readlines()

In [None]:
# l0 = flines[0].decode('utf-8')
# import orjson as json

In [None]:
# l0 = json.loads(l0)

In [None]:
from pycountry import languages

In [None]:
languages.get(alpha_3='nds')

In [None]:
%%time
# OUTPUT_FNAME = '/home/leo/projects/Datasets/text/train_selected_monofile/monofile.txt'
# OUTPUT_FNAME = '/home/leo/projects/Datasets/text/train_selected_monofile/monofile.txt'
# json2lines(ofile=OUTPUT_FNAME)

the output file of this is 87M samples (lines)

    $ wc -l monofile.txt 
    87235277 monofile.txt


In [None]:
# model_vocab_sizes = [32000, 64000, 96000, 128000]
# model_prefixes = ['all_34G_32k', 'all_34G_64k', 'all_34G_96k', 'all_34G_128k']
# model_types = ['unigram', 'bpe', 'word', 'char']
# input_sentence_size = [1e6, 1e7, 273332515]

# cmd = "spm_train --input={} --vocab_size={} --input_format=tsv --model_prefix={} --model_type={} --character_coverage=0.9995"
# cmd2 = "spm_train --input={} --input_sentence_size={} --vocab_size={} --input_format=tsv --model_prefix={} --model_type={} --character_coverage=0.9995 --shuffle_input_sentence"

# commands = []
# commands2 = []
# file = OUTPUT_FNAME
# for vs in model_vocab_sizes:
#     for t in model_types:
#         for pref in model_prefixes:
#             for ss in input_sentence_size:
#                 prefix = '-'.join((t,pref))
#                 c = cmd.format(file, vs, prefix,t )
#                 commands.append(c)
#                 c2 = cmd2.format(file, int(ss), vs, prefix,t )
#                 commands2.append(c2)


In [None]:
commands

SentencePiece

--input_sentence_size {} --vocab_size {} --input_format tsv --model_prefix {} --input {} --model_type {} --character_coverage=0.9995


BPEmb: Subword Embeddings in 275 Languages

BPEmb 

https://nlp.h-its.org/bpemb/
https://nlp.h-its.org/bpemb/multi/



In [None]:
import sentencepiece as spm

In [None]:
s = spm.SentencePieceProcessor()
# s.Load('/home/leo/projects/Datasets/text/sentencepiece/bpe-all_2G5_64k.model')
s.Load('/home/leo/projects/Datasets/text/sentencepiece/bpe-all_2G5_64k.model')

In [None]:
p = s.SampleEncodeAsPieces('New York', -1, 0.1)

In [None]:
s.EncodeAsPieces

In [None]:
s.SampleEncodeAsPieces?

In [None]:
for i in range(10):
    print(s.EncodeAsPieces('吾輩は猫である'), s.EncodeAsIds('吾輩は猫である'))
    print(s.EncodeAsPieces('New York'), s.EncodeAsIds('New York'))
    print(s.SampleEncodeAsPieces('New York', -1, 0.1))

In [None]:
s.SampleEncodeAsIds('New York', -1, 0.1)

In [None]:
s.DecodeIds([474, 13, 390, 776])

In [None]:
'U+2588', chr(0x2588)

In [None]:
'█'

In [None]:
from pycountry import languages

In [None]:
l = languages.get(alpha_2='es')

In [None]:
l.name

I don't like how the sentencepiece is encoding, it fails, while I don't want issues with languages single symbols.

For the moment I'd redo the entire decision, the coding and the languages that we'll be able to represent. This creates for one side a problem as I wnated something universally extendable, but for the other simplifies many things and cuts the amount of data that I'll have to use. Languages to use will be mostly western, latin, green and cyrillic based.

Sorting now the datasets into train dev test (or train test validation whatever name you want)

In [None]:
import os, sys

from utils import *


In [None]:
basedir = '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/'
all_ud_files = get_all_files_recurse(basedir)

In [None]:
t2t_ud_files = [f for f in all_ud_files if 'text2text' in f]
train = [f for f in t2t_ud_files if '-train-' in f]
test = [f for f in t2t_ud_files if '-test-' in f]
dev = [f for f in t2t_ud_files if '-dev-' in f]
to_delete = [f for f in all_ud_files if 'charseq' in f]

In [None]:
%%time

for f in train:
    os.system("mv {} /home/leo/projects/Datasets/text/train_selected".format(f))
    
for f in dev:
    os.system("mv {} /home/leo/projects/Datasets/text/dev_selected".format(f))
    
for f in test:
    os.system("mv {} /home/leo/projects/Datasets/text/validation_selected".format(f))
    
# for f in to_delete:
#     os.system("rm {}".format(f))


In [None]:
all_wm_test_files = get_all_files_recurse(os.path.join(WIKIMATRIX_BASEPATH,'SMALL'))

In [None]:
all_wm_test_files =[ f for f in all_wm_test_files if f.endswith('.json.gz')]

In [None]:
# for f in all_wm_test_files:
#     os.system("cp {} /home/leo/projects/Datasets/text/dev_selected".format(f))

In [None]:
fnames = [path_leaf(f) for f in all_wm_test_files]

In [None]:
# %%time
# # clean from the previously copied files
# for f in fnames:
#     os.system("rm /home/leo/projects/Datasets/text/train_selected/{}".format(f.replace(".json.gz", "-langmodel.json.gz")))

In [None]:
# cleanup of the Universal Dependencies files that we can't encode due to chosen characters in the encoding settings
tfolder = "/home/leo/projects/Datasets/text/train_selected"
dfolder = "/home/leo/projects/Datasets/text/dev_selected/"
vfolder = "/home/leo/projects/Datasets/text/validation_selected/"

all_ud_files = get_all_files_recurse(tfolder) +  get_all_files_recurse(dfolder) + get_all_files_recurse(vfolder)

ud_to_remove = []
for f in all_ud_files:
    fname = path_leaf(f)
    for bl in BLACKLIST_LANGS:
        if fname.startswith(bl):
            ud_to_remove.append(f)
            break

In [None]:
len(all_ud_files), len(ud_to_remove)

In [None]:
for f in ud_to_remove:
    os.system("rm {}".format(f))

There are files that are not good in the Universal Dependencies, so a manual check would be nice, but I just only get to find some due to the nature of the checks,as the following files:

    fr_ftb-ud-test-PoS-text2text-*
    en_esl-ud-test-PoS-text2text-*
    qhe_hiencs-ud-test-PoS-text2text-*

And I found some issues in the text of the json files (some old format) so I need to do a cleanup and redo all the UD treebank processing again.

So there it goes


### Language Name length
Finding the longest language name in all the language list, this will be the tensor space for language detection in the models

In [None]:
from pycountry import languages
langnames = [ l.name for l in list(languages)]  
max([len(l) for l in langnames])                                                                                                                      


### String corruption and Masking

In [None]:
from constants import *
from data_loader import *
import numpy as np

In [None]:
txt = "El Ministerio chino de Asuntos Exteriores defendió hoy el resultado de las elecciones presidenciales celebradas en Perú y ofreció su apoyo al nuevo gobierno del presidente Alberto Fujimori."

In [None]:
''.join(add_str_noise(txt, dup_char_prob=0.01, del_char_prob=0.005)[0])

In [None]:
# import timeit
# # code snippet to be executed only once 
# mysetup = "from data_loader import add_str_noise, generate_mask"
  
# # code snippet whose execution time is to be measured 
# mycode = 'add_str_noise("El Ministerio chino de Asuntos Exteriores defendió hoy el resultado de las elecciones presidenciales celebradas en Perú y ofreció su apoyo al nuevo gobierno del presidente Alberto Fujimori.", dup_char_prob=0.01, del_char_prob=0.005)'
  
# # timeit statement 
# print (timeit.timeit(setup = mysetup, 
#                     stmt = mycode, 
#                     number = 10000) )

In [None]:
10.8827264 / 10000

In [None]:
import pickle
fname = '/home/leo/projects/mix_nlp/utf8/codes/adhoc-codebook-2112.pkl'
f = open(fname, 'rb')
codebook, char2int, int2char = pickle.load(f)

In [None]:

def item2int(char):
    if char not in char2int:
        char = UNK[1]
    num = char2int[char]
    return num

def txt2tensor(txt):
    return np.array(list(map(item2int, txt)))


In [None]:
code = np.array([char2int[c] for c in txt])

In [None]:
code1 = txt2tensor(txt)

In [None]:
not False in code == code1

In [None]:
code

In [None]:
msk, txt = generate_mask(code)

In [None]:
''.join([int2char[i] for i in msk])

In [None]:
from prepare_data import *

In [None]:
%%time

TRAIN_PATH = os.path.join(BASEPATH, 'train_selected')
DEV_PATH = os.path.join(BASEPATH, 'dev_selected')
VALID_PATH = os.path.join(BASEPATH, 'validation_selected')

outpath = OUTPUT_FNAME = '/home/leo/projects/Datasets/text/selected_monofile/all_tasks-{}.txt'
# outpath = OUTPUT_FNAME = '/home/leo/projects/Datasets/text/selected_monofile/glue-pos_tasks-{}.txt'
# outpath = OUTPUT_FNAME = '/home/leo/projects/Datasets/text/selected_monofile/pos_tasks-{}.txt'
opaths = [outpath.format(t) for t in ['train', 'dev', 'valid']]
paths = [TRAIN_PATH, DEV_PATH, VALID_PATH]

for fpath, ofile in zip(paths, opaths):
    jsonfile2jsonlines(paths=[fpath], ofile=ofile)

Datasets lines are then shuffled to avoid issues, this is done randomly and in the console ...

In [None]:
cmd = "shuf {} > {}"
files = get_all_files_recurse("/home/leo/projects/Datasets/text/selected_monofile")

In [None]:
files

In [None]:
# %%time
# for f in files:
#     os.system(cmd.format(f, f.replace(".txt", ".shuf.txt")))


In [None]:
# # clean non shuffled files
# for f in files:
#     os.system("rm {}".format(f))

In [1]:
from torch.utils.data import DataLoader
from data_loader import *

In [2]:
# import orjson 
# import json
# import os, sys

In [3]:
import pickle
fname = '/home/leo/projects/mix_nlp/utf8/codes/adhoc-codebook-1871.pkl'
f = open(fname, 'rb')
codebook, char2int, int2char = pickle.load(f)

fpath = OUTPUT_FNAME = '/home/leo/projects/Datasets/text/selected_monofile/glue-pos_tasks-dev.shuf.txt'

In [4]:
# dataset = Txt2TxtDataset([fpath], char2int, max_len=128, add_noise_to_task=False)
dataset = Txt2TxtDataset([fpath], char2int, max_len=128, add_noise_to_task=True)

In [5]:
loader = DataLoader(dataset, batch_size=10)

In [6]:
loader

<torch.utils.data.dataloader.DataLoader at 0x7fa6d2b4c5f8>

In [7]:
data = []
for d in dataset:
    data.append(d)
    print(d)
    break

(array([  2,  26, 111,  26,  32,  26,  79,  32,  73, 617, 108,  26, 114,
        97, 110,  32, 116, 116,  72,  26,  32, 115,  97, 120,  79,  26,
        72, 111,  78, 101,  63,  10,  72, 111, 119,  32,  26, 111,  32,
        79, 111,  85,  32, 108,  26,  65, 110,  32, 116,  72, 101,  32,
       115,  97, 120,  26, 112, 104, 111, 110, 101,  32, 113, 117, 105,
        67, 107, 108,   3,   4,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0]), array([  2,  72, 111, 119,  32, 100,  79,  32,  73,  32, 108, 101, 114,
        97, 110,  32, 116, 116, 104, 101,  32, 115,  97, 120,  79, 112,
       104, 111,  78, 101,  63,  10,  72, 111, 119,  32, 100, 111,  32,
        79, 117,  32, 108, 101,  97, 114, 110,  32, 116, 104,  69,  32,

In [8]:
data[0][0].dtype

dtype('int64')

In [9]:
iterdata = dataset.__iter__()

In [10]:
iterdata

<generator object Txt2TxtDataset._get_stream at 0x7fa6d2d0d6d8>

In [11]:
# %%time
# ld = list(iterdata)

In [12]:
ld0 = iterdata.__next__()

In [13]:
for data in ld0:
    print(''.join([int2char[i] for i in data]))

◂how▒ dO I ▒eArn thesa▒phonne?▒h▒W▒do yOu▒▒earN the SxAOphOne qUcIKL▸▶◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌
◂How  do I leArn thesaXophone?
how do yOu learn the saxOphOne qUicKly▸▶◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌
◀QQP Duplication Detection◂How  do I leArn thesaXophone?
how do yOu learn the saxOphOne qUicKly▸▶◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌
◂Not duplicates▸▶◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌
◂English▸◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌


In [14]:
''.join([int2char[i] for i in data])

'◂English▸◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌'

In [15]:
len(int2char.keys()),len(char2int.keys())

(1871, 1878)

In [16]:
# for l in loader:
#     print(l)
#     break

In [17]:
# split files
BASE_PATH = '/home/leo/projects/Datasets/text/selected_monofile'
from utils import *
from prepare_data import *
import os
from torch.utils.data import DataLoader
from data_loader import *

In [18]:
# all_files = [ f for f in os.listdir(BASE_PATH) if f.endswith(".txt")]

In [19]:
# all_files

In [20]:
# cmd = "split -d -l 300000 {} {}"

# for f in sorted(all_files):
#      print(cmd.format(f, "partitions/"+f+'-'))

In [21]:
BASE_PATH = '/home/leo/projects/Datasets/text/selected_monofile/partitions'

In [22]:
fpaths = get_all_files_recurse(BASE_PATH) 

In [23]:
train_files = [f for f in fpaths if 'train' in f]
dev_files = [f for f in fpaths if 'dev' in f]
valid_files = [f for f in fpaths if 'valid' in f]

In [24]:
len(train_files), len(dev_files), len(valid_files)
train_glue_files = [f for f in train_files if 'glue-' in f]

In [25]:
len(train_glue_files)

12

In [26]:
dataset = Txt2TxtDataset(train_glue_files, char2int, max_len=512, add_noise_to_task=True)
loader = DataLoader(dataset, batch_size=1000, num_workers=10, worker_init_fn=Txt2TxtDataset.worker_init_fn)

In [27]:
%%time
batches = []
for l in loader:
    batches.append(l)
    if len(batches) > 10:
        break

CPU times: user 31.7 ms, sys: 64 ms, total: 95.6 ms
Wall time: 15 s


In [28]:
len(batches)

11

In [29]:
b0 = batches[0]

In [30]:
len(b0)

5

In [31]:
msk, src, txt, tgt, lang = b0
print(msk.shape, src.shape, txt.shape, tgt.shape, lang.shape)

torch.Size([1000, 512]) torch.Size([1000, 512]) torch.Size([1000, 512]) torch.Size([1000, 512]) torch.Size([1000, 60])


In [32]:
for t in b0:
    for s in t[:10]:
        s = s.numpy()
        print(code2str(s, int2char))

◂ThEse ▒Esorts aRe A weLcOme▒escape froM tHe heaT▒of The City in Su▒Er a▒d o▒ָeeR▒ma▒Y actIvi▒iesn
Durni gTHE sU▒MeRt▒▒e▒, the ciit iS ▒▒ry H▸▶◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌
◂ti▒m▒of yEar iIS alSoA actor,t HE▒wInteR moNhTS▒▒rinGing he▒v▒eer s▒a▸▶◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌

In [33]:
type(b0[1])

torch.Tensor

In [34]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
b0t = b0[1].to(device)

In [35]:
b0t0 = b0[0].to(device)

In [36]:
type(b0t0.to(device))

torch.Tensor

In [37]:
b0t

tensor([[  2,  84, 104,  ...,   0,   0,   0],
        [  2,  84, 105,  ...,   0,   0,   0],
        [  2,  80,  80,  ...,   0,   0,   0],
        ...,
        [  2,  84, 104,  ...,   0,   0,   0],
        [  2,  99, 108,  ...,   0,   0,   0],
        [  2,  73,  70,  ...,   0,   0,   0]], device='cuda:0')