# Data Preparation

This notebook develops the data preparation for text-to-text learning for supervised datasets (like T5 from Deep Mind), it extends T5 for more tasks and is developed with PyTorch.

The source code is open-sourced.

For the processed text, it will be given when/if I get resources to get it in the open (due to data volumes).



## Dataset preparation.

One of the ideas of this process is to do less pre-processing and use the least pre-processed text possible. Uppercase, punctuation and other simbols have information that with some pre-processing is lost. This might not be too problematic for English or other languages, but certainly is for German (and might be for others).

Due to this, many of the pre-processsd (tokenized) datasets available are discarded and the data preparation will be done from Raw data (example for the GLUE and SuperGLUE benchmmarks)

Data preparation would be much faster with Scala in Spark than with Python but for ease of portability and usage I'll be using python. Also the data preparation is one off only, no need to re-process once done.

Nevertheless, even if working with Python, choosing the right libraries is good. This is why for json we choose [orjson](https://github.com/ijl/orjson) and for csv even though there seems to be a [faster library ](https://github.com/juancarlospaco/faster-than-csv) it does not have many users or community so we keep with the standard csv library which is the fastest other way of doing it.

### Text Task Description

In the original T5 paper the tasks are described in english and with a single representation, for example: 
 
    Source String: "translate {}"
    Target String: "to {}"
 
In this work we add a few variations to this. The first variation is that the task will be described in multiple languages, for starting:

* English
* Spanish
* French
* German

TODO The second change is that instead of a single description of the task, there will be multiple ones and they'll be chosen randomly.

Examples for language translation:
 
    " Cómo se dice: {} en {} ?"
    " Cómo se escribe: {} en {} ?"
    " Escribe: {} en {} ?"
    " Traducir: {} al {}."
    " Por favor traduce: {} al {}"
    " Traduce: {} al {}"



## Datasets List to process/analyze

* ~~MUSE~~ Issue downloading data, only multilang dictionaries available
* GLUE
    - [CoLA](https://nyu-mll.github.io/CoLA/); [Neural Network Acceptability Judgments ](https://arxiv.org/abs/1805.12471); [Source Code](https://github.com/nyu-mll/CoLA-baselines)
    - [MNLI](https://www.nyu.edu/projects/bowman/multinli/); [Paper](https://arxiv.org/abs/1704.05426); [Baseline](https://github.com/nyu-mll/multiNLI/blob/master/README.md)
    - MRPC [Paper](https://pdfs.semanticscholar.org/13d7/cbe9035abbb0f243a5e63e19d9c01bcf69d8.pdf); [Original Dataset](https://www.microsoft.com/en-us/download/details.aspx?id=52398&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F607d14d9-20cd-47e3-85bc-a2f65cd28042%2F)
    - QNLI [Paper](https://www.nyu.edu/projects/bowman/glue.pdf) 
    - QQP
    - RTE
    - SNLI
    - SST-2
    - STS-B
    - WNLI
* [SuperGLUE](https://w4ngatang.github.io/static/papers/superglue.pdf) 
    - BoolQ
    - CB
    - COPA
    - MultiRC
    - ReCoRD
    - RTE
    - WiC
    - WSC
* [XNLI](https://github.com/facebookresearch/XNLI) <- this one is interesting
* UD-Treebank v2.5 <- this one is interesting
* [SWAG](http://rowanzellers.com/swag/); [Paper](https://arxiv.org/abs/1808.05326); [Source Code](https://github.com/rowanz/swagaf)
* [WikiMatrix](https://ai.facebook.com/blog/wikimatrix/); [Paper](https://arxiv.org/abs/1907.05791); [Github](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix)
* ~~[SETimes](http://nlp.ffzg.hr/resources/corpora/setimes/)~~ No need of it, already many samples at WikiMatrix and UD-Treebank
* Tatoeba:  Wikimatrix is nice but this one has different kind of phrases (questions, answers and some other things)
* [EuroParliament](http://www.statmt.org/europarl/)
* [Wikipedia Translation Dataset](http://opus.nlpl.eu/Wikipedia.php); [WikiExtractor](https://github.com/tatuylonen/wiktextract)
* [ConceptNET](http://conceptnet.io/); [Github](https://github.com/commonsense/conceptnet5/wiki) 
* [Open Multilingual WordNet](http://compling.hss.ntu.edu.sg/omw/) and [Global WordNet Association](http://globalwordnet.org/resources/wordnets-in-the-world/)


* [BabelNET](https://babelnet.org/) [Downloads](https://babelnet.org/download) seem proprietary ...
* [PanLex](https://panlex.org/)  Word level traductions for many (many) language pairs. [Downloads](https://panlex.org/source-list/) and [Vocabulary](https://vocab.panlex.org/)
* [ASJD Database](https://asjp.clld.org)
* Thesaurus [Some](https://old.datahub.io/dataset/open-data-thesaurus) [links](http://vocabulary.semantic-web.at/PoolParty/wiki/OpenData) [where](https://www.thesaurus.net/) [to](https://www.powerthesaurus.org/multilingual) find
* [bAbI](https://research.fb.com/downloads/babi/); [Code on Github](https://github.com/facebook/bAbI-tasks). Although it seems that there are [issues](https://www.reddit.com/r/MachineLearning/comments/3ohkt8/i_solved_facebooks_babi_and_found_lots_of_errors/) in the [dataset](http://jamesknighton.com/2015/babi/)
* [MALMO](https://www.microsoft.com/en-us/research/project/project-malmo/) Minecraft Artificial Intelligence; [Github](https://github.com/Microsoft/malmo)
* [FastText](https://fasttext.cc/docs/en/dataset.html)
* [DBPedia](https://wiki.dbpedia.org/develop/datasets)
* [W3C](https://www.w3.org/community/sentiment/wiki/Datasets)
* [Europarl](http://opus.nlpl.eu/Europarl.php)
* [Amazon Registry Open Data on AWS](https://registry.opendata.aws/)
* [Peter Jansen Cognitiveai.org Explanation Bank](http://cognitiveai.org/explanationbank/)
* [List of Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets#naturallanguage)
* [Emoji Database - Kaggle](https://www.kaggle.com/eliasdabbas/emoji-data-descriptions-codepoints)
* [Emoji Sentiment Data - Kaggle](https://www.kaggle.com/thomasseleck/emoji-sentiment-data)
* [EmojiNet - Kaggle](https://www.kaggle.com/rtatman/emojinet)
* [Twitter Emoji Prediction - Kaggle](https://www.kaggle.com/hariharasudhanas/twitter-emoji-prediction)
* [Sentiment Analysis multi-language - Kaggle](https://www.kaggle.com/weywenn/sentiment-analysis-multilanguage)
* [BigQuery public Dataset List](https://www.reddit.com/r/bigquery/wiki/datasets)

### Question Answering:

* XuAD;  [Paper](https://arxiv.org/abs/1910.11856) [Dataset](https://github.com/deepmind/xquad)
* XQA; [Paper](https://www.aclweb.org/anthology/P19-1227/)
* MLQA; [Paper](https://arxiv.org/abs/1910.07475)


### Many more datasets here:

* https://quantumstat.com/dataset/dataset.html

## Unsupervised Datasets

* Gutenberg
* [Wiktionary](https://dumps.wikimedia.org/enwiktionary/)
* Scholarpedia
* [Wikipedia](https://dumps.wikimedia.org/)
* ArXiv
* Wikitext-2
* Wikitext-103 

## Source Code (Programming) Datasets

* [Github data](https://medium.com/google-cloud/github-on-bigquery-analyze-all-the-code-b3576fd2b150); [Original Post](https://github.blog/2016-06-29-making-open-source-data-more-available/); [GitHub BigQuery](https://cloud.google.com/blog/products/gcp/github-on-bigquery-analyze-all-the-open-source-code);  [BigQuery Public Data](https://cloud.google.com/bigquery/public-data)
* [GHArchive - Github](https://www.gharchive.org/); [Analyzing Github repo](https://github.com/fhoffa/analyzing_github)

### CoLA




## MNLI - MultiNLI Dataset

There are more than one task that are possible as the dataset contains also the parse tree for each sentence, which is nice. So the output format of the json will be:

    {
        'input': "task: MNLI | Sentence 1: {} | Sentence 2: {}".format(sentence_1, sentence_2),
        'target': e['gold_label'],
        'input_sentence_1': "task: MNLI parse tree of: {}".format(sentence_1),
        'input_sentence_2': "task: MNLI parse tree of: {}".format(sentence_2),
        'parse_target_1': e['sentence1_parse'],
        'parse_target_2': e['sentence2_parse'],
    }

## MRPC 



This data consists of 5 columns:

    label: 0 Not equivalent, 1 semantically equivalent
    sentence 1 id
    sentence 2 id
    sentence 1 text
    sentence 2 text
    
    
    
The note to make is that the dataset is already tokenized meaning is not the raw text. Nothing else will be done to the text

## QNLI

The dataset download contains the following columns:

    ndex
    Question
    Sentence
    Label - [entailment|not_entailment]


## QQP

Columns in the dataset:

    id
    qid1
    qid2
    question1
    question2
    is_duplicate



In [1]:
from preprocess import process_glue, process_superglue, rename_files

In [2]:
%time rename_files()

/home/leo/projects/Datasets/text/GLUE/CoLA/cola_cola_train-txt2txtmax-256.json
/home/leo/projects/Datasets/text/GLUE/CoLA/cola_cola_train.tsv
/home/leo/projects/Datasets/text/GLUE/CoLA/cola_cola_dev-txt2txtmax-256.json
/home/leo/projects/Datasets/text/GLUE/CoLA/cola_cola_train-txt2txt.json
/home/leo/projects/Datasets/text/GLUE/CoLA/cola_cola_dev-txt2txt.json
/home/leo/projects/Datasets/text/GLUE/CoLA/cola_cola_dev.tsv
/home/leo/projects/Datasets/text/GLUE/CoLA/cola_cola_test.tsv
/home/leo/projects/Datasets/text/GLUE/CoLA/cola_cola_test-txt2txt.json
/home/leo/projects/Datasets/text/GLUE/CoLA/original/raw/cola_cola_out_of_domain_dev.tsv
/home/leo/projects/Datasets/text/GLUE/CoLA/original/raw/cola_cola_in_domain_dev.tsv
/home/leo/projects/Datasets/text/GLUE/CoLA/original/raw/cola_cola_in_domain_train.tsv
/home/leo/projects/Datasets/text/GLUE/CoLA/original/tokenized/cola_cola_out_of_domain_dev.tsv
/home/leo/projects/Datasets/text/GLUE/CoLA/original/tokenized/cola_cola_in_domain_dev.tsv
/ho

In [2]:
%time process_glue()

opening /home/leo/projects/Datasets/text/GLUE/CoLA/dev.tsv
opening /home/leo/projects/Datasets/text/GLUE/CoLA/test.tsv
opening /home/leo/projects/Datasets/text/GLUE/MNLI/original/multinli_1.0_dev_matched.jsonl
opening /home/leo/projects/Datasets/text/GLUE/MNLI/original/multinli_1.0_train.jsonl
opening /home/leo/projects/Datasets/text/GLUE/CoLA/train.tsv
opening /home/leo/projects/Datasets/text/GLUE/MRPC/dev_ids.tsv
opening /home/leo/projects/Datasets/text/GLUE/MNLI/original/multinli_1.0_dev_mismatched.jsonl
saving to /home/leo/projects/Datasets/text/GLUE/CoLA/test-txt2txt.json
opening /home/leo/projects/Datasets/text/GLUE/MRPC/test.tsv
saving to /home/leo/projects/Datasets/text/GLUE/CoLA/dev-txt2txt.json
opening /home/leo/projects/Datasets/text/GLUE/MRPC/dev.tsv
opening /home/leo/projects/Datasets/text/GLUE/MRPC/train.tsv
saving to /home/leo/projects/Datasets/text/GLUE/MRPC/dev-txt2txt.json
opening /home/leo/projects/Datasets/text/GLUE/QNLI/train.tsv
saving to /home/leo/projects/Datase

## SuperGLUE

In [3]:
%time process_superglue()

opening /home/leo/projects/Datasets/text/SuperGLUE/CB/val.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/CB/test.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/CB/train.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/BoolQ/test.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/BoolQ/val.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/BoolQ/train.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/COPA/test.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/COPA/val.jsonl
saving to /home/leo/projects/Datasets/text/SuperGLUE/COPA/val-txt2txt.json
saving to /home/leo/projects/Datasets/text/SuperGLUE/CB/train-txt2txt.json
saving to /home/leo/projects/Datasets/text/SuperGLUE/COPA/test-txt2txt.json
opening /home/leo/projects/Datasets/text/SuperGLUE/COPA/train.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/ReCoRD/val.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/ReCoRD/test.jsonl
saving to /home/leo/projects/Datasets

## SwagAF


## Universal Dependencies v2.5

In [4]:
# from preprocess_conllu import conllu_process
from preprocess_conllu import *

In [5]:
%%time
conllu_process()

CPU times: user 53.1 ms, sys: 38.6 ms, total: 91.8 ms
Wall time: 1min 39s


In [6]:
all_wm = get_all_files_recurse("/media/nfs/Datasets/text/WikiMatrix/")

## WikiMatrix

File structure is:
 
    v1/*.gz - 65 GB
    vi/SMALL/*.gz - 4,6GB
    
We can use all the big files for the training and the small ones for validation. Checking the files they are different language pairs, so this can be used for Zero-Shot learning on translation pairs.



In [1]:
from utils import *
import pickle

In [2]:
# WIKIMATRIX_BASEPATH = "/media/nfs/Datasets/text/WikiMatrix/v1"
# all_files = get_all_files_recurse(WIKIMATRIX_BASEPATH)

In [3]:
# all_files = [ f for f in all_files if 'txt2txt' in f]

In [4]:
# import os

# for f in all_files:
#     os.system('rm {}'.format(f))

In [5]:
from preprocess_wikimatrix import *


I'll first erase the data I'm sure I ĺl not be using, the original tar file is complete, so there is no issue with deleting individual gz files if I need them later. This frees some space and I can start to work on checking the rest of the data to see if there is any encoding issue with the current codebook



In [6]:
all_files = get_all_files_recurse(WIKIMATRIX_BASEPATH)
blacklist = [b + '-' for b in BLACKLIST_LANGS] + ['-' + b for b in BLACKLIST_LANGS]
to_remove = []

for f in all_files:
    for b in blacklist:
        if b in f:
            to_remove.append(f)
            break
            

In [7]:
len(all_files), len(to_remove)

(1702, 23)

In [8]:
import os
for f in to_remove:
    os.system("rm {}".format(f))

In [9]:
all_files = sorted(get_all_files_recurse(WIKIMATRIX_BASEPATH))

In [10]:
len(all_files)

1679

It seems quite a big win on the pre-pre-processing.
Now I have to deal with actually checking the rest of the languages, to do this I could filter 2 or 3 samples of each language instead of having to check all files. This will might faster but will be an issue as there might be characters of non recognized languages in the input so for the moment I'll process them all and check by handl later

In [11]:
# get all remaining language codes:
lang_codes = set([])
for f in all_files:
    codes = path_leaf(f).replace("WikiMatrix.","").replace(".tsv.gz","").split("-")
    lang_codes.update(codes)

In [13]:
codebook_path = 'codes/adhoc-codebook-1916.pkl'
f = open(codebook_path, 'rb')
codebook, char2int, int2char = pickle.load(f)

In [24]:
all_files[0]

'/media/nfs/Datasets/text/WikiMatrix/v1/SMALL/WikiMatrix.an-bg.tsv.gz'

In [27]:
# all_files.remove('/media/nfs/Datasets/text/WikiMatrix/v1/SMALL/WikiMatrix.q*tsv.gz')
all_files.index('/media/nfs/Datasets/text/WikiMatrix/v1/SMALL/WikiMatrix.pt-war.tsv.gz')

844

In [None]:
checks = []

In [28]:
%%time
for f in all_files[844:]:
    checks.append(check_encoding_works(char2int, f))

Accept: True | chars set: 1323 | chars count: 536941.0000001 | unk_chars = 1682 | ratio = 0.003132560188176516  | fname = WikiMatrix.pt-war.tsv.gz
Accept: True | chars set: 749 | chars count: 745932.0000001 | unk_chars = 1554 | ratio = 0.002083299818213713  | fname = WikiMatrix.rm-ro.tsv.gz
Accept: True | chars set: 902 | chars count: 1325102.0000001001 | unk_chars = 5119 | ratio = 0.0038630988406927265  | fname = WikiMatrix.rm-ru.tsv.gz
Accept: True | chars set: 830 | chars count: 909385.0000001 | unk_chars = 1591 | ratio = 0.0017495340257424798  | fname = WikiMatrix.rm-sv.tsv.gz
Accept: True | chars set: 635 | chars count: 552455.0000001 | unk_chars = 1028 | ratio = 0.0018607850413152455  | fname = WikiMatrix.rm-tr.tsv.gz
Accept: True | chars set: 693 | chars count: 1057728.0000001001 | unk_chars = 3666 | ratio = 0.0034659194046103093  | fname = WikiMatrix.rm-uk.tsv.gz
Accept: True | chars set: 1048 | chars count: 646117.0000001 | unk_chars = 1872 | ratio = 0.0028973080726860776  | f

In [37]:
accepted = [c for c in checks if c[0]]

In [38]:
failed = [c for c in checks if c[0] is False]

In [39]:
len(accepted), len(failed)

(1649, 30)

In [41]:
failed = [f[1] for f in  failed]
accepted = [f[1] for f in  accepted]

In [42]:
failed

[('ce', 'en'),
 ('ce', 'ru'),
 ('ce', 'uk'),
 ('cv', 'en'),
 ('cv', 'ru'),
 ('dv', 'en'),
 ('en', 'ga'),
 ('en', 'gd'),
 ('en', 'ht'),
 ('en', 'hy'),
 ('en', 'ku'),
 ('en', 'mh'),
 ('en', 'mi'),
 ('en', 'mn'),
 ('en', 'mt'),
 ('en', 'ps'),
 ('en', 'sa'),
 ('en', 'su'),
 ('en', 'tk'),
 ('en', 'wa'),
 ('en', 'wa'),
 ('fr', 'ps'),
 ('hr', 'ru'),
 ('ba', 'en'),
 ('br', 'en'),
 ('en', 'jv'),
 ('en', 'tg'),
 ('en', 'tt'),
 ('en', 'ug'),
 ('fr', 'hy')]

In [33]:
failedset = set([])
for f in failed:
    failedset.update(f)

In [34]:
accset = set([])
for a in accepted:
    accset.update(a)

In [35]:
failed = failedset.difference(accset)

In [36]:
failed

set()

In [23]:
bl = set(['wuu', 'gom', 'lmo', 'mwl', 'ilo', 'ckb', "ar", "hi", "sh", "hu", "eo", "fo", "si",
                   "bn", "ml", "fa", "ne", "as", "azb", "ka", 'as', 'bn', 'fa', 'ka', 'ml', 'ne', 'si', 'zb',
                   # "sq", "he", maybe yes
                   # "hr", "br" ???
                   "ur", "id", "kk", "mr", "ta", "th", "hi", "zh", "ko", "tl", "vi", "te", "ja",'bp', 'ew', 'gu', 'pa', 'py'
                   ])

In [26]:
# sorted(list(bl))

In [3]:
%%time
wikimatrix_process()

Error processing file: /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.en-wuu.tsv.gz 
With error: 'NoneType' object has no attribute 'name'
Error processing file: /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.ru-wuu.tsv.gz 
With error: 'NoneType' object has no attribute 'name'
Error processing file: /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.es-gom.tsv.gz 
With error: 'NoneType' object has no attribute 'name'
Error processing file: /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.it-lmo.tsv.gz 
With error: 'NoneType' object has no attribute 'name'
Error processing file: /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.fr-wuu.tsv.gz 
With error: 'NoneType' object has no attribute 'name'
Error processing file: /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.it-wuu.tsv.gz 
With error: 'NoneType' object has no attribute 'name'
Error processing file: /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.en-mwl.tsv.gz 
With error: 'NoneType' object has no attribute 'name'
Error 

In [10]:
sum([1620, 1925])  # 1620 are the complete files, 1925 are the files in the SMALL dataset

3545

In [11]:
# WIKIMATRIX_BASEPATH = "/media/nfs/Datasets/text/WikiMatrix/v1"

# allfiles = get_all_files_recurse(WIKIMATRIX_BASEPATH)

In [12]:
# t2t = [f for f in allfiles if 'txt2txt' in f]

In [13]:
# len(t2t)

3545 files processed and 3545 files existing, everything seems OK.

# Data preparation by length and task

This part checks some things that should work

In [8]:
from prepare_data import *

In [15]:
# tfile = '/home/leo/projects/Datasets/text/SuperGLUE/CB/val-txt2txt.json'
# fname = '/media/nfs/Datasets/text/WikiMatrix/v1/SMALL/WikiMatrix.ja-su-txt2txt.json.gz'

In [9]:
%%time
process()

CPU times: user 434 ms, sys: 127 ms, total: 562 ms
Wall time: 14min 23s


In [10]:
# separate_by_strlen(fname)

In [11]:
%%time
prepare_select_all()

Preparing 10109 files of max_len 512
CPU times: user 836 ms, sys: 1.13 s, total: 1.96 s
Wall time: 35.2 s


In [12]:
%%time
prepare_lm_data_wikimatrix()

CPU times: user 32min 59s, sys: 48.2 s, total: 33min 47s
Wall time: 33min 47s


In [14]:
%%time
# OUTPUT_FNAME = '/home/leo/projects/Datasets/text/train_selected_monofile/monofile.txt'
OUTPUT_FNAME = '/home/leo/projects/Datasets/text/train_selected_monofile/monofile_2.txt'

json2lines(separator='\t', ofile=OUTPUT_FNAME)

CPU times: user 12min 20s, sys: 58.7 s, total: 13min 19s
Wall time: 13min 20s


the output file of this is 273M samples (lines)

    wc -l 
    273332515 monofile_2.txt


In [24]:
# model_vocab_sizes = [32000, 64000, 96000, 128000]
# model_prefixes = ['all_34G_32k', 'all_34G_64k', 'all_34G_96k', 'all_34G_128k']
# model_types = ['unigram', 'bpe', 'word', 'char']
# input_sentence_size = [1e6, 1e7, 273332515]

# cmd = "spm_train --input={} --vocab_size={} --input_format=tsv --model_prefix={} --model_type={} --character_coverage=0.9995"
# cmd2 = "spm_train --input={} --input_sentence_size={} --vocab_size={} --input_format=tsv --model_prefix={} --model_type={} --character_coverage=0.9995 --shuffle_input_sentence"

# commands = []
# commands2 = []
# file = OUTPUT_FNAME
# for vs in model_vocab_sizes:
#     for t in model_types:
#         for pref in model_prefixes:
#             for ss in input_sentence_size:
#                 prefix = '-'.join((t,pref))
#                 c = cmd.format(file, vs, prefix,t )
#                 commands.append(c)
#                 c2 = cmd2.format(file, int(ss), vs, prefix,t )
#                 commands2.append(c2)


In [25]:
commands

['spm_train --input=/home/leo/projects/Datasets/text/train_selected_monofile/monofile_2.txt --vocab_size=32000 --input_format=tsv --model_prefix=unigram-all_34G_32k --model_type=unigram --character_coverage=0.9995',
 'spm_train --input=/home/leo/projects/Datasets/text/train_selected_monofile/monofile_2.txt --vocab_size=32000 --input_format=tsv --model_prefix=unigram-all_34G_32k --model_type=unigram --character_coverage=0.9995',
 'spm_train --input=/home/leo/projects/Datasets/text/train_selected_monofile/monofile_2.txt --vocab_size=32000 --input_format=tsv --model_prefix=unigram-all_34G_32k --model_type=unigram --character_coverage=0.9995',
 'spm_train --input=/home/leo/projects/Datasets/text/train_selected_monofile/monofile_2.txt --vocab_size=32000 --input_format=tsv --model_prefix=unigram-all_34G_64k --model_type=unigram --character_coverage=0.9995',
 'spm_train --input=/home/leo/projects/Datasets/text/train_selected_monofile/monofile_2.txt --vocab_size=32000 --input_format=tsv --mode

SentencePiece

--input_sentence_size {} --vocab_size {} --input_format tsv --model_prefix {} --input {} --model_type {} --character_coverage=0.9995


BPEmb: Subword Embeddings in 275 Languages

BPEmb 

https://nlp.h-its.org/bpemb/
https://nlp.h-its.org/bpemb/multi/



In [1]:
import sentencepiece as spm

In [48]:
s = spm.SentencePieceProcessor()
# s.Load('/home/leo/projects/Datasets/text/sentencepiece/bpe-all_2G5_64k.model')
s.Load('/home/leo/projects/Datasets/text/sentencepiece/bpe-all_2G5_64k.model')

True

In [49]:
p = s.SampleEncodeAsPieces('New York', -1, 0.1)

In [45]:
s.EncodeAsPieces

<bound method SentencePieceProcessor.EncodeAsPieces of <sentencepiece.SentencePieceProcessor; proxy of <Swig Object of type 'sentencepiece::SentencePieceProcessor *' at 0x7f9f4b628120> >>

In [43]:
s.SampleEncodeAsPieces?

[0;31mSignature:[0m [0ms[0m[0;34m.[0m[0mSampleEncodeAsPieces[0m[0;34m([0m[0minput[0m[0;34m,[0m [0mnbest_size[0m[0;34m,[0m [0malpha[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m      ~/venv3/lib/python3.7/site-packages/sentencepiece.py
[0;31mType:[0m      method


In [53]:
for i in range(10):
    print(s.EncodeAsPieces('吾輩は猫である'), s.EncodeAsIds('吾輩は猫である'))
    print(s.EncodeAsPieces('New York'), s.EncodeAsIds('New York'))
    print(s.SampleEncodeAsPieces('New York', -1, 0.1))

['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]


In [35]:
s.SampleEncodeAsIds('New York', -1, 0.1)

[803, 390, 7, 62657]

In [27]:
s.DecodeIds([474, 13, 390, 776])

'New York'

In [18]:
'U+2588', chr(0x2588)

('U+2588', '█')

In [21]:
'█'

'█'

In [3]:
from pycountry import languages

In [5]:
l = languages.get(alpha_2='es')

In [6]:
l.name

'Spanish'

I don't like how the sentencepiece is encoding, it fails, while I don't want issues with languages single symbols.

For the moment I'd redo the entire decision, the coding and the languages that we'll be able to represent. This creates for one side a problem as I wnated something universally extendable, but for the other simplifies many things and cuts the amount of data that I'll have to use. Languages to use will be mostly western, latin, green and cyrillic based.