# Data Preparation

This notebook develops the data preparation for text-to-text learning for supervised datasets (like T5 from Deep Mind), it extends T5 for more tasks and is developed with PyTorch.

The source code is open-sourced.

For the processed text, it will be given when/if I get resources to get it in the open (due to data volumes).



## Dataset preparation.

One of the ideas of this process is to do less pre-processing and use the least pre-processed text possible. Uppercase, punctuation and other simbols have information that with some pre-processing is lost. This might not be too problematic for English or other languages, but certainly is for German (and might be for others).

Due to this, many of the pre-processsd (tokenized) datasets available are discarded and the data preparation will be done from Raw data (example for the GLUE and SuperGLUE benchmmarks)

Data preparation would be much faster with Scala in Spark than with Python but for ease of portability and usage I'll be using python. Also the data preparation is one off only, no need to re-process once done.

Nevertheless, even if working with Python, choosing the right libraries is good. This is why for json we choose [orjson](https://github.com/ijl/orjson) and for csv even though there seems to be a [faster library ](https://github.com/juancarlospaco/faster-than-csv) it does not have many users or community so we keep with the standard csv library which is the fastest other way of doing it.

### Text Task Description

In the original T5 paper the tasks are described in english and with a single representation, for example: 
 
    Source String: "translate {}"
    Target String: "to {}"
 
In this work we add a few variations to this. The first variation is that the task will be described in multiple languages, for starting:

* English
* Spanish
* French
* German

TODO The second change is that instead of a single description of the task, there will be multiple ones and they'll be chosen randomly.

Examples for language translation:
 
    " Cómo se dice: {} en {} ?"
    " Cómo se escribe: {} en {} ?"
    " Escribe: {} en {} ?"
    " Traducir: {} al {}."
    " Por favor traduce: {} al {}"
    " Traduce: {} al {}"



## Datasets List to process/analyze

* ~~MUSE~~ Issue downloading data, only multilang dictionaries available
* GLUE
    - [CoLA](https://nyu-mll.github.io/CoLA/); [Neural Network Acceptability Judgments ](https://arxiv.org/abs/1805.12471); [Source Code](https://github.com/nyu-mll/CoLA-baselines)
    - [MNLI](https://www.nyu.edu/projects/bowman/multinli/); [Paper](https://arxiv.org/abs/1704.05426); [Baseline](https://github.com/nyu-mll/multiNLI/blob/master/README.md)
    - MRPC [Paper](https://pdfs.semanticscholar.org/13d7/cbe9035abbb0f243a5e63e19d9c01bcf69d8.pdf); [Original Dataset](https://www.microsoft.com/en-us/download/details.aspx?id=52398&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F607d14d9-20cd-47e3-85bc-a2f65cd28042%2F)
    - QNLI [Paper](https://www.nyu.edu/projects/bowman/glue.pdf) 
    - QQP
    - RTE
    - SNLI
    - SST-2
    - STS-B
    - WNLI
* [SuperGLUE](https://w4ngatang.github.io/static/papers/superglue.pdf) 
    - BoolQ
    - CB
    - COPA
    - MultiRC
    - ReCoRD
    - RTE
    - WiC
    - WSC
* [XNLI](https://github.com/facebookresearch/XNLI) <- this one is interesting
* UD-Treebank v2.5 <- this one is interesting
* [SWAG](http://rowanzellers.com/swag/); [Paper](https://arxiv.org/abs/1808.05326); [Source Code](https://github.com/rowanz/swagaf)
* [WikiMatrix](https://ai.facebook.com/blog/wikimatrix/); [Paper](https://arxiv.org/abs/1907.05791); [Github](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix)
* ~~[SETimes](http://nlp.ffzg.hr/resources/corpora/setimes/)~~ No need of it, already many samples at WikiMatrix and UD-Treebank
* Tatoeba:  Wikimatrix is nice but this one has different kind of phrases (questions, answers and some other things)
* [EuroParliament](http://www.statmt.org/europarl/)
* [Wikipedia Translation Dataset](http://opus.nlpl.eu/Wikipedia.php); [WikiExtractor](https://github.com/tatuylonen/wiktextract)
* [ConceptNET](http://conceptnet.io/); [Github](https://github.com/commonsense/conceptnet5/wiki) 
* [Open Multilingual WordNet](http://compling.hss.ntu.edu.sg/omw/) and [Global WordNet Association](http://globalwordnet.org/resources/wordnets-in-the-world/)


* [BabelNET](https://babelnet.org/) [Downloads](https://babelnet.org/download) seem proprietary ...
* [PanLex](https://panlex.org/)  Word level traductions for many (many) language pairs. [Downloads](https://panlex.org/source-list/) and [Vocabulary](https://vocab.panlex.org/)
* [ASJD Database](https://asjp.clld.org)
* Thesaurus [Some](https://old.datahub.io/dataset/open-data-thesaurus) [links](http://vocabulary.semantic-web.at/PoolParty/wiki/OpenData) [where](https://www.thesaurus.net/) [to](https://www.powerthesaurus.org/multilingual) find
* [bAbI](https://research.fb.com/downloads/babi/); [Code on Github](https://github.com/facebook/bAbI-tasks). Although it seems that there are [issues](https://www.reddit.com/r/MachineLearning/comments/3ohkt8/i_solved_facebooks_babi_and_found_lots_of_errors/) in the [dataset](http://jamesknighton.com/2015/babi/)
* [MALMO](https://www.microsoft.com/en-us/research/project/project-malmo/) Minecraft Artificial Intelligence; [Github](https://github.com/Microsoft/malmo)
* [FastText](https://fasttext.cc/docs/en/dataset.html)
* [DBPedia](https://wiki.dbpedia.org/develop/datasets)
* [W3C](https://www.w3.org/community/sentiment/wiki/Datasets)
* [Europarl](http://opus.nlpl.eu/Europarl.php)
* [Amazon Registry Open Data on AWS](https://registry.opendata.aws/)
* [Peter Jansen Cognitiveai.org Explanation Bank](http://cognitiveai.org/explanationbank/)
* [List of Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets#naturallanguage)
* [Emoji Database - Kaggle](https://www.kaggle.com/eliasdabbas/emoji-data-descriptions-codepoints)
* [Emoji Sentiment Data - Kaggle](https://www.kaggle.com/thomasseleck/emoji-sentiment-data)
* [EmojiNet - Kaggle](https://www.kaggle.com/rtatman/emojinet)
* [Twitter Emoji Prediction - Kaggle](https://www.kaggle.com/hariharasudhanas/twitter-emoji-prediction)
* [Sentiment Analysis multi-language - Kaggle](https://www.kaggle.com/weywenn/sentiment-analysis-multilanguage)
* [BigQuery public Dataset List](https://www.reddit.com/r/bigquery/wiki/datasets)

### Question Answering:

* XuAD;  [Paper](https://arxiv.org/abs/1910.11856) [Dataset](https://github.com/deepmind/xquad)
* XQA; [Paper](https://www.aclweb.org/anthology/P19-1227/)
* MLQA; [Paper](https://arxiv.org/abs/1910.07475)


### Many more datasets here:

* https://quantumstat.com/dataset/dataset.html

## Unsupervised Datasets

* Gutenberg
* [Wiktionary](https://dumps.wikimedia.org/enwiktionary/)
* Scholarpedia
* [Wikipedia](https://dumps.wikimedia.org/)
* ArXiv
* Wikitext-2
* Wikitext-103 

## Source Code (Programming) Datasets

* [Github data](https://medium.com/google-cloud/github-on-bigquery-analyze-all-the-code-b3576fd2b150); [Original Post](https://github.blog/2016-06-29-making-open-source-data-more-available/); [GitHub BigQuery](https://cloud.google.com/blog/products/gcp/github-on-bigquery-analyze-all-the-open-source-code);  [BigQuery Public Data](https://cloud.google.com/bigquery/public-data)
* [GHArchive - Github](https://www.gharchive.org/); [Analyzing Github repo](https://github.com/fhoffa/analyzing_github)

### CoLA




## MNLI - MultiNLI Dataset

There are more than one task that are possible as the dataset contains also the parse tree for each sentence, which is nice. So the output format of the json will be:

    {
        'input': "task: MNLI | Sentence 1: {} | Sentence 2: {}".format(sentence_1, sentence_2),
        'target': e['gold_label'],
        'input_sentence_1': "task: MNLI parse tree of: {}".format(sentence_1),
        'input_sentence_2': "task: MNLI parse tree of: {}".format(sentence_2),
        'parse_target_1': e['sentence1_parse'],
        'parse_target_2': e['sentence2_parse'],
    }

## MRPC 



This data consists of 5 columns:

    label: 0 Not equivalent, 1 semantically equivalent
    sentence 1 id
    sentence 2 id
    sentence 1 text
    sentence 2 text
    
    
    
The note to make is that the dataset is already tokenized meaning is not the raw text. Nothing else will be done to the text

## QNLI

The dataset download contains the following columns:

    ndex
    Question
    Sentence
    Label - [entailment|not_entailment]


## QQP

Columns in the dataset:

    id
    qid1
    qid2
    question1
    question2
    is_duplicate



In [2]:
from preprocess import process_glue, process_superglue, rename_files

In [14]:
# %time rename_files()

In [3]:
%time process_glue()

opening /home/leo/projects/Datasets/text/GLUE/CoLA/cola_train.tsv
opening /home/leo/projects/Datasets/text/GLUE/MNLI/original/mnli_multinli_1.0_dev_matched.jsonl
opening /home/leo/projects/Datasets/text/GLUE/CoLA/cola_dev.tsv
opening /home/leo/projects/Datasets/text/GLUE/CoLA/cola_test.tsv
opening /home/leo/projects/Datasets/text/GLUE/MNLI/original/mnli_multinli_1.0_train.jsonl
opening /home/leo/projects/Datasets/text/GLUE/MNLI/original/mnli_multinli_1.0_dev_mismatched.jsonl
opening /home/leo/projects/Datasets/text/GLUE/MRPC/mrpc_test.tsv
opening /home/leo/projects/Datasets/text/GLUE/MRPC/mrpc_dev.tsv
saving to /home/leo/projects/Datasets/text/GLUE/CoLA/cola_dev-txt2txt.json
saving to /home/leo/projects/Datasets/text/GLUE/MRPC/mrpc_dev-txt2txt.json
saving to /home/leo/projects/Datasets/text/GLUE/MRPC/mrpc_test-txt2txt.json
saving to /home/leo/projects/Datasets/text/GLUE/CoLA/cola_test-txt2txt.json
opening /home/leo/projects/Datasets/text/GLUE/QNLI/qnli_test.tsv
opening /home/leo/projec

## SuperGLUE

In [4]:
%time process_superglue()

opening /home/leo/projects/Datasets/text/SuperGLUE/CB/sg_cb_val.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/CB/sg_cb_train.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/BoolQ/sg_boolq_test.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/COPA/sg_copa_train.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/BoolQ/sg_boolq_train.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/BoolQ/sg_boolq_val.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/CB/sg_cb_test.jsonl
saving to /home/leo/projects/Datasets/text/SuperGLUE/CB/sg_cb_val-txt2txt.json
opening /home/leo/projects/Datasets/text/SuperGLUE/COPA/sg_copa_test.jsonl
saving to /home/leo/projects/Datasets/text/SuperGLUE/COPA/sg_copa_train-txt2txt.json
saving to /home/leo/projects/Datasets/text/SuperGLUE/CB/sg_cb_train-txt2txt.json
opening /home/leo/projects/Datasets/text/SuperGLUE/ReCoRD/sg_record_val.jsonl
saving to /home/leo/projects/Datasets/text/SuperGLUE/COPA/sg_copa_test-txt2txt.jso

## SwagAF


## Universal Dependencies v2.5

In [5]:
# from preprocess_conllu import conllu_process
from preprocess_conllu import *

In [6]:
%%time
conllu_process()

CPU times: user 57.9 ms, sys: 52.1 ms, total: 110 ms
Wall time: 1min 28s


In [19]:
all_wm = get_all_files_recurse("/media/nfs/Datasets/text/WikiMatrix/")

## WikiMatrix

File structure is:
 
    v1/*.gz - 65 GB
    vi/SMALL/*.gz - 4,6GB
    
We can use all the big files for the training and the small ones for validation. Checking the files they are different language pairs, so this can be used for Zero-Shot learning on translation pairs.



In [6]:
from utils import *
import pickle

In [7]:
WIKIMATRIX_BASEPATH = "/media/nfs/Datasets/text/WikiMatrix/v1"
# all_files = get_all_files_recurse(WIKIMATRIX_BASEPATH)

In [3]:
# all_files = [ f for f in all_files if 'txt2txt' in f]

In [4]:
# import os

# for f in all_files:
#     os.system('rm {}'.format(f))

In [8]:
from preprocess_wikimatrix import *

I'll first erase the data I'm sure I ĺl not be using, the original tar file is complete, so there is no issue with deleting individual gz files if I need them later. This frees some space and I can start to work on checking the rest of the data to see if there is any encoding issue with the current codebook



In [3]:
all_files = get_all_files_recurse(WIKIMATRIX_BASEPATH)
blacklist = [b + '-' for b in BLACKLIST_LANGS] + ['-' + b for b in BLACKLIST_LANGS]
to_remove = []

for f in all_files:
    for b in blacklist:
        if b in f:
            to_remove.append(f)
            break
            

In [7]:
'war' in BLACKLIST_LANGS

False

In [8]:
sorted(blacklist)

['-ar',
 '-arz',
 '-as',
 '-azb',
 '-ba',
 '-bn',
 '-bp',
 '-ce',
 '-ceb',
 '-ckb',
 '-cv',
 '-dv',
 '-eo',
 '-ew',
 '-fa',
 '-fo',
 '-gom',
 '-gu',
 '-hi',
 '-ht',
 '-hu',
 '-hy',
 '-id',
 '-ilo',
 '-ja',
 '-jv',
 '-ka',
 '-kk',
 '-ko',
 '-ku',
 '-lmo',
 '-mh',
 '-mi',
 '-ml',
 '-mr',
 '-mwl',
 '-ne',
 '-pa',
 '-ps',
 '-py',
 '-sh',
 '-si',
 '-su',
 '-ta',
 '-te',
 '-tg',
 '-th',
 '-tk',
 '-tl',
 '-tt',
 '-ug',
 '-ur',
 '-vi',
 '-wuu',
 '-yi',
 '-zb',
 '-zh',
 'ar-',
 'arz-',
 'as-',
 'azb-',
 'ba-',
 'bn-',
 'bp-',
 'ce-',
 'ceb-',
 'ckb-',
 'cv-',
 'dv-',
 'eo-',
 'ew-',
 'fa-',
 'fo-',
 'gom-',
 'gu-',
 'hi-',
 'ht-',
 'hu-',
 'hy-',
 'id-',
 'ilo-',
 'ja-',
 'jv-',
 'ka-',
 'kk-',
 'ko-',
 'ku-',
 'lmo-',
 'mh-',
 'mi-',
 'ml-',
 'mr-',
 'mwl-',
 'ne-',
 'pa-',
 'ps-',
 'py-',
 'sh-',
 'si-',
 'su-',
 'ta-',
 'te-',
 'tg-',
 'th-',
 'tk-',
 'tl-',
 'tt-',
 'ug-',
 'ur-',
 'vi-',
 'wuu-',
 'yi-',
 'zb-',
 'zh-']

In [4]:
len(all_files), len(to_remove)

(3753, 154)

In [5]:
to_remove

['/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-pt.tsv.gz',
 '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-it-txt2txt.json.gz',
 '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-es.tsv.gz',
 '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-de-charset.txt',
 '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-en-txt2txtmax-384.json.gz',
 '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-en-txt2txtmax-256.json.gz',
 '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-fr-txt2txt.json.gz',
 '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-pt-charset.txt',
 '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-en-txt2txt.json.gz',
 '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-es-txt2txt.json.gz',
 '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-de-txt2txt.json.gz',
 '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-de.tsv.gz',
 '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.arz-en-charset.txt',
 '/media/nfs/Datasets/te

In [9]:
import os
for f in to_remove:
    os.system("rm {}".format(f))

In [10]:
all_files = sorted(get_all_files_recurse(WIKIMATRIX_BASEPATH))

In [11]:
len(all_files)

3599

It seems quite a big win on the pre-pre-processing.
Now I have to deal with actually checking the rest of the languages, to do this I could filter 2 or 3 samples of each language instead of having to check all files. This will might faster but will be an issue as there might be characters of non recognized languages in the input so for the moment I'll process them all and check by handl later

In [11]:
# # get all remaining language codes:
# lang_codes = set([])
# for f in all_files:
#     codes = path_leaf(f).replace("WikiMatrix.","").replace(".tsv.gz","").split("-")
#     lang_codes.update(codes)

In [13]:
# codebook_path = 'codes/adhoc-codebook-'
# f = open(codebook_path, 'rb')
# codebook, char2int, int2char = pickle.load(f)

In [24]:
# all_files[0]

'/media/nfs/Datasets/text/WikiMatrix/v1/SMALL/WikiMatrix.an-bg.tsv.gz'

In [27]:
# all_files.remove('/media/nfs/Datasets/text/WikiMatrix/v1/SMALL/WikiMatrix.q*tsv.gz')
# all_files.index('/media/nfs/Datasets/text/WikiMatrix/v1/SMALL/WikiMatrix.pt-war.tsv.gz')

844

In [None]:
# checks = []

In [28]:
# %%time
# for f in all_files[844:]:
#     checks.append(check_encoding_works(char2int, f))

Accept: True | chars set: 1323 | chars count: 536941.0000001 | unk_chars = 1682 | ratio = 0.003132560188176516  | fname = WikiMatrix.pt-war.tsv.gz
Accept: True | chars set: 749 | chars count: 745932.0000001 | unk_chars = 1554 | ratio = 0.002083299818213713  | fname = WikiMatrix.rm-ro.tsv.gz
Accept: True | chars set: 902 | chars count: 1325102.0000001001 | unk_chars = 5119 | ratio = 0.0038630988406927265  | fname = WikiMatrix.rm-ru.tsv.gz
Accept: True | chars set: 830 | chars count: 909385.0000001 | unk_chars = 1591 | ratio = 0.0017495340257424798  | fname = WikiMatrix.rm-sv.tsv.gz
Accept: True | chars set: 635 | chars count: 552455.0000001 | unk_chars = 1028 | ratio = 0.0018607850413152455  | fname = WikiMatrix.rm-tr.tsv.gz
Accept: True | chars set: 693 | chars count: 1057728.0000001001 | unk_chars = 3666 | ratio = 0.0034659194046103093  | fname = WikiMatrix.rm-uk.tsv.gz
Accept: True | chars set: 1048 | chars count: 646117.0000001 | unk_chars = 1872 | ratio = 0.0028973080726860776  | f

In [37]:
accepted = [c for c in checks if c[0]]

In [38]:
failed = [c for c in checks if c[0] is False]

In [39]:
len(accepted), len(failed)

(1649, 30)

In [41]:
failed = [f[1] for f in  failed]
accepted = [f[1] for f in  accepted]

In [9]:
accepted

NameError: name 'accepted' is not defined

In [42]:
failed

[('ce', 'en'),
 ('ce', 'ru'),
 ('ce', 'uk'),
 ('cv', 'en'),
 ('cv', 'ru'),
 ('dv', 'en'),
 ('en', 'ga'),
 ('en', 'gd'),
 ('en', 'ht'),
 ('en', 'hy'),
 ('en', 'ku'),
 ('en', 'mh'),
 ('en', 'mi'),
 ('en', 'mn'),
 ('en', 'mt'),
 ('en', 'ps'),
 ('en', 'sa'),
 ('en', 'su'),
 ('en', 'tk'),
 ('en', 'wa'),
 ('en', 'wa'),
 ('fr', 'ps'),
 ('hr', 'ru'),
 ('ba', 'en'),
 ('br', 'en'),
 ('en', 'jv'),
 ('en', 'tg'),
 ('en', 'tt'),
 ('en', 'ug'),
 ('fr', 'hy')]

In [33]:
failedset = set([])
for f in failed:
    failedset.update(f)

In [34]:
accset = set([])
for a in accepted:
    accset.update(a)

In [35]:
failed = failedset.difference(accset)

In [36]:
failed

set()

In [23]:
bl = set(['wuu', 'gom', 'lmo', 'mwl', 'ilo', 'ckb', "ar", "hi", "sh", "hu", "eo", "fo", "si",
                   "bn", "ml", "fa", "ne", "as", "azb", "ka", 'as', 'bn', 'fa', 'ka', 'ml', 'ne', 'si', 'zb',
                   # "sq", "he", maybe yes
                   # "hr", "br" ???
                   "ur", "id", "kk", "mr", "ta", "th", "hi", "zh", "ko", "tl", "vi", "te", "ja",'bp', 'ew', 'gu', 'pa', 'py'
                   ])

In [9]:
from preprocess_wikimatrix import *

In [2]:
%%time
wikimatrix_charset_process()

Failed extracting chars from /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.gl-sk-charset.txt with error: 
 Not a gzipped file (b'X2')Failed extracting chars from /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.el-hr-charset.txt with error: 
 Not a gzipped file (b'2b')Failed extracting chars from /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.ru-sk-charset.txt with error: 
 Not a gzipped file (b'2\xd0')Failed extracting chars from /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.ca-he-charset.txt with error: 
 Not a gzipped file (b'2\xd7')Failed extracting chars from /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.ca-hr-charset.txt with error: 
 Not a gzipped file (b'2b')


Failed extracting chars from /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.hr-pl-charset.txt with error: 
 Not a gzipped file (b'2b')
Failed extracting chars from /media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.az-hr-charset.txt with error: 
 Not a gzipped file (b'2b')

Failed extracting chars from 

Process ForkPoolWorker-2:
Process ForkPoolWorker-7:
Process ForkPoolWorker-3:
Process ForkPoolWorker-6:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
Process ForkPoolWorker-5:
Process ForkPoolWorker-8:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.7/multiprocessing/pool.py", 

KeyboardInterrupt: 

In [4]:
# def extract_charset(fname):
#     charset = set([])
#     with gzip.open(fname, 'rb') as f:
#         lines = f.readlines()
#         for txt in lines:
#             txt = txt.decode('utf-8')
#         charset.update(set(list(txt)))
#     saveto = fname.replace('.tsv.gz', '-charset.txt')
#     with gzip.open(saveto, 'wb') as f:
#         # print("saving to {}".format(saveto))
#         otxt = ''.join(list(charset)).encode('utf-8')
#         f.write(otxt)
#         f.flush()
#     return charset


In [8]:
# fail1 = '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.eu-tr-charset.txt'
# fail2 = '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.es-ro-charset.txt'
# fail3 = '/media/nfs/Datasets/text/WikiMatrix/v1/WikiMatrix.ca-it-charset.txt'
# extract_charset(fail2)

In [6]:
# # obtain the entire charsets previously extracted into one file
# all_files = get_all_files_recurse(WIKIMATRIX_BASEPATH)
# all_files = [f for f in all_files if f.endswith(".txt")]

In [7]:
# len(all_files)

1208

In [8]:
# %%time
# all_chars = set([])
# errors = []
# for fname in all_files:
#     with(open(fname, 'rb')) as f:
#         try:
# #             flines = f.readlines()
#             for line in f.readlines():  # flines:
#                 chars = list(line.decode('utf-8'))
#                 all_chars.update(chars)
#         except Exception as e:
#             errors.append(e)
#             print("error processing {} with e= {}".format(fname, e))

CPU times: user 89.3 ms, sys: 81.8 ms, total: 171 ms
Wall time: 476 ms


In [9]:
# len(errors)

0

In [10]:
# len(all_chars)

521

In [11]:
# sorted(list(all_chars))

['\t',
 '\n',
 ' ',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '=',
 '?',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '\\',
 ']',
 '^',
 '_',
 '`',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '{',
 '|',
 '}',
 '~',
 '\xa0',
 '£',
 '«',
 '\xad',
 '°',
 '²',
 '³',
 'µ',
 '·',
 'º',
 '»',
 '½',
 'Á',
 'Ä',
 'Å',
 'Ç',
 'È',
 'É',
 'Í',
 'Î',
 'Ò',
 'Ó',
 'Ö',
 '×',
 'Ú',
 'Ü',
 'Þ',
 'ß',
 'à',
 'á',
 'â',
 'ã',
 'ä',
 'å',
 'æ',
 'ç',
 'è',
 'é',
 'ê',
 'ë',
 'ì',
 'í',
 'î',
 'ï',
 'ð',
 'ñ',
 'ò',
 'ó',
 'ô',
 'õ',
 'ö',
 'ø',
 'ù',
 'ú',
 'û',
 'ü',
 'ý',
 'þ',
 'Ā',
 'ā',
 'ă',
 'ą',
 'ć',
 'ċ',
 'Č',
 'č',
 'ď',
 'Đ',
 'đ',
 'ė',
 'ę',
 'ě',
 

In [10]:
%%time
wikimatrix_process()

CPU times: user 402 ms, sys: 210 ms, total: 612 ms
Wall time: 13min 18s


In [10]:
# sum([1620, 1925])  # 1620 are the complete files, 1925 are the files in the SMALL dataset

3545

In [11]:
# WIKIMATRIX_BASEPATH = "/media/nfs/Datasets/text/WikiMatrix/v1"

# allfiles = get_all_files_recurse(WIKIMATRIX_BASEPATH)

In [12]:
# t2t = [f for f in allfiles if 'txt2txt' in f]

In [13]:
# len(t2t)

3545 files processed and 3545 files existing, everything seems OK.

# Data preparation by length and task

This part checks some things that should work

In [1]:
from prepare_data import *

In [4]:
# tfile = '/home/leo/projects/Datasets/text/SuperGLUE/CB/val-txt2txt.json'
# fname = '/media/nfs/Datasets/text/WikiMatrix/v1/SMALL/WikiMatrix.ja-su-txt2txt.json.gz'

In [2]:
%%time
process()

CPU times: user 314 ms, sys: 285 ms, total: 599 ms
Wall time: 9min 37s


In [10]:
# separate_by_strlen(fname)

In [3]:
%%time
prepare_select_all()

Preparing 3532 files of max_len 512
CPU times: user 320 ms, sys: 595 ms, total: 916 ms
Wall time: 2min 45s


In [15]:
%%time
prepare_lm_data_wikimatrix()


CPU times: user 26min 52s, sys: 40 s, total: 27min 32s
Wall time: 27min 32s


In [15]:
# import gzip
# import orjson as json
# fname = '/home/leo/projects/Datasets/text/train_selected/WikiMatrix.arz-he-txt2txtmax-512.json.gz'

# f = gzip.open(fname, 'rb')

In [16]:
# flines = f.readlines()

In [17]:
# l0 = flines[0].decode('utf-8')
# import orjson as json

In [18]:
# l0 = json.loads(l0)

In [22]:
from pycountry import languages

In [23]:
languages.get(alpha_3='nds')

Language(alpha_3='nds', inverted_name='German, Low', name='Low German', scope='I', type='L')

In [14]:
%%time
# OUTPUT_FNAME = '/home/leo/projects/Datasets/text/train_selected_monofile/monofile.txt'
# OUTPUT_FNAME = '/home/leo/projects/Datasets/text/train_selected_monofile/monofile.txt'
# json2lines(ofile=OUTPUT_FNAME)

Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.sv-wa-txt2txtmax-384.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.hr-ro-txt2txtmax-512.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.ga-he-txt2txtmax-256.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.pl-wa-txt2txtmax-256.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.et-fi-txt2txtmax-512.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.es-mk-txt2txtmax-512.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.he-sr-txt2txtmax-384.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.pt-war-txt2txtmax-512.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.oc-pl-txt2txtmax-384.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.mk-tr-txt2txtmax-512.json.gz

the output file of this is 87M samples (lines)

    $ wc -l monofile.txt 
    87235277 monofile.txt


In [24]:
# model_vocab_sizes = [32000, 64000, 96000, 128000]
# model_prefixes = ['all_34G_32k', 'all_34G_64k', 'all_34G_96k', 'all_34G_128k']
# model_types = ['unigram', 'bpe', 'word', 'char']
# input_sentence_size = [1e6, 1e7, 273332515]

# cmd = "spm_train --input={} --vocab_size={} --input_format=tsv --model_prefix={} --model_type={} --character_coverage=0.9995"
# cmd2 = "spm_train --input={} --input_sentence_size={} --vocab_size={} --input_format=tsv --model_prefix={} --model_type={} --character_coverage=0.9995 --shuffle_input_sentence"

# commands = []
# commands2 = []
# file = OUTPUT_FNAME
# for vs in model_vocab_sizes:
#     for t in model_types:
#         for pref in model_prefixes:
#             for ss in input_sentence_size:
#                 prefix = '-'.join((t,pref))
#                 c = cmd.format(file, vs, prefix,t )
#                 commands.append(c)
#                 c2 = cmd2.format(file, int(ss), vs, prefix,t )
#                 commands2.append(c2)


In [25]:
commands

['spm_train --input=/home/leo/projects/Datasets/text/train_selected_monofile/monofile_2.txt --vocab_size=32000 --input_format=tsv --model_prefix=unigram-all_34G_32k --model_type=unigram --character_coverage=0.9995',
 'spm_train --input=/home/leo/projects/Datasets/text/train_selected_monofile/monofile_2.txt --vocab_size=32000 --input_format=tsv --model_prefix=unigram-all_34G_32k --model_type=unigram --character_coverage=0.9995',
 'spm_train --input=/home/leo/projects/Datasets/text/train_selected_monofile/monofile_2.txt --vocab_size=32000 --input_format=tsv --model_prefix=unigram-all_34G_32k --model_type=unigram --character_coverage=0.9995',
 'spm_train --input=/home/leo/projects/Datasets/text/train_selected_monofile/monofile_2.txt --vocab_size=32000 --input_format=tsv --model_prefix=unigram-all_34G_64k --model_type=unigram --character_coverage=0.9995',
 'spm_train --input=/home/leo/projects/Datasets/text/train_selected_monofile/monofile_2.txt --vocab_size=32000 --input_format=tsv --mode

SentencePiece

--input_sentence_size {} --vocab_size {} --input_format tsv --model_prefix {} --input {} --model_type {} --character_coverage=0.9995


BPEmb: Subword Embeddings in 275 Languages

BPEmb 

https://nlp.h-its.org/bpemb/
https://nlp.h-its.org/bpemb/multi/



In [1]:
import sentencepiece as spm

In [48]:
s = spm.SentencePieceProcessor()
# s.Load('/home/leo/projects/Datasets/text/sentencepiece/bpe-all_2G5_64k.model')
s.Load('/home/leo/projects/Datasets/text/sentencepiece/bpe-all_2G5_64k.model')

True

In [49]:
p = s.SampleEncodeAsPieces('New York', -1, 0.1)

In [45]:
s.EncodeAsPieces

<bound method SentencePieceProcessor.EncodeAsPieces of <sentencepiece.SentencePieceProcessor; proxy of <Swig Object of type 'sentencepiece::SentencePieceProcessor *' at 0x7f9f4b628120> >>

In [43]:
s.SampleEncodeAsPieces?

[0;31mSignature:[0m [0ms[0m[0;34m.[0m[0mSampleEncodeAsPieces[0m[0;34m([0m[0minput[0m[0;34m,[0m [0mnbest_size[0m[0;34m,[0m [0malpha[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m      ~/venv3/lib/python3.7/site-packages/sentencepiece.py
[0;31mType:[0m      method


In [53]:
for i in range(10):
    print(s.EncodeAsPieces('吾輩は猫である'), s.EncodeAsIds('吾輩は猫である'))
    print(s.EncodeAsPieces('New York'), s.EncodeAsIds('New York'))
    print(s.SampleEncodeAsPieces('New York', -1, 0.1))

['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]
['▁', '吾輩', 'は', '猫', 'である'] [60452, 0, 60718, 0, 3515]
['▁New', '▁York'] [1176, 2076]
[]


In [35]:
s.SampleEncodeAsIds('New York', -1, 0.1)

[803, 390, 7, 62657]

In [27]:
s.DecodeIds([474, 13, 390, 776])

'New York'

In [18]:
'U+2588', chr(0x2588)

('U+2588', '█')

In [21]:
'█'

'█'

In [3]:
from pycountry import languages

In [5]:
l = languages.get(alpha_2='es')

In [6]:
l.name

'Spanish'

I don't like how the sentencepiece is encoding, it fails, while I don't want issues with languages single symbols.

For the moment I'd redo the entire decision, the coding and the languages that we'll be able to represent. This creates for one side a problem as I wnated something universally extendable, but for the other simplifies many things and cuts the amount of data that I'll have to use. Languages to use will be mostly western, latin, green and cyrillic based.

Sorting now the datasets into train dev test (or train test validation whatever name you want)

In [4]:
import os, sys

from utils import *


In [5]:
basedir = '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/'
all_ud_files = get_all_files_recurse(basedir)

In [6]:
t2t_ud_files = [f for f in all_ud_files if 'text2text' in f]
train = [f for f in t2t_ud_files if '-train-' in f]
test = [f for f in t2t_ud_files if '-test-' in f]
dev = [f for f in t2t_ud_files if '-dev-' in f]
to_delete = [f for f in all_ud_files if 'charseq' in f]

In [7]:
%%time

for f in train:
    os.system("mv {} /home/leo/projects/Datasets/text/train_selected".format(f))
    
for f in dev:
    os.system("mv {} /home/leo/projects/Datasets/text/dev_selected".format(f))
    
for f in test:
    os.system("mv {} /home/leo/projects/Datasets/text/validation_selected".format(f))
    
# for f in to_delete:
#     os.system("rm {}".format(f))


CPU times: user 81.5 ms, sys: 189 ms, total: 271 ms
Wall time: 3.61 s


In [23]:

all_wm_test_files = get_all_files_recurse(os.path.join(WIKIMATRIX_BASEPATH,'SMALL'))

In [27]:
all_wm_test_files =[ f for f in all_wm_test_files if f.endswith('.json.gz')]

In [29]:
# for f in all_wm_test_files:
#     os.system("cp {} /home/leo/projects/Datasets/text/dev_selected".format(f))

In [30]:
fnames = [path_leaf(f) for f in all_wm_test_files]

In [34]:
# %%time
# # clean from the previously copied files
# for f in fnames:
#     os.system("rm /home/leo/projects/Datasets/text/train_selected/{}".format(f.replace(".json.gz", "-langmodel.json.gz")))

CPU times: user 81.3 ms, sys: 94.3 ms, total: 176 ms
Wall time: 1.64 s


In [36]:
# cleanup of the Universal Dependencies files that we can't encode due to chosen characters in the encoding settings
tfolder = "/home/leo/projects/Datasets/text/train_selected"
dfolder = "/home/leo/projects/Datasets/text/dev_selected/"
vfolder = "/home/leo/projects/Datasets/text/validation_selected/"

all_ud_files = get_all_files_recurse(tfolder) +  get_all_files_recurse(dfolder) + get_all_files_recurse(vfolder)

ud_to_remove = []
for f in all_ud_files:
    fname = path_leaf(f)
    for bl in BLACKLIST_LANGS:
        if fname.startswith(bl):
            ud_to_remove.append(f)
            break

In [39]:
len(all_ud_files), len(ud_to_remove)

(8128, 429)

In [40]:
for f in ud_to_remove:
    os.system("rm {}".format(f))

There are files that are not good in the Universal Dependencies, so a manual check would be nice, but I just only get to find some due to the nature of the checks,as the following files:

    fr_ftb-ud-test-PoS-text2text-*
    en_esl-ud-test-PoS-text2text-*
    qhe_hiencs-ud-test-PoS-text2text-*

And I found some issues in the text of the json files (some old format) so I need to do a cleanup and redo all the UD treebank processing again.

So there it goes


### Language Name length
Finding the longest language name in all the language list, this will be the tensor space for language detection in the models

In [35]:
from pycountry import languages
langnames = [ l.name for l in list(languages)]  
max([len(l) for l in langnames])                                                                                                                      


58

### String corruption and Masking

In [74]:
from constants import *
from data_loader import *
import numpy as np

In [75]:
txt = "El Ministerio chino de Asuntos Exteriores defendió hoy el resultado de las elecciones presidenciales celebradas en Perú y ofreció su apoyo al nuevo gobierno del presidente Alberto Fujimori."

In [89]:
''.join(add_str_noise(txt, dup_char_prob=0.01, del_char_prob=0.005)[0])

'El MinIsterio chinO dE AsuntOs xterIores DeFEndio hooY el reSultAdo de laS elecciones preSiDEnciAles celebRaDass en PeRu Y ofrecio su apoyo al nuevo gobieRno del Presidente alberto Fujimori'

In [61]:
# import timeit
# # code snippet to be executed only once 
# mysetup = "from data_loader import add_str_noise, generate_mask"
  
# # code snippet whose execution time is to be measured 
# mycode = 'add_str_noise("El Ministerio chino de Asuntos Exteriores defendió hoy el resultado de las elecciones presidenciales celebradas en Perú y ofreció su apoyo al nuevo gobierno del presidente Alberto Fujimori.", dup_char_prob=0.01, del_char_prob=0.005)'
  
# # timeit statement 
# print (timeit.timeit(setup = mysetup, 
#                     stmt = mycode, 
#                     number = 10000) )

In [16]:
10.8827264 / 10000

0.00108827264

In [62]:
import pickle
fname = '/home/leo/projects/mix_nlp/utf8/codes/adhoc-codebook-2112.pkl'
f = open(fname, 'rb')
codebook, char2int, int2char = pickle.load(f)

In [63]:

def item2int(char):
    if char not in char2int:
        char = UNK[1]
    num = char2int[char]
    return num

def txt2tensor(txt):
    return np.array(list(map(item2int, txt)))


In [64]:
code = np.array([char2int[c] for c in txt])

In [65]:
code1 = txt2tensor(txt)

In [68]:
not False in code == code1

True

In [69]:
code

array([ 71, 110,  34,  79, 107, 112, 107, 117, 118, 103, 116, 107, 113,
        34, 101, 106, 107, 112, 113,  34, 102, 103,  34,  67, 117, 119,
       112, 118, 113, 117,  34,  71, 122, 118, 103, 116, 107, 113, 116,
       103, 117,  34, 102, 103, 104, 103, 112, 102, 107, 210,  34, 106,
       113, 123,  34, 103, 110,  34, 116, 103, 117, 119, 110, 118,  99,
       102, 113,  34, 102, 103,  34, 110,  99, 117,  34, 103, 110, 103,
       101, 101, 107, 113, 112, 103, 117,  34, 114, 116, 103, 117, 107,
       102, 103, 112, 101, 107,  99, 110, 103, 117,  34, 101, 103, 110,
       103, 100, 116,  99, 102,  99, 117,  34, 103, 112,  34,  82, 103,
       116, 217,  34, 123,  34, 113, 104, 116, 103, 101, 107, 210,  34,
       117, 119,  34,  99, 114, 113, 123, 113,  34,  99, 110,  34, 112,
       119, 103, 120, 113,  34, 105, 113, 100, 107, 103, 116, 112, 113,
        34, 102, 103, 110,  34, 114, 116, 103, 117, 107, 102, 103, 112,
       118, 103,  34,  67, 110, 100, 103, 116, 118, 113,  34,  7

In [98]:
msk, txt = generate_mask(code)

In [99]:
''.join([int2char[i] for i in msk])

'El Mi▒isterio ch▒no de Asuntos ExtȄrioresؠd▒fendió ▒oy ▒▒ re▒▒lta▒o d▒▒l▒s elecciones ▒residenciales celeb▒ada▒ ▒▒ P▒rú y ofre▒ió su▒apoys al nuevo gobierno del p▒eside▒te Alberto F▒ji▒or▒.'

In [1]:
from prepare_data import *

In [None]:
%%time

TRAIN_PATH = os.path.join(BASEPATH, 'train_selected')
DEV_PATH = os.path.join(BASEPATH, 'dev_selected')
VALID_PATH = os.path.join(BASEPATH, 'validation_selected')

outpath = OUTPUT_FNAME = '/home/leo/projects/Datasets/text/selected_monofile/all_tasks-{}.txt'
# outpath = OUTPUT_FNAME = '/home/leo/projects/Datasets/text/selected_monofile/glue-pos_tasks-{}.txt'
# outpath = OUTPUT_FNAME = '/home/leo/projects/Datasets/text/selected_monofile/pos_tasks-{}.txt'
opaths = [outpath.format(t) for t in ['train', 'dev', 'valid']]
paths = [TRAIN_PATH, DEV_PATH, VALID_PATH]

for fpath, ofile in zip(paths, opaths):
    jsonfile2jsonlines(paths=[fpath], ofile=ofile)

Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.sv-wa-txt2txtmax-384.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.hr-ro-txt2txtmax-512.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.ga-he-txt2txtmax-256.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/mt_mudt-ud-train-PoS-text2text-deprel.json
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.pl-wa-txt2txtmax-256.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.et-fi-txt2txtmax-512.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.es-mk-txt2txtmax-512.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.he-sr-txt2txtmax-384.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.pt-war-txt2txtmax-512.json.gz
Processing: /home/leo/projects/Datasets/text/train_selected/WikiMatrix.oc-pl-txt2txtmax-384.json

Datasets lines are then shuffled to avoid issues, this is done randomly and in the console ...

In [None]:
cmd = "shuf {} > {}"
files = os.listdir("/home/leo/projects/Datasets/text/selected_monofile")

In [None]:
files

In [None]:
%%time
for f in files:
    os.system(cmd.format(f, f.replace(".txt", ".shuf.txt")))


In [None]:
for f in files:
    os.system("rm {}".format(f))

In [1]:
from torch.utils.data import DataLoader
from data_loader import *


In [2]:
import orjson 
import json
import os, sys

In [3]:
import pickle
fname = '/home/leo/projects/mix_nlp/utf8/codes/adhoc-codebook-2112.pkl'
f = open(fname, 'rb')
codebook, char2int, int2char = pickle.load(f)

fpath = OUTPUT_FNAME = '/home/leo/projects/Datasets/text/selected_monofile/glue-pos_tasks-dev.txt'

In [4]:
dataset = Txt2TxtDataset([fpath], char2int)

In [5]:
loader = DataLoader(dataset, batch_size=2)

In [6]:
loader

<torch.utils.data.dataloader.DataLoader at 0x7f6ac842b518>

In [8]:
for d in dataset:
    print(d)
    break

(array([3.534e+03, 7.900e+01, 8.400e+01, 8.200e+01, 6.900e+01, 3.400e+01,
       7.100e+01, 1.150e+02, 1.190e+02, 1.070e+02, 1.200e+02, 9.900e+01,
       1.100e+02, 1.100e+02, 1.030e+02, 1.120e+02, 1.010e+02, 1.230e+02,
       2.000e+00, 1.180e+02, 1.060e+02, 1.030e+02, 3.400e+01, 1.020e+02,
       1.030e+02, 7.800e+01, 1.030e+02, 1.050e+02, 9.900e+01, 1.180e+02,
       1.030e+02, 1.170e+02, 3.400e+01, 1.170e+02, 2.600e+01, 1.070e+02,
       1.020e+02, 3.400e+01, 1.160e+02, 9.900e+01, 1.070e+02, 1.170e+02,
       7.500e+01, 1.120e+02, 1.050e+02, 2.600e+01, 2.600e+01, 2.600e+01,
       1.020e+02, 3.400e+01, 2.600e+01, 1.070e+02, 1.170e+02, 1.180e+02,
       1.070e+02, 1.160e+02, 1.000e+02, 8.700e+01, 8.600e+01, 1.070e+02,
       2.600e+01, 7.300e+01, 3.400e+01, 1.040e+02, 8.700e+01, 1.120e+02,
       1.020e+02, 1.170e+02, 3.400e+01, 1.060e+02, 6.700e+01, 1.170e+02,
       3.400e+01, 1.000e+02, 1.030e+02, 2.600e+01, 2.600e+01, 3.400e+01,
       1.010e+02, 1.130e+02, 1.110e+02, 1.140e+02,

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [7]:
for l in loader:
    print(l)
    break

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 426 and 410 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:612

In [16]:
iter