# Data Preparation

This notebook develops the data preparation for text-to-text learning for supervised datasets (like T5 from Deep Mind), it extends T5 for more tasks and is developed with PyTorch.

The source code is open-sourced.

For the processed text, it will be given when/if I get resources to get it in the open (due to data volumes).



## Dataset preparation.

One of the ideas of this process is to do less pre-processing and use the least pre-processed text possible. Uppercase, punctuation and other simbols have information that with some pre-processing is lost. This might not be too problematic for English or other languages, but certainly is for German (and might be for others).

Due to this, many of the pre-processsd (tokenized) datasets available are discarded and the data preparation will be done from Raw data (example for the GLUE and SuperGLUE benchmmarks)

Data preparation would be much faster with Scala in Spark than with Python but for ease of portability and usage I'll be using python. Also the data preparation is one off only, no need to re-process once done.

Nevertheless, even if working with Python, choosing the right libraries is good. This is why for json we choose [orjson](https://github.com/ijl/orjson) and for csv even though there seems to be a [faster library ](https://github.com/juancarlospaco/faster-than-csv) it does not have many users or community so we keep with the standard csv library which is the fastest other way of doing it.

### Text Task Description

In the original T5 paper the tasks are described in english and with a single representation, for example: 
 
    Source String: "translate {}"
    Target String: "to {}"
 
In this work we add a few variations to this. The first variation is that the task will be described in multiple languages, for starting:

* English
* Spanish
* French
* German

TODO The second change is that instead of a single description of the task, there will be multiple ones and they'll be chosen randomly.

Examples for language translation:
 
    " Cómo se dice: {} en {} ?"
    " Cómo se escribe: {} en {} ?"
    " Escribe: {} en {} ?"
    " Traducir: {} al {}."
    " Por favor traduce: {} al {}"
    " Traduce: {} al {}"



## Datasets List to process/analyze

* ~~MUSE~~ Issue downloading data, only multilang dictionaries available
* GLUE
    - [CoLA](https://nyu-mll.github.io/CoLA/); [Neural Network Acceptability Judgments ](https://arxiv.org/abs/1805.12471); [Source Code](https://github.com/nyu-mll/CoLA-baselines)
    - [MNLI](https://www.nyu.edu/projects/bowman/multinli/); [Paper](https://arxiv.org/abs/1704.05426); [Baseline](https://github.com/nyu-mll/multiNLI/blob/master/README.md)
    - MRPC [Paper](https://pdfs.semanticscholar.org/13d7/cbe9035abbb0f243a5e63e19d9c01bcf69d8.pdf); [Original Dataset](https://www.microsoft.com/en-us/download/details.aspx?id=52398&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F607d14d9-20cd-47e3-85bc-a2f65cd28042%2F)
    - QNLI [Paper](https://www.nyu.edu/projects/bowman/glue.pdf) 
    - QQP
    - RTE
    - SNLI
    - SST-2
    - STS-B
    - WNLI
* [SuperGLUE](https://w4ngatang.github.io/static/papers/superglue.pdf) 
    - BoolQ
    - CB
    - COPA
    - MultiRC
    - ReCoRD
    - RTE
    - WiC
    - WSC
* [XNLI](https://github.com/facebookresearch/XNLI) <- this one is interesting
* UD-Treebank v2.5 <- this one is interesting
* [SWAG](http://rowanzellers.com/swag/); [Paper](https://arxiv.org/abs/1808.05326); [Source Code](https://github.com/rowanz/swagaf)
* [WikiMatrix](https://ai.facebook.com/blog/wikimatrix/); [Paper](https://arxiv.org/abs/1907.05791); [Github](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix)
* ~~[SETimes](http://nlp.ffzg.hr/resources/corpora/setimes/)~~ No need of it, already many samples at WikiMatrix and UD-Treebank
* Tatoeba:  Wikimatrix is nice but this one has different kind of phrases (questions, answers and some other things)
* [EuroParliament](http://www.statmt.org/europarl/)
* [Wikipedia Translation Dataset](http://opus.nlpl.eu/Wikipedia.php); [WikiExtractor](https://github.com/tatuylonen/wiktextract)
* [ConceptNET](http://conceptnet.io/); [Github](https://github.com/commonsense/conceptnet5/wiki) 
* [Open Multilingual WordNet](http://compling.hss.ntu.edu.sg/omw/) and [Global WordNet Association](http://globalwordnet.org/resources/wordnets-in-the-world/)


* [BabelNET](https://babelnet.org/) [Downloads](https://babelnet.org/download) seem proprietary ...
* [PanLex](https://panlex.org/)  Word level traductions for many (many) language pairs. [Downloads](https://panlex.org/source-list/) and [Vocabulary](https://vocab.panlex.org/)
* [ASJD Database](https://asjp.clld.org)
* Thesaurus [Some](https://old.datahub.io/dataset/open-data-thesaurus) [links](http://vocabulary.semantic-web.at/PoolParty/wiki/OpenData) [where](https://www.thesaurus.net/) [to](https://www.powerthesaurus.org/multilingual) find

* [bAbI](https://research.fb.com/downloads/babi/); [Code on Github](https://github.com/facebook/bAbI-tasks). Although it seems that there are [issues](https://www.reddit.com/r/MachineLearning/comments/3ohkt8/i_solved_facebooks_babi_and_found_lots_of_errors/) in the [dataset](http://jamesknighton.com/2015/babi/)
* [MALMO](https://www.microsoft.com/en-us/research/project/project-malmo/) Minecraft Artificial Intelligence; [Github](https://github.com/Microsoft/malmo)

### Question Answering:

* XuAD;  [Paper](https://arxiv.org/abs/1910.11856) [Dataset](https://github.com/deepmind/xquad)
* XQA; [Paper](https://www.aclweb.org/anthology/P19-1227/)
* MLQA; [Paper](https://arxiv.org/abs/1910.07475)



## Unsupervised Datasets

* Gutenberg
* [Wiktionary](https://dumps.wikimedia.org/enwiktionary/)
* Scholarpedia
* [Wikipedia](https://dumps.wikimedia.org/)
* ArXiv
* Wikitext-2
* Wikitext-103 

## Source Code (Programming) Datasets

* 

### CoLA




## MNLI - MultiNLI Dataset

There are more than one task that are possible as the dataset contains also the parse tree for each sentence, which is nice. So the output format of the json will be:

    {
        'input': "task: MNLI | Sentence 1: {} | Sentence 2: {}".format(sentence_1, sentence_2),
        'target': e['gold_label'],
        'input_sentence_1': "task: MNLI parse tree of: {}".format(sentence_1),
        'input_sentence_2': "task: MNLI parse tree of: {}".format(sentence_2),
        'parse_target_1': e['sentence1_parse'],
        'parse_target_2': e['sentence2_parse'],
    }

## MRPC 



This data consists of 5 columns:

    label: 0 Not equivalent, 1 semantically equivalent
    sentence 1 id
    sentence 2 id
    sentence 1 text
    sentence 2 text
    
    
    
The note to make is that the dataset is already tokenized meaning is not the raw text. Nothing else will be done to the text

## QNLI

The dataset download contains the following columns:

    ndex
    Question
    Sentence
    Label - [entailment|not_entailment]


## QQP

Columns in the dataset:

    id
    qid1
    qid2
    question1
    question2
    is_duplicate



In [1]:
from preprocess import process_glue, process_superglue

In [2]:
%time process_glue()

opening /home/leo/projects/Datasets/text/GLUE/MNLI/original/multinli_1.0_dev_mismatched.jsonl
opening /home/leo/projects/Datasets/text/GLUE/CoLA/dev.tsv
opening /home/leo/projects/Datasets/text/GLUE/CoLA/test.tsv
opening /home/leo/projects/Datasets/text/GLUE/CoLA/train.tsv
opening /home/leo/projects/Datasets/text/GLUE/MRPC/dev_ids.tsv
opening /home/leo/projects/Datasets/text/GLUE/MNLI/original/multinli_1.0_dev_matched.jsonl
saving to /home/leo/projects/Datasets/text/GLUE/CoLA/dev-txt2txt.json
saving to /home/leo/projects/Datasets/text/GLUE/MRPC/dev_ids-txt2txt.json
saving to /home/leo/projects/Datasets/text/GLUE/CoLA/test-txt2txt.json
opening /home/leo/projects/Datasets/text/GLUE/MRPC/test.tsv
opening /home/leo/projects/Datasets/text/GLUE/MNLI/original/multinli_1.0_train.jsonl
opening /home/leo/projects/Datasets/text/GLUE/MRPC/dev.tsv
opening /home/leo/projects/Datasets/text/GLUE/MRPC/train.tsv
saving to /home/leo/projects/Datasets/text/GLUE/MRPC/test-txt2txt.json
opening /home/leo/pro

## SuperGLUE

In [3]:
%time process_superglue()

opening /home/leo/projects/Datasets/text/SuperGLUE/BoolQ/test.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/CB/val.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/BoolQ/val.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/CB/test.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/BoolQ/train.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/ReCoRD/val.jsonl
saving to /home/leo/projects/Datasets/text/SuperGLUE/CB/val-txt2txt.json
saving to /home/leo/projects/Datasets/text/SuperGLUE/CB/test-txt2txt.json
opening /home/leo/projects/Datasets/text/SuperGLUE/RTE/val.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/ReCoRD/train.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/CB/train.jsonl
opening /home/leo/projects/Datasets/text/SuperGLUE/ReCoRD/test.jsonl
saving to /home/leo/projects/Datasets/text/SuperGLUE/CB/train-txt2txt.json
opening /home/leo/projects/Datasets/text/SuperGLUE/RTE/test.jsonl
saving to /home/leo/projects/Datasets/tex

## SwagAF


## Universal Dependencies v2.5

In [1]:
# from preprocess_conllu import conllu_process
from preprocess_conllu import *

In [5]:
%%time
conllu_process()

CPU times: user 40.3 ms, sys: 46.1 ms, total: 86.3 ms
Wall time: 1min 39s


In [8]:
all_wm = get_all_files_recurse("/media/nfs/Datasets/text/WikiMatrix/")

## WikiMatrix

File structure is:
 
    v1/*.gz - 65 GB
    vi/SMALL/*.gz - 4,6GB
    
We can use all the big files for the training and the small ones for validation. Checking the files they are different language pairs, so this can be used for Zero-Shot learning on translation pairs.



In [8]:
from utils import *

In [1]:
from preprocess_wikimatrix import *

In [None]:
%%time
wikimedia_process()

In [14]:
sum([1620, 1925])  # 1620 are the complete files, 1925 are the files in the SMALL dataset

3545

In [9]:
WIKIMATRIX_BASEPATH = "/media/nfs/Datasets/text/WikiMatrix/v1"

allfiles = get_all_files_recurse(WIKIMATRIX_BASEPATH)

In [10]:
t2t = [f for f in allfiles if 'txt2txt' in f]

In [12]:
len(t2t)

3545

3545 files processed and 3545 files existing, everything seems OK.

# Data preparation by length and task

This part checks some things that should work

In [1]:
import itertools
import numpy as np
import os
import ntpath
import orjson as json

try:
    from .utils import *
except:
    # hack to solve issue with ipython executing this import
    from utils import *


In [2]:
tfile = '/home/leo/projects/Datasets/text/SuperGLUE/CB/val-txt2txt.json'

In [5]:
f = json.loads(open(tfile, 'rb').read())


In [30]:
from collections import OrderedDict

MAX_LENS = (0, 64, 128, 256, 384, 512, 768, 1024, 2048)
def _group_foo(x, arr=np.array(MAX_LENS)):
    ml = max(len(x["input"]), len(x["target"]))
    am = np.argmax( arr > ml)
    return arr[am]

groups = {}
uniquekeys = set([])
for k,g in itertools.groupby(f, _group_foo):
    if k in groups:
        groups[k].extend(list(g))
    else:
        groups[k] = list(g)
    uniquekeys.add(k)
    
groups = OrderedDict(sorted(groups.items()))

In [26]:

uniquekeys

{128, 256, 384, 512, 768, 1024, 2048}

In [31]:
for k,v in groups.items():
    print(k, len(v))

128 1
256 12
384 19
512 14
768 6
1024 2
2048 2


In [34]:
groups[768]

[{'input': "What is the relationship between: A: so I don't know if I wasn't drug tested based on that or because the man who hired me didn't request the drug test, because I know that my company does drug testing on occasion. B: Right. Well, for instance, does the company you worked for before have the right or do they have the ability to say, hey, we've already drug tested her and she came up negative. A: Well, no, I don't think they can force another company to not drug test me just by saying that I didn't, I mean, and they can force another company to not drug test her",
  'target': 'Contradiction'},
 {'input': "What is the relationship between: Whether the relationship had gone beyond friendship Dalgliesh would now never know. She had, apparently, spent little of the money on herself, had been a dependable benefactress of the few eccentric charities of which she approved, had remembered them in her will, but without egregious generosity, and had left the residue of her estate to h