# Data Preparation

This notebook develops the data preparation for text-to-text learning for supervised datasets (like T5 from Deep Mind), it extends T5 for more tasks and is developed with PyTorch.

The source code is open-sourced.

For the processed text, it will be given when/if I get resources to get it in the open.



## Dataset preparation.

One of the ideas of this process is to do less pre-processing and use the least pre-processed text possible. Uppercase, punctuation and other simbols have information that with some pre-processing is lost. This might not be too problematic for English or other languages, but certainly is for German (and might be for others).

Due to this, many of the pre-processsd (tokenized) datasets available are discarded and the data preparation will be done from Raw data (example for the GLUE and SuperGLUE benchmmarks)

### Text Task Description

In the original T5 paper the tasks are described in english and with a single representation, for example: 
 
    Source String: "translate {}"
    Target String: "to {}"
 
In this work we add a few variations to this. The first variation is that the task will be described in multiple languages, for starting:

* English
* Spanish
* French
* German

The second change is that instead of a single description of the task, there will be multiple ones and they'll be chosen randomly.

Examples for language translation:
 
    " Cómo se dice: {} en {} ?"
    " Traducir: {} al {}."
    " Por favor traduce: {} al {}"
    " Traduce: {} al {}"



## Datasets List to process/analyze

* ~~MUSE~~ Issue downloading data, only multilang dictionaries available
* GLUE
    - [CoLA](https://nyu-mll.github.io/CoLA/); [Neural Network Acceptability Judgments ](https://arxiv.org/abs/1805.12471); [Source Code](https://github.com/nyu-mll/CoLA-baselines)
    - MNLI
    - MRPC
    - QNLI
    - QQP
    - RTE
    - SNLI
    - SST-2
    - STS-B
    - WNLI
* SuperGLUE
    - BoolQ
    - CB
    - COPA
    - MultiRC
    - ReCoRD
    - RTE
    - WiC
    - WSC
* MultiNLI
* SNLI
* XNLI
* UD-Treebank v2.5
* SWAG
* WikiMatrix

## Unsupervised Datasets

* Gutenberg
* Wiktionary
* Wikipedia
* ArXiv
* Wikitext-2
* Wikitext-103

### CoLA




In [2]:
import torch
from torchtext.data import TabularDataset, Field


In [None]:
cola_path = "~/"
cola_src = Field(sequential=True,
                tokenize=lambda x:x  # character level tokenization
                 include_lenghts=True,  # 
                )
cola_tgt = Field(is_target=True)
mt_train = TabularDataset( path='/path/to/file.tsv', format="TSV", fields=(_, cola_trg, _, cola_src))