# HuggingFace Dataset Import

In this notebook, we'll learn how to push a parallel text files to a huggingface dataset format. I have taken the example of Esperanto (epo) and English (eng) dataset. The similar methods could be applied to your dataset as well.

In [1]:
from datasets import list_datasets
from datasets import load_dataset

# Dataset Preparation

The dataset used here is the eng-esp data from the Tatoeba challenge. All the available languages can be found [here](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/Data.md).

## Download

In [2]:
!wget https://object.pouta.csc.fi/Tatoeba-Challenge/eng-epo.tar

--2021-11-21 15:15:20--  https://object.pouta.csc.fi/Tatoeba-Challenge/eng-epo.tar
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35747840 (34M) [application/x-tar]
Saving to: ‘eng-epo.tar’


2021-11-21 15:15:26 (10.5 MB/s) - ‘eng-epo.tar’ saved [35747840/35747840]



In [3]:
!tar -xvf  'eng-epo.tar'
!gunzip 'data/eng-epo/train.src.gz'
!gunzip 'data/eng-epo/train.trg.gz'

data/eng-epo/
data/eng-epo/train.src.gz
data/eng-epo/dev.trg
data/eng-epo/train.id.gz
data/eng-epo/test.trg
data/eng-epo/test.id
data/eng-epo/dev.src
data/eng-epo/dev.id
data/eng-epo/test.src
data/eng-epo/train.trg.gz


In [4]:
!mv 'data/eng-epo/train.src' 'data/eng-epo/train.eng'
!mv 'data/eng-epo/train.trg' 'data/eng-epo/train.epo'
!mv 'data/eng-epo/dev.src' 'data/eng-epo/dev.eng'
!mv 'data/eng-epo/test.src' 'data/eng-epo/test.eng'
!mv 'data/eng-epo/dev.trg' 'data/eng-epo/dev.epo'
!mv 'data/eng-epo/test.trg' 'data/eng-epo/test.epo'

## Preprocessing
We will create a new text file in which each line will contain two lines from the corpora which are tab separated.

In [7]:
def create_parallel_text_files(src_file, trg_file, new_file):

    with open(src_file, "r") as src, open(trg_file, "r") as trg, open(
        new_file, "w"
    ) as new_f:
        for src_line, trg_line in zip(src, trg):
            new_f.write(src_line + "\t" + trg_line + "\n")


In [8]:
create_parallel_text_files(
    "data/eng-epo/train.eng", "data/eng-epo/train.epo", "data/eng-epo/train.eng-epo.txt"
)


In [9]:
create_parallel_text_files(
    "data/eng-epo/dev.eng", "data/eng-epo/dev.epo", "data/eng-epo/dev.eng-epo.txt"
)


In [10]:
create_parallel_text_files(
    "data/eng-epo/test.eng", "data/eng-epo/test.epo", "data/eng-epo/test.eng-epo.txt"
)


## Loading the text files into the HuggingFace Dataset
After creating the above files, these files could easily be loaded into the HuggingFace dataset module and then be used further for training.

In [None]:
dataset = load_dataset(
    "text",
    data_files={
        "train": "data/eng-epo/train.eng-epo.txt",
        "dev": "data/eng-epo/dev.eng-epo.txt",
        "test": "data/eng-epo/test.eng-epo.txt",
    },
)