Translate HuggingFace Dataset

This script can be used to translate any HuggingFace text dataset in any language to English (also to other languages, see Modifications). It uses Facebook's m2m100_1.2B model.

Installation

Clone this repo:

git clone https://github.com/marianna13/translate_dataset.git

And install all the required packages by running this in root directory:

pip install -r requirements.txt

Usage

To use this script run the following command in the terminal:

python translate_data.py [HuggingFace Dataset Name] [File Name] [start] [end]

For example, with this command you can translate one file from the LAION2B-multi Dataset from index 1000 to index 2000 (1000 samples at all):

python translate_data.py laion/laion2B-multi part-00000-fc82da14-99c9-4ff6-ab6a-ac853ac82819-c000.snappy.parquet 1000 2000

Modification

You can change target language (in the script tgt_lang="en"). Also you might wanna change batch_size and num_processes to make the translation more efficient.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
get_clip_embs_similarities.py		get_clip_embs_similarities.py
get_dataset_stats.py		get_dataset_stats.py
join_additional.py		join_additional.py
join_aesthetics.py		join_aesthetics.py
requirements.txt		requirements.txt
translate_data.py		translate_data.py
translate_data_nolang.py		translate_data_nolang.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

get_clip_embs_similarities.py

get_clip_embs_similarities.py

get_dataset_stats.py

get_dataset_stats.py

join_additional.py

join_additional.py

join_aesthetics.py

join_aesthetics.py

requirements.txt

requirements.txt

translate_data.py

translate_data.py

translate_data_nolang.py

translate_data_nolang.py

Repository files navigation

Translate HuggingFace Dataset

Installation

Usage

Modification

About

Releases

Packages

Contributors 2

Languages

marianna13/translate_dataset

Folders and files

Latest commit

History

Repository files navigation

Translate HuggingFace Dataset

Installation

Usage

Modification

About

Resources

Stars

Watchers

Forks

Languages