Skip to content

Data setup: WMT preprocessing in data_setup.py #104

@priyakasimbeg

Description

@priyakasimbeg

There are currently has to be downloaded manually and there does not exist any documentation with instructions.

Description

The WMT workload requires the wmt17_translate/de-en and wmt14_translate/de-en datasets which have to be downloaded manually with tfds.load. During the download the user may run into issues, for example when downloading wmt14_translate/de-en I ran into ResourceExhausted errors during the generating step. The solution was to manually increase the open file limit on the system as described here.

Steps to Reproduce

VM: tf-agents/in-to-win
In python terminal run `tfds.load('wmt14_translate/de-en')

Source or Possible Fix

We should either:

  • Document instructions to download the datasets for WMT workload clearly, including workarounds for anticipated issues.
  • Automate the downloading of the data, perhaps as a prerequisite step.

Metadata

Metadata

Assignees

Labels

🚀 Launch BlockerIssues that are blocking launch of benchmark

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions