New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Document translation re revised #59

Open

HaukurPall wants to merge 20 commits into main from document-translation-re-revised

Collaborator

HaukurPall commented Sep 12, 2023

A very large refactoring of the document translation code to make it

Multilingual
Better support backtranslation data
Support training file inputs as prefixes

haukurb and others added 17 commits

September 12, 2023 11:28


          Initial migration of document_translation code


          Document translation task integrated and runs

e9b7a8e


          Fixed improper sentence separators in document translation

f7d3144


          Added character noising to document translation task

c79cbe9


          Added lockfile approach to document translation data loader. Fixed bu…

c3fd27f

…g in batch_sampler


          Changes for linter

ac9218c


          Changes for linter

50f3054


          Added ability for backtranslations to pass through unnoised

9d62a88


          Changes for linter

06ff02b


          Plenty of changes for the document translation task:

4cbf41a

- Removing unnecessary code
- Adding plenty of comments
- Grouping task/component arguments into dataclasses
- Updating the task for fairseq version 0.12.1
- Removed interleaving caching (it was failing)


          Fixing encoding of training examples in document translation. The inp…

073402b

…ut to the encoding function is a string, not a list.


          Bug fixes and a little more refactoring of document translation data …

041066d

…loading


          Bug fixes and a little more refactoring of document translation data …

d13e52c

…loading


          Refactored and fixed bugs in merging segments for the document transl…

84f391d

…ation task


          Adding some datasize logging and a basic readme for training

7721e38


          A language mapper from data languages to model languages for document…

ddfacac

… transation. A large change in how the training data is loaded, now we read all the files in the data-dir and attempt to load based on a prefix string.


          - Adding support for BT information in training examples

f7f2b89

- Checking wether noised examples become too long (needs rework on List[examples] to make sense)
- Adding extra padding for max_seq len when deciding if examples are too long
- Changing behaviour max_seq_len validation in BT dataset

HaukurPall force-pushed the document-translation-re-revised branch from 21c2e4e to f7f2b89 Compare

September 12, 2023 11:28

HaukurPall added 3 commits

September 12, 2023 13:00


          Fixing linting issues and removing some dead code

c127736


          Fixing linting issues and removing some dead code

f7a2db7


          Adding a missing dependency

e6d2f38

HaukurPall requested review from peturorri and haukurb

September 12, 2023 15:35

haukurb approved these changes

View reviewed changes

src/greynirseq/nicenlp/data/parallel_documents/indexed_parallel_bt_documents_dataset.py

+                      # Experimental: add BT information
+                      bt_info = self.encoder.encode("BT") if is_bt else torch.tensor([], dtype=torch.long)
+                      with data_utils.numpy_seed(self.seed, self.epoch, index):
+                          insert_sep = np.random.randint(2, dtype=bool)

Member

haukurb Sep 14, 2023

change this to "should_insert_sep"

src/greynirseq/nicenlp/data/parallel_documents/indexed_parallel_bt_documents_dataset.py

+                      # This language code handling is like the mBart-50 model and nllb-200
+                      src_out = torch.cat(
+                          [torch.tensor([self.dictionary.index(src_langs[0])])]

Member

haukurb Sep 14, 2023

I think this would be more readable with an explicit step with a named variable instead; like source_lang_code_as_bpe or something.

Member

haukurb Sep 14, 2023

also applies to target language code

src/greynirseq/nicenlp/data/parallel_documents/indexed_parallel_bt_documents_dataset.py

+                          [torch.tensor([self.dictionary.index(tgt_langs[0])])] + tgt_out + [torch.tensor([self.dictionary.eos()])]
+                      )
+                      if len(src_out) > self.max_seq_len or len(tgt_out) > self.max_seq_len:

Member

haukurb Sep 14, 2023

Do we want truncation as the default behavior or is the default behavior 'undefiend'?

Member

haukurb Sep 14, 2023

I.e. we could supply this as a config parameter

Member

haukurb Sep 14, 2023

Also not sure about this truncation strategy, is it better to remove the middle rather than the end?

src/greynirseq/nicenlp/data/parallel_documents/indexed_parallel_bt_documents_dataset.py

+                      tgt_string = self.src_dict.string(example["target"])
+                      print(f"{self.encoder.bpe.decode(src_string)}")
+                      print(f"{self.encoder.bpe.decode(tgt_string)}")
+                      print()

Member

haukurb Sep 14, 2023

This seems to be superceded by log_example and could be removed?

src/greynirseq/nicenlp/data/parallel_documents/indexed_parallel_documents_dataset.py

		return np.cumsum(lengths) - lengths


		class KEYS:

Member

haukurb Sep 14, 2023

KEYS and many of the other definitions here should probably be moved to a separate file.

src/greynirseq/nicenlp/tasks/document_translation_from_pretrained_bart.py

+                      data_dir_path = paths[(epoch - 1) % len(paths)]
+                      # if a split contains a comma, we should crash - since that is no longer supported
+                      assert "," not in split, "Split should not contain a comma"

Member

haukurb Sep 14, 2023

But when split contains the cli input of --train-subset, it is allowed to have commas according to the function documentation

src/greynirseq/nicenlp/tasks/document_translation_from_pretrained_bart.py

+                              file_type = "src"
+                          elif file_type == lang2:
+                              file_type = "tgt"
+                          return {

Member

haukurb Sep 14, 2023

This would be better as a namedtuple (even if the namedtuple constructor is defined locally).

src/greynirseq/nicenlp/tasks/document_translation_from_pretrained_bart.py

+                              "tgt_path": tgt_dataset["path"],
+                              "align_path": align_dataset["path"] if align_dataset is not None else None,
+                              "is_bt": datasets[0]["name"].startswith(self.cfg.bt_subset),
+                          }

Member

haukurb Sep 14, 2023

This should also be a namedtuple.

Member

haukurb Sep 14, 2023

Or dataclass

src/greynirseq/nicenlp/tasks/document_translation_from_pretrained_bart.py

+                      if len(datasets) != 1:
+                          parallel_datasets = [dataset for dataset in datasets if not dataset.is_bt]
+                          bt_datasets = [dataset for dataset in datasets if dataset.is_bt]

Member

haukurb Sep 14, 2023

Could this not be done inside the constructor of IndexedParallelBTDocumentsDataset?

src/greynirseq/nicenlp/tasks/document_translation_from_pretrained_bart.py

+                      else:
+                          dataset = datasets[0]
+                      dataset.set_epoch(1)

Member

haukurb Sep 14, 2023

Add comment that this can take a while since it causes an interleave datasets call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet