Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What inputs does a QUETCH model take? #2

Open
warlock2k opened this issue Apr 12, 2020 · 3 comments
Open

What inputs does a QUETCH model take? #2

warlock2k opened this issue Apr 12, 2020 · 3 comments

Comments

@warlock2k
Copy link

warlock2k commented Apr 12, 2020

Could you point me to relevant documentation? If not would you be kind enough to explain how QUETCH works with the WMT dataset and what kind of inputs are required. The documentation available online is vague and unclear.

@juliakreutzer
Copy link
Owner

Hi @warlock2k, the pre-processing is described in the README. It works with data provided by WMT14 and WMT15. If the data format changed since then, you need to adjust it accordingly to match it. The additional pre-processing that is mentions uses the preprocessing scripts of the Mosesdecoder and fast-align for token alignments.

Please note that this implementation is based on a Theano version from 5y ago, so I don't know whether it will comply with newer versions.
For an up-to-date implementation in PyTorch, please use OpenKiwi.

@warlock2k
Copy link
Author

Thanks for the response. However, from a consumers perspective - please correct me If I am wrong:

One needs to use WMT (source & target-translations) data to train the quetch model to generate a model and use this model with real MT output to generate a result.

I wanted to know what this result contains - a tag file showing OK & BAD tags?

@juliakreutzer
Copy link
Owner

juliakreutzer commented Apr 16, 2020

Hi @warlock2k, one needs WMT QE data (source sentences, target sentences) as provided in the shared task, and token alignments, preprocessed as described in the README:

training source data, lowercased: WMT15-data/task2_en-es_train_comb/train.source.lc.comb: 0 0 we *, i.e. the sentence id, the word id, the source word and a placeholder

training target data, combined with features, lowercased: WMT15-data/task2_en-es_train_comb/train.target.lc.comb.feat: 0 0 sólo OK 6.0 5.0 1.2 sólo start utilizamos only we use 0 0 1 0 0, i.e. sentence id, word id, target word, word-level label, and features (here: WMT15 baseline features). The use of the features is optional and not required for the QUETCH model.

source to target alignments: WMT15-data/task2_en-es_train_comb/train.align: 0 1-0 2-1 3-2 4-3 5-4, i.e. the sentence id separated with a tab from the source-target alignment indices.

For testing, every MT output has to be processed in the same way. For each of the tokens, QUETCH will predict OK or BAD.
The exact output format is specified here: https://github.com/juliakreutzer/quetch/blob/master/src/QUETCH.py#L75 and here https://github.com/juliakreutzer/quetch/blob/master/src/QUETCH.py#L104, depending on the task (WMT14 or WMT15).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants