-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plans to support model trained in fairseq #11
Comments
Assuming fairseq trains compatible Transformer models (same computation graph), you can write a converter to extract the weights.
Which model are you referring to? The benchmarks numbers come from a Transformer Base.
Users can simply decide to apply distillation before using this project. |
File "/home/user/.local/lib/python3.6/site-packages/ctranslate2/converters/opennmt_py.py", line 25, in _load I am trying to convert fairseq based transformer model |
fairseq models are not supported, see the README: |
I wonder if there are plans of supporting of conversion to ctranslate2 from Fairseq transformer models. It is a widely used training framework and supports some new pertained models like BART as well. @guillaumekln |
We will probably get to that. Do you want to help? We should first determine the architecture differences with the current Transformer implementation (if any) and then add a converter for these checkpoints. |
@guillaumekln I took a quick look into fairSeq weights names and the ones in opennmt-py. I noticed that opennmt-py models have "encoder.layer_norm.a_2", "encoder.layer_norm.b_2" which I don't think I have a corresponding mapping in fairseq models. In Fairseq, there are only (use encoder as examples) |
Hi, There are architectural/implementation level changes in fairseq's transformer. Differences example: @guillaumekln How to start writing a converter when there are many changes as these? |
Each weight has a unique name. That's how the order is preserved between the bin format and the core implementation.
Dropout is not used during difference.
Here you are comparing Pre-Norm Transformers (OpenNMT default) and Post-Norm Transformers (fairseq default). The converter and core implementation supports both variants since this commit: 316ed26 Additionally, the V2 branch slightly improves the converter design to make it easier to implement new converters. We hope to release this version in the coming days or weeks. |
Let's reopen this issue for visibility. I'm currently trying to add a converter and see if we are missing anything else. |
Hi @guillaumekln , I have used this converter with the pretrained model available here from the page , but the generations using extracted weights is not producing correct translations. I have verified the variable names and number and everything seems correct. Am i missing something? I am using recently released v2.0.0. for this change |
I found that in Fairseq the first token forwarded into the decoder is I implemented a converter using the Fairseq API to support multiple checkpoint versions (see the PR above). I translated from their WMT19 model and it seems to work but more testing would be required. |
@gvskalyan Can you help testing? You can grab the latest wheels from the CI: download the artifact |
@guillaumekln I will be able to help to test a case for myself as well |
Hi @guillaumekln , the translation with the latest wheel from CI seems proper, I tested it on fairseq wmt19 models and observed the following, are there any other validation tests to be done?
Update - edit 1 : scores with --normalize_scores --disable_early_exit |
Thanks for testing! I checked further and found the output is the same with greedy search (beam_size=1) but not with beam search. There are indeed small differences in the beam search implementation, for example:
We can easily add new translation options to match the fairseq behavior, but we will probably not change the default behavior. Since the model conversion is working fine, I will merge the PR and add these options in separate PRs. I will summarize here the options you can enable to close the gap with fairseq. |
With the latest commit on master, you can change the following options to get almost the same output as fairseq:
(Note that disabling early exit slightly reduces the translation speed.) The few remaining differences are mostly due to small variations in the beam search implementation. |
Hi @gvskalyan, I see just now your updated table in #11 (comment). We don't expect this difference. Did you set the same beam size in this comparison? |
For reference, #911 should fix the remaining differences. |
Can you please support a model trained in fairseq, else since it is torch can it be imported to infer and quantized.
Also the model sizes are of transformer_big? Since if it is transformer _base it would be around half of the score.
Please consider distilling the model into smaller model that would help for inference and size.
The text was updated successfully, but these errors were encountered: