Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding TAPEX to HuggingFace Transformers #6

Closed
NielsRogge opened this issue Oct 29, 2021 · 13 comments
Closed

Adding TAPEX to HuggingFace Transformers #6

NielsRogge opened this issue Oct 29, 2021 · 13 comments

Comments

@NielsRogge
Copy link

NielsRogge commented Oct 29, 2021

Hi!

First of all, congratulations on the great paper and results! If you need any help regarding converting the models to the HuggingFace API, let me know. My very first contribution to HuggingFace Transformers was actually TAPAS, the table question answering model from Google. We also create a subsequent table-question answering task on the hub, as well as an inference widget which let's users directly try out TAPAS in the browser.

We could do the same for TAPEX! TAPEX also looks way simpler as it's a generative model. The only thing required would be to write a conversion script (which I can help you with).

Also, are you interested in joining the Microsoft organization on the hub?

Btw, the README is also very comprehensive and well written. Wish more authors did that ;)

Kind regards,

Niels
ML Engineer @ HuggingFace

@SivilTaram
Copy link
Collaborator

SivilTaram commented Oct 29, 2021

Hi Niels ! @NielsRogge Glad to hear that! Thank you very much for the kind words. I love and appreciate the great work you did integrating TAPAS into Huggingface, which draws more attention to the community ❤️.

In fact, I have tried to convert fairseq model checkpoints to Huggingface Transformers and integrate them into the library. It will be a great honor for TAPEX to be integrated into 🤗 Transformers!

But there are three reasons slowing down the process: (i). These days I am catching up with a paper submission deadline until November 15; (ii). TAPEX is still under review and I do not want to actively advertise it to influence the double-blind review. (iii). I have tried to train BART-large on the same dataset (e.g., Wikitablequestions) using Transformers, but it did not give a similar performance as with fairseq, which makes me very confused. I'm still trying to figure out the reason (may need your help in the near future :-D).

Thanks again for your attention and effort on our work! I have joined the Microsoft org few days ago, thanks again!

Best,
Qian

@NielsRogge
Copy link
Author

NielsRogge commented Nov 20, 2021

Hi,

Great to hear :) So I've managed to convert the TAPEX-base checkpoint to its HuggingFace counterpart (in a BartForConditionalGeneration model). However, can you tell me a bit more about the BPE vocabulary? I see that this checkpoint has a vocab size of 51201.

Where can I find this vocabulary file? Is the same vocabulary used during pre-training and fine-tuning?
UPDATE: found them in the checkpoints themselves :)

Similar to TAPAS, I'm considering making a TapexTokenizer that has the following API:

from transformers import TapexTokenizer, BartForConditionalGeneration
import pandas as pd

model_name = 'microsoft/tapex-base'
tokenizer = TapexTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
table = pd.DataFrame.from_dict(data)
queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
input_ids = tokenizer(table=table, queries=queries, padding='max_length', return_tensors='pt').input_ids

# generate answer autoregressively (this method includes parameters for beam search, top-k sampling, etc.)
outputs = model.generate(input_ids)
print(tokenizer.batch_decode(outputs[0])

@SivilTaram
Copy link
Collaborator

SivilTaram commented Nov 22, 2021

@NielsRogge Hi Niels! It looks that you have found the BPE files. Right they follow the same procedure with BART-base.

BTW, we have an initial plan to release T5 based models (or more models) under the pre-training procedure of TAPEX to faciliate the community. What model names do you think are appropriate for the current TAPEX for future compatibility? For example, microsoft/tapex-bart-base or microsoft/tapex-base?

Thanks again for your effort!

Best Regards,
Qian

@NielsRogge
Copy link
Author

Oh great to hear!

Maybe it makes sense to include the model architecture in the name of each checkpoint, so microsoft/tapex-bart-base (and then microsoft/tapex-t5-base) sounds good.

@SivilTaram
Copy link
Collaborator

@NielsRogge I agree with that. Thanks for your quick reply! If there is any other thing I can help, please directly ping me or email me (qian.liu@buaa.edu.cn). Any discussion about TAPEX itself is also warmly welcome! In the next few months, I will be relatively free lol.

@NielsRogge
Copy link
Author

NielsRogge commented Nov 24, 2021

Do you have time to add this model to the library?

The modeling part is done, as the model is just a BART model. The conversion notebook can be found here. However, it would be great if you could implement the TapexTokenizer.

I can help you in the process. It will also make you familiar with HuggingFace Transformers (it's like a behind-the-scenes of the library).

Let me know if you're interested :)

@SivilTaram
Copy link
Collaborator

SivilTaram commented Nov 24, 2021

@NielsRogge Sure! I'm interested to do that (I'm a earlier fans of huggingface transformers lol). I will add it to the library following the official guideline, which may cost nearly 1 week. I will be back here if I meet anything I cannot deal with, or I think the pull request needs reviewing. Thanks in advance, Niels!

Best,
Qian

@NielsRogge
Copy link
Author

Ok great, let me set up a Slack channel with your email address such that we can communicate over there rather than here.

Is that ok for you?

@SivilTaram
Copy link
Collaborator

@NielsRogge It sounds good for me!

@vnik18
Copy link

vnik18 commented Dec 2, 2021

@SivilTaram Hi, I just wanted to check in on the status of adding these models to Huggingface. Thank you!

@SivilTaram
Copy link
Collaborator

@vnik18 Hi! You can see the updated README to preview the fine-tuning script. And tapex may be merged into Transformers in the near future!

@vnik18
Copy link

vnik18 commented Feb 21, 2022

@SivilTaram Thank you for letting me know!

@SivilTaram
Copy link
Collaborator

TAPEX is merged into huggingface now. Now you can view https://github.com/huggingface/transformers/tree/main/examples/research_projects/tapex to have a try! Enjoy it ☕ .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants