Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using jiant to make BioBERT and SciBERT to run superGLUE tasks #1306

Closed
shi-kejian opened this issue Apr 11, 2021 · 23 comments
Closed

Using jiant to make BioBERT and SciBERT to run superGLUE tasks #1306

shi-kejian opened this issue Apr 11, 2021 · 23 comments
Assignees

Comments

@shi-kejian
Copy link

Hi,
As always, thank you for this amazing contribution.

I am taking prof.bowman's class and attempting to run BioBERT https://github.com/dmis-lab/biobert-pytorch and SciBERT https://github.com/allenai/scibert on jiant. One of our objectives for the course project is to run BioBERT and SciBERT on common NLU tasks.

As far as I understand, it should be possible to add any Transformers encoder model to jiant, but both of those models will probably require a bit of code?

I am sketching out what I'd have to do in jiant vs. plain transformers. Will using jiant create more overheads (by trying to support those models) than just following the standard fine-tuning processes in transformers?

Any suggestions and pointers will be helpful. Thank you!

@jeswan
Copy link
Collaborator

jeswan commented Apr 11, 2021

I think either fine-tuning with transformers or using jiant is suitable for your task. If you decide to use jiant, I will guide you through the process. BioBERT and SciBERT are available as pertained models here: https://huggingface.co/models. Both of the models you mentioned appear to have a BERT architecture with different pretraining objectives. I am currently working on a feature branch (https://github.com/nyu-mll/jiant/tree/js/feature/easy_add_model - should be merged to main in the next week) that is suitable for your goal. Working off the feature branch would be a good place to start. You can simply use the model name from the Transformers Model Hub as an argument(hf_pretrained_model_name_or_path="allenai/scibert_scivocab_cased") in the simple API to fine-tune the models.

What tasks are you planning on running? The following notebook is an example of how to fine tune tasks currently supported by jiant: https://colab.research.google.com/github/nyu-mll/jiant/blob/master/examples/notebooks/simple_api_fine_tuning.ipynb

List of supported tasks: https://github.com/nyu-mll/jiant/blob/master/guides/tasks/supported_tasks.md

@jeswan jeswan assigned jeswan and shi-kejian and unassigned jeswan Apr 11, 2021
@shi-kejian
Copy link
Author

shi-kejian commented Apr 12, 2021

Thank you very much Jesse for the prompt reply!

Yeah, I think using jiant will definitely save us lots of code and time.

For the mllu course project, we aim to measure the degree to which using a different pre-training corpus would change that affect both in terms of the amount of bias learned and used in downstream applications but also what is the tradeoff with actual pure NLU performance. I am planning to run BERT, RoBERTa, BioBERT, and SciBERT on ALL SuperGLUE tasks, because I don't think there is a way to clearly say apriori which tasks would be best at discriminating the models.

Back to my question. I guess the main issue here is that I'd better finish the tasks by the end of this week because there will be a partial draft due next Thursday, so I am afraid I will not be able to use your amazing new feature :(. However, by looking at your notebook examples, I think I can also use the current main branch of jiant right? by loading different models in line say MODEL_TYPE = "roberta-base". Or did I miss anything?

Again, thank you very much. And it would be much appreciated if I could ask more follow-up questions during the process (I am planning to install jiant on NYU greene and run the tasks). @jeswan

edit: sorry I accidentally removed myself as an assignee for this thread. If you need it, please reassign me (I guess that's how you keep track of the issues?) Thank you so much. @jeswan

@shi-kejian shi-kejian removed their assignment Apr 12, 2021
@jeswan
Copy link
Collaborator

jeswan commented Apr 13, 2021

jiant would be well-suited for what you described. You could use the main branch of jiant but may need to make some modifications for jiant to recognize SciBERT as a BERT model. As a starting point, I would use this notebook: https://colab.research.google.com/github/nyu-mll/jiant/blob/master/examples/notebooks/jiant_Basic_Example.ipynb. You will need to change the following:

export_model.export_model(
    hf_pretrained_model_name_or_path="allenai/scibert_scivocab_cased",
    output_base_path="./models/scibert",
)

In the subsequent steps, you will need to modify jiant to treat allenai/scibert_scivocab_cased the same as bert (this is not required in the feature branch mentioned above). Feel free to ask any additional questions during the process.

I previously assigned the issue to you to notify you that I required your feedback. For now, I'll close this issue but please reopen the issue and comment if you have more questions.

@jeswan jeswan closed this as completed Apr 13, 2021
@shi-kejian
Copy link
Author

shi-kejian commented Apr 13, 2021

Thank you very much for the resource.

I would like to quickly follow up tho. When you mentioned "to modify jiant", did you mean we need to somehow modify the source code of jiant, after we install jiant on Greene HPC? Or do we just need to modify, say, the example notebook to make sure all configs are correct for each model and task?
Right now I am setting up jiant (master branch) on HPC.
Thank you for your time.

edit: Or maybe I just install the feature branch?

@jeswan
Copy link
Collaborator

jeswan commented Apr 13, 2021

Modify the source code. However, if you use the feature branch above (should be merged into master within a week), you should not need to modify the source code.

@shi-kejian
Copy link
Author

shi-kejian commented Apr 13, 2021

Thank you.
I think I can first run BERT and RoBERTa (already supported) using the simple fine-tuning API and get the results.
Since that's just a partial draft due next week plus we had other finished bias detecting experiments, so I think I can leave running BioBERT and SciBERT on SuperGLUE later next week, using your amazing new feature.

Thanks again.

@shi-kejian
Copy link
Author

shi-kejian commented Apr 14, 2021

Hello, I actually run into a problem. I installed jiant v2 on HPC Greene. And I am running the simple API using python. I am running roberta-large on superglue tasks. However, it says that superglue_axg and superglue_axb are not supported? Here is the screenshot:
error

Thank you very much.

@jeswan
Copy link
Collaborator

jeswan commented Apr 15, 2021

Please use superglue_broadcoverage_diagnostics and superglue_winogender_diagnostics. The following discussion may also be relevant: #1255

@shi-kejian
Copy link
Author

Thanks for the prompt reply!

@shi-kejian
Copy link
Author

shi-kejian commented Apr 18, 2021

Dear Jesse,
Just checking in -- will the easy_add_model branch be merged to master recently as expected?
We were simply using the Simple API.
.....
..... (previous information deprecated)
Thank you for the amazing tool!! @jeswan

Update: Sorry I think I was confused! So previously you actually meant the feature branch IS WORKING right? I just need to download that branch not the master... So I don't need to wait till it merged to main.

Update: I was working on the easy_add_model feature branch just using Simple API. However, it seems like the data path was not compatible with the current master branch.

Here are the details:
I have a directory called "biobert_base" inside /jiant.
And here is my python code using Simple API to run Biobert on BoolQ.

Screen Shot 2021-04-19 at 7 56 42 PM

Downloading the data and model seems good:
Screen Shot 2021-04-19 at 7 58 36 PM

However, right before it starts to tokenize: I have an error:

Screen Shot 2021-04-19 at 8 04 01 PM

So it seems like somewhere the path gets concatenated twice? I am trying to locate which files are deprecated. If this sounds/looks familiar to you, any pointer would be appreciated. Thank you so much!

@shi-kejian
Copy link
Author

Hey! I just saw the easy_add_model branch is in the PR. Congrats! @jeswan
I think my error described above may help you resolve some conflicts? Also, I will be more than happy to help you test the latest updates. I will use several other models to run both Simple and Main API.

Thank you so much!

@jeswan
Copy link
Collaborator

jeswan commented Apr 26, 2021

@gokejian I will try BioBert with the PR branch to try and replicate the above error!

@jeswan jeswan reopened this Apr 26, 2021
@jeswan jeswan self-assigned this Apr 26, 2021
@jeswan
Copy link
Collaborator

jeswan commented Apr 26, 2021

@gokejian Fixed the issue and will push the fix to the easy_add_model branch. If you want to fix the issue locally use the following for RUN_NAME in the simple api notebook:

# Remove forward slashes so RUN_NAME can be used as path
MODEL_NAME = HF_PRETRAINED_MODEL_NAME.split("/")[-1]
RUN_NAME = f"simple_{TASK_NAME}_{MODEL_NAME}"

@shi-kejian
Copy link
Author

Thank you. Now the path issue is fixed.

However, I am still getting some errors:

This is the code for BioBERT to run BoolQ:
Screen Shot 2021-04-26 at 7 47 07 PM
Error file:

Screen Shot 2021-04-26 at 7 46 40 PM

So I think there might be a few commits behind the master branch.

Hope this helps! Thank you.

@shi-kejian
Copy link
Author

Yeah I think this error applies to other tasks also:
Here is the CB task:

Screen Shot 2021-04-27 at 3 26 54 PM

@jeswan
Copy link
Collaborator

jeswan commented Apr 27, 2021

Ah, I know what is going on here. I'll push a fix for this shortly (need to add num_labels property to the Task object or slightly change how that label is used).

@shi-kejian
Copy link
Author

shi-kejian commented Apr 28, 2021

Cool thanks! BTW all tasks inherit the Task object right? Is it possible for a local fix? I am trying to download the updated feature branch of jiant

@jeswan
Copy link
Collaborator

jeswan commented Apr 29, 2021

Local fix: Change line 89 in heads.py to self.out_proj = nn.Linear(hidden_size, len(task.LABELS))

@shi-kejian
Copy link
Author

shi-kejian commented Apr 29, 2021

Thank you. Here is the update:

  1. no errors for BoolQ, CB, RTE, MultiRC, COPA

  2. Running WIC task yields such error, and WSC also has a similar error.

Screen Shot 2021-04-28 at 10 32 35 PM.

@jeswan
Copy link
Collaborator

jeswan commented Apr 29, 2021

I'll also fix this :) Another task required the switch back to slow tokenizers but it looks like fast tokenizers are still referenced.

@shi-kejian
Copy link
Author

Thanks! so if local fixes are possible please let me know. Otherwise, I will download the latest checkpoint.

@jeswan
Copy link
Collaborator

jeswan commented Apr 29, 2021

Can you try the HEAD of the feature branch? I was not able to replicate your issue with the latest commit.

@shi-kejian
Copy link
Author

Oh my bad. Yes, the current HEAD is working fine. Thanks.

jeswan added a commit that referenced this issue May 4, 2021
* Update to Transformers v4.3.3 (#1266)

* use default return_dict in taskmodels and remove hidden state context manager in models.

* return hidden states in output of model wrapper

* Switch to task model/head factories instead of embedded if-else statements (#1268)

* Use jiant transformers model wrapper instead of if-else. Use taskmodel and head factory instead of if-else.

* switch to ModelArchitectures enum instead of strings

* Refactor get_output_from_encoder() to be member of JiantTaskModel (#1283)

* refactor getting output from encoder to be member function of jiant model

* switch to explicit encode() in jiant transformers model

* fix simple runscript test

* update to tokenizer 0.10.1

* Add tests for flat_strip() (#1289)

* add flat_strip test

* add list to test cases flat_strip

* mlm_weights(), feat_spec(), flat_strip() if-else refactors (#1288)

* moves remaining if-else statments to jiant model or replaces with model agnostic method

* switch from jiant_transformers_model to encoder

* fix bug in flat_strip()

* Move tokenization logic to central JiantModelTransformers method (#1290)

* move model specific tokenization logic to JiantTransformerModels

* implement abstract methods for JiantTransformerModels

* fix tasks circular import (#1296)

* Add DeBERTa (#1295)

* Add DeBERTa with sanity test

* fix tasks circular import

* [WIP] add deberta tests

* Revert "fix tasks circular import"

This reverts commit f924640.

* deberta tests passing with transformers 6472d8

* switch to deberta-v2

* fix get_mlm_weights_dict() for deberta-v2

* update to transformers 4.5.0

* mark deberta test_export as slow

* Update test_tokenization_normalization.py

* add guide to add a model

* fix test_expor_model tests

* minor pytest fixes (add num_labels for rte, overnight flag fix)

* bugfix for simple api notebook

* bugfix for #1310

* bugfix for #1306: simple api notebook path name

* squad running

* 2nd bugfix for #1310: not all tasks have num_labels property

* simple api notebook back to roberta-base

* run test matrix for more steps to compare to master

* save last/best model test fix

Co-authored-by: Jesse Swanson <js11133Wnyu.edu>
@jeswan jeswan closed this as completed May 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants