Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SciBERT - NER #8

Open
ivalexander13 opened this issue Mar 19, 2021 · 15 comments
Open

SciBERT - NER #8

ivalexander13 opened this issue Mar 19, 2021 · 15 comments
Assignees
Labels
bug Something isn't working nlp functional groups rejoice

Comments

@ivalexander13
Copy link
Collaborator

ivalexander13 commented Mar 19, 2021

Overview

We are doing this to compare SciBERT's performance on NER, relative to text classification. SciBERT didn't provide a chemprot dataset for NER, so we are using the chemprot dataset straight from its source (link here?) and formatting it to fit the model's NER task.

Attempt (ongoing)

We are in the middle of converting the source chemprot dataset, and doing part-of-speech tagging on each word, as well as connecting the relevant entities (substrate, product, and enzyme).

Plans

We will do the full 75 epoch training on this dataset, and see how it performs.

@mrunalimanj
Copy link
Collaborator

Sina and I got our data reformatted finally after a couple hours, in mar12_NER/20210326_set_up_NER_runs_with_dividers.ipynb -- data was saved to data/ner/chemprot_sub_enzyme/clean/{dev, train, test}.txt

We ran it yesterday but keep getting low f1s, so I'm going to start working on seeing if we can use bits and pieces of the SciBERT model to include class_weights - more coming

@mrunalimanj
Copy link
Collaborator

mrunalimanj commented Apr 4, 2021

how we ran it for testing (didn't want to use compute hours):
source activate /global/home/groups/fc_igemcomp/software/scibert_env_NER
cd fc_igemcomp/2020_nlp/scibert
rm -R scripts/NER_output_26mar/
./scripts/train_allennlp_local_v3_NER_trial.sh ./scripts/NER_output_26mar/

@mrunalimanj
Copy link
Collaborator

mrunalimanj commented Apr 4, 2021

creating new kernel:
source activate ~/fc_igemcomp/software/scibert_env_NER # very important! # can also use conda
( had to install ipykernel: conda install -p /global/home/groups/fc_igemcomp/software/scibert_env_NER ipykernel)
python -m ipykernel install --user --name python3.6.13_ner_scibert --display-name "Python 3.6.13 (scibert_env_NER)" # display name is what will show

6:20pm: issue with some iProgress module, so ran these

conda install -c conda-forge ipywidgets
jupyter nbextension enable --py widgetsnbextension
(cool! now TQDM works in-notebook)

@mrunalimanj
Copy link
Collaborator

mrunalimanj commented Apr 4, 2021

okay, my plan:
what I want to do:

  • get a trained model of maybe 10 epochs.
  • then get the logits out from the predict option of the finetuned model
  • and weight those accordingly to get better labels.

uguguguguugugg we need to modify the loss if we want the model to LEARN these weights though

train: scripts/0403_train_allennlp_local_NER_few_epochs.sh scripts/NER_output_3apr/

@mrunalimanj
Copy link
Collaborator

mrunalimanj commented Apr 4, 2021

oof okay switching to local to make changes to AllenNLP - will try to set up similar structure of files on Savio and sync to GitHub

ah sike - we realized it's not bert_text_classifier it uses for the NER set, but rather the bert_crf_tagger.py file - will try to see if we can modify that to use class weights instead!

@mrunalimanj
Copy link
Collaborator

kmkurn/pytorch-crf#47 is helpful, and files to modify include ner_finetune.json, allennlp CRF class, and the bert_crf_tagger.py file.

@sghandian
Copy link

sghandian commented Apr 6, 2021

did some more checks into how people have fixed imbalanced data issues in AllenNLP before. Seems like there is no generalized solution according to this thread.

Mrunali and my experiments with directly modifying the weights haven't made a big difference to performance so far, might be missing something though.

@mrunalimanj mrunalimanj added bug Something isn't working nlp functional groups rejoice labels Apr 6, 2021
@mrunalimanj
Copy link
Collaborator

mrunalimanj commented Apr 6, 2021

looking into modifying CRFs to be weighted:
mathy paper that says basically we should compute a double sum for loss so we can weight the classes https://perso.uclouvain.be/michel.verleysen/papers/ieeetbe12gdl.pdf:
seems to have kind of decent results? hadn't thought about L1 regularization.
Screen Shot 2021-04-05 at 11 59 05 PM

from: allenai/allennlp#4619

someone said "I mean, I believe it can work in practice, but their theoretical motivation is not correct. If this is the case, we could do it with a much simpler approach (like weighted emission scores)." which is what we did...: tensorflow/addons#817

okay, I'm just going to keep a running list of updates in this comment on other comments/potential implementations

{in any case can you tell how much fun I'm having with GitHub issues lmao}

@sghandian
Copy link

This textbook chapter from my NLP class actually goes over what we have concluded as being a good approach to solving this problem which I thought was validating (i.e. NER/Relation Extraction + semi-supervised approach) https://web.stanford.edu/~jurafsky/slp3/17.pdf

@ivalexander13
Copy link
Collaborator Author

This textbook chapter from my NLP class actually goes over what we have concluded as being a good approach to solving this problem which I thought was validating (i.e. NER/Relation Extraction + semi-supervised approach) https://web.stanford.edu/~jurafsky/slp3/17.pdf

Is the semi-supervised approach the approach you're/they're thinking of? It does seem really cool and it seems to have decent track record, though we'd probably need to rewrite a lot of code. Do you think this is something worth pursuing?

@sghandian
Copy link

Yeah take a look at 17.2.4 in there (distant supervision for relation extraction). It sounds very similar to the pattern recognition technique we've been talking about, except it learns non-regex patterns for features (or aggregates data to be fed into NN directly without extracting features beforehand). Problem is that it generally has low precision, which is similar to the other paper we read using pattern matching, so not sure what the best solution is for us.

@mrunalimanj
Copy link
Collaborator

Trying to rebalance the data (with 12apr/20210412 notebook + script) so as to remove any sentences without entities/labels of interest, but the F1 does not change considerably :(

@mrunalimanj
Copy link
Collaborator

praise Ivan who modified a hugging face implementation (in his scratch folder, /global/scratch/ivalexander13/NLPChemExtractor/scibert-text-classification/main.ipynb, but also in /global/home/groups/fc_igemcomp/2020_nlp/scibert/apr16_huggingface_NER)

  • hyper parameter search doc: https://docs.google.com/spreadsheets/d/1jolvSI9tCqHZqBMtX1MAUjht2WuXyl_uFauhbvHMUtQ/edit?usp=sharing
    TODO:
  • tune hyperparams
  • creation of a simpler model
    • would need labeled data from pattern dev to have more freedom
  • see if modifying loss fn is necessary
  • relation extraction equivalent? may be more fitted to our problem, but potentially not a huge deal if we can't do it.
  • run for more epochs, see if that helps in training
  • modify brenda data to be used as training data? or potentially use for tagging, off of a semi-supervised model with chemprot data
    • get labeled data from pattern development for some kind of benchmark?

@mrunalimanj
Copy link
Collaborator

revised TODOs:

@ivalexander13
Copy link
Collaborator Author

revised TODOs:

I'm working on this at #26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working nlp functional groups rejoice
Projects
None yet
Development

No branches or pull requests

4 participants