Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use extracted features in extract_features.py? #493

Closed
heslowen opened this issue Apr 16, 2019 · 17 comments
Closed

how to use extracted features in extract_features.py? #493

heslowen opened this issue Apr 16, 2019 · 17 comments
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though) wontfix

Comments

@heslowen
Copy link

I extract features like examples in extarct_features.py. But went I used these features (the last encoded_layers) as word embeddings in a text classification task, I got a worse result than using 300D Glove(any other parameters are the same). I also used these features to compute the cos similarity for each word in sentences, I found that all values were around 0.6. So are these features can be used as Glove or word2vec embeddings? What exactly these features are?

@thomwolf thomwolf added BERT Discussion Discussion on a topic (keep it focused or open a new issue though) labels Apr 17, 2019
@thomwolf
Copy link
Member

Without fine-tuning, BERT features are usually less useful than plain GloVe or wrd2vec indeed.
They start to be interesting when you fine-tune a classifier on top of BERT.

See the recent study by Matthew Peters, Sebastian Ruder, Noah A. Smith (To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks) for some practical tips on that.

@heslowen
Copy link
Author

thank you so much~

@joistick11
Copy link

@heslowen could you please share the code for extracting features in order to use them for learning a classifier? Thanks.

@heslowen
Copy link
Author

@joistick11 you can find a demo in extract_features.py

@joistick11
Copy link

Could you please help me?
I was using bert-as-service (https://github.com/hanxiao/bert-as-service) and there is model method encode, which accepts list and returns list of the same size, each element containing sentence embedding. All the elements of the same size.

  1. When I use extract_features.py, it returns embedding for each recognized symbol in the sentence from the specified layers. I mean, instead of sentence embedding it returns symbols embeddings. How should I use it, for instance, to train an SVM? I am using bert-base-multilingual-cased
  2. Which layer output should I use? Is it with index -1?

Thanks you very much!

@heslowen
Copy link
Author

@joistick11 you want to embed a sentence to a vector?
all_encoder_layers, pooled_output = model(input_ids, token_type_ids=None, attention_mask=input_mask) pooled_output may help you.
I have no idea about using these features to train an SVM although I know the theory about SVM.
For the second question, please refer to thomwolf's answer.
I used the top 4 encoders_layers, but I did not get a better result than using Glove

@RomanShen
Copy link

RomanShen commented Apr 29, 2019

@heslowen Hello, would you please help me? For a sequence like [cls I have a dog.sep], when I input this to Bert and get the last hidden layer of sequence out, let’s say the output is “vector”, is the vector[0] embedding of cls, vector[1] embedding of I, etc. vector[-1] embedding of sep?

@rvoak
Copy link

rvoak commented May 2, 2019

@heslowen How did you extract features after training a classifier on top of BERT? I've been trying to do the same, but I'm unable to do so.
Do I first follow run_classifier.py, and then extract the features from tf.Estimator?

@heslowen
Copy link
Author

heslowen commented May 5, 2019

@rvoak I use pytorch. I did it as the demo in extract_featrues.py. it is easy to do that, you just need to load a tokenizer, a bert model, then tokenize your sentences, and then run the model to get the encoded_layers

@heslowen
Copy link
Author

heslowen commented May 6, 2019

@RomanShen yes you're right

@RomanShen
Copy link

@heslowen Thanks for your reply!

@stale
Copy link

stale bot commented Jul 5, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 5, 2019
@stale stale bot closed this as completed Jul 12, 2019
@hungph-dev-ict
Copy link

@heslowen sorry about my english, now i doing embedding for sentence task, i tuned with my corpus with this library, and i received config.json, vocab.txt and model.bin file, but in bert's original doc, can extract feature when load from ckpt tensorflow checkpoint. according to your answer, i must write feature_extraction for torch, that's right ? please help me

@LysandreJik
Copy link
Member

@hungph-dev-ict Do you mind opening a new issue with your problem? I'll try and help you out.

@hungph-dev-ict
Copy link

@LysandreJik Thank you for your help, I will find solution for my problem, it's use last hidden layer in bert mechanism, but if you have a better solution, can you help me ?
So i have more concerns about with my corpus, with this library code, use tokenizer from pretrained BERT model, so I want use only BasicTokenizer. Can you help me ?

@Ovis85
Copy link

Ovis85 commented Oct 15, 2019

How long should the extract_features.py take to complete?

when using 'bert-large-uncased' it takes seconds however it writes a blank file.
when using 'bert-base-uncased' its been running for over 30 mins.

any advice?

the code I used:

!python extract_features.py
--input_file data/src_train.txt
--output_file data/output1.jsonl
--bert_model bert-base-uncased
--layers -1

@mykelismyname
Copy link

You can look at what the BertForSequenceClassification model https://github.com/huggingface/transformers/blob/3ba5470eb85464df62f324bea88e20da234c423f/pytorch_pretrained_bert/modeling.py#L867 does in it’s forward 139.
The pooled_output obtained from self.bert would seem to be the features you are looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though) wontfix
Projects
None yet
Development

No branches or pull requests

9 participants