Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representations of phrases #17

Closed
JiachengLi1995 opened this issue Oct 24, 2021 · 6 comments
Closed

Representations of phrases #17

JiachengLi1995 opened this issue Oct 24, 2021 · 6 comments

Comments

@JiachengLi1995
Copy link

Hi,

Thanks for the interesting project!

One question: If I want to get only phrase representations from your pre-trained model, how can I do that? I plan to use them as baselines. Thank you!

Best,
Jiacheng

@JiachengLi1995
Copy link
Author

My input is some sentences, and positions of all phrases are already recognized.
e.g.,
sentence: The name of the project is DensePhrases. phrases: [6, 6, DensePhrases]

How can I get the vector of 'DensePhrases'.

Thanks for your help!

@jhyuklee
Copy link
Member

jhyuklee commented Oct 24, 2021

Hi @JiachengLi1995,

If you have a sentence and want to extract one of the phrase representations from it, you can simply make your sentence into json format that is used in this target (predict_file):

DensePhrases/Makefile

Lines 135 to 150 in a64414f

gen-vecs:
python generate_phrase_vecs.py \
--model_type bert \
--pretrained_name_or_path SpanBERT/spanbert-base-cased \
--data_dir $(DATA_DIR)/single-qa \
--cache_dir $(CACHE_DIR) \
--predict_file $(DEV_DATA) \
--do_dump \
--max_seq_length 512 \
--doc_stride 500 \
--fp16 \
--filter_threshold -2.0 \
--append_title \
--load_dir $(SAVE_DIR)/$(MODEL_NAME) \
--output_dir $(SAVE_DIR)/$(MODEL_NAME) \
$(OPTIONS)

Since you have the exact position that you want to extract, you can remove the filter_threshold option to to set it as the default value (a large negative value) and store the entire token representations. Then, you can select the start and end positions from this set of representations to create the phrase representation.

You can make this process much simpler by slightly modifying the code. For instance, I would modify generate_phrase_vecs.py to get inputs of start and end positions and save the specific phrase representations.

@JiachengLi1995
Copy link
Author

Is the 'SpanBERT/spanbert-base-cased' your pre-trained model for phrase embedding?

@jhyuklee
Copy link
Member

You should use princeton-nlp/densephrases-multi. Set load_dirto princeton-nlp/densephrases-multi.

@JiachengLi1995
Copy link
Author

Thanks for your help!

For me, the outputs from outputs = model(**inputs) contain start_vecs, end_vecs, sft_logits, eft_logits, and the start_vecs are totally equal to end_vecs. And start_vecs are token vectors for the documents. So, If I want to get phrase embeddings, I just need to concatenate [start_vecs[phrase_start_pos], start_vecs[phrase_end_pos]], right? phrase_start_pos is the start index of phrases in token-level.

@jhyuklee
Copy link
Member

Correct! start and end representations are shared so you can just use start representations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants