Representations of phrases #17

JiachengLi1995 · 2021-10-24T06:16:23Z

Hi,

Thanks for the interesting project!

One question: If I want to get only phrase representations from your pre-trained model, how can I do that? I plan to use them as baselines. Thank you!

Best,
Jiacheng

JiachengLi1995 · 2021-10-24T06:23:28Z

My input is some sentences, and positions of all phrases are already recognized.
e.g.,
sentence: The name of the project is DensePhrases. phrases: [6, 6, DensePhrases]

How can I get the vector of 'DensePhrases'.

Thanks for your help!

jhyuklee · 2021-10-24T22:08:27Z

Hi @JiachengLi1995,

If you have a sentence and want to extract one of the phrase representations from it, you can simply make your sentence into json format that is used in this target (predict_file):

DensePhrases/Makefile

Lines 135 to 150 in a64414f

    
           gen-vecs: 
        
           	python generate_phrase_vecs.py \ 
        
           		--model_type bert \ 
        
           		--pretrained_name_or_path SpanBERT/spanbert-base-cased \ 
        
           		--data_dir $(DATA_DIR)/single-qa \ 
        
           		--cache_dir $(CACHE_DIR) \ 
        
           		--predict_file $(DEV_DATA) \ 
        
           		--do_dump \ 
        
           		--max_seq_length 512 \ 
        
           		--doc_stride 500 \ 
        
           		--fp16 \ 
        
           		--filter_threshold -2.0 \ 
        
           		--append_title \ 
        
           		--load_dir $(SAVE_DIR)/$(MODEL_NAME) \ 
        
           		--output_dir $(SAVE_DIR)/$(MODEL_NAME) \ 
        
           		$(OPTIONS)

Since you have the exact position that you want to extract, you can remove the filter_threshold option to to set it as the default value (a large negative value) and store the entire token representations. Then, you can select the start and end positions from this set of representations to create the phrase representation.

You can make this process much simpler by slightly modifying the code. For instance, I would modify generate_phrase_vecs.py to get inputs of start and end positions and save the specific phrase representations.

JiachengLi1995 · 2021-10-24T22:41:36Z

Is the 'SpanBERT/spanbert-base-cased' your pre-trained model for phrase embedding?

jhyuklee · 2021-10-24T23:23:32Z

You should use princeton-nlp/densephrases-multi. Set load_dirto princeton-nlp/densephrases-multi.

JiachengLi1995 · 2021-10-25T01:51:14Z

Thanks for your help!

For me, the outputs from outputs = model(**inputs) contain start_vecs, end_vecs, sft_logits, eft_logits, and the start_vecs are totally equal to end_vecs. And start_vecs are token vectors for the documents. So, If I want to get phrase embeddings, I just need to concatenate [start_vecs[phrase_start_pos], start_vecs[phrase_end_pos]], right? phrase_start_pos is the start index of phrases in token-level.

jhyuklee · 2021-10-25T02:07:31Z

Correct! start and end representations are shared so you can just use start representations.

jhyuklee mentioned this issue Oct 25, 2021

How to extract phrases from Wikipedia? #16

Closed

jhyuklee closed this as completed Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representations of phrases #17

Representations of phrases #17

JiachengLi1995 commented Oct 24, 2021

JiachengLi1995 commented Oct 24, 2021

jhyuklee commented Oct 24, 2021 •

edited

JiachengLi1995 commented Oct 24, 2021

jhyuklee commented Oct 24, 2021

JiachengLi1995 commented Oct 25, 2021

jhyuklee commented Oct 25, 2021

Representations of phrases #17

Representations of phrases #17

Comments

JiachengLi1995 commented Oct 24, 2021

JiachengLi1995 commented Oct 24, 2021

jhyuklee commented Oct 24, 2021 • edited

JiachengLi1995 commented Oct 24, 2021

jhyuklee commented Oct 24, 2021

JiachengLi1995 commented Oct 25, 2021

jhyuklee commented Oct 25, 2021

jhyuklee commented Oct 24, 2021 •

edited