How to extract phrases from Wikipedia? #16

Albert-Ma · 2021-10-21T08:04:44Z

Hi!

First of all thanks a lot for this solid project!

I just want to figure out how to extract phrases from Wikipedia? Which script is the right one?
I am a little confused when I see so many scripts in the preprocess folder.

The text was updated successfully, but these errors were encountered:

jhyuklee · 2021-10-21T15:01:52Z

Hi @Albert-Ma,

if you are looking for getting phrase representations from the documents, please refer here.

The code that extracts phrases is https://github.com/princeton-nlp/DensePhrases/blob/main/generate_phrase_vecs.py and also see

DensePhrases/densephrases/utils/embed_utils.py

Line 201 in a64414f

    
           def write_phrases(all_examples, all_features, all_results, max_answer_length, do_lower_case, tokenizer, hdf5_path,

which is used in generate_phrase_vecs.py.

Albert-Ma · 2021-10-22T00:58:21Z

Hi @jhyuklee,
I am looking for how to get phrases from raw documents like Wikipedia or squad?
This is the very first step of the phrase retrieval and I think this may occur before training the model.
Or the phrase retrieval didn't extract phrases from documents explicitly and it did it on the fly?
I'll check the generate_phrase_vecs.py function.
Thanks.

jhyuklee · 2021-10-22T14:14:25Z

Phrase retrieval is trained with QA datasets which contain phrase-level answer annotations. So we don't need to explicitly extract phrases before training. After training, generate_phrase_vecs.py filters out irrelevant phrases (i.e., start/end tokens) and stores only relevant phrases that could be used for downstream tasks. This filtering model was also trained on QA datasets so that these phrases serve as answer candidates. In embed_utils.py, there is a function that applies this filtering:

DensePhrases/densephrases/utils/embed_utils.py

Line 117 in a64414f

def filter_metadata(metadata, threshold):

Here, metadata means phrase vector related outputs for each document (phrase start/end vectors, start2end mapper, etc).

Albert-Ma · 2021-10-25T02:15:16Z

Got it, thanks

jhyuklee · 2021-10-25T03:01:37Z

You can also check this issue! #17
I think it's related.

Albert-Ma closed this as completed Oct 25, 2021

hieudx149 mentioned this issue Sep 23, 2022

How to choose phrase to encode in wikipedia document #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract phrases from Wikipedia? #16

How to extract phrases from Wikipedia? #16

Albert-Ma commented Oct 21, 2021

jhyuklee commented Oct 21, 2021

Albert-Ma commented Oct 22, 2021 •

edited

jhyuklee commented Oct 22, 2021

Albert-Ma commented Oct 25, 2021

jhyuklee commented Oct 25, 2021

How to extract phrases from Wikipedia? #16

How to extract phrases from Wikipedia? #16

Comments

Albert-Ma commented Oct 21, 2021

jhyuklee commented Oct 21, 2021

Albert-Ma commented Oct 22, 2021 • edited

jhyuklee commented Oct 22, 2021

Albert-Ma commented Oct 25, 2021

jhyuklee commented Oct 25, 2021

Albert-Ma commented Oct 22, 2021 •

edited