Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to extract phrases from Wikipedia? #16

Closed
Albert-Ma opened this issue Oct 21, 2021 · 5 comments
Closed

How to extract phrases from Wikipedia? #16

Albert-Ma opened this issue Oct 21, 2021 · 5 comments

Comments

@Albert-Ma
Copy link

Hi!

First of all thanks a lot for this solid project!

I just want to figure out how to extract phrases from Wikipedia? Which script is the right one?
I am a little confused when I see so many scripts in the preprocess folder.

@jhyuklee
Copy link
Member

Hi @Albert-Ma,

if you are looking for getting phrase representations from the documents, please refer here.

The code that extracts phrases is https://github.com/princeton-nlp/DensePhrases/blob/main/generate_phrase_vecs.py and also see

def write_phrases(all_examples, all_features, all_results, max_answer_length, do_lower_case, tokenizer, hdf5_path,

which is used in generate_phrase_vecs.py.

@Albert-Ma
Copy link
Author

Albert-Ma commented Oct 22, 2021

Hi @jhyuklee,
I am looking for how to get phrases from raw documents like Wikipedia or squad?
This is the very first step of the phrase retrieval and I think this may occur before training the model.
Or the phrase retrieval didn't extract phrases from documents explicitly and it did it on the fly?
I'll check the generate_phrase_vecs.py function.
Thanks.

@jhyuklee
Copy link
Member

Phrase retrieval is trained with QA datasets which contain phrase-level answer annotations. So we don't need to explicitly extract phrases before training. After training, generate_phrase_vecs.py filters out irrelevant phrases (i.e., start/end tokens) and stores only relevant phrases that could be used for downstream tasks. This filtering model was also trained on QA datasets so that these phrases serve as answer candidates. In embed_utils.py, there is a function that applies this filtering:

def filter_metadata(metadata, threshold):

Here, metadata means phrase vector related outputs for each document (phrase start/end vectors, start2end mapper, etc).

@Albert-Ma
Copy link
Author

Got it, thanks

@jhyuklee
Copy link
Member

You can also check this issue! #17
I think it's related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants