Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifying num_clusters in index-vecs #20

Closed
light42 opened this issue Nov 20, 2021 · 11 comments
Closed

Modifying num_clusters in index-vecs #20

light42 opened this issue Nov 20, 2021 · 11 comments

Comments

@light42
Copy link

light42 commented Nov 20, 2021

I tried to run index-vecs using custom wikidump, dataset and model, but got this error

image

Modifying num_clusters flags to 96 doesn't seem to help, the k in error message is still 256.

@light42
Copy link
Author

light42 commented Nov 20, 2021

And also, the dump/phrase was made from gen-vecs-parallel instead of gen-vecs. I don't understand how to make gen-vecs choose my custom wikidump as predict_file.

@jhyuklee
Copy link
Member

Hi, how big is your custom corpus? Is it separated by multiple files? It seems like the size of the corpus is too small. num_clusters <= 96 should work for the error message.

@light42
Copy link
Author

light42 commented Nov 21, 2021

I'm using the latest Indonesian wikipedia dump(idwiki-latest-pages-articles.xml.bz2) from https://dumps.wikimedia.org/idwiki/latest/.
And then I preprocess it into WikiExtractor.
And then I used all of the json output(should be around 1145) to feed into gen_vecs_parallel.

The weird thing is I keep having the exact error message even if I change the num_clusters <= 96. The k in the error message keep showing value 256.

@jhyuklee
Copy link
Member

Did you use the pretrained DensePhrases for indonesian? Note that DensePhrases was trained on English Wikipedia.

@light42
Copy link
Author

light42 commented Nov 21, 2021

Yes, I'm using pretrained indolem/indobert-base-uncased from huggingface.

After some digging, I have gut feeling that the fifth parameter from IndexIVFPQ constructor have something to do with the error. Since 256 is 2^8.

sub_index = faiss.IndexIVFPQ(quantizer, ds, num_clusters, code_size, 8, faiss.METRIC_INNER_PRODUCT)

Unfortunately this can't be changed at all. It must be set to 8 or else it will break.
image

^my guess is wrong, I ran it on cpu mode where I could changed the parameter, yet I still got the same error. :(

@jhyuklee
Copy link
Member

You should train your pre-trained LMs on QA datasets for phrase retrieval. Is the model trained using our provided training script? (with QA datasets in indonesian?)

@light42
Copy link
Author

light42 commented Nov 22, 2021

Yes. In the end I leave fine_quant flag empty and now it worked. Is it okay if I leave it empty?

@jhyuklee
Copy link
Member

It means that you are using the flat index, which is not very optimal when you try to use the entire Wikipedia dump. How does it work? How large is your phrase dump?

@light42
Copy link
Author

light42 commented Nov 22, 2021

It only generated 1145 files compared to wiki-20181220 which have 5621 files.

@light42
Copy link
Author

light42 commented Nov 22, 2021

I find that #15 have similar problem.
I realized making the corpus bigger is the only way to go, but I can't make it bigger unless I search from another sources.
Is there a script where I could preprocess non-wikipedia pages into dumps ready to be used for densephrases?

@light42
Copy link
Author

light42 commented Nov 22, 2021

Finally solved, after I changed doc_sample_ration and vec_sample_ratio to 1.0.

@light42 light42 closed this as completed Nov 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants