Modifying num_clusters in index-vecs #20

light42 · 2021-11-20T08:19:52Z

I tried to run index-vecs using custom wikidump, dataset and model, but got this error

Modifying num_clusters flags to 96 doesn't seem to help, the k in error message is still 256.

light42 · 2021-11-20T08:55:57Z

And also, the dump/phrase was made from gen-vecs-parallel instead of gen-vecs. I don't understand how to make gen-vecs choose my custom wikidump as predict_file.

jhyuklee · 2021-11-20T22:33:28Z

Hi, how big is your custom corpus? Is it separated by multiple files? It seems like the size of the corpus is too small. num_clusters <= 96 should work for the error message.

light42 · 2021-11-21T12:38:07Z

I'm using the latest Indonesian wikipedia dump(idwiki-latest-pages-articles.xml.bz2) from https://dumps.wikimedia.org/idwiki/latest/.
And then I preprocess it into WikiExtractor.
And then I used all of the json output(should be around 1145) to feed into gen_vecs_parallel.

The weird thing is I keep having the exact error message even if I change the num_clusters <= 96. The k in the error message keep showing value 256.

jhyuklee · 2021-11-21T15:36:12Z

Did you use the pretrained DensePhrases for indonesian? Note that DensePhrases was trained on English Wikipedia.

light42 · 2021-11-21T15:57:25Z

Yes, I'm using pretrained indolem/indobert-base-uncased from huggingface.

~~After some digging, I have gut feeling that the fifth parameter from IndexIVFPQ constructor have something to do with the error. Since 256 is 2^8.~~

DensePhrases/build_phrase_index.py

Line 115 in 4f35efe

    
           sub_index = faiss.IndexIVFPQ(quantizer, ds, num_clusters, code_size, 8, faiss.METRIC_INNER_PRODUCT)

~~Unfortunately this can't be changed at all. It must be set to 8 or else it will break.~~

^my guess is wrong, I ran it on cpu mode where I could changed the parameter, yet I still got the same error. :(

jhyuklee · 2021-11-21T19:15:41Z

You should train your pre-trained LMs on QA datasets for phrase retrieval. Is the model trained using our provided training script? (with QA datasets in indonesian?)

light42 · 2021-11-22T02:40:17Z

Yes. In the end I leave fine_quant flag empty and now it worked. Is it okay if I leave it empty?

jhyuklee · 2021-11-22T04:17:01Z

It means that you are using the flat index, which is not very optimal when you try to use the entire Wikipedia dump. How does it work? How large is your phrase dump?

light42 · 2021-11-22T04:40:19Z

It only generated 1145 files compared to wiki-20181220 which have 5621 files.

light42 · 2021-11-22T04:47:33Z

I find that #15 have similar problem.
I realized making the corpus bigger is the only way to go, but I can't make it bigger unless I search from another sources.
Is there a script where I could preprocess non-wikipedia pages into dumps ready to be used for densephrases?

light42 · 2021-11-22T07:56:45Z

Finally solved, after I changed doc_sample_ration and vec_sample_ratio to 1.0.

light42 closed this as completed Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modifying num_clusters in index-vecs #20

Modifying num_clusters in index-vecs #20

light42 commented Nov 20, 2021

light42 commented Nov 20, 2021

jhyuklee commented Nov 20, 2021

light42 commented Nov 21, 2021

jhyuklee commented Nov 21, 2021

light42 commented Nov 21, 2021 •

edited

jhyuklee commented Nov 21, 2021

light42 commented Nov 22, 2021

jhyuklee commented Nov 22, 2021

light42 commented Nov 22, 2021

light42 commented Nov 22, 2021

light42 commented Nov 22, 2021

Modifying num_clusters in index-vecs #20

Modifying num_clusters in index-vecs #20

Comments

light42 commented Nov 20, 2021

light42 commented Nov 20, 2021

jhyuklee commented Nov 20, 2021

light42 commented Nov 21, 2021

jhyuklee commented Nov 21, 2021

light42 commented Nov 21, 2021 • edited

jhyuklee commented Nov 21, 2021

light42 commented Nov 22, 2021

jhyuklee commented Nov 22, 2021

light42 commented Nov 22, 2021

light42 commented Nov 22, 2021

light42 commented Nov 22, 2021

light42 commented Nov 21, 2021 •

edited