New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modifying num_clusters in index-vecs #20
Comments
And also, the dump/phrase was made from gen-vecs-parallel instead of gen-vecs. I don't understand how to make gen-vecs choose my custom wikidump as predict_file. |
Hi, how big is your custom corpus? Is it separated by multiple files? It seems like the size of the corpus is too small. num_clusters <= 96 should work for the error message. |
I'm using the latest Indonesian wikipedia dump(idwiki-latest-pages-articles.xml.bz2) from https://dumps.wikimedia.org/idwiki/latest/. The weird thing is I keep having the exact error message even if I change the num_clusters <= 96. The k in the error message keep showing value 256. |
Did you use the pretrained DensePhrases for indonesian? Note that DensePhrases was trained on English Wikipedia. |
Yes, I'm using pretrained indolem/indobert-base-uncased from huggingface.
DensePhrases/build_phrase_index.py Line 115 in 4f35efe
^my guess is wrong, I ran it on cpu mode where I could changed the parameter, yet I still got the same error. :( |
You should train your pre-trained LMs on QA datasets for phrase retrieval. Is the model trained using our provided training script? (with QA datasets in indonesian?) |
Yes. In the end I leave fine_quant flag empty and now it worked. Is it okay if I leave it empty? |
It means that you are using the flat index, which is not very optimal when you try to use the entire Wikipedia dump. How does it work? How large is your phrase dump? |
It only generated 1145 files compared to wiki-20181220 which have 5621 files. |
I find that #15 have similar problem. |
Finally solved, after I changed doc_sample_ration and vec_sample_ratio to 1.0. |
I tried to run index-vecs using custom wikidump, dataset and model, but got this error
Modifying num_clusters flags to 96 doesn't seem to help, the k in error message is still 256.
The text was updated successfully, but these errors were encountered: