Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue while creating faiss index, Command is not clear #15

Closed
SAIVENKATARAJU opened this issue Oct 15, 2021 · 14 comments
Closed

Issue while creating faiss index, Command is not clear #15

SAIVENKATARAJU opened this issue Oct 15, 2021 · 14 comments

Comments

@SAIVENKATARAJU
Copy link

Hi,

What is the all in this command, I am getting unrecognized command error when i remove all.

python build_phrase_index.py \
    $SAVE_DIR/densephrases-multi_sample/dump all \
    --replace \
    --num_clusters 32 \
    --fine_quant OPQ96 \
    --doc_sample_ratio 1.0 \
    --vec_sample_ratio 1.0 \
    --cuda

I corrected that by giving --dump_dir before but its not creating anything. Please find the screenshot below,
Screenshot from 2021-10-15 12-49-51

@SAIVENKATARAJU
Copy link
Author

Hi,
I understand the issue, that it was stage argument I need to pass for all. However, with following command the process is taking forever, no progress, not stopping.
Screenshot from 2021-10-15 17-06-30

@jhyuklee
Copy link
Member

Hi @SAIVENKATARAJU, what is the version of faiss? Did you install faiss-gpu?
For the small version of phrase indexes, this shouldn't take more than several minutes.

@SAIVENKATARAJU
Copy link
Author

Hi @jhyuklee
Thanks for your comment . I am using faiss-gpu -1.65 version.

@SAIVENKATARAJU
Copy link
Author

Hi @jhyuklee
I just debug the code further, and i got this error, Not sure what i am missing here. For debug purpose i have change the script and added arguments inside the code itself like this. Hope this will helpful. I am attaching my article.json here.

args.dump_dir='./outputs/densephrases-multi_sample/dump'
args.stage='all'
args.replace=True
args.num_clusters=32
args.fine_quant='OPQ96'
args.doc_sample_ratio=1.0
args.vec_sample_ratio=1.0
args.cuda=True
args.index_filter=-1e8
args.index_name='start'
args.quantizer_path='quantizer.faiss'
args.trained_index_path='trained.faiss'
args.inv_path='merged.invdata'
args.subindex_name='index'
args.dump_paths='./densephrases-multi_sample/dump/phrase'
args.phrase_dir='phrase'
args.subindex_dir='./densephrases-multi_sample/dump/phrase/'
args.offset=0
args.norm_th=999

Screenshot from 2021-10-16 15-56-52

@jhyuklee
Copy link
Member

I'm not sure why, but the error message says you are trying to use hnsw. The default setting for using hnsw is false and this shouldn't occur. Could you check if the command passes these lines?

opq_matrix = faiss.OPQMatrix(ds, code_size)
opq_matrix.niter = 10
sub_index = faiss.IndexIVFPQ(quantizer, ds, num_clusters, code_size, 8, faiss.METRIC_INNER_PRODUCT)
start_index = faiss.IndexPreTransform(opq_matrix, sub_index)

@SAIVENKATARAJU
Copy link
Author

Hi @jhyuklee

This hnsw type error was resolved, however ,I posted second screenshot that its hanging forever, That problem still persisting. its happening because of deadlock. Please find the screenshot below. I am attaching sample article.json and commands That I have used. if you have time you can go through that.
Command 1:
python DensePhrases/generate_phrase_vecs.py --model_type bert --pretrained_name_or_path SpanBERT/spanbert-base-cased --data_dir ./ --cache_dir $CACHE_DIR --predict_file article_original.json --do_dump --max_seq_length 512 --doc_stride 500 --fp16 --filter_threshold -2.0 --append_title --load_dir $SAVE_DIR/densephrases-multi --output_dir $SAVE_DIR/densephrases-multi_sample

Command 2
python DensePhrases/build_phrase_index.py --dump_dir $SAVE_DIR/densephrases-multi_sample/dump --stage all --replace --num_clusters 32 --fine_quant OPQ96 --doc_sample_ratio 1.0 --vec_sample_ratio 1.0 --cuda

deadlock
dead_lock

article_original.zip

@jhyuklee
Copy link
Member

Hi @SAIVENKATARAJU, I've downloaded your article_original.json and found out that the size of it is too small to train the index. I got the following error as follows:

RuntimeError: Error in void faiss::Clustering::train_encoded(faiss::Clustering::idx_t, const uint8_t*, const faiss::Index*, faiss::Index&, const float*) at /__w/faiss-wheels/faiss-wheels/faiss/faiss/Clustering.cpp:275: Error: 'nx >= k' failed: Number of training points (29) should be at least as large as number of clusters (256)

If you really want to build a phrase index for this small sized corpus, try exact search instead of IVFPQ.
This could be done by setting --fine_quant none for build_phrase_index. But this will also require you to fix the relevant parts in the https://github.com/princeton-nlp/DensePhrases/blob/main/densephrases/index.py because you will be using faiss.IndexFlatIP not faiss.IndexIVFPQ. So I suggest increasing the size of the corpus.

I didn't encounter any problem with article.json (default file). By the way, thanks for correcting the command for build_phrase_index.py.

@SAIVENKATARAJU
Copy link
Author

Hi @jhyuklee

Thanks for Your comments.

I've downloaded your article_original.json and found out that the size of it is too small to train the index. I got the following error as follows:

I wonder the default file and article_original.json are both are same. I just rename it as my original custom data is article.json. How come you did not see any error for default file but with a change name.

And So you did get any deadlock while building Phrase index as shown above?

@jhyuklee
Copy link
Member

jhyuklee commented Oct 20, 2021

No I didn't. Seems like article_original.json is much smaller than articles.json. If you want to see a detailed error message, you can set start_index.verbose=True in

start_index.verbose = False
and gpu_index.verbose=True in
gpu_index.verbose = False
Make sure you are using GPUs to create the index.

@jhyuklee
Copy link
Member

jhyuklee commented Oct 20, 2021

Oh it seems like you just copied and created the json file..
You should use this https://github.com/princeton-nlp/DensePhrases/blob/main/examples/create-custom-index/articles.json.

@jhyuklee
Copy link
Member

jhyuklee commented Nov 2, 2021

Hi @SAIVENKATARAJU, is your problem solved?

@SAIVENKATARAJU
Copy link
Author

Hi ,
No Unfortunately, I am stuck with deadlock mentioned above. I am getting deadlock even for your articles.json. I saw your post on haystack. is there any near planning to integrate this to haystack?

@jhyuklee
Copy link
Member

jhyuklee commented Nov 3, 2021

Yes, but that will take some time. I think there could be a problem with the faiss installation. You can try re-installing faiss and pytorch. Their dependencies sometimes conflict.

@jhyuklee
Copy link
Member

Keep an eye on this issue if you need: deepset-ai/haystack#1721

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants