# Using k-NN search on dense and dense-sparse representation for candidate generation

### First we need to move to the top-level directory.

In [1]:
cd ../..

/home/leo/SourceTreeGit/FlexNeuART.refact2021


## Generating purely dense (ANCE) embeddings

### Exporting data (currently we have no convenience wrapper scripts)

In [7]:
!mkdir -p collections/wikipedia_dpr_nq_sample/derived_data/nmslib/ance

In [8]:
!target/appassembler/bin/ExportToNMSLIBDenseSparseFusion  \
    -extr_json collections/wikipedia_dpr_nq_sample/exper_desc.best/extractors/ance.json \
    -fwd_index_dir collections/wikipedia_dpr_nq_sample/forward_index/ \
    -out_file collections/wikipedia_dpr_nq_sample/derived_data/nmslib/ance/export.data

[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryMapDb - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index//dense.mapdb
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Writing the number of entries (774392) to the output file
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 10000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 20000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 30000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 40000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 50000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 60000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 70000 do

We need to compile the latest NMSLIB [query server (details)](https://github.com/nmslib/nmslib/blob/master/manual/query_server.md). Then it needs to be started from the FlexNeuART source tree root as shown below. In our example, the server is going to carry out the brute-force search. It is also possible to create an index using, e.g., HNSW:

```
<path to the query server>/query_server \
    -p 8000 \
     -s sparse_dense_fusion:weightfilename=\
collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib/ance/fusion_weights \
     -i collections/wikipedia_dpr_nq_sample/derived_data/nmslib/ance/export.data \
     -m brute_force


```

Now we can run a benchmark:

In [3]:
!scripts/exper/run_experiments.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/nmslib_ance.json \
    -test_part dev

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib_ance.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:final_exper/ance_knn
candProv:nmslib
candProvAddConfParam:exper_desc.best/nmslib/ance/cand_prov.json
candProvURI:localhost:8000
candQty:1000
testOnly:1
Started a process 3153, working dir: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/ance_knn
Process log file: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/ance_knn/exper.log
Waiting for 1 child processes
Process with pid=3153 finished successfully.
Waiting for 0 child processes
1 experiments executed
0 experiments failed


__Top-100 report:__

In [4]:
!cat collections/wikipedia_dpr_nq_sample/results/dev/final_exper/ance_knn/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.705800
NDCG@20:  0.702400
NDCG@100: 0.728700
P20:      0.145400
MAP:      0.595500
MRR:      0.946100
Recall:   0.546620


## Generating mixed dense-sparse (BM25+ANCE) embeddings

### Exporting data (currently we have no convenience wrapper scripts)

In [11]:
!mkdir -p collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance

In [12]:
!target/appassembler/bin/ExportToNMSLIBDenseSparseFusion  \
    -extr_json collections/wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25_ance.json \
    -fwd_index_dir collections/wikipedia_dpr_nq_sample/forward_index/ \
    -out_file collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance/export.data

[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryMapDb - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index//text.mapdb
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryMapDb - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index//dense.mapdb
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Writing the number of entries (774392) to the output file
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 10000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 20000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 30000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 40000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 50000 docs
[main] INFO edu.cmu.lti

We need to compile the latest NMSLIB [query server (details)](https://github.com/nmslib/nmslib/blob/master/manual/query_server.md). Then it needs to be started from the FlexNeuART source tree root as shown below. In our example, the server is going to carry out the brute-force search. It is also possible to create an index using, e.g., HNSW:

```
<path to the query server>/query_server \
    -p 8000 \
     -s sparse_dense_fusion:weightfilename=\
collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib/bm25_ance/fusion_weights \
     -i collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance/export.data \
     -m brute_force
```

Now we can run a benchmark:

In [5]:
!scripts/exper/run_experiments.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/nmslib_bm25_ance.json \
    -test_part dev

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib_bm25_ance.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:final_exper/bm25_ance_knn
candProv:nmslib
candProvAddConfParam:exper_desc.best/nmslib/bm25_ance/cand_prov.json
candProvURI:localhost:8000
candQty:1000
testOnly:1
Started a process 3391, working dir: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn
Process log file: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn/exper.log
Waiting for 1 child processes
Process with pid=3391 finished successfully.
Waiting for 0 child processes
1 experiments executed
0 

__Top 100 report__:

In [11]:
!cat collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.714900
NDCG@20:  0.712100
NDCG@100: 0.740300
P20:      0.151700
MAP:      0.606000
MRR:      0.945300
Recall:   0.577034


## Generating mixed dense-sparse (BM25+ANCE) embeddings

## Note, we expect this mode is going primarily useful for purely sparse mixes. Otherwise, its efficiency is not great

### Exporting data (currently we have no convenience wrapper scripts)

In [15]:
!mkdir -p collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance_interleaved

In [16]:
!target/appassembler/bin/ExportToNMSLIBSparse  \
    -extr_json collections/wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25_ance.json \
    -model_file collections/wikipedia_dpr_nq_sample/exper_desc.best/models/bm25_ance.model \
    -fwd_index_dir collections/wikipedia_dpr_nq_sample/forward_index/ \
    -out_file collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance_interleaved/export.data

[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryMapDb - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index//text.mapdb
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryMapDb - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index//dense.mapdb
Writing the number of entries (774392) to the output file
Exported 10000 docs
Exported 20000 docs
Exported 30000 docs
Exported 40000 docs
Exported 50000 docs
Exported 60000 docs
Exported 70000 docs
Exported 80000 docs
Exported 90000 docs
Exported 100000 docs
Exported 110000 docs
Exported 120000 docs
Exported 130000 docs
Exported 140000 docs
Exported 150000 docs
Exported 160000 docs
Exported 170000 docs
Exported 180000 docs
Exported 190000 docs
Exported 200000 docs
Exported 210000 docs
Exported 220000 docs
Exported 230000 docs
Exported 240000 docs
Exported 250000 docs
Exported 260000 docs
Exported 270000 docs
Exported 280000 docs
Exported 290000 do

We need to compile the latest NMSLIB [query server (details)](https://github.com/nmslib/nmslib/blob/master/manual/query_server.md). Then it needs to be started from the FlexNeuART source tree root as shown below. In our example, the server is going to carry out the brute-force search. It is also possible to create an index using, e.g., HNSW. The __difference__ from the previous examples is that the similarity function here is a simple unweighted inner product between sparse vectors:

```
<path to the query server>/query_server \
    -p 8000 \
     -s negdotprod_sparse_bin_fast \
     -i collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance_interleaved/export.data \
     -m brute_force
```

Now we can run a benchmark:

In [9]:
!scripts/exper/run_experiments.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/nmslib_bm25_ance_interleaved.json \
    -test_part dev

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib_bm25_ance_interleaved.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:final_exper/bm25_ance_knn_interleaved
candProv:nmslib
candProvAddConfParam:exper_desc.best/nmslib/bm25_ance_interleaved/cand_prov.json
candProvURI:localhost:8000
candQty:1000
testOnly:1
Started a process 3686, working dir: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn_interleaved
Process log file: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn_interleaved/exper.log
Waiting for 1 child processes
Process with pid=3686 finished successfu

__Top 100 report__:

In [6]:
!cat collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.714900
NDCG@20:  0.712100
NDCG@100: 0.740300
P20:      0.151700
MAP:      0.606000
MRR:      0.945300
Recall:   0.577034


## Generating mixed dense-sparse (BM25+glove averaged) embeddings

### Exporting data (currently we have no convenience wrapper scripts)

In [4]:
!mkdir -p collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_avgembed

In [6]:
!target/appassembler/bin/ExportToNMSLIBDenseSparseFusion  \
    -extr_json collections/wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25_avgembed.json \
    -fwd_index_dir collections/wikipedia_dpr_nq_sample/forward_index/ \
    -embed_dir collections/wikipedia_dpr_nq_sample/derived_data/embeddings \
    -out_file collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_avgembed/export.data

[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryMapDb - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index//text.mapdb
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryMapDb - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index//text_unlemm.mapdb
[main] INFO edu.cmu.lti.oaqa.flexneuart.letor.EmbeddingReaderAndRecoder - Loaded 50000 source word vectors from 'collections/wikipedia_dpr_nq_sample/derived_data/embeddings/glove/glove.6B.50d.txt.bz2'
[main] INFO edu.cmu.lti.oaqa.flexneuart.letor.EmbeddingReaderAndRecoder - Loaded 100000 source word vectors from 'collections/wikipedia_dpr_nq_sample/derived_data/embeddings/glove/glove.6B.50d.txt.bz2'
[main] INFO edu.cmu.lti.oaqa.flexneuart.letor.EmbeddingReaderAndRecoder - Loaded 150000 source word vectors from 'collections/wikipedia_dpr_nq_sample/derived_data/embeddings/glove/glove.6B.50d.txt.bz2'
[main] INFO edu.cmu.lti.oaqa.flexneuart.leto

[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 600000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 610000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 620000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 630000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 640000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 650000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 660000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 670000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 680000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 690000 docs


We need to compile the latest NMSLIB [query server (details)](https://github.com/nmslib/nmslib/blob/master/manual/query_server.md). Then it needs to be started from the FlexNeuART source tree root as shown below. In our example, the server is going to carry out the brute-force search. It is also possible to create an index using, e.g., HNSW. The __difference__ from the previous examples is that the similarity function here is a simple unweighted inner product between sparse vectors:

```
<path to the query server>/query_server \
    -p 8000 \
     -s sparse_dense_fusion:weightfilename=\
collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib/bm25_avgembed/fusion_weights \
     -i collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_avgembed/export.data \
     -m brute_force
```

Now we can run a benchmark:

In [7]:
!scripts/exper/run_experiments.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/nmslib_bm25_avgembed.json \
    -test_part dev

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib_bm25_avgembed.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:final_exper/bm25_avgembed_knn
candProv:nmslib
candProvAddConfParam:exper_desc.best/nmslib/bm25_avgembed/cand_prov.json
candProvURI:localhost:8000
candQty:1000
testOnly:1
Started a process 5616, working dir: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_avgembed_knn
Process log file: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_avgembed_knn/exper.log
Waiting for 1 child processes
Process with pid=5616 finished successfully.
Waiting for 0 child processes
1 exp

__Top 100 report__:

In [8]:
!cat collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_avgembed_knn/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.402300
NDCG@20:  0.434900
NDCG@100: 0.507800
P20:      0.165600
MAP:      0.345600
MRR:      0.486600
Recall:   0.820033
