# Using k-NN search on dense and dense-sparse representation for candidate generation

### First we need to move to the top-level directory.

In [1]:
cd ../..

/home/leo/SourceTreeGit/FlexNeuART.refact2021


## Generating purely dense (ANCE) embeddings

### Exporting data (currently we have no convenience wrapper scripts)

In [2]:
!mkdir -p collections/wikipedia_dpr_nq_sample/derived_data/nmslib/ance

In [3]:
!target/appassembler/bin/ExportToNMSLIBDenseSparseFusion  \
    -extr_json collections/wikipedia_dpr_nq_sample/exper_desc.best/extractors/ance.json \
    -fwd_index_dir collections/wikipedia_dpr_nq_sample/forward_index/ \
    -out_file collections/wikipedia_dpr_nq_sample/derived_data/nmslib/ance/export.data

[main] INFO edu.cmu.lti.oaqa.flexneuart.resources.ResourceManager - Resource manager initialization. Resource root:
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.MapDbBackend - MapDB opened for reading: collections/wikipedia_dpr_nq_sample/forward_index/dense.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryDataDict - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index/dense.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.letor.FeatExtrDenseDocEmbedDotProdSimilarity - Index field name: dense normalize embeddings:? false
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Writing the number of entries (774392) to the output file
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 100000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDenseSparseFusion - Exported 200000 docs
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLIBDens

We need to compile the latest NMSLIB [query server (details)](https://github.com/nmslib/nmslib/blob/master/manual/query_server.md). Then it needs to be started from the FlexNeuART source tree root as shown below. In our example, the server is going to carry out the brute-force search. It is also possible to create an index using, e.g., HNSW:

```
<path to the query server>/query_server \
    -p 8000 \
     -s sparse_dense_fusion:weightfilename=\
collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib/ance/fusion_weights \
     -i collections/wikipedia_dpr_nq_sample/derived_data/nmslib/ance/export.data \
     -m brute_force


```

Now we can run a benchmark:

In [6]:
!scripts/exper/run_experiments.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/nmslib_ance.json \
    -test_part dev

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib_ance.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:final_exper/ance_knn
candProv:nmslib
candProvAddConf:exper_desc.best/nmslib/ance/cand_prov.json
candProvURI:localhost:8000
candQty:1000
testOnly:1
Started a process 2177, working dir: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/ance_knn
Process log file: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/ance_knn/exper.log
Waiting for 1 child processes
Process with pid=2177 finished successfully.
Waiting for 0 child processes
1 experiments executed
0 experiments failed


__Top-100 report:__

In [7]:
!cat collections/wikipedia_dpr_nq_sample/results/dev/final_exper/ance_knn/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.705800
NDCG@20:  0.702400
NDCG@100: 0.728700
P20:      0.145400
MAP:      0.595500
MRR:      0.946100
Recall:   0.546620


## Generating mixed dense-sparse (BM25+ANCE) embeddings

### Exporting data (currently we have no convenience wrapper scripts)

In [8]:
!mkdir -p collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance

In [9]:
!target/appassembler/bin/ExportToNMSLIBDenseSparseFusion  \
    -extr_json collections/wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25_ance.json \
    -fwd_index_dir collections/wikipedia_dpr_nq_sample/forward_index/ \
    -out_file collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance/export.data

[main] INFO edu.cmu.lti.oaqa.flexneuart.resources.ResourceManager - Resource manager initialization. Resource root:
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.MapDbBackend - MapDB opened for reading: collections/wikipedia_dpr_nq_sample/forward_index/text.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryDataDict - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index/text.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.MapDbBackend - MapDB opened for reading: collections/wikipedia_dpr_nq_sample/forward_index/dense.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryDataDict - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index/dense.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.letor.FeatExtrDenseDocEmbedDotProdSimilarity - Index field name: dense normalize embeddings:? false
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLI

We need to compile the latest NMSLIB [query server (details)](https://github.com/nmslib/nmslib/blob/master/manual/query_server.md). Then it needs to be started from the FlexNeuART source tree root as shown below. In our example, the server is going to carry out the brute-force search. It is also possible to create an index using, e.g., HNSW:

```
<path to the query server>/query_server \
    -p 8000 \
     -s sparse_dense_fusion:weightfilename=\
collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib/bm25_ance/fusion_weights \
     -i collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance/export.data \
     -m brute_force
```

Now we can run a benchmark:

In [10]:
!scripts/exper/run_experiments.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/nmslib_bm25_ance.json \
    -test_part dev

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib_bm25_ance.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:final_exper/bm25_ance_knn
candProv:nmslib
candProvAddConf:exper_desc.best/nmslib/bm25_ance/cand_prov.json
candProvURI:localhost:8000
candQty:1000
testOnly:1
Started a process 2345, working dir: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn
Process log file: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn/exper.log
Waiting for 1 child processes
Process with pid=2345 finished successfully.
Waiting for 0 child processes
1 experiments executed
0 exper

__Top 100 report__:

In [11]:
!cat collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.714900
NDCG@20:  0.712100
NDCG@100: 0.740300
P20:      0.151700
MAP:      0.606000
MRR:      0.945300
Recall:   0.577034


## Generating mixed dense-sparse (BM25+ANCE) embeddings

## Note, we expect this mode is going primarily useful for purely sparse mixes. Otherwise, its efficiency is not great

### Exporting data (currently we have no convenience wrapper scripts)

In [12]:
!mkdir -p collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance_interleaved

In [13]:
!target/appassembler/bin/ExportToNMSLIBSparse  \
    -extr_json collections/wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25_ance.json \
    -model_file collections/wikipedia_dpr_nq_sample/exper_desc.best/models/bm25_ance.model \
    -fwd_index_dir collections/wikipedia_dpr_nq_sample/forward_index/ \
    -out_file collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance_interleaved/export.data

[main] INFO edu.cmu.lti.oaqa.flexneuart.resources.ResourceManager - Resource manager initialization. Resource root:
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.MapDbBackend - MapDB opened for reading: collections/wikipedia_dpr_nq_sample/forward_index/text.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryDataDict - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index/text.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.MapDbBackend - MapDB opened for reading: collections/wikipedia_dpr_nq_sample/forward_index/dense.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryDataDict - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index/dense.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.letor.FeatExtrDenseDocEmbedDotProdSimilarity - Index field name: dense normalize embeddings:? false
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportToNMSLI

We need to compile the latest NMSLIB [query server (details)](https://github.com/nmslib/nmslib/blob/master/manual/query_server.md). Then it needs to be started from the FlexNeuART source tree root as shown below. In our example, the server is going to carry out the brute-force search. It is also possible to create an index using, e.g., HNSW. The __difference__ from the previous examples is that the similarity function here is a simple unweighted inner product between sparse vectors:

```
<path to the query server>/query_server \
    -p 8000 \
     -s negdotprod_sparse_bin_fast \
     -i collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance_interleaved/export.data \
     -m brute_force
```

Now we can run a benchmark:

In [3]:
!scripts/exper/run_experiments.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/nmslib_bm25_ance_interleaved.json \
    -test_part dev \
    -thread_qty 4

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        4
Experiment descriptor file:                                 collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib_bm25_ance_interleaved.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 4
Parsed experiment parameters:
experSubdir:final_exper/bm25_ance_knn_interleaved
candProv:nmslib
candProvAddConf:exper_desc.best/nmslib/bm25_ance_interleaved/cand_prov.json
candProvURI:localhost:8000
candQty:1000
testOnly:1
Started a process 3054, working dir: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn_interleaved
Process log file: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn_interleaved/exper.log
Waiting for 1 child processes
Process with pid=3054 finished successfully.


__Top 100 report__:

In [4]:
!cat collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.714900
NDCG@20:  0.712100
NDCG@100: 0.740300
P20:      0.151700
MAP:      0.606000
MRR:      0.945300
Recall:   0.577034


## Generating mixed dense-sparse (BM25+glove averaged) embeddings

### Exporting data (currently we have no convenience wrapper scripts)

In [5]:
!mkdir -p collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_avgembed

In [6]:
!target/appassembler/bin/ExportToNMSLIBDenseSparseFusion  \
    -extr_json collections/wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25_avgembed.json \
    -fwd_index_dir collections/wikipedia_dpr_nq_sample/forward_index/ \
    -embed_dir collections/wikipedia_dpr_nq_sample/derived_data/embeddings \
    -out_file collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_avgembed/export.data

[main] INFO edu.cmu.lti.oaqa.flexneuart.resources.ResourceManager - Resource manager initialization. Resource root:
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.MapDbBackend - MapDB opened for reading: collections/wikipedia_dpr_nq_sample/forward_index/text.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryDataDict - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index/text.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.MapDbBackend - MapDB opened for reading: collections/wikipedia_dpr_nq_sample/forward_index/text_unlemm.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndexBinaryDataDict - Finished loading context from file: collections/wikipedia_dpr_nq_sample/forward_index/text_unlemm.mapdb_dataDict
[main] INFO edu.cmu.lti.oaqa.flexneuart.letor.EmbeddingReaderAndRecoder - Loaded 50000 source word vectors from 'collections/wikipedia_dpr_nq_sample/derived_data/embeddings/glove/glove.6

We need to compile the latest NMSLIB [query server (details)](https://github.com/nmslib/nmslib/blob/master/manual/query_server.md). Then it needs to be started from the FlexNeuART source tree root as shown below. In our example, the server is going to carry out the brute-force search. It is also possible to create an index using, e.g., HNSW. The __difference__ from the previous examples is that the similarity function here is a simple unweighted inner product between sparse vectors:

```
<path to the query server>/query_server \
    -p 8000 \
     -s sparse_dense_fusion:weightfilename=\
collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib/bm25_avgembed/fusion_weights \
     -i collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_avgembed/export.data \
     -m brute_force
```

Now we can run a benchmark:

In [7]:
!scripts/exper/run_experiments.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/nmslib_bm25_avgembed.json \
    -test_part dev \
    -thread_qty 4

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        4
Experiment descriptor file:                                 collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib_bm25_avgembed.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 4
Parsed experiment parameters:
experSubdir:final_exper/bm25_avgembed_knn
candProv:nmslib
candProvAddConf:exper_desc.best/nmslib/bm25_avgembed/cand_prov.json
candProvURI:localhost:8000
candQty:1000
testOnly:1
Started a process 3267, working dir: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_avgembed_knn
Process log file: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_avgembed_knn/exper.log
Waiting for 1 child processes
Process with pid=3267 finished successfully.
Waiting for 0 child processes
1 experime

__Top 100 report__:

In [8]:
!cat collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_avgembed_knn/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.402300
NDCG@20:  0.434900
NDCG@100: 0.507800
P20:      0.165600
MAP:      0.345600
MRR:      0.486600
Recall:   0.820033
