# Using k-NN search on dense and dense-sparse representation for candidate generation

### Two things to do before we start:
1. Point environment variable `COLLECT_ROOT` to the collection root.
2. Change directory to the location of installed scripts/binaries

In [1]:
%env COLLECT_ROOT=/home/leo/flexneuart_collections

env: COLLECT_ROOT=/home/leo/flexneuart_collections


In [2]:
cd /home/leo/flexneuart_scripts/

/home/leo/flexneuart_scripts


## Generating purely dense (ANCE) embeddings

### Exporting data

In [3]:
!mkdir -p $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/nmslib/ance

In [4]:
!./export_nmslib/export_nmslib_dense_sparse_fused.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/extractors/ance.json \
    nmslib/ance/export.data

Using collection root: /home/leo/flexneuart_collections
Collection directory:      /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample                                    
Extractor JSON:            exper_desc.best/extractors/ance.json                                      
Output file:               /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/derived_data/nmslib/ance/export.data                                       
Forward index directory:   forward_index                                   
Model 1 directory:         derived_data/giza                                  
Embedding directory:       derived_data/embeddings                                   
[main] INFO edu.cmu.lti.oaqa.flexneuart.resources.ResourceManager - Resource manager initialization. Resource root:/home/leo/flexneuart_collections/wikipedia_dpr_nq_sample
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.MapDbBackend - MapDB opened for reading: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/for

We need to compile the latest NMSLIB [query server (details)](https://github.com/nmslib/nmslib/blob/master/manual/query_server.md). Then it needs to be started from the FlexNeuART source tree root as shown below. In our example, the server is going to carry out the brute-force search. It is also possible to create an index using, e.g., HNSW:

```
export COLLECT_ROOT=/home/leo/flexneuart_collections
<path to the query server>/query_server \
    -p 8000 \
     -s sparse_dense_fusion:weightfilename=$COLLECT_ROOT/wikipedia_dpr_nq_sample/exper_desc.best/nmslib/ance/fusion_weights \
     -i $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/nmslib/ance/export.data \
     -m brute_force


```

Now we can run a benchmark:

In [7]:
!./exper/run_experiments.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/nmslib_ance.json \
    -test_part dev -clean

Using collection root: /home/leo/flexneuart_collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib_ance.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:final_exper/ance_knn
candProv:nmslib
candProvAddConf:exper_desc.best/nmslib/ance/cand_prov.json
candProvURI:localhost:8000
candQty:1000
testOnly:1
Experimental directory already exists (removing contents): /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/final_exper/ance_knn
Cleaning the experimental directory: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/final_exper/ance_knn
Started a process 15324, working dir: /home/l

__Top-100 report:__

In [9]:
!cat $COLLECT_ROOT/wikipedia_dpr_nq_sample/results/dev/final_exper/ance_knn/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.705800
NDCG@20:  0.702400
NDCG@100: 0.728700
P20:      0.145400
MAP:      0.595500
MRR:      0.946100
Recall:   0.546620


## Generating mixed dense-sparse (BM25+ANCE) embeddings

### Exporting data

In [10]:
!mkdir -p $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance

In [11]:
!./export_nmslib/export_nmslib_dense_sparse_fused.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/extractors/bm25_ance.json \
    nmslib/bm25_ance/export.data

Using collection root: /home/leo/flexneuart_collections
Collection directory:      /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample                                    
Extractor JSON:            exper_desc.best/extractors/bm25_ance.json                                      
Output file:               /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance/export.data                                       
Forward index directory:   forward_index                                   
Model 1 directory:         derived_data/giza                                  
Embedding directory:       derived_data/embeddings                                   
[main] INFO edu.cmu.lti.oaqa.flexneuart.resources.ResourceManager - Resource manager initialization. Resource root:/home/leo/flexneuart_collections/wikipedia_dpr_nq_sample
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.MapDbBackend - MapDB opened for reading: /home/leo/flexneuart_collections/wikipedia_dpr_nq_

We need to compile the latest NMSLIB [query server (details)](https://github.com/nmslib/nmslib/blob/master/manual/query_server.md). Then it needs to be started from the FlexNeuART source tree root as shown below. In our example, the server is going to carry out the brute-force search. It is also possible to create an index using, e.g., HNSW:

```
<path to the query server>/query_server \
    -p 8000 \
     -s sparse_dense_fusion:weightfilename=\
$COLLECT_ROOT/wikipedia_dpr_nq_sample/exper_desc.best/nmslib/bm25_ance/fusion_weights \
     -i $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance/export.data \
     -m brute_force
```

Now we can run a benchmark:

In [13]:
!./exper/run_experiments.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/nmslib_bm25_ance.json \
    -test_part dev

Using collection root: /home/leo/flexneuart_collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib_bm25_ance.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:final_exper/bm25_ance_knn
candProv:nmslib
candProvAddConf:exper_desc.best/nmslib/bm25_ance/cand_prov.json
candProvURI:localhost:8000
candQty:1000
testOnly:1
Started a process 15552, working dir: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn
Process log file: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn/exper.log
Waiting for 1 child processes
Process with pid=155

__Top 100 report__:

In [15]:
!cat $COLLECT_ROOT/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.714900
NDCG@20:  0.712100
NDCG@100: 0.740300
P20:      0.151700
MAP:      0.606000
MRR:      0.945300
Recall:   0.577034


## Generating mixed dense-sparse (BM25+ANCE) embeddings

## Note, we expect this mode is going primarily useful for purely sparse mixes. For mostly dense mixes, the efficiency is subpar compared to dense-sparse export

### Exporting data

In [16]:
!mkdir -p $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance_interleaved

In [18]:
!./export_nmslib/export_nmslib_sparse.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/extractors/bm25_ance.json \
    nmslib/bm25_ance_interleaved/export.data  \
    -model_file exper_desc.best/models/bm25_ance.model 

Using collection root: /home/leo/flexneuart_collections
Collection directory:      /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample                                    
Extractor JSON:            exper_desc.best/extractors/bm25_ance.json                                      
Output file:               /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance_interleaved/export.data                                       
Forward index directory:   forward_index                                   
Model 1 directory:         derived_data/giza                                  
Embedding directory:       derived_data/embeddings                                   
Model file:                exper_desc.best/models/bm25_ance.model                                     
[main] INFO edu.cmu.lti.oaqa.flexneuart.resources.ResourceManager - Resource manager initialization. Resource root:/home/leo/flexneuart_collections/wikipedia_dpr_nq_sample
[main] INFO edu.cmu.lti.

We need to compile the latest NMSLIB [query server (details)](https://github.com/nmslib/nmslib/blob/master/manual/query_server.md). Then it needs to be started from the FlexNeuART source tree root as shown below. In our example, the server is going to carry out the brute-force search. It is also possible to create an index using, e.g., HNSW. The __difference__ from the previous examples is that the similarity function here is a simple unweighted inner product between sparse vectors:

```
<path to the query server>/query_server \
    -p 8000 \
     -s negdotprod_sparse_bin_fast \
     -i $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_ance_interleaved/export.data \
     -m brute_force
```

Now we can run a benchmark:

In [20]:
!./exper/run_experiments.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/nmslib_bm25_ance_interleaved.json \
    -test_part dev \
    -thread_qty 4 \
    -clean

Using collection root: /home/leo/flexneuart_collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        4
Experiment descriptor file:                                 /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib_bm25_ance_interleaved.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 4
Parsed experiment parameters:
experSubdir:final_exper/bm25_ance_knn_interleaved
candProv:nmslib
candProvAddConf:exper_desc.best/nmslib/bm25_ance_interleaved/cand_prov.json
candProvURI:localhost:8000
candQty:1000
testOnly:1
Experimental directory already exists (removing contents): /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn_interleaved
Cleaning the experimental directory: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/de

__Top 100 report__:

In [21]:
!cat $COLLECT_ROOT/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_ance_knn/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.714900
NDCG@20:  0.712100
NDCG@100: 0.740300
P20:      0.151700
MAP:      0.606000
MRR:      0.945300
Recall:   0.577034


## Generating mixed dense-sparse (BM25+glove averaged) embeddings

### Exporting data

In [22]:
!mkdir -p $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_avgembed

In [23]:
!./export_nmslib/export_nmslib_dense_sparse_fused.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/extractors/bm25_avgembed.json \
    nmslib/bm25_avgembed/export.data

Using collection root: /home/leo/flexneuart_collections
Collection directory:      /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample                                    
Extractor JSON:            exper_desc.best/extractors/bm25_avgembed.json                                      
Output file:               /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_avgembed/export.data                                       
Forward index directory:   forward_index                                   
Model 1 directory:         derived_data/giza                                  
Embedding directory:       derived_data/embeddings                                   
Model file:                                                     
[main] INFO edu.cmu.lti.oaqa.flexneuart.resources.ResourceManager - Resource manager initialization. Resource root:/home/leo/flexneuart_collections/wikipedia_dpr_nq_sample
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.MapDbBackend - Map

We need to compile the latest NMSLIB [query server (details)](https://github.com/nmslib/nmslib/blob/master/manual/query_server.md). Then it needs to be started from the FlexNeuART source tree root as shown below. In our example, the server is going to carry out the brute-force search. It is also possible to create an index using, e.g., HNSW. The __difference__ from the previous examples is that the similarity function here is a simple unweighted inner product between sparse vectors:

```
<path to the query server>/query_server \
    -p 8000 \
     -s sparse_dense_fusion:weightfilename=\
$COLLECT_ROOT/wikipedia_dpr_nq_sample/exper_desc.best/nmslib/bm25_avgembed/fusion_weights \
     -i $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/nmslib/bm25_avgembed/export.data \
     -m brute_force
```

Now we can run a benchmark:

In [24]:
!./exper/run_experiments.sh \
    wikipedia_dpr_nq_sample \
    exper_desc.best/nmslib_bm25_avgembed.json \
    -test_part dev \
    -thread_qty 4

Using collection root: /home/leo/flexneuart_collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        4
Experiment descriptor file:                                 /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/exper_desc.best/nmslib_bm25_avgembed.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 4
Parsed experiment parameters:
experSubdir:final_exper/bm25_avgembed_knn
candProv:nmslib
candProvAddConf:exper_desc.best/nmslib/bm25_avgembed/cand_prov.json
candProvURI:localhost:8000
candQty:1000
testOnly:1
Started a process 16166, working dir: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_avgembed_knn
Process log file: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_avgembed_knn/exper.log
Waiting for 1 child processes


__Top 100 report__:

In [25]:
!cat $COLLECT_ROOT/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25_avgembed_knn/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.402300
NDCG@20:  0.434900
NDCG@100: 0.507800
P20:      0.165600
MAP:      0.345600
MRR:      0.486600
Recall:   0.820033
